docs(superpowers): add worker state and queue consolidation spec

Approved design for centralizing task status mutations in a TaskStateService,
splitting TaskStatus into orthogonal lifecycle/planning/blocking fields, and
making queue wakes automatic. Sets up the 6-slice refactor of Worker/Services.
This commit is contained in:
Mika Kuns
2026-04-27 10:16:55 +02:00
parent 5c55f6c6cf
commit 43af17e546

View File

@@ -0,0 +1,297 @@
# Worker State & Queue Consolidation — Design
**Date:** 2026-04-27
**Status:** Approved (brainstorming)
**Scope:** `ClaudeDo.Worker` + `ClaudeDo.Data` (TaskEntity, TaskRepository), EF migration
## Problem
The worker layer has accumulated structural problems that culminate in a concrete bug — the queue does not pick up tasks created by a planning session.
### Concrete bug
`TaskRepository.FinalizePlanningAsync(parentId, queueAgentTasks=true)` only flips a draft child to `Queued` if the child *or* its list carries the `agent` tag:
```csharp
var shouldQueue = queueAgentTasks && (childHasAgentTag || listHasAgentTag);
```
When neither carries the tag, the child silently becomes `Manual` — the queue ignores it. There is no UI feedback. Users observe "queue never picks up planning tasks".
### Underlying design issues
1. **Status enum mixes orthogonal concerns.** Today's `TaskStatus` carries 10 values: lifecycle (`Manual, Queued, Running, Done, Failed`), planning hierarchy (`Planning, Planned`), chain ordering (`Waiting`), and an unclear `Draft`. Every consumer has to know which subset applies in which context.
2. **Status writes are scattered.** TaskRunner, StaleTaskRecovery, PlanningChainCoordinator, FinalizePlanningAsync, TaskResetService, ExternalMcpService, and PlanningMcpService all mutate `Status` directly. Some go through `TaskRepository.Mark*Async` helpers, some do `task.Status = …` straight on the DbContext (PlanningChainCoordinator).
3. **Guards are duplicated.** `if (Status == Running) throw …` appears in at least four places (delete, retag, merge, reset).
4. **Two competing planning flows.** `FinalizePlanningAsync` (parallel queueing in Repo) and `PlanningChainCoordinator.QueueSubtasksSequentiallyAsync` (sequential chain) make incompatible assumptions about child status.
5. **`WakeQueue()` is manual.** Multiple callers must remember to invoke it after any DB mutation that creates a `Queued` task. `QueueSubtasksSequentiallyAsync` forgets to. The queue only picks up after a backstop tick.
6. **`Worker/Services/` is a grab-bag.** Queue, lifecycle, merge, worktree maintenance, agent files, and recovery sit side-by-side without domain boundaries.
## Goals
- One source of truth for status mutations: `TaskStateService`.
- Status enum reflects only lifecycle. Planning state and chain blocking are separate fields.
- Wake-queue side effects are automatic, not caller-driven.
- Planning finalization has exactly one path.
- `Worker/Services/` is split into domain folders.
## Non-Goals
- No change to UI status-rendering logic beyond adapting to renamed values.
- No change to SignalR/MCP wire formats beyond the necessary status-string updates.
- No change to git/worktree behavior.
## Design
### 1. Status model reform
Replace today's single `TaskStatus` with three orthogonal fields on `TaskEntity`.
#### `TaskStatus` (lifecycle only) — 6 values
| Value | Meaning |
|---|---|
| `Idle` | not in queue, not active. Replaces today's `Manual` and `Draft`. |
| `Queued` | waiting for queue pickup. |
| `Running` | currently executing. |
| `Done` | finished successfully. |
| `Failed` | finished with error. |
| `Cancelled` | aborted by user (today conflated with `Failed`). |
#### `PlanningPhase` (parent-only, new column) — 3 values
| Value | Meaning |
|---|---|
| `None` | no planning session. Default for all tasks. |
| `Active` | planning session is running. Replaces `Status=Planning`. |
| `Finalized` | plan is committed, children exist. Replaces `Status=Planned`. |
A parent task can now be `Status=Idle, PlanningPhase=Finalized` simultaneously, enabling re-runs of finalized plans without losing planning metadata.
#### `BlockedByTaskId` (nullable FK, new column) — replaces `Waiting`
- Today: `Status=Waiting` means "waiting on a predecessor in the chain".
- New: `Status=Queued` AND `BlockedByTaskId=<predecessor>`. Picker filters out any row with `BlockedByTaskId IS NOT NULL`.
- `ON DELETE SET NULL` — if predecessor is deleted, child becomes pickable.
### 2. `TaskStateService` (centralized state machine)
The only component that writes `Status`, `PlanningPhase`, `BlockedByTaskId`. All other code goes through it.
```csharp
public interface ITaskStateService
{
Task<TransitionResult> EnqueueAsync(string taskId, CancellationToken ct);
Task<TransitionResult> StartRunningAsync(string taskId, DateTime startedAt, CancellationToken ct);
Task<TransitionResult> CompleteAsync(string taskId, DateTime finishedAt, string? result, CancellationToken ct);
Task<TransitionResult> FailAsync(string taskId, DateTime finishedAt, string? error, CancellationToken ct);
Task<TransitionResult> CancelAsync(string taskId, DateTime finishedAt, CancellationToken ct);
Task<TransitionResult> ResetToIdleAsync(string taskId, CancellationToken ct);
Task<TransitionResult> StartPlanningAsync(string parentId, CancellationToken ct);
Task<TransitionResult> FinalizePlanningAsync(string parentId, CancellationToken ct);
Task<TransitionResult> BlockOnAsync(string taskId, string predecessorTaskId, CancellationToken ct);
Task<TransitionResult> UnblockAsync(string taskId, CancellationToken ct);
Task<int> RecoverStaleRunningAsync(string reason, CancellationToken ct);
}
public sealed record TransitionResult(bool Ok, string? Reason);
```
#### Allowed transitions
```
Idle → Queued | Running (RunNow)
Queued → Running | Cancelled | Idle (ResetToIdle)
Running → Done | Failed | Cancelled
Done → Idle (ResetToIdle, for re-run)
Failed → Idle | Queued (re-queue)
Cancelled → Idle | Queued
```
Anything else returns `TransitionResult(false, "invalid transition X→Y")`. No exceptions for invalid transitions — Result pattern keeps callers tolerant.
#### Invariants
1. **Atomic.** Each transition is a single `ExecuteUpdate` (or short tx) using `WHERE Status = <expected>` to be TOCTOU-free.
2. **Validated.** Source status is verified at the SQL level, not in C#.
3. **Side effects (after successful DB write):**
- On any `→ Queued`: `IQueueWaker.Wake()`.
- On any successful transition: `HubBroadcaster.TaskUpdated(taskId)`.
- On `Done`/`Failed`/`Cancelled` for a child task: `IPlanningChainCoordinator.OnChildFinishedAsync`, which calls `_state.UnblockAsync(nextChild)` and `TryCompleteParent` if applicable.
4. **No caller responsibility for side effects.** A caller only needs to invoke one method.
#### Caller migration
| Today | New |
|---|---|
| `TaskRunner.MarkRunningAsync` | `_state.StartRunningAsync` |
| `TaskRunner.HandleSuccess` (Mark + chain + parent) | `_state.CompleteAsync` (handles all) |
| `TaskRunner.HandleFailure` | `_state.FailAsync` |
| `StaleTaskRecovery.FlipAllRunningToFailedAsync` | `_state.RecoverStaleRunningAsync("worker restart")` |
| `PlanningChainCoordinator.QueueSubtasksSequentiallyAsync` (direct DbContext) | iterates children, calls `_state.EnqueueAsync` for first, `_state.BlockOnAsync` for rest |
| `TaskRepository.FinalizePlanningAsync` | **removed**; `PlanningSessionManager` orchestrates via state-service |
| `TaskResetService` (direct DbContext) | `_state.ResetToIdleAsync` (service only owns worktree-cleanup) |
`Mark*Async` repo helpers stay but become `internal` — used only by `TaskStateService`.
### 3. Queue dispatch & wake mechanics
Three classes, clear responsibilities.
#### `IQueueWaker`
```csharp
public interface IQueueWaker { void Wake(); }
```
- Singleton. Backed by today's `SemaphoreSlim`.
- Called automatically by `TaskStateService` after any `→ Queued` transition.
- Manual `WakeQueue()` calls in app code are removed (Hub `WakeQueue` SignalR endpoint stays for diagnostics but maps directly to `IQueueWaker.Wake`).
#### `IQueuePicker`
```csharp
public interface IQueuePicker
{
Task<TaskEntity?> ClaimNextAsync(DateTime now, CancellationToken ct);
}
```
- The single place where queue selection happens.
- Filter (all required):
- `Status == Queued`
- `BlockedByTaskId IS NULL`
- `(ScheduledFor IS NULL OR ScheduledFor <= :now)`
- `EXISTS task_tags WHERE name='agent'` OR `EXISTS list_tags WHERE name='agent'`
- Order: `SortOrder ASC, CreatedAt ASC`.
- Atomic claim via `UPDATE … RETURNING` (matching today's pattern), flips `Queued → Running` and writes `StartedAt`.
- Picker is the sole caller of `Queued → Running` transition. `TaskStateService.StartRunningAsync` exists for the override slot path (RunNow / Continue).
#### `QueueService` (BackgroundService) — slimmer
- Wait on wake-signal or backstop timer.
- Call `_picker.ClaimNextAsync`.
- If task: occupy queue slot, run via `_runner.RunAsync`, in `ContinueWith` invoke `_waker.Wake()` for the next pickup.
- No DbContext. No status mutation. No DTO knowledge.
#### `OverrideSlotService` (new)
- Owns `RunNow` and `ContinueTask` (today both in `QueueService`).
- Holds the override slot state.
- Status mutations go through `TaskStateService.StartRunningAsync` (non-atomic claim — caller-driven, fine because override is user-initiated and serialized by slot lock).
### 4. Planning chain integration
Single flow, replaces both `FinalizePlanningAsync` (Repo) and `QueueSubtasksSequentiallyAsync` (Coordinator).
1. `PlanningSessionManager.StartAsync(parentId)``_state.StartPlanningAsync` → parent `PlanningPhase=Active`.
2. User edits children in MCP tool. Children are in `Status=Idle`.
3. `PlanningSessionManager.FinalizeAsync(parentId)`:
- `_state.FinalizePlanningAsync(parentId)` → parent `PlanningPhase=Finalized, Status=Idle`.
- `_chainCoordinator.SetupChainAsync(parentId)`:
- Attaches `agent` tag to all children (automatic — confirmed in brainstorming).
- `_state.EnqueueAsync(children[0])` → wake fires.
- `_state.BlockOnAsync(children[i], children[i-1])` for `i ≥ 1`.
4. When a child finishes, `TaskRunner.HandleSuccess` calls `_state.CompleteAsync(child)`. State-service internally invokes `_chainCoordinator.OnChildFinishedAsync`, which calls `_state.UnblockAsync(nextChild)` (wake fires). Predecessor block goes away because of `ON DELETE SET NULL`-style logic in `UnblockAsync`.
5. When all children are terminal: `_state` runs `TryCompleteParent` and sets parent `Done`/`Failed` based on aggregate.
`TaskRepository.FinalizePlanningAsync` is **deleted**. `QueueSubtasksSequentiallyAsync` is renamed to `SetupChainAsync` and made internal to the coordinator (called only from `PlanningSessionManager.FinalizeAsync`).
### 5. `Worker/Services/` reorganization
```
Worker/
State/
ITaskStateService.cs
TaskStateService.cs
TransitionResult.cs
Queue/
IQueueWaker.cs
IQueuePicker.cs
QueuePicker.cs
QueueService.cs (BackgroundService, slimmer)
OverrideSlotService.cs
QueueSlotState.cs
Lifecycle/
StaleTaskRecovery.cs
TaskResetService.cs
TaskMergeService.cs
Worktrees/
WorktreeMaintenanceService.cs
Agents/
AgentFileService.cs
DefaultAgentSeeder.cs
Runner/ (unchanged)
Planning/ (ChainCoordinator simplified)
External/ (unchanged)
Hub/ (unchanged)
```
`WorkerHub` calls fewer services — typically `_state.X` plus a domain service for non-status work (Merge, Worktree-Cleanup).
### 6. EF migration
```sql
ALTER TABLE tasks ADD COLUMN planning_phase INTEGER NOT NULL DEFAULT 0;
ALTER TABLE tasks ADD COLUMN blocked_by_task_id TEXT NULL REFERENCES tasks(id) ON DELETE SET NULL;
CREATE INDEX ix_tasks_blocked_by ON tasks(blocked_by_task_id);
UPDATE tasks SET status='idle' WHERE status='manual';
UPDATE tasks SET status='idle' WHERE status='draft';
UPDATE tasks SET status='idle', planning_phase=1 WHERE status='planning';
UPDATE tasks SET status='idle', planning_phase=2 WHERE status='planned';
```
`Waiting` migration uses a CTE with `LAG()` to derive `BlockedByTaskId` from `(parent_task_id, sort_order)`:
```sql
WITH ordered AS (
SELECT id,
LAG(id) OVER (PARTITION BY parent_task_id ORDER BY sort_order, created_at) AS prev_id
FROM tasks WHERE status='waiting'
)
UPDATE tasks SET status='queued',
blocked_by_task_id=(SELECT prev_id FROM ordered WHERE ordered.id=tasks.id)
WHERE id IN (SELECT id FROM ordered);
```
Migration runs at worker startup via the existing `MigrateAsync` flow.
`Down()` is best-effort (local-only app). Reverse mapping is lossy: `Cancelled``Failed`, `BlockedByTaskId``Waiting`, planning fields → folded back into status.
### 7. Test strategy
New test fixtures (xUnit, real SQLite, real git where needed):
1. **`TaskStateServiceTests`** — happy path + reject for every transition; mock `IQueueWaker`, `HubBroadcaster`, `IPlanningChainCoordinator` and verify side-effect invocations; concurrency test (two parallel `StartRunningAsync` → exactly one wins).
2. **`QueuePickerTests`** — filter logic (blocked, missing tag, future schedule, wrong status) and ordering (`sort_order, created_at`); two parallel pickers → exactly one claims a row.
3. **`PlanningChainCoordinatorTests`** — `SetupChainAsync` produces correct (`Queued`, `BlockedBy`) layout; `OnChildFinishedAsync` unblocks the next child; child failure leaves remaining blocked, parent transitions to `Failed` after `TryCompleteParent`.
4. **`PlanningEndToEndTests`** — regression for the original bug. `Active` parent + 3 drafts → `Finalize` → assert first child reaches `Running` within 200 ms with no manual `Wake`.
5. **Existing tests** — anything seeding `task.Status = TaskStatus.Manual` or similar gets updated to new enum values or routed through `_state`.
Coverage target: state machine + queue picker at ≥90% branch coverage. Existing coverage levels preserved elsewhere.
### 8. Implementation slices
Each slice is one PR with green tests before the next starts.
1. **Slice 1 — Status model + migration.** New enum values, new columns, EF migration. Existing code mapped to new values mechanically (no behavior change).
2. **Slice 2 — `TaskStateService`.** Service + interface + tests. Migrate TaskRunner, StaleTaskRecovery, ExternalMcp/PlanningMcp guards, TaskResetService. Mark `Mark*Async` repo helpers `internal`.
3. **Slice 3 — `IQueueWaker` + `IQueuePicker`.** Extract from QueueService and Repo. Remove all manual `WakeQueue()` calls in app code.
4. **Slice 4 — Planning flow consolidation.** Delete `FinalizePlanningAsync` from repo. `PlanningSessionManager.FinalizeAsync` orchestrates via state-service + ChainCoordinator. Rename `QueueSubtasksSequentiallyAsync``SetupChainAsync` (internal). E2E test green.
5. **Slice 5 — `OverrideSlotService` + folder reorg.** Extract RunNow / ContinueTask. Move files to new folder structure. Update DI registration.
6. **Slice 6 — Cleanup & docs.** Update `Worker/CLAUDE.md`, `docs/plan.md`. Remove dead helpers.
## Risks & Mitigations
- **EF migration on existing DBs.** Tested via integration tests that load a pre-migration fixture DB. `MigrateAsync` is already in production use, low risk.
- **State-service becomes a god-object.** Mitigated by keeping it narrow: only status/phase/blocked-by writes, no business logic. Worktree, merge, and runner concerns stay in their own services.
- **Two paths to `Running` (picker atomic, state-service for override).** Confirmed acceptable in brainstorming. Picker remains the only atomic-claim path; override slot is serialized by slot lock so non-atomic is safe.
- **Waiting-migration CTE.** SQLite supports `LAG()` since 3.25. .NET 8's bundled SQLite is well above. Tested in migration unit tests.
## Open Questions
None at design time. All knackpunkte resolved during brainstorming.