diff --git a/docs/superpowers/specs/2026-04-27-worker-state-and-queue-consolidation-design.md b/docs/superpowers/specs/2026-04-27-worker-state-and-queue-consolidation-design.md new file mode 100644 index 0000000..54d1fcd --- /dev/null +++ b/docs/superpowers/specs/2026-04-27-worker-state-and-queue-consolidation-design.md @@ -0,0 +1,297 @@ +# Worker State & Queue Consolidation — Design + +**Date:** 2026-04-27 +**Status:** Approved (brainstorming) +**Scope:** `ClaudeDo.Worker` + `ClaudeDo.Data` (TaskEntity, TaskRepository), EF migration + +## Problem + +The worker layer has accumulated structural problems that culminate in a concrete bug — the queue does not pick up tasks created by a planning session. + +### Concrete bug + +`TaskRepository.FinalizePlanningAsync(parentId, queueAgentTasks=true)` only flips a draft child to `Queued` if the child *or* its list carries the `agent` tag: + +```csharp +var shouldQueue = queueAgentTasks && (childHasAgentTag || listHasAgentTag); +``` + +When neither carries the tag, the child silently becomes `Manual` — the queue ignores it. There is no UI feedback. Users observe "queue never picks up planning tasks". + +### Underlying design issues + +1. **Status enum mixes orthogonal concerns.** Today's `TaskStatus` carries 10 values: lifecycle (`Manual, Queued, Running, Done, Failed`), planning hierarchy (`Planning, Planned`), chain ordering (`Waiting`), and an unclear `Draft`. Every consumer has to know which subset applies in which context. +2. **Status writes are scattered.** TaskRunner, StaleTaskRecovery, PlanningChainCoordinator, FinalizePlanningAsync, TaskResetService, ExternalMcpService, and PlanningMcpService all mutate `Status` directly. Some go through `TaskRepository.Mark*Async` helpers, some do `task.Status = …` straight on the DbContext (PlanningChainCoordinator). +3. **Guards are duplicated.** `if (Status == Running) throw …` appears in at least four places (delete, retag, merge, reset). +4. **Two competing planning flows.** `FinalizePlanningAsync` (parallel queueing in Repo) and `PlanningChainCoordinator.QueueSubtasksSequentiallyAsync` (sequential chain) make incompatible assumptions about child status. +5. **`WakeQueue()` is manual.** Multiple callers must remember to invoke it after any DB mutation that creates a `Queued` task. `QueueSubtasksSequentiallyAsync` forgets to. The queue only picks up after a backstop tick. +6. **`Worker/Services/` is a grab-bag.** Queue, lifecycle, merge, worktree maintenance, agent files, and recovery sit side-by-side without domain boundaries. + +## Goals + +- One source of truth for status mutations: `TaskStateService`. +- Status enum reflects only lifecycle. Planning state and chain blocking are separate fields. +- Wake-queue side effects are automatic, not caller-driven. +- Planning finalization has exactly one path. +- `Worker/Services/` is split into domain folders. + +## Non-Goals + +- No change to UI status-rendering logic beyond adapting to renamed values. +- No change to SignalR/MCP wire formats beyond the necessary status-string updates. +- No change to git/worktree behavior. + +## Design + +### 1. Status model reform + +Replace today's single `TaskStatus` with three orthogonal fields on `TaskEntity`. + +#### `TaskStatus` (lifecycle only) — 6 values + +| Value | Meaning | +|---|---| +| `Idle` | not in queue, not active. Replaces today's `Manual` and `Draft`. | +| `Queued` | waiting for queue pickup. | +| `Running` | currently executing. | +| `Done` | finished successfully. | +| `Failed` | finished with error. | +| `Cancelled` | aborted by user (today conflated with `Failed`). | + +#### `PlanningPhase` (parent-only, new column) — 3 values + +| Value | Meaning | +|---|---| +| `None` | no planning session. Default for all tasks. | +| `Active` | planning session is running. Replaces `Status=Planning`. | +| `Finalized` | plan is committed, children exist. Replaces `Status=Planned`. | + +A parent task can now be `Status=Idle, PlanningPhase=Finalized` simultaneously, enabling re-runs of finalized plans without losing planning metadata. + +#### `BlockedByTaskId` (nullable FK, new column) — replaces `Waiting` + +- Today: `Status=Waiting` means "waiting on a predecessor in the chain". +- New: `Status=Queued` AND `BlockedByTaskId=`. Picker filters out any row with `BlockedByTaskId IS NOT NULL`. +- `ON DELETE SET NULL` — if predecessor is deleted, child becomes pickable. + +### 2. `TaskStateService` (centralized state machine) + +The only component that writes `Status`, `PlanningPhase`, `BlockedByTaskId`. All other code goes through it. + +```csharp +public interface ITaskStateService +{ + Task EnqueueAsync(string taskId, CancellationToken ct); + Task StartRunningAsync(string taskId, DateTime startedAt, CancellationToken ct); + Task CompleteAsync(string taskId, DateTime finishedAt, string? result, CancellationToken ct); + Task FailAsync(string taskId, DateTime finishedAt, string? error, CancellationToken ct); + Task CancelAsync(string taskId, DateTime finishedAt, CancellationToken ct); + Task ResetToIdleAsync(string taskId, CancellationToken ct); + + Task StartPlanningAsync(string parentId, CancellationToken ct); + Task FinalizePlanningAsync(string parentId, CancellationToken ct); + + Task BlockOnAsync(string taskId, string predecessorTaskId, CancellationToken ct); + Task UnblockAsync(string taskId, CancellationToken ct); + + Task RecoverStaleRunningAsync(string reason, CancellationToken ct); +} + +public sealed record TransitionResult(bool Ok, string? Reason); +``` + +#### Allowed transitions + +``` +Idle → Queued | Running (RunNow) +Queued → Running | Cancelled | Idle (ResetToIdle) +Running → Done | Failed | Cancelled +Done → Idle (ResetToIdle, for re-run) +Failed → Idle | Queued (re-queue) +Cancelled → Idle | Queued +``` + +Anything else returns `TransitionResult(false, "invalid transition X→Y")`. No exceptions for invalid transitions — Result pattern keeps callers tolerant. + +#### Invariants + +1. **Atomic.** Each transition is a single `ExecuteUpdate` (or short tx) using `WHERE Status = ` to be TOCTOU-free. +2. **Validated.** Source status is verified at the SQL level, not in C#. +3. **Side effects (after successful DB write):** + - On any `→ Queued`: `IQueueWaker.Wake()`. + - On any successful transition: `HubBroadcaster.TaskUpdated(taskId)`. + - On `Done`/`Failed`/`Cancelled` for a child task: `IPlanningChainCoordinator.OnChildFinishedAsync`, which calls `_state.UnblockAsync(nextChild)` and `TryCompleteParent` if applicable. +4. **No caller responsibility for side effects.** A caller only needs to invoke one method. + +#### Caller migration + +| Today | New | +|---|---| +| `TaskRunner.MarkRunningAsync` | `_state.StartRunningAsync` | +| `TaskRunner.HandleSuccess` (Mark + chain + parent) | `_state.CompleteAsync` (handles all) | +| `TaskRunner.HandleFailure` | `_state.FailAsync` | +| `StaleTaskRecovery.FlipAllRunningToFailedAsync` | `_state.RecoverStaleRunningAsync("worker restart")` | +| `PlanningChainCoordinator.QueueSubtasksSequentiallyAsync` (direct DbContext) | iterates children, calls `_state.EnqueueAsync` for first, `_state.BlockOnAsync` for rest | +| `TaskRepository.FinalizePlanningAsync` | **removed**; `PlanningSessionManager` orchestrates via state-service | +| `TaskResetService` (direct DbContext) | `_state.ResetToIdleAsync` (service only owns worktree-cleanup) | + +`Mark*Async` repo helpers stay but become `internal` — used only by `TaskStateService`. + +### 3. Queue dispatch & wake mechanics + +Three classes, clear responsibilities. + +#### `IQueueWaker` + +```csharp +public interface IQueueWaker { void Wake(); } +``` + +- Singleton. Backed by today's `SemaphoreSlim`. +- Called automatically by `TaskStateService` after any `→ Queued` transition. +- Manual `WakeQueue()` calls in app code are removed (Hub `WakeQueue` SignalR endpoint stays for diagnostics but maps directly to `IQueueWaker.Wake`). + +#### `IQueuePicker` + +```csharp +public interface IQueuePicker +{ + Task ClaimNextAsync(DateTime now, CancellationToken ct); +} +``` + +- The single place where queue selection happens. +- Filter (all required): + - `Status == Queued` + - `BlockedByTaskId IS NULL` + - `(ScheduledFor IS NULL OR ScheduledFor <= :now)` + - `EXISTS task_tags WHERE name='agent'` OR `EXISTS list_tags WHERE name='agent'` +- Order: `SortOrder ASC, CreatedAt ASC`. +- Atomic claim via `UPDATE … RETURNING` (matching today's pattern), flips `Queued → Running` and writes `StartedAt`. +- Picker is the sole caller of `Queued → Running` transition. `TaskStateService.StartRunningAsync` exists for the override slot path (RunNow / Continue). + +#### `QueueService` (BackgroundService) — slimmer + +- Wait on wake-signal or backstop timer. +- Call `_picker.ClaimNextAsync`. +- If task: occupy queue slot, run via `_runner.RunAsync`, in `ContinueWith` invoke `_waker.Wake()` for the next pickup. +- No DbContext. No status mutation. No DTO knowledge. + +#### `OverrideSlotService` (new) + +- Owns `RunNow` and `ContinueTask` (today both in `QueueService`). +- Holds the override slot state. +- Status mutations go through `TaskStateService.StartRunningAsync` (non-atomic claim — caller-driven, fine because override is user-initiated and serialized by slot lock). + +### 4. Planning chain integration + +Single flow, replaces both `FinalizePlanningAsync` (Repo) and `QueueSubtasksSequentiallyAsync` (Coordinator). + +1. `PlanningSessionManager.StartAsync(parentId)` → `_state.StartPlanningAsync` → parent `PlanningPhase=Active`. +2. User edits children in MCP tool. Children are in `Status=Idle`. +3. `PlanningSessionManager.FinalizeAsync(parentId)`: + - `_state.FinalizePlanningAsync(parentId)` → parent `PlanningPhase=Finalized, Status=Idle`. + - `_chainCoordinator.SetupChainAsync(parentId)`: + - Attaches `agent` tag to all children (automatic — confirmed in brainstorming). + - `_state.EnqueueAsync(children[0])` → wake fires. + - `_state.BlockOnAsync(children[i], children[i-1])` for `i ≥ 1`. +4. When a child finishes, `TaskRunner.HandleSuccess` calls `_state.CompleteAsync(child)`. State-service internally invokes `_chainCoordinator.OnChildFinishedAsync`, which calls `_state.UnblockAsync(nextChild)` (wake fires). Predecessor block goes away because of `ON DELETE SET NULL`-style logic in `UnblockAsync`. +5. When all children are terminal: `_state` runs `TryCompleteParent` and sets parent `Done`/`Failed` based on aggregate. + +`TaskRepository.FinalizePlanningAsync` is **deleted**. `QueueSubtasksSequentiallyAsync` is renamed to `SetupChainAsync` and made internal to the coordinator (called only from `PlanningSessionManager.FinalizeAsync`). + +### 5. `Worker/Services/` reorganization + +``` +Worker/ + State/ + ITaskStateService.cs + TaskStateService.cs + TransitionResult.cs + Queue/ + IQueueWaker.cs + IQueuePicker.cs + QueuePicker.cs + QueueService.cs (BackgroundService, slimmer) + OverrideSlotService.cs + QueueSlotState.cs + Lifecycle/ + StaleTaskRecovery.cs + TaskResetService.cs + TaskMergeService.cs + Worktrees/ + WorktreeMaintenanceService.cs + Agents/ + AgentFileService.cs + DefaultAgentSeeder.cs + Runner/ (unchanged) + Planning/ (ChainCoordinator simplified) + External/ (unchanged) + Hub/ (unchanged) +``` + +`WorkerHub` calls fewer services — typically `_state.X` plus a domain service for non-status work (Merge, Worktree-Cleanup). + +### 6. EF migration + +```sql +ALTER TABLE tasks ADD COLUMN planning_phase INTEGER NOT NULL DEFAULT 0; +ALTER TABLE tasks ADD COLUMN blocked_by_task_id TEXT NULL REFERENCES tasks(id) ON DELETE SET NULL; +CREATE INDEX ix_tasks_blocked_by ON tasks(blocked_by_task_id); + +UPDATE tasks SET status='idle' WHERE status='manual'; +UPDATE tasks SET status='idle' WHERE status='draft'; +UPDATE tasks SET status='idle', planning_phase=1 WHERE status='planning'; +UPDATE tasks SET status='idle', planning_phase=2 WHERE status='planned'; +``` + +`Waiting` migration uses a CTE with `LAG()` to derive `BlockedByTaskId` from `(parent_task_id, sort_order)`: + +```sql +WITH ordered AS ( + SELECT id, + LAG(id) OVER (PARTITION BY parent_task_id ORDER BY sort_order, created_at) AS prev_id + FROM tasks WHERE status='waiting' +) +UPDATE tasks SET status='queued', + blocked_by_task_id=(SELECT prev_id FROM ordered WHERE ordered.id=tasks.id) +WHERE id IN (SELECT id FROM ordered); +``` + +Migration runs at worker startup via the existing `MigrateAsync` flow. + +`Down()` is best-effort (local-only app). Reverse mapping is lossy: `Cancelled` → `Failed`, `BlockedByTaskId` → `Waiting`, planning fields → folded back into status. + +### 7. Test strategy + +New test fixtures (xUnit, real SQLite, real git where needed): + +1. **`TaskStateServiceTests`** — happy path + reject for every transition; mock `IQueueWaker`, `HubBroadcaster`, `IPlanningChainCoordinator` and verify side-effect invocations; concurrency test (two parallel `StartRunningAsync` → exactly one wins). +2. **`QueuePickerTests`** — filter logic (blocked, missing tag, future schedule, wrong status) and ordering (`sort_order, created_at`); two parallel pickers → exactly one claims a row. +3. **`PlanningChainCoordinatorTests`** — `SetupChainAsync` produces correct (`Queued`, `BlockedBy`) layout; `OnChildFinishedAsync` unblocks the next child; child failure leaves remaining blocked, parent transitions to `Failed` after `TryCompleteParent`. +4. **`PlanningEndToEndTests`** — regression for the original bug. `Active` parent + 3 drafts → `Finalize` → assert first child reaches `Running` within 200 ms with no manual `Wake`. +5. **Existing tests** — anything seeding `task.Status = TaskStatus.Manual` or similar gets updated to new enum values or routed through `_state`. + +Coverage target: state machine + queue picker at ≥90% branch coverage. Existing coverage levels preserved elsewhere. + +### 8. Implementation slices + +Each slice is one PR with green tests before the next starts. + +1. **Slice 1 — Status model + migration.** New enum values, new columns, EF migration. Existing code mapped to new values mechanically (no behavior change). +2. **Slice 2 — `TaskStateService`.** Service + interface + tests. Migrate TaskRunner, StaleTaskRecovery, ExternalMcp/PlanningMcp guards, TaskResetService. Mark `Mark*Async` repo helpers `internal`. +3. **Slice 3 — `IQueueWaker` + `IQueuePicker`.** Extract from QueueService and Repo. Remove all manual `WakeQueue()` calls in app code. +4. **Slice 4 — Planning flow consolidation.** Delete `FinalizePlanningAsync` from repo. `PlanningSessionManager.FinalizeAsync` orchestrates via state-service + ChainCoordinator. Rename `QueueSubtasksSequentiallyAsync` → `SetupChainAsync` (internal). E2E test green. +5. **Slice 5 — `OverrideSlotService` + folder reorg.** Extract RunNow / ContinueTask. Move files to new folder structure. Update DI registration. +6. **Slice 6 — Cleanup & docs.** Update `Worker/CLAUDE.md`, `docs/plan.md`. Remove dead helpers. + +## Risks & Mitigations + +- **EF migration on existing DBs.** Tested via integration tests that load a pre-migration fixture DB. `MigrateAsync` is already in production use, low risk. +- **State-service becomes a god-object.** Mitigated by keeping it narrow: only status/phase/blocked-by writes, no business logic. Worktree, merge, and runner concerns stay in their own services. +- **Two paths to `Running` (picker atomic, state-service for override).** Confirmed acceptable in brainstorming. Picker remains the only atomic-claim path; override slot is serialized by slot lock so non-atomic is safe. +- **Waiting-migration CTE.** SQLite supports `LAG()` since 3.25. .NET 8's bundled SQLite is well above. Tested in migration unit tests. + +## Open Questions + +None at design time. All knackpunkte resolved during brainstorming.