# Worker State & Queue Consolidation — Design **Date:** 2026-04-27 **Status:** Approved (brainstorming) **Scope:** `ClaudeDo.Worker` + `ClaudeDo.Data` (TaskEntity, TaskRepository), EF migration ## Problem The worker layer has accumulated structural problems that culminate in a concrete bug — the queue does not pick up tasks created by a planning session. ### Concrete bug `TaskRepository.FinalizePlanningAsync(parentId, queueAgentTasks=true)` only flips a draft child to `Queued` if the child *or* its list carries the `agent` tag: ```csharp var shouldQueue = queueAgentTasks && (childHasAgentTag || listHasAgentTag); ``` When neither carries the tag, the child silently becomes `Manual` — the queue ignores it. There is no UI feedback. Users observe "queue never picks up planning tasks". ### Underlying design issues 1. **Status enum mixes orthogonal concerns.** Today's `TaskStatus` carries 10 values: lifecycle (`Manual, Queued, Running, Done, Failed`), planning hierarchy (`Planning, Planned`), chain ordering (`Waiting`), and an unclear `Draft`. Every consumer has to know which subset applies in which context. 2. **Status writes are scattered.** TaskRunner, StaleTaskRecovery, PlanningChainCoordinator, FinalizePlanningAsync, TaskResetService, ExternalMcpService, and PlanningMcpService all mutate `Status` directly. Some go through `TaskRepository.Mark*Async` helpers, some do `task.Status = …` straight on the DbContext (PlanningChainCoordinator). 3. **Guards are duplicated.** `if (Status == Running) throw …` appears in at least four places (delete, retag, merge, reset). 4. **Two competing planning flows.** `FinalizePlanningAsync` (parallel queueing in Repo) and `PlanningChainCoordinator.QueueSubtasksSequentiallyAsync` (sequential chain) make incompatible assumptions about child status. 5. **`WakeQueue()` is manual.** Multiple callers must remember to invoke it after any DB mutation that creates a `Queued` task. `QueueSubtasksSequentiallyAsync` forgets to. The queue only picks up after a backstop tick. 6. **`Worker/Services/` is a grab-bag.** Queue, lifecycle, merge, worktree maintenance, agent files, and recovery sit side-by-side without domain boundaries. ## Goals - One source of truth for status mutations: `TaskStateService`. - Status enum reflects only lifecycle. Planning state and chain blocking are separate fields. - Wake-queue side effects are automatic, not caller-driven. - Planning finalization has exactly one path. - `Worker/Services/` is split into domain folders. ## Non-Goals - No change to UI status-rendering logic beyond adapting to renamed values. - No change to SignalR/MCP wire formats beyond the necessary status-string updates. - No change to git/worktree behavior. ## Design ### 1. Status model reform Replace today's single `TaskStatus` with three orthogonal fields on `TaskEntity`. #### `TaskStatus` (lifecycle only) — 6 values | Value | Meaning | |---|---| | `Idle` | not in queue, not active. Replaces today's `Manual` and `Draft`. | | `Queued` | waiting for queue pickup. | | `Running` | currently executing. | | `Done` | finished successfully. | | `Failed` | finished with error. | | `Cancelled` | aborted by user (today conflated with `Failed`). | #### `PlanningPhase` (parent-only, new column) — 3 values | Value | Meaning | |---|---| | `None` | no planning session. Default for all tasks. | | `Active` | planning session is running. Replaces `Status=Planning`. | | `Finalized` | plan is committed, children exist. Replaces `Status=Planned`. | A parent task can now be `Status=Idle, PlanningPhase=Finalized` simultaneously, enabling re-runs of finalized plans without losing planning metadata. #### `BlockedByTaskId` (nullable FK, new column) — replaces `Waiting` - Today: `Status=Waiting` means "waiting on a predecessor in the chain". - New: `Status=Queued` AND `BlockedByTaskId=`. Picker filters out any row with `BlockedByTaskId IS NOT NULL`. - `ON DELETE SET NULL` — if predecessor is deleted, child becomes pickable. ### 2. `TaskStateService` (centralized state machine) The only component that writes `Status`, `PlanningPhase`, `BlockedByTaskId`. All other code goes through it. ```csharp public interface ITaskStateService { Task EnqueueAsync(string taskId, CancellationToken ct); Task StartRunningAsync(string taskId, DateTime startedAt, CancellationToken ct); Task CompleteAsync(string taskId, DateTime finishedAt, string? result, CancellationToken ct); Task FailAsync(string taskId, DateTime finishedAt, string? error, CancellationToken ct); Task CancelAsync(string taskId, DateTime finishedAt, CancellationToken ct); Task ResetToIdleAsync(string taskId, CancellationToken ct); Task StartPlanningAsync(string parentId, CancellationToken ct); Task FinalizePlanningAsync(string parentId, CancellationToken ct); Task BlockOnAsync(string taskId, string predecessorTaskId, CancellationToken ct); Task UnblockAsync(string taskId, CancellationToken ct); Task RecoverStaleRunningAsync(string reason, CancellationToken ct); } public sealed record TransitionResult(bool Ok, string? Reason); ``` #### Allowed transitions ``` Idle → Queued | Running (RunNow) Queued → Running | Cancelled | Idle (ResetToIdle) Running → Done | Failed | Cancelled Done → Idle (ResetToIdle, for re-run) Failed → Idle | Queued (re-queue) Cancelled → Idle | Queued ``` Anything else returns `TransitionResult(false, "invalid transition X→Y")`. No exceptions for invalid transitions — Result pattern keeps callers tolerant. #### Invariants 1. **Atomic.** Each transition is a single `ExecuteUpdate` (or short tx) using `WHERE Status = ` to be TOCTOU-free. 2. **Validated.** Source status is verified at the SQL level, not in C#. 3. **Side effects (after successful DB write):** - On any `→ Queued`: `IQueueWaker.Wake()`. - On any successful transition: `HubBroadcaster.TaskUpdated(taskId)`. - On `Done`/`Failed`/`Cancelled` for a child task: `IPlanningChainCoordinator.OnChildFinishedAsync`, which calls `_state.UnblockAsync(nextChild)` and `TryCompleteParent` if applicable. 4. **No caller responsibility for side effects.** A caller only needs to invoke one method. #### Caller migration | Today | New | |---|---| | `TaskRunner.MarkRunningAsync` | `_state.StartRunningAsync` | | `TaskRunner.HandleSuccess` (Mark + chain + parent) | `_state.CompleteAsync` (handles all) | | `TaskRunner.HandleFailure` | `_state.FailAsync` | | `StaleTaskRecovery.FlipAllRunningToFailedAsync` | `_state.RecoverStaleRunningAsync("worker restart")` | | `PlanningChainCoordinator.QueueSubtasksSequentiallyAsync` (direct DbContext) | iterates children, calls `_state.EnqueueAsync` for first, `_state.BlockOnAsync` for rest | | `TaskRepository.FinalizePlanningAsync` | **removed**; `PlanningSessionManager` orchestrates via state-service | | `TaskResetService` (direct DbContext) | `_state.ResetToIdleAsync` (service only owns worktree-cleanup) | `Mark*Async` repo helpers stay but become `internal` — used only by `TaskStateService`. ### 3. Queue dispatch & wake mechanics Three classes, clear responsibilities. #### `IQueueWaker` ```csharp public interface IQueueWaker { void Wake(); } ``` - Singleton. Backed by today's `SemaphoreSlim`. - Called automatically by `TaskStateService` after any `→ Queued` transition. - Manual `WakeQueue()` calls in app code are removed (Hub `WakeQueue` SignalR endpoint stays for diagnostics but maps directly to `IQueueWaker.Wake`). #### `IQueuePicker` ```csharp public interface IQueuePicker { Task ClaimNextAsync(DateTime now, CancellationToken ct); } ``` - The single place where queue selection happens. - Filter (all required): - `Status == Queued` - `BlockedByTaskId IS NULL` - `(ScheduledFor IS NULL OR ScheduledFor <= :now)` - `EXISTS task_tags WHERE name='agent'` OR `EXISTS list_tags WHERE name='agent'` - Order: `SortOrder ASC, CreatedAt ASC`. - Atomic claim via `UPDATE … RETURNING` (matching today's pattern), flips `Queued → Running` and writes `StartedAt`. - Picker is the sole caller of `Queued → Running` transition. `TaskStateService.StartRunningAsync` exists for the override slot path (RunNow / Continue). #### `QueueService` (BackgroundService) — slimmer - Wait on wake-signal or backstop timer. - Call `_picker.ClaimNextAsync`. - If task: occupy queue slot, run via `_runner.RunAsync`, in `ContinueWith` invoke `_waker.Wake()` for the next pickup. - No DbContext. No status mutation. No DTO knowledge. #### `OverrideSlotService` (new) - Owns `RunNow` and `ContinueTask` (today both in `QueueService`). - Holds the override slot state. - Status mutations go through `TaskStateService.StartRunningAsync` (non-atomic claim — caller-driven, fine because override is user-initiated and serialized by slot lock). ### 4. Planning chain integration Single flow, replaces both `FinalizePlanningAsync` (Repo) and `QueueSubtasksSequentiallyAsync` (Coordinator). 1. `PlanningSessionManager.StartAsync(parentId)` → `_state.StartPlanningAsync` → parent `PlanningPhase=Active`. 2. User edits children in MCP tool. Children are in `Status=Idle`. 3. `PlanningSessionManager.FinalizeAsync(parentId)`: - `_state.FinalizePlanningAsync(parentId)` → parent `PlanningPhase=Finalized, Status=Idle`. - `_chainCoordinator.SetupChainAsync(parentId)`: - Attaches `agent` tag to all children (automatic — confirmed in brainstorming). - `_state.EnqueueAsync(children[0])` → wake fires. - `_state.BlockOnAsync(children[i], children[i-1])` for `i ≥ 1`. 4. When a child finishes, `TaskRunner.HandleSuccess` calls `_state.CompleteAsync(child)`. State-service internally invokes `_chainCoordinator.OnChildFinishedAsync`, which calls `_state.UnblockAsync(nextChild)` (wake fires). Predecessor block goes away because of `ON DELETE SET NULL`-style logic in `UnblockAsync`. 5. When all children are terminal: `_state` runs `TryCompleteParent` and sets parent `Done`/`Failed` based on aggregate. `TaskRepository.FinalizePlanningAsync` is **deleted**. `QueueSubtasksSequentiallyAsync` is renamed to `SetupChainAsync` and made internal to the coordinator (called only from `PlanningSessionManager.FinalizeAsync`). ### 5. `Worker/Services/` reorganization ``` Worker/ State/ ITaskStateService.cs TaskStateService.cs TransitionResult.cs Queue/ IQueueWaker.cs IQueuePicker.cs QueuePicker.cs QueueService.cs (BackgroundService, slimmer) OverrideSlotService.cs QueueSlotState.cs Lifecycle/ StaleTaskRecovery.cs TaskResetService.cs TaskMergeService.cs Worktrees/ WorktreeMaintenanceService.cs Agents/ AgentFileService.cs DefaultAgentSeeder.cs Runner/ (unchanged) Planning/ (ChainCoordinator simplified) External/ (unchanged) Hub/ (unchanged) ``` `WorkerHub` calls fewer services — typically `_state.X` plus a domain service for non-status work (Merge, Worktree-Cleanup). ### 6. EF migration ```sql ALTER TABLE tasks ADD COLUMN planning_phase INTEGER NOT NULL DEFAULT 0; ALTER TABLE tasks ADD COLUMN blocked_by_task_id TEXT NULL REFERENCES tasks(id) ON DELETE SET NULL; CREATE INDEX ix_tasks_blocked_by ON tasks(blocked_by_task_id); UPDATE tasks SET status='idle' WHERE status='manual'; UPDATE tasks SET status='idle' WHERE status='draft'; UPDATE tasks SET status='idle', planning_phase=1 WHERE status='planning'; UPDATE tasks SET status='idle', planning_phase=2 WHERE status='planned'; ``` `Waiting` migration uses a CTE with `LAG()` to derive `BlockedByTaskId` from `(parent_task_id, sort_order)`: ```sql WITH ordered AS ( SELECT id, LAG(id) OVER (PARTITION BY parent_task_id ORDER BY sort_order, created_at) AS prev_id FROM tasks WHERE status='waiting' ) UPDATE tasks SET status='queued', blocked_by_task_id=(SELECT prev_id FROM ordered WHERE ordered.id=tasks.id) WHERE id IN (SELECT id FROM ordered); ``` Migration runs at worker startup via the existing `MigrateAsync` flow. `Down()` is best-effort (local-only app). Reverse mapping is lossy: `Cancelled` → `Failed`, `BlockedByTaskId` → `Waiting`, planning fields → folded back into status. ### 7. Test strategy New test fixtures (xUnit, real SQLite, real git where needed): 1. **`TaskStateServiceTests`** — happy path + reject for every transition; mock `IQueueWaker`, `HubBroadcaster`, `IPlanningChainCoordinator` and verify side-effect invocations; concurrency test (two parallel `StartRunningAsync` → exactly one wins). 2. **`QueuePickerTests`** — filter logic (blocked, missing tag, future schedule, wrong status) and ordering (`sort_order, created_at`); two parallel pickers → exactly one claims a row. 3. **`PlanningChainCoordinatorTests`** — `SetupChainAsync` produces correct (`Queued`, `BlockedBy`) layout; `OnChildFinishedAsync` unblocks the next child; child failure leaves remaining blocked, parent transitions to `Failed` after `TryCompleteParent`. 4. **`PlanningEndToEndTests`** — regression for the original bug. `Active` parent + 3 drafts → `Finalize` → assert first child reaches `Running` within 200 ms with no manual `Wake`. 5. **Existing tests** — anything seeding `task.Status = TaskStatus.Manual` or similar gets updated to new enum values or routed through `_state`. Coverage target: state machine + queue picker at ≥90% branch coverage. Existing coverage levels preserved elsewhere. ### 8. Implementation slices Each slice is one PR with green tests before the next starts. 1. **Slice 1 — Status model + migration.** New enum values, new columns, EF migration. Existing code mapped to new values mechanically (no behavior change). 2. **Slice 2 — `TaskStateService`.** Service + interface + tests. Migrate TaskRunner, StaleTaskRecovery, ExternalMcp/PlanningMcp guards, TaskResetService. Mark `Mark*Async` repo helpers `internal`. 3. **Slice 3 — `IQueueWaker` + `IQueuePicker`.** Extract from QueueService and Repo. Remove all manual `WakeQueue()` calls in app code. 4. **Slice 4 — Planning flow consolidation.** Delete `FinalizePlanningAsync` from repo. `PlanningSessionManager.FinalizeAsync` orchestrates via state-service + ChainCoordinator. Rename `QueueSubtasksSequentiallyAsync` → `SetupChainAsync` (internal). E2E test green. 5. **Slice 5 — `OverrideSlotService` + folder reorg.** Extract RunNow / ContinueTask. Move files to new folder structure. Update DI registration. 6. **Slice 6 — Cleanup & docs.** Update `Worker/CLAUDE.md`, `docs/plan.md`. Remove dead helpers. ## Risks & Mitigations - **EF migration on existing DBs.** Tested via integration tests that load a pre-migration fixture DB. `MigrateAsync` is already in production use, low risk. - **State-service becomes a god-object.** Mitigated by keeping it narrow: only status/phase/blocked-by writes, no business logic. Worktree, merge, and runner concerns stay in their own services. - **Two paths to `Running` (picker atomic, state-service for override).** Confirmed acceptable in brainstorming. Picker remains the only atomic-claim path; override slot is serialized by slot lock so non-atomic is safe. - **Waiting-migration CTE.** SQLite supports `LAG()` since 3.25. .NET 8's bundled SQLite is well above. Tested in migration unit tests. ## Open Questions None at design time. All knackpunkte resolved during brainstorming.