Approved design for centralizing task status mutations in a TaskStateService, splitting TaskStatus into orthogonal lifecycle/planning/blocking fields, and making queue wakes automatic. Sets up the 6-slice refactor of Worker/Services.
15 KiB
Worker State & Queue Consolidation — Design
Date: 2026-04-27
Status: Approved (brainstorming)
Scope: ClaudeDo.Worker + ClaudeDo.Data (TaskEntity, TaskRepository), EF migration
Problem
The worker layer has accumulated structural problems that culminate in a concrete bug — the queue does not pick up tasks created by a planning session.
Concrete bug
TaskRepository.FinalizePlanningAsync(parentId, queueAgentTasks=true) only flips a draft child to Queued if the child or its list carries the agent tag:
var shouldQueue = queueAgentTasks && (childHasAgentTag || listHasAgentTag);
When neither carries the tag, the child silently becomes Manual — the queue ignores it. There is no UI feedback. Users observe "queue never picks up planning tasks".
Underlying design issues
- Status enum mixes orthogonal concerns. Today's
TaskStatuscarries 10 values: lifecycle (Manual, Queued, Running, Done, Failed), planning hierarchy (Planning, Planned), chain ordering (Waiting), and an unclearDraft. Every consumer has to know which subset applies in which context. - Status writes are scattered. TaskRunner, StaleTaskRecovery, PlanningChainCoordinator, FinalizePlanningAsync, TaskResetService, ExternalMcpService, and PlanningMcpService all mutate
Statusdirectly. Some go throughTaskRepository.Mark*Asynchelpers, some dotask.Status = …straight on the DbContext (PlanningChainCoordinator). - Guards are duplicated.
if (Status == Running) throw …appears in at least four places (delete, retag, merge, reset). - Two competing planning flows.
FinalizePlanningAsync(parallel queueing in Repo) andPlanningChainCoordinator.QueueSubtasksSequentiallyAsync(sequential chain) make incompatible assumptions about child status. WakeQueue()is manual. Multiple callers must remember to invoke it after any DB mutation that creates aQueuedtask.QueueSubtasksSequentiallyAsyncforgets to. The queue only picks up after a backstop tick.Worker/Services/is a grab-bag. Queue, lifecycle, merge, worktree maintenance, agent files, and recovery sit side-by-side without domain boundaries.
Goals
- One source of truth for status mutations:
TaskStateService. - Status enum reflects only lifecycle. Planning state and chain blocking are separate fields.
- Wake-queue side effects are automatic, not caller-driven.
- Planning finalization has exactly one path.
Worker/Services/is split into domain folders.
Non-Goals
- No change to UI status-rendering logic beyond adapting to renamed values.
- No change to SignalR/MCP wire formats beyond the necessary status-string updates.
- No change to git/worktree behavior.
Design
1. Status model reform
Replace today's single TaskStatus with three orthogonal fields on TaskEntity.
TaskStatus (lifecycle only) — 6 values
| Value | Meaning |
|---|---|
Idle |
not in queue, not active. Replaces today's Manual and Draft. |
Queued |
waiting for queue pickup. |
Running |
currently executing. |
Done |
finished successfully. |
Failed |
finished with error. |
Cancelled |
aborted by user (today conflated with Failed). |
PlanningPhase (parent-only, new column) — 3 values
| Value | Meaning |
|---|---|
None |
no planning session. Default for all tasks. |
Active |
planning session is running. Replaces Status=Planning. |
Finalized |
plan is committed, children exist. Replaces Status=Planned. |
A parent task can now be Status=Idle, PlanningPhase=Finalized simultaneously, enabling re-runs of finalized plans without losing planning metadata.
BlockedByTaskId (nullable FK, new column) — replaces Waiting
- Today:
Status=Waitingmeans "waiting on a predecessor in the chain". - New:
Status=QueuedANDBlockedByTaskId=<predecessor>. Picker filters out any row withBlockedByTaskId IS NOT NULL. ON DELETE SET NULL— if predecessor is deleted, child becomes pickable.
2. TaskStateService (centralized state machine)
The only component that writes Status, PlanningPhase, BlockedByTaskId. All other code goes through it.
public interface ITaskStateService
{
Task<TransitionResult> EnqueueAsync(string taskId, CancellationToken ct);
Task<TransitionResult> StartRunningAsync(string taskId, DateTime startedAt, CancellationToken ct);
Task<TransitionResult> CompleteAsync(string taskId, DateTime finishedAt, string? result, CancellationToken ct);
Task<TransitionResult> FailAsync(string taskId, DateTime finishedAt, string? error, CancellationToken ct);
Task<TransitionResult> CancelAsync(string taskId, DateTime finishedAt, CancellationToken ct);
Task<TransitionResult> ResetToIdleAsync(string taskId, CancellationToken ct);
Task<TransitionResult> StartPlanningAsync(string parentId, CancellationToken ct);
Task<TransitionResult> FinalizePlanningAsync(string parentId, CancellationToken ct);
Task<TransitionResult> BlockOnAsync(string taskId, string predecessorTaskId, CancellationToken ct);
Task<TransitionResult> UnblockAsync(string taskId, CancellationToken ct);
Task<int> RecoverStaleRunningAsync(string reason, CancellationToken ct);
}
public sealed record TransitionResult(bool Ok, string? Reason);
Allowed transitions
Idle → Queued | Running (RunNow)
Queued → Running | Cancelled | Idle (ResetToIdle)
Running → Done | Failed | Cancelled
Done → Idle (ResetToIdle, for re-run)
Failed → Idle | Queued (re-queue)
Cancelled → Idle | Queued
Anything else returns TransitionResult(false, "invalid transition X→Y"). No exceptions for invalid transitions — Result pattern keeps callers tolerant.
Invariants
- Atomic. Each transition is a single
ExecuteUpdate(or short tx) usingWHERE Status = <expected>to be TOCTOU-free. - Validated. Source status is verified at the SQL level, not in C#.
- Side effects (after successful DB write):
- On any
→ Queued:IQueueWaker.Wake(). - On any successful transition:
HubBroadcaster.TaskUpdated(taskId). - On
Done/Failed/Cancelledfor a child task:IPlanningChainCoordinator.OnChildFinishedAsync, which calls_state.UnblockAsync(nextChild)andTryCompleteParentif applicable.
- On any
- No caller responsibility for side effects. A caller only needs to invoke one method.
Caller migration
| Today | New |
|---|---|
TaskRunner.MarkRunningAsync |
_state.StartRunningAsync |
TaskRunner.HandleSuccess (Mark + chain + parent) |
_state.CompleteAsync (handles all) |
TaskRunner.HandleFailure |
_state.FailAsync |
StaleTaskRecovery.FlipAllRunningToFailedAsync |
_state.RecoverStaleRunningAsync("worker restart") |
PlanningChainCoordinator.QueueSubtasksSequentiallyAsync (direct DbContext) |
iterates children, calls _state.EnqueueAsync for first, _state.BlockOnAsync for rest |
TaskRepository.FinalizePlanningAsync |
removed; PlanningSessionManager orchestrates via state-service |
TaskResetService (direct DbContext) |
_state.ResetToIdleAsync (service only owns worktree-cleanup) |
Mark*Async repo helpers stay but become internal — used only by TaskStateService.
3. Queue dispatch & wake mechanics
Three classes, clear responsibilities.
IQueueWaker
public interface IQueueWaker { void Wake(); }
- Singleton. Backed by today's
SemaphoreSlim. - Called automatically by
TaskStateServiceafter any→ Queuedtransition. - Manual
WakeQueue()calls in app code are removed (HubWakeQueueSignalR endpoint stays for diagnostics but maps directly toIQueueWaker.Wake).
IQueuePicker
public interface IQueuePicker
{
Task<TaskEntity?> ClaimNextAsync(DateTime now, CancellationToken ct);
}
- The single place where queue selection happens.
- Filter (all required):
Status == QueuedBlockedByTaskId IS NULL(ScheduledFor IS NULL OR ScheduledFor <= :now)EXISTS task_tags WHERE name='agent'OREXISTS list_tags WHERE name='agent'
- Order:
SortOrder ASC, CreatedAt ASC. - Atomic claim via
UPDATE … RETURNING(matching today's pattern), flipsQueued → Runningand writesStartedAt. - Picker is the sole caller of
Queued → Runningtransition.TaskStateService.StartRunningAsyncexists for the override slot path (RunNow / Continue).
QueueService (BackgroundService) — slimmer
- Wait on wake-signal or backstop timer.
- Call
_picker.ClaimNextAsync. - If task: occupy queue slot, run via
_runner.RunAsync, inContinueWithinvoke_waker.Wake()for the next pickup. - No DbContext. No status mutation. No DTO knowledge.
OverrideSlotService (new)
- Owns
RunNowandContinueTask(today both inQueueService). - Holds the override slot state.
- Status mutations go through
TaskStateService.StartRunningAsync(non-atomic claim — caller-driven, fine because override is user-initiated and serialized by slot lock).
4. Planning chain integration
Single flow, replaces both FinalizePlanningAsync (Repo) and QueueSubtasksSequentiallyAsync (Coordinator).
PlanningSessionManager.StartAsync(parentId)→_state.StartPlanningAsync→ parentPlanningPhase=Active.- User edits children in MCP tool. Children are in
Status=Idle. PlanningSessionManager.FinalizeAsync(parentId):_state.FinalizePlanningAsync(parentId)→ parentPlanningPhase=Finalized, Status=Idle._chainCoordinator.SetupChainAsync(parentId):- Attaches
agenttag to all children (automatic — confirmed in brainstorming). _state.EnqueueAsync(children[0])→ wake fires._state.BlockOnAsync(children[i], children[i-1])fori ≥ 1.
- Attaches
- When a child finishes,
TaskRunner.HandleSuccesscalls_state.CompleteAsync(child). State-service internally invokes_chainCoordinator.OnChildFinishedAsync, which calls_state.UnblockAsync(nextChild)(wake fires). Predecessor block goes away because ofON DELETE SET NULL-style logic inUnblockAsync. - When all children are terminal:
_staterunsTryCompleteParentand sets parentDone/Failedbased on aggregate.
TaskRepository.FinalizePlanningAsync is deleted. QueueSubtasksSequentiallyAsync is renamed to SetupChainAsync and made internal to the coordinator (called only from PlanningSessionManager.FinalizeAsync).
5. Worker/Services/ reorganization
Worker/
State/
ITaskStateService.cs
TaskStateService.cs
TransitionResult.cs
Queue/
IQueueWaker.cs
IQueuePicker.cs
QueuePicker.cs
QueueService.cs (BackgroundService, slimmer)
OverrideSlotService.cs
QueueSlotState.cs
Lifecycle/
StaleTaskRecovery.cs
TaskResetService.cs
TaskMergeService.cs
Worktrees/
WorktreeMaintenanceService.cs
Agents/
AgentFileService.cs
DefaultAgentSeeder.cs
Runner/ (unchanged)
Planning/ (ChainCoordinator simplified)
External/ (unchanged)
Hub/ (unchanged)
WorkerHub calls fewer services — typically _state.X plus a domain service for non-status work (Merge, Worktree-Cleanup).
6. EF migration
ALTER TABLE tasks ADD COLUMN planning_phase INTEGER NOT NULL DEFAULT 0;
ALTER TABLE tasks ADD COLUMN blocked_by_task_id TEXT NULL REFERENCES tasks(id) ON DELETE SET NULL;
CREATE INDEX ix_tasks_blocked_by ON tasks(blocked_by_task_id);
UPDATE tasks SET status='idle' WHERE status='manual';
UPDATE tasks SET status='idle' WHERE status='draft';
UPDATE tasks SET status='idle', planning_phase=1 WHERE status='planning';
UPDATE tasks SET status='idle', planning_phase=2 WHERE status='planned';
Waiting migration uses a CTE with LAG() to derive BlockedByTaskId from (parent_task_id, sort_order):
WITH ordered AS (
SELECT id,
LAG(id) OVER (PARTITION BY parent_task_id ORDER BY sort_order, created_at) AS prev_id
FROM tasks WHERE status='waiting'
)
UPDATE tasks SET status='queued',
blocked_by_task_id=(SELECT prev_id FROM ordered WHERE ordered.id=tasks.id)
WHERE id IN (SELECT id FROM ordered);
Migration runs at worker startup via the existing MigrateAsync flow.
Down() is best-effort (local-only app). Reverse mapping is lossy: Cancelled → Failed, BlockedByTaskId → Waiting, planning fields → folded back into status.
7. Test strategy
New test fixtures (xUnit, real SQLite, real git where needed):
TaskStateServiceTests— happy path + reject for every transition; mockIQueueWaker,HubBroadcaster,IPlanningChainCoordinatorand verify side-effect invocations; concurrency test (two parallelStartRunningAsync→ exactly one wins).QueuePickerTests— filter logic (blocked, missing tag, future schedule, wrong status) and ordering (sort_order, created_at); two parallel pickers → exactly one claims a row.PlanningChainCoordinatorTests—SetupChainAsyncproduces correct (Queued,BlockedBy) layout;OnChildFinishedAsyncunblocks the next child; child failure leaves remaining blocked, parent transitions toFailedafterTryCompleteParent.PlanningEndToEndTests— regression for the original bug.Activeparent + 3 drafts →Finalize→ assert first child reachesRunningwithin 200 ms with no manualWake.- Existing tests — anything seeding
task.Status = TaskStatus.Manualor similar gets updated to new enum values or routed through_state.
Coverage target: state machine + queue picker at ≥90% branch coverage. Existing coverage levels preserved elsewhere.
8. Implementation slices
Each slice is one PR with green tests before the next starts.
- Slice 1 — Status model + migration. New enum values, new columns, EF migration. Existing code mapped to new values mechanically (no behavior change).
- Slice 2 —
TaskStateService. Service + interface + tests. Migrate TaskRunner, StaleTaskRecovery, ExternalMcp/PlanningMcp guards, TaskResetService. MarkMark*Asyncrepo helpersinternal. - Slice 3 —
IQueueWaker+IQueuePicker. Extract from QueueService and Repo. Remove all manualWakeQueue()calls in app code. - Slice 4 — Planning flow consolidation. Delete
FinalizePlanningAsyncfrom repo.PlanningSessionManager.FinalizeAsyncorchestrates via state-service + ChainCoordinator. RenameQueueSubtasksSequentiallyAsync→SetupChainAsync(internal). E2E test green. - Slice 5 —
OverrideSlotService+ folder reorg. Extract RunNow / ContinueTask. Move files to new folder structure. Update DI registration. - Slice 6 — Cleanup & docs. Update
Worker/CLAUDE.md,docs/plan.md. Remove dead helpers.
Risks & Mitigations
- EF migration on existing DBs. Tested via integration tests that load a pre-migration fixture DB.
MigrateAsyncis already in production use, low risk. - State-service becomes a god-object. Mitigated by keeping it narrow: only status/phase/blocked-by writes, no business logic. Worktree, merge, and runner concerns stay in their own services.
- Two paths to
Running(picker atomic, state-service for override). Confirmed acceptable in brainstorming. Picker remains the only atomic-claim path; override slot is serialized by slot lock so non-atomic is safe. - Waiting-migration CTE. SQLite supports
LAG()since 3.25. .NET 8's bundled SQLite is well above. Tested in migration unit tests.
Open Questions
None at design time. All knackpunkte resolved during brainstorming.