Files
ClaudeDo/docs/superpowers/specs/2026-04-27-worker-state-and-queue-consolidation-design.md
Mika Kuns 43af17e546 docs(superpowers): add worker state and queue consolidation spec
Approved design for centralizing task status mutations in a TaskStateService,
splitting TaskStatus into orthogonal lifecycle/planning/blocking fields, and
making queue wakes automatic. Sets up the 6-slice refactor of Worker/Services.
2026-04-27 10:16:55 +02:00

15 KiB

Worker State & Queue Consolidation — Design

Date: 2026-04-27 Status: Approved (brainstorming) Scope: ClaudeDo.Worker + ClaudeDo.Data (TaskEntity, TaskRepository), EF migration

Problem

The worker layer has accumulated structural problems that culminate in a concrete bug — the queue does not pick up tasks created by a planning session.

Concrete bug

TaskRepository.FinalizePlanningAsync(parentId, queueAgentTasks=true) only flips a draft child to Queued if the child or its list carries the agent tag:

var shouldQueue = queueAgentTasks && (childHasAgentTag || listHasAgentTag);

When neither carries the tag, the child silently becomes Manual — the queue ignores it. There is no UI feedback. Users observe "queue never picks up planning tasks".

Underlying design issues

  1. Status enum mixes orthogonal concerns. Today's TaskStatus carries 10 values: lifecycle (Manual, Queued, Running, Done, Failed), planning hierarchy (Planning, Planned), chain ordering (Waiting), and an unclear Draft. Every consumer has to know which subset applies in which context.
  2. Status writes are scattered. TaskRunner, StaleTaskRecovery, PlanningChainCoordinator, FinalizePlanningAsync, TaskResetService, ExternalMcpService, and PlanningMcpService all mutate Status directly. Some go through TaskRepository.Mark*Async helpers, some do task.Status = … straight on the DbContext (PlanningChainCoordinator).
  3. Guards are duplicated. if (Status == Running) throw … appears in at least four places (delete, retag, merge, reset).
  4. Two competing planning flows. FinalizePlanningAsync (parallel queueing in Repo) and PlanningChainCoordinator.QueueSubtasksSequentiallyAsync (sequential chain) make incompatible assumptions about child status.
  5. WakeQueue() is manual. Multiple callers must remember to invoke it after any DB mutation that creates a Queued task. QueueSubtasksSequentiallyAsync forgets to. The queue only picks up after a backstop tick.
  6. Worker/Services/ is a grab-bag. Queue, lifecycle, merge, worktree maintenance, agent files, and recovery sit side-by-side without domain boundaries.

Goals

  • One source of truth for status mutations: TaskStateService.
  • Status enum reflects only lifecycle. Planning state and chain blocking are separate fields.
  • Wake-queue side effects are automatic, not caller-driven.
  • Planning finalization has exactly one path.
  • Worker/Services/ is split into domain folders.

Non-Goals

  • No change to UI status-rendering logic beyond adapting to renamed values.
  • No change to SignalR/MCP wire formats beyond the necessary status-string updates.
  • No change to git/worktree behavior.

Design

1. Status model reform

Replace today's single TaskStatus with three orthogonal fields on TaskEntity.

TaskStatus (lifecycle only) — 6 values

Value Meaning
Idle not in queue, not active. Replaces today's Manual and Draft.
Queued waiting for queue pickup.
Running currently executing.
Done finished successfully.
Failed finished with error.
Cancelled aborted by user (today conflated with Failed).

PlanningPhase (parent-only, new column) — 3 values

Value Meaning
None no planning session. Default for all tasks.
Active planning session is running. Replaces Status=Planning.
Finalized plan is committed, children exist. Replaces Status=Planned.

A parent task can now be Status=Idle, PlanningPhase=Finalized simultaneously, enabling re-runs of finalized plans without losing planning metadata.

BlockedByTaskId (nullable FK, new column) — replaces Waiting

  • Today: Status=Waiting means "waiting on a predecessor in the chain".
  • New: Status=Queued AND BlockedByTaskId=<predecessor>. Picker filters out any row with BlockedByTaskId IS NOT NULL.
  • ON DELETE SET NULL — if predecessor is deleted, child becomes pickable.

2. TaskStateService (centralized state machine)

The only component that writes Status, PlanningPhase, BlockedByTaskId. All other code goes through it.

public interface ITaskStateService
{
    Task<TransitionResult> EnqueueAsync(string taskId, CancellationToken ct);
    Task<TransitionResult> StartRunningAsync(string taskId, DateTime startedAt, CancellationToken ct);
    Task<TransitionResult> CompleteAsync(string taskId, DateTime finishedAt, string? result, CancellationToken ct);
    Task<TransitionResult> FailAsync(string taskId, DateTime finishedAt, string? error, CancellationToken ct);
    Task<TransitionResult> CancelAsync(string taskId, DateTime finishedAt, CancellationToken ct);
    Task<TransitionResult> ResetToIdleAsync(string taskId, CancellationToken ct);

    Task<TransitionResult> StartPlanningAsync(string parentId, CancellationToken ct);
    Task<TransitionResult> FinalizePlanningAsync(string parentId, CancellationToken ct);

    Task<TransitionResult> BlockOnAsync(string taskId, string predecessorTaskId, CancellationToken ct);
    Task<TransitionResult> UnblockAsync(string taskId, CancellationToken ct);

    Task<int> RecoverStaleRunningAsync(string reason, CancellationToken ct);
}

public sealed record TransitionResult(bool Ok, string? Reason);

Allowed transitions

Idle      → Queued | Running (RunNow)
Queued    → Running | Cancelled | Idle (ResetToIdle)
Running   → Done | Failed | Cancelled
Done      → Idle (ResetToIdle, for re-run)
Failed    → Idle | Queued (re-queue)
Cancelled → Idle | Queued

Anything else returns TransitionResult(false, "invalid transition X→Y"). No exceptions for invalid transitions — Result pattern keeps callers tolerant.

Invariants

  1. Atomic. Each transition is a single ExecuteUpdate (or short tx) using WHERE Status = <expected> to be TOCTOU-free.
  2. Validated. Source status is verified at the SQL level, not in C#.
  3. Side effects (after successful DB write):
    • On any → Queued: IQueueWaker.Wake().
    • On any successful transition: HubBroadcaster.TaskUpdated(taskId).
    • On Done/Failed/Cancelled for a child task: IPlanningChainCoordinator.OnChildFinishedAsync, which calls _state.UnblockAsync(nextChild) and TryCompleteParent if applicable.
  4. No caller responsibility for side effects. A caller only needs to invoke one method.

Caller migration

Today New
TaskRunner.MarkRunningAsync _state.StartRunningAsync
TaskRunner.HandleSuccess (Mark + chain + parent) _state.CompleteAsync (handles all)
TaskRunner.HandleFailure _state.FailAsync
StaleTaskRecovery.FlipAllRunningToFailedAsync _state.RecoverStaleRunningAsync("worker restart")
PlanningChainCoordinator.QueueSubtasksSequentiallyAsync (direct DbContext) iterates children, calls _state.EnqueueAsync for first, _state.BlockOnAsync for rest
TaskRepository.FinalizePlanningAsync removed; PlanningSessionManager orchestrates via state-service
TaskResetService (direct DbContext) _state.ResetToIdleAsync (service only owns worktree-cleanup)

Mark*Async repo helpers stay but become internal — used only by TaskStateService.

3. Queue dispatch & wake mechanics

Three classes, clear responsibilities.

IQueueWaker

public interface IQueueWaker { void Wake(); }
  • Singleton. Backed by today's SemaphoreSlim.
  • Called automatically by TaskStateService after any → Queued transition.
  • Manual WakeQueue() calls in app code are removed (Hub WakeQueue SignalR endpoint stays for diagnostics but maps directly to IQueueWaker.Wake).

IQueuePicker

public interface IQueuePicker
{
    Task<TaskEntity?> ClaimNextAsync(DateTime now, CancellationToken ct);
}
  • The single place where queue selection happens.
  • Filter (all required):
    • Status == Queued
    • BlockedByTaskId IS NULL
    • (ScheduledFor IS NULL OR ScheduledFor <= :now)
    • EXISTS task_tags WHERE name='agent' OR EXISTS list_tags WHERE name='agent'
  • Order: SortOrder ASC, CreatedAt ASC.
  • Atomic claim via UPDATE … RETURNING (matching today's pattern), flips Queued → Running and writes StartedAt.
  • Picker is the sole caller of Queued → Running transition. TaskStateService.StartRunningAsync exists for the override slot path (RunNow / Continue).

QueueService (BackgroundService) — slimmer

  • Wait on wake-signal or backstop timer.
  • Call _picker.ClaimNextAsync.
  • If task: occupy queue slot, run via _runner.RunAsync, in ContinueWith invoke _waker.Wake() for the next pickup.
  • No DbContext. No status mutation. No DTO knowledge.

OverrideSlotService (new)

  • Owns RunNow and ContinueTask (today both in QueueService).
  • Holds the override slot state.
  • Status mutations go through TaskStateService.StartRunningAsync (non-atomic claim — caller-driven, fine because override is user-initiated and serialized by slot lock).

4. Planning chain integration

Single flow, replaces both FinalizePlanningAsync (Repo) and QueueSubtasksSequentiallyAsync (Coordinator).

  1. PlanningSessionManager.StartAsync(parentId)_state.StartPlanningAsync → parent PlanningPhase=Active.
  2. User edits children in MCP tool. Children are in Status=Idle.
  3. PlanningSessionManager.FinalizeAsync(parentId):
    • _state.FinalizePlanningAsync(parentId) → parent PlanningPhase=Finalized, Status=Idle.
    • _chainCoordinator.SetupChainAsync(parentId):
      • Attaches agent tag to all children (automatic — confirmed in brainstorming).
      • _state.EnqueueAsync(children[0]) → wake fires.
      • _state.BlockOnAsync(children[i], children[i-1]) for i ≥ 1.
  4. When a child finishes, TaskRunner.HandleSuccess calls _state.CompleteAsync(child). State-service internally invokes _chainCoordinator.OnChildFinishedAsync, which calls _state.UnblockAsync(nextChild) (wake fires). Predecessor block goes away because of ON DELETE SET NULL-style logic in UnblockAsync.
  5. When all children are terminal: _state runs TryCompleteParent and sets parent Done/Failed based on aggregate.

TaskRepository.FinalizePlanningAsync is deleted. QueueSubtasksSequentiallyAsync is renamed to SetupChainAsync and made internal to the coordinator (called only from PlanningSessionManager.FinalizeAsync).

5. Worker/Services/ reorganization

Worker/
  State/
    ITaskStateService.cs
    TaskStateService.cs
    TransitionResult.cs
  Queue/
    IQueueWaker.cs
    IQueuePicker.cs
    QueuePicker.cs
    QueueService.cs              (BackgroundService, slimmer)
    OverrideSlotService.cs
    QueueSlotState.cs
  Lifecycle/
    StaleTaskRecovery.cs
    TaskResetService.cs
    TaskMergeService.cs
  Worktrees/
    WorktreeMaintenanceService.cs
  Agents/
    AgentFileService.cs
    DefaultAgentSeeder.cs
  Runner/                        (unchanged)
  Planning/                      (ChainCoordinator simplified)
  External/                      (unchanged)
  Hub/                           (unchanged)

WorkerHub calls fewer services — typically _state.X plus a domain service for non-status work (Merge, Worktree-Cleanup).

6. EF migration

ALTER TABLE tasks ADD COLUMN planning_phase INTEGER NOT NULL DEFAULT 0;
ALTER TABLE tasks ADD COLUMN blocked_by_task_id TEXT NULL REFERENCES tasks(id) ON DELETE SET NULL;
CREATE INDEX ix_tasks_blocked_by ON tasks(blocked_by_task_id);

UPDATE tasks SET status='idle' WHERE status='manual';
UPDATE tasks SET status='idle' WHERE status='draft';
UPDATE tasks SET status='idle', planning_phase=1 WHERE status='planning';
UPDATE tasks SET status='idle', planning_phase=2 WHERE status='planned';

Waiting migration uses a CTE with LAG() to derive BlockedByTaskId from (parent_task_id, sort_order):

WITH ordered AS (
  SELECT id,
         LAG(id) OVER (PARTITION BY parent_task_id ORDER BY sort_order, created_at) AS prev_id
  FROM tasks WHERE status='waiting'
)
UPDATE tasks SET status='queued',
                 blocked_by_task_id=(SELECT prev_id FROM ordered WHERE ordered.id=tasks.id)
WHERE id IN (SELECT id FROM ordered);

Migration runs at worker startup via the existing MigrateAsync flow.

Down() is best-effort (local-only app). Reverse mapping is lossy: CancelledFailed, BlockedByTaskIdWaiting, planning fields → folded back into status.

7. Test strategy

New test fixtures (xUnit, real SQLite, real git where needed):

  1. TaskStateServiceTests — happy path + reject for every transition; mock IQueueWaker, HubBroadcaster, IPlanningChainCoordinator and verify side-effect invocations; concurrency test (two parallel StartRunningAsync → exactly one wins).
  2. QueuePickerTests — filter logic (blocked, missing tag, future schedule, wrong status) and ordering (sort_order, created_at); two parallel pickers → exactly one claims a row.
  3. PlanningChainCoordinatorTestsSetupChainAsync produces correct (Queued, BlockedBy) layout; OnChildFinishedAsync unblocks the next child; child failure leaves remaining blocked, parent transitions to Failed after TryCompleteParent.
  4. PlanningEndToEndTests — regression for the original bug. Active parent + 3 drafts → Finalize → assert first child reaches Running within 200 ms with no manual Wake.
  5. Existing tests — anything seeding task.Status = TaskStatus.Manual or similar gets updated to new enum values or routed through _state.

Coverage target: state machine + queue picker at ≥90% branch coverage. Existing coverage levels preserved elsewhere.

8. Implementation slices

Each slice is one PR with green tests before the next starts.

  1. Slice 1 — Status model + migration. New enum values, new columns, EF migration. Existing code mapped to new values mechanically (no behavior change).
  2. Slice 2 — TaskStateService. Service + interface + tests. Migrate TaskRunner, StaleTaskRecovery, ExternalMcp/PlanningMcp guards, TaskResetService. Mark Mark*Async repo helpers internal.
  3. Slice 3 — IQueueWaker + IQueuePicker. Extract from QueueService and Repo. Remove all manual WakeQueue() calls in app code.
  4. Slice 4 — Planning flow consolidation. Delete FinalizePlanningAsync from repo. PlanningSessionManager.FinalizeAsync orchestrates via state-service + ChainCoordinator. Rename QueueSubtasksSequentiallyAsyncSetupChainAsync (internal). E2E test green.
  5. Slice 5 — OverrideSlotService + folder reorg. Extract RunNow / ContinueTask. Move files to new folder structure. Update DI registration.
  6. Slice 6 — Cleanup & docs. Update Worker/CLAUDE.md, docs/plan.md. Remove dead helpers.

Risks & Mitigations

  • EF migration on existing DBs. Tested via integration tests that load a pre-migration fixture DB. MigrateAsync is already in production use, low risk.
  • State-service becomes a god-object. Mitigated by keeping it narrow: only status/phase/blocked-by writes, no business logic. Worktree, merge, and runner concerns stay in their own services.
  • Two paths to Running (picker atomic, state-service for override). Confirmed acceptable in brainstorming. Picker remains the only atomic-claim path; override slot is serialized by slot lock so non-atomic is safe.
  • Waiting-migration CTE. SQLite supports LAG() since 3.25. .NET 8's bundled SQLite is well above. Tested in migration unit tests.

Open Questions

None at design time. All knackpunkte resolved during brainstorming.