docs(superpowers): add worker state and queue consolidation spec
Approved design for centralizing task status mutations in a TaskStateService, splitting TaskStatus into orthogonal lifecycle/planning/blocking fields, and making queue wakes automatic. Sets up the 6-slice refactor of Worker/Services.
This commit is contained in:
@@ -0,0 +1,297 @@
|
|||||||
|
# Worker State & Queue Consolidation — Design
|
||||||
|
|
||||||
|
**Date:** 2026-04-27
|
||||||
|
**Status:** Approved (brainstorming)
|
||||||
|
**Scope:** `ClaudeDo.Worker` + `ClaudeDo.Data` (TaskEntity, TaskRepository), EF migration
|
||||||
|
|
||||||
|
## Problem
|
||||||
|
|
||||||
|
The worker layer has accumulated structural problems that culminate in a concrete bug — the queue does not pick up tasks created by a planning session.
|
||||||
|
|
||||||
|
### Concrete bug
|
||||||
|
|
||||||
|
`TaskRepository.FinalizePlanningAsync(parentId, queueAgentTasks=true)` only flips a draft child to `Queued` if the child *or* its list carries the `agent` tag:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
var shouldQueue = queueAgentTasks && (childHasAgentTag || listHasAgentTag);
|
||||||
|
```
|
||||||
|
|
||||||
|
When neither carries the tag, the child silently becomes `Manual` — the queue ignores it. There is no UI feedback. Users observe "queue never picks up planning tasks".
|
||||||
|
|
||||||
|
### Underlying design issues
|
||||||
|
|
||||||
|
1. **Status enum mixes orthogonal concerns.** Today's `TaskStatus` carries 10 values: lifecycle (`Manual, Queued, Running, Done, Failed`), planning hierarchy (`Planning, Planned`), chain ordering (`Waiting`), and an unclear `Draft`. Every consumer has to know which subset applies in which context.
|
||||||
|
2. **Status writes are scattered.** TaskRunner, StaleTaskRecovery, PlanningChainCoordinator, FinalizePlanningAsync, TaskResetService, ExternalMcpService, and PlanningMcpService all mutate `Status` directly. Some go through `TaskRepository.Mark*Async` helpers, some do `task.Status = …` straight on the DbContext (PlanningChainCoordinator).
|
||||||
|
3. **Guards are duplicated.** `if (Status == Running) throw …` appears in at least four places (delete, retag, merge, reset).
|
||||||
|
4. **Two competing planning flows.** `FinalizePlanningAsync` (parallel queueing in Repo) and `PlanningChainCoordinator.QueueSubtasksSequentiallyAsync` (sequential chain) make incompatible assumptions about child status.
|
||||||
|
5. **`WakeQueue()` is manual.** Multiple callers must remember to invoke it after any DB mutation that creates a `Queued` task. `QueueSubtasksSequentiallyAsync` forgets to. The queue only picks up after a backstop tick.
|
||||||
|
6. **`Worker/Services/` is a grab-bag.** Queue, lifecycle, merge, worktree maintenance, agent files, and recovery sit side-by-side without domain boundaries.
|
||||||
|
|
||||||
|
## Goals
|
||||||
|
|
||||||
|
- One source of truth for status mutations: `TaskStateService`.
|
||||||
|
- Status enum reflects only lifecycle. Planning state and chain blocking are separate fields.
|
||||||
|
- Wake-queue side effects are automatic, not caller-driven.
|
||||||
|
- Planning finalization has exactly one path.
|
||||||
|
- `Worker/Services/` is split into domain folders.
|
||||||
|
|
||||||
|
## Non-Goals
|
||||||
|
|
||||||
|
- No change to UI status-rendering logic beyond adapting to renamed values.
|
||||||
|
- No change to SignalR/MCP wire formats beyond the necessary status-string updates.
|
||||||
|
- No change to git/worktree behavior.
|
||||||
|
|
||||||
|
## Design
|
||||||
|
|
||||||
|
### 1. Status model reform
|
||||||
|
|
||||||
|
Replace today's single `TaskStatus` with three orthogonal fields on `TaskEntity`.
|
||||||
|
|
||||||
|
#### `TaskStatus` (lifecycle only) — 6 values
|
||||||
|
|
||||||
|
| Value | Meaning |
|
||||||
|
|---|---|
|
||||||
|
| `Idle` | not in queue, not active. Replaces today's `Manual` and `Draft`. |
|
||||||
|
| `Queued` | waiting for queue pickup. |
|
||||||
|
| `Running` | currently executing. |
|
||||||
|
| `Done` | finished successfully. |
|
||||||
|
| `Failed` | finished with error. |
|
||||||
|
| `Cancelled` | aborted by user (today conflated with `Failed`). |
|
||||||
|
|
||||||
|
#### `PlanningPhase` (parent-only, new column) — 3 values
|
||||||
|
|
||||||
|
| Value | Meaning |
|
||||||
|
|---|---|
|
||||||
|
| `None` | no planning session. Default for all tasks. |
|
||||||
|
| `Active` | planning session is running. Replaces `Status=Planning`. |
|
||||||
|
| `Finalized` | plan is committed, children exist. Replaces `Status=Planned`. |
|
||||||
|
|
||||||
|
A parent task can now be `Status=Idle, PlanningPhase=Finalized` simultaneously, enabling re-runs of finalized plans without losing planning metadata.
|
||||||
|
|
||||||
|
#### `BlockedByTaskId` (nullable FK, new column) — replaces `Waiting`
|
||||||
|
|
||||||
|
- Today: `Status=Waiting` means "waiting on a predecessor in the chain".
|
||||||
|
- New: `Status=Queued` AND `BlockedByTaskId=<predecessor>`. Picker filters out any row with `BlockedByTaskId IS NOT NULL`.
|
||||||
|
- `ON DELETE SET NULL` — if predecessor is deleted, child becomes pickable.
|
||||||
|
|
||||||
|
### 2. `TaskStateService` (centralized state machine)
|
||||||
|
|
||||||
|
The only component that writes `Status`, `PlanningPhase`, `BlockedByTaskId`. All other code goes through it.
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
public interface ITaskStateService
|
||||||
|
{
|
||||||
|
Task<TransitionResult> EnqueueAsync(string taskId, CancellationToken ct);
|
||||||
|
Task<TransitionResult> StartRunningAsync(string taskId, DateTime startedAt, CancellationToken ct);
|
||||||
|
Task<TransitionResult> CompleteAsync(string taskId, DateTime finishedAt, string? result, CancellationToken ct);
|
||||||
|
Task<TransitionResult> FailAsync(string taskId, DateTime finishedAt, string? error, CancellationToken ct);
|
||||||
|
Task<TransitionResult> CancelAsync(string taskId, DateTime finishedAt, CancellationToken ct);
|
||||||
|
Task<TransitionResult> ResetToIdleAsync(string taskId, CancellationToken ct);
|
||||||
|
|
||||||
|
Task<TransitionResult> StartPlanningAsync(string parentId, CancellationToken ct);
|
||||||
|
Task<TransitionResult> FinalizePlanningAsync(string parentId, CancellationToken ct);
|
||||||
|
|
||||||
|
Task<TransitionResult> BlockOnAsync(string taskId, string predecessorTaskId, CancellationToken ct);
|
||||||
|
Task<TransitionResult> UnblockAsync(string taskId, CancellationToken ct);
|
||||||
|
|
||||||
|
Task<int> RecoverStaleRunningAsync(string reason, CancellationToken ct);
|
||||||
|
}
|
||||||
|
|
||||||
|
public sealed record TransitionResult(bool Ok, string? Reason);
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Allowed transitions
|
||||||
|
|
||||||
|
```
|
||||||
|
Idle → Queued | Running (RunNow)
|
||||||
|
Queued → Running | Cancelled | Idle (ResetToIdle)
|
||||||
|
Running → Done | Failed | Cancelled
|
||||||
|
Done → Idle (ResetToIdle, for re-run)
|
||||||
|
Failed → Idle | Queued (re-queue)
|
||||||
|
Cancelled → Idle | Queued
|
||||||
|
```
|
||||||
|
|
||||||
|
Anything else returns `TransitionResult(false, "invalid transition X→Y")`. No exceptions for invalid transitions — Result pattern keeps callers tolerant.
|
||||||
|
|
||||||
|
#### Invariants
|
||||||
|
|
||||||
|
1. **Atomic.** Each transition is a single `ExecuteUpdate` (or short tx) using `WHERE Status = <expected>` to be TOCTOU-free.
|
||||||
|
2. **Validated.** Source status is verified at the SQL level, not in C#.
|
||||||
|
3. **Side effects (after successful DB write):**
|
||||||
|
- On any `→ Queued`: `IQueueWaker.Wake()`.
|
||||||
|
- On any successful transition: `HubBroadcaster.TaskUpdated(taskId)`.
|
||||||
|
- On `Done`/`Failed`/`Cancelled` for a child task: `IPlanningChainCoordinator.OnChildFinishedAsync`, which calls `_state.UnblockAsync(nextChild)` and `TryCompleteParent` if applicable.
|
||||||
|
4. **No caller responsibility for side effects.** A caller only needs to invoke one method.
|
||||||
|
|
||||||
|
#### Caller migration
|
||||||
|
|
||||||
|
| Today | New |
|
||||||
|
|---|---|
|
||||||
|
| `TaskRunner.MarkRunningAsync` | `_state.StartRunningAsync` |
|
||||||
|
| `TaskRunner.HandleSuccess` (Mark + chain + parent) | `_state.CompleteAsync` (handles all) |
|
||||||
|
| `TaskRunner.HandleFailure` | `_state.FailAsync` |
|
||||||
|
| `StaleTaskRecovery.FlipAllRunningToFailedAsync` | `_state.RecoverStaleRunningAsync("worker restart")` |
|
||||||
|
| `PlanningChainCoordinator.QueueSubtasksSequentiallyAsync` (direct DbContext) | iterates children, calls `_state.EnqueueAsync` for first, `_state.BlockOnAsync` for rest |
|
||||||
|
| `TaskRepository.FinalizePlanningAsync` | **removed**; `PlanningSessionManager` orchestrates via state-service |
|
||||||
|
| `TaskResetService` (direct DbContext) | `_state.ResetToIdleAsync` (service only owns worktree-cleanup) |
|
||||||
|
|
||||||
|
`Mark*Async` repo helpers stay but become `internal` — used only by `TaskStateService`.
|
||||||
|
|
||||||
|
### 3. Queue dispatch & wake mechanics
|
||||||
|
|
||||||
|
Three classes, clear responsibilities.
|
||||||
|
|
||||||
|
#### `IQueueWaker`
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
public interface IQueueWaker { void Wake(); }
|
||||||
|
```
|
||||||
|
|
||||||
|
- Singleton. Backed by today's `SemaphoreSlim`.
|
||||||
|
- Called automatically by `TaskStateService` after any `→ Queued` transition.
|
||||||
|
- Manual `WakeQueue()` calls in app code are removed (Hub `WakeQueue` SignalR endpoint stays for diagnostics but maps directly to `IQueueWaker.Wake`).
|
||||||
|
|
||||||
|
#### `IQueuePicker`
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
public interface IQueuePicker
|
||||||
|
{
|
||||||
|
Task<TaskEntity?> ClaimNextAsync(DateTime now, CancellationToken ct);
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
- The single place where queue selection happens.
|
||||||
|
- Filter (all required):
|
||||||
|
- `Status == Queued`
|
||||||
|
- `BlockedByTaskId IS NULL`
|
||||||
|
- `(ScheduledFor IS NULL OR ScheduledFor <= :now)`
|
||||||
|
- `EXISTS task_tags WHERE name='agent'` OR `EXISTS list_tags WHERE name='agent'`
|
||||||
|
- Order: `SortOrder ASC, CreatedAt ASC`.
|
||||||
|
- Atomic claim via `UPDATE … RETURNING` (matching today's pattern), flips `Queued → Running` and writes `StartedAt`.
|
||||||
|
- Picker is the sole caller of `Queued → Running` transition. `TaskStateService.StartRunningAsync` exists for the override slot path (RunNow / Continue).
|
||||||
|
|
||||||
|
#### `QueueService` (BackgroundService) — slimmer
|
||||||
|
|
||||||
|
- Wait on wake-signal or backstop timer.
|
||||||
|
- Call `_picker.ClaimNextAsync`.
|
||||||
|
- If task: occupy queue slot, run via `_runner.RunAsync`, in `ContinueWith` invoke `_waker.Wake()` for the next pickup.
|
||||||
|
- No DbContext. No status mutation. No DTO knowledge.
|
||||||
|
|
||||||
|
#### `OverrideSlotService` (new)
|
||||||
|
|
||||||
|
- Owns `RunNow` and `ContinueTask` (today both in `QueueService`).
|
||||||
|
- Holds the override slot state.
|
||||||
|
- Status mutations go through `TaskStateService.StartRunningAsync` (non-atomic claim — caller-driven, fine because override is user-initiated and serialized by slot lock).
|
||||||
|
|
||||||
|
### 4. Planning chain integration
|
||||||
|
|
||||||
|
Single flow, replaces both `FinalizePlanningAsync` (Repo) and `QueueSubtasksSequentiallyAsync` (Coordinator).
|
||||||
|
|
||||||
|
1. `PlanningSessionManager.StartAsync(parentId)` → `_state.StartPlanningAsync` → parent `PlanningPhase=Active`.
|
||||||
|
2. User edits children in MCP tool. Children are in `Status=Idle`.
|
||||||
|
3. `PlanningSessionManager.FinalizeAsync(parentId)`:
|
||||||
|
- `_state.FinalizePlanningAsync(parentId)` → parent `PlanningPhase=Finalized, Status=Idle`.
|
||||||
|
- `_chainCoordinator.SetupChainAsync(parentId)`:
|
||||||
|
- Attaches `agent` tag to all children (automatic — confirmed in brainstorming).
|
||||||
|
- `_state.EnqueueAsync(children[0])` → wake fires.
|
||||||
|
- `_state.BlockOnAsync(children[i], children[i-1])` for `i ≥ 1`.
|
||||||
|
4. When a child finishes, `TaskRunner.HandleSuccess` calls `_state.CompleteAsync(child)`. State-service internally invokes `_chainCoordinator.OnChildFinishedAsync`, which calls `_state.UnblockAsync(nextChild)` (wake fires). Predecessor block goes away because of `ON DELETE SET NULL`-style logic in `UnblockAsync`.
|
||||||
|
5. When all children are terminal: `_state` runs `TryCompleteParent` and sets parent `Done`/`Failed` based on aggregate.
|
||||||
|
|
||||||
|
`TaskRepository.FinalizePlanningAsync` is **deleted**. `QueueSubtasksSequentiallyAsync` is renamed to `SetupChainAsync` and made internal to the coordinator (called only from `PlanningSessionManager.FinalizeAsync`).
|
||||||
|
|
||||||
|
### 5. `Worker/Services/` reorganization
|
||||||
|
|
||||||
|
```
|
||||||
|
Worker/
|
||||||
|
State/
|
||||||
|
ITaskStateService.cs
|
||||||
|
TaskStateService.cs
|
||||||
|
TransitionResult.cs
|
||||||
|
Queue/
|
||||||
|
IQueueWaker.cs
|
||||||
|
IQueuePicker.cs
|
||||||
|
QueuePicker.cs
|
||||||
|
QueueService.cs (BackgroundService, slimmer)
|
||||||
|
OverrideSlotService.cs
|
||||||
|
QueueSlotState.cs
|
||||||
|
Lifecycle/
|
||||||
|
StaleTaskRecovery.cs
|
||||||
|
TaskResetService.cs
|
||||||
|
TaskMergeService.cs
|
||||||
|
Worktrees/
|
||||||
|
WorktreeMaintenanceService.cs
|
||||||
|
Agents/
|
||||||
|
AgentFileService.cs
|
||||||
|
DefaultAgentSeeder.cs
|
||||||
|
Runner/ (unchanged)
|
||||||
|
Planning/ (ChainCoordinator simplified)
|
||||||
|
External/ (unchanged)
|
||||||
|
Hub/ (unchanged)
|
||||||
|
```
|
||||||
|
|
||||||
|
`WorkerHub` calls fewer services — typically `_state.X` plus a domain service for non-status work (Merge, Worktree-Cleanup).
|
||||||
|
|
||||||
|
### 6. EF migration
|
||||||
|
|
||||||
|
```sql
|
||||||
|
ALTER TABLE tasks ADD COLUMN planning_phase INTEGER NOT NULL DEFAULT 0;
|
||||||
|
ALTER TABLE tasks ADD COLUMN blocked_by_task_id TEXT NULL REFERENCES tasks(id) ON DELETE SET NULL;
|
||||||
|
CREATE INDEX ix_tasks_blocked_by ON tasks(blocked_by_task_id);
|
||||||
|
|
||||||
|
UPDATE tasks SET status='idle' WHERE status='manual';
|
||||||
|
UPDATE tasks SET status='idle' WHERE status='draft';
|
||||||
|
UPDATE tasks SET status='idle', planning_phase=1 WHERE status='planning';
|
||||||
|
UPDATE tasks SET status='idle', planning_phase=2 WHERE status='planned';
|
||||||
|
```
|
||||||
|
|
||||||
|
`Waiting` migration uses a CTE with `LAG()` to derive `BlockedByTaskId` from `(parent_task_id, sort_order)`:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
WITH ordered AS (
|
||||||
|
SELECT id,
|
||||||
|
LAG(id) OVER (PARTITION BY parent_task_id ORDER BY sort_order, created_at) AS prev_id
|
||||||
|
FROM tasks WHERE status='waiting'
|
||||||
|
)
|
||||||
|
UPDATE tasks SET status='queued',
|
||||||
|
blocked_by_task_id=(SELECT prev_id FROM ordered WHERE ordered.id=tasks.id)
|
||||||
|
WHERE id IN (SELECT id FROM ordered);
|
||||||
|
```
|
||||||
|
|
||||||
|
Migration runs at worker startup via the existing `MigrateAsync` flow.
|
||||||
|
|
||||||
|
`Down()` is best-effort (local-only app). Reverse mapping is lossy: `Cancelled` → `Failed`, `BlockedByTaskId` → `Waiting`, planning fields → folded back into status.
|
||||||
|
|
||||||
|
### 7. Test strategy
|
||||||
|
|
||||||
|
New test fixtures (xUnit, real SQLite, real git where needed):
|
||||||
|
|
||||||
|
1. **`TaskStateServiceTests`** — happy path + reject for every transition; mock `IQueueWaker`, `HubBroadcaster`, `IPlanningChainCoordinator` and verify side-effect invocations; concurrency test (two parallel `StartRunningAsync` → exactly one wins).
|
||||||
|
2. **`QueuePickerTests`** — filter logic (blocked, missing tag, future schedule, wrong status) and ordering (`sort_order, created_at`); two parallel pickers → exactly one claims a row.
|
||||||
|
3. **`PlanningChainCoordinatorTests`** — `SetupChainAsync` produces correct (`Queued`, `BlockedBy`) layout; `OnChildFinishedAsync` unblocks the next child; child failure leaves remaining blocked, parent transitions to `Failed` after `TryCompleteParent`.
|
||||||
|
4. **`PlanningEndToEndTests`** — regression for the original bug. `Active` parent + 3 drafts → `Finalize` → assert first child reaches `Running` within 200 ms with no manual `Wake`.
|
||||||
|
5. **Existing tests** — anything seeding `task.Status = TaskStatus.Manual` or similar gets updated to new enum values or routed through `_state`.
|
||||||
|
|
||||||
|
Coverage target: state machine + queue picker at ≥90% branch coverage. Existing coverage levels preserved elsewhere.
|
||||||
|
|
||||||
|
### 8. Implementation slices
|
||||||
|
|
||||||
|
Each slice is one PR with green tests before the next starts.
|
||||||
|
|
||||||
|
1. **Slice 1 — Status model + migration.** New enum values, new columns, EF migration. Existing code mapped to new values mechanically (no behavior change).
|
||||||
|
2. **Slice 2 — `TaskStateService`.** Service + interface + tests. Migrate TaskRunner, StaleTaskRecovery, ExternalMcp/PlanningMcp guards, TaskResetService. Mark `Mark*Async` repo helpers `internal`.
|
||||||
|
3. **Slice 3 — `IQueueWaker` + `IQueuePicker`.** Extract from QueueService and Repo. Remove all manual `WakeQueue()` calls in app code.
|
||||||
|
4. **Slice 4 — Planning flow consolidation.** Delete `FinalizePlanningAsync` from repo. `PlanningSessionManager.FinalizeAsync` orchestrates via state-service + ChainCoordinator. Rename `QueueSubtasksSequentiallyAsync` → `SetupChainAsync` (internal). E2E test green.
|
||||||
|
5. **Slice 5 — `OverrideSlotService` + folder reorg.** Extract RunNow / ContinueTask. Move files to new folder structure. Update DI registration.
|
||||||
|
6. **Slice 6 — Cleanup & docs.** Update `Worker/CLAUDE.md`, `docs/plan.md`. Remove dead helpers.
|
||||||
|
|
||||||
|
## Risks & Mitigations
|
||||||
|
|
||||||
|
- **EF migration on existing DBs.** Tested via integration tests that load a pre-migration fixture DB. `MigrateAsync` is already in production use, low risk.
|
||||||
|
- **State-service becomes a god-object.** Mitigated by keeping it narrow: only status/phase/blocked-by writes, no business logic. Worktree, merge, and runner concerns stay in their own services.
|
||||||
|
- **Two paths to `Running` (picker atomic, state-service for override).** Confirmed acceptable in brainstorming. Picker remains the only atomic-claim path; override slot is serialized by slot lock so non-atomic is safe.
|
||||||
|
- **Waiting-migration CTE.** SQLite supports `LAG()` since 3.25. .NET 8's bundled SQLite is well above. Tested in migration unit tests.
|
||||||
|
|
||||||
|
## Open Questions
|
||||||
|
|
||||||
|
None at design time. All knackpunkte resolved during brainstorming.
|
||||||
Reference in New Issue
Block a user