--- name: diagnose-why-work-stopped description: > How to handle "why did this work stop / why is this looping?" assignments. Forensics first on the named tree, surface the exact stop-point, frame the fix as a general product rule that respects three invariants (productive work continues, only real blockers stop work, no infinite loops), and deliver a plan — no code changes — gated by board/CTO approval before child issues are created. Use whenever the issue title or body asks for forensics on a stalled, looping, or "went too deep" tree. --- # Diagnose Why Work Stopped A repeatable procedure for the recurring class of issues where the user (or a manager) points at a stalled / looping / over-recovered issue tree and asks "why did this stop / why is this looping / how do we make sure this doesn't happen again?" This skill is **diagnostic + product-design**, not engineering. The output is a written root cause and an approved plan. No code changes leave this skill. Canonical execution model: read `doc/execution-semantics.md` before diagnosing or proposing a new liveness/recovery rule. Use that document as the source of truth for status, action-path, post-run disposition, bounded continuation, productivity review, pause-hold, watchdog, and explicit recovery semantics. If the investigation finds a true product-rule gap, the plan should say whether `doc/execution-semantics.md` needs a matching update. ## When to use Trigger on an assignment whose title or body matches any of: - "why did this work stop", "why did this stall", "why did this just stop" - "infinite loop", "looping", "spinning", "going too deep", "recovery went too deep" - "liveness — what happened here", "this tree stopped working", "stuck" - "approach it from a product perspective", "general product principle / rule" - An attached link to a specific stalled / looping / over-recovered issue tree Also use when the user asks for forensics, root cause, or a write-up *before* any product change. ## When NOT to use - The assignment asks you to ship a code change directly. Use normal engineering flow. - The assignment is a normal bug report against a specific feature. Use normal investigation. - You are the original implementer being asked to fix your own bug. Use normal debugging. ## Three invariants you must preserve Every diagnosis and every proposed rule must hold these three invariants together. The user has restated them on at least four issues; treat them as load-bearing: 1. **Productive work continues.** Agents that have a clear next action must keep working without needing the user to wake them. ([PAP-2674](/PAP/issues/PAP-2674), [PAP-2708](/PAP/issues/PAP-2708)) 2. **Only real blockers stop work.** Stops happen when something genuinely cannot proceed (missing approval, missing dependency, human owner). Pseudo-stops (in_review with no action path, cancelled leaves, malformed metadata) must be detected and routed, not left silent. ([PAP-2335](/PAP/issues/PAP-2335), [PAP-2674](/PAP/issues/PAP-2674)) 3. **No infinite loops.** Stranded-work recovery and continuation loops must be bounded and distinguishable from genuinely productive continuation. ([PAP-2602](/PAP/issues/PAP-2602), [PAP-2486](/PAP/issues/PAP-2486)) If a proposed rule violates any of the three, drop it or rework it. State explicitly in the plan how each invariant is held. ## Procedure ### 0. Read the current execution contract Before walking the tree, read `doc/execution-semantics.md` and keep its terms intact: - live path / waiting path / recovery path - post-run disposition: terminal, explicitly live, explicitly waiting, invalid - bounded `run_liveness_continuation` - productivity review vs liveness recovery - active subtree pause holds - silent active-run watchdog Do not invent a new rule until you can state how it differs from the current execution semantics document. ### 1. Forensics on the named tree — before anything else Do this in the same heartbeat. Do not propose a rule until you have a concrete stop point. - Open the linked issue (and its blocker chain, parents, recovery siblings, recent runs). - Walk the tree node-by-node and find the exact issue + state combination that stops the world. Common shapes seen in the company so far: - `in_review` with no typed execution participant, no active run, no pending interaction, no recovery issue ([PAP-2335](/PAP/issues/PAP-2335), [PAP-2674](/PAP/issues/PAP-2674)). - `in_progress` after a successful run with no future action path queued ([PAP-2674](/PAP/issues/PAP-2674)). - Blocker chain whose leaf is `cancelled` / malformed / cross-company-inaccessible ([PAP-2602](/PAP/issues/PAP-2602)). - `issue.continuation_recovery` waking the same issue >N times after successful runs ([PAP-2602](/PAP/issues/PAP-2602)). - Stranded-work recovery treating its own recovery issues as more recoverable source work ([PAP-2486](/PAP/issues/PAP-2486)). - Quote the evidence: run ids, comment timestamps, status transitions. "Inferred" is acceptable only when an API boundary blocks direct evidence — say so explicitly and mark the claim provisional ([PAP-2631](/PAP/issues/PAP-2631)). Respect the API boundary. If the linked issue is in another company and your agent token returns 403, do not bypass scoping. Either request a board-approved diagnostic path or proceed from inferred PAP-side evidence and label it. ### 2. Survey recent related work Before proposing a new product rule, read what already shipped this week in the same area. The user has explicitly called this out: ([PAP-2602](/PAP/issues/PAP-2602)) "review our recent work on liveness that we shipped in the last couple of days." A new rule that contradicts code merged 48 hours ago is rework, not improvement. Quick survey: - Recent merged PRs in the affected area. - Recent done issues whose title mentions liveness, recovery, productivity, continuation, or the affected subsystem. - Any active plan documents on parent issues. The fix may belong as a revision to an existing plan, not as a new top-level proposal. State in the forensics: "I reviewed X, Y, Z. The new gap is …" ### 3. Classify each non-progressing issue in the tree For every issue in the affected tree that is not `done` / `cancelled` / actively running, decide: - **Truly needs human or board intervention** — name the owner and the action. - **Agent-actionable but not currently routed** — name the rule that would have routed it, and the agent that should have been waked. - **Already covered** — point at the active run, queued wake, recovery issue, or pending interaction. This is the table the user has asked for repeatedly ([PAP-2335](/PAP/issues/PAP-2335)). Without it the plan is abstract. ### 4. Frame as a general product rule The user does not want a one-off patch on the named tree. They want the rule. Two checks: - The rule is **stated as a contract**, not as an if/else patch. Example contract: "every agent-owned non-terminal issue must finish each heartbeat with a terminal state, an explicit waiting path, or an explicit live path" ([PAP-2674](/PAP/issues/PAP-2674)). - The rule is reconciled against `doc/execution-semantics.md`. Prefer citing and applying the existing contract; propose a document change only when the current doc is incomplete or contradicted by accepted/implemented behavior. - The rule **explicitly preserves the three invariants** above. Show the work. If the rule would have blocked a recent productive run from succeeding, drop or narrow it. ### 5. Plan, do not code Write the plan into the issue's `plan` document. Cover: - Forensics summary (root cause + evidence). - The general product rule, stated as a contract. - Whether the existing `doc/execution-semantics.md` contract already covers the case, or what exact documentation update is needed. - Phased subtasks: typically `Phase 0` resolves the named live tree (carefully, not destructively), `Phase 1` codifies the contract in docs, then implementation phases for detection, recovery, UI surfacing, security review, QA, and CTO review. - Explicit assignees per phase; favor team specialty (CodexCoder for server, ClaudeCoder for FE, UXDesigner for visible state, SecurityEngineer for ownership/permissions, QA for validation). - Blocking dependencies wired with `blockedByIssueIds`, parallel branches identified. Do not create the child issues yet. Do not push code. ### 6. Request approval, then decompose - Open a `request_confirmation` interaction targeting the latest plan revision. Idempotency key `confirmation:{issueId}:plan:{revisionId}`. - Wait for board/CTO acceptance. If the user posts a new comment that supersedes the plan, the prior confirmation is invalidated — open a fresh confirmation tied to the new revision ([PAP-2602](/PAP/issues/PAP-2602) cycled three revisions; that is fine). - Only after acceptance: create the phased child issues with the right assignees and dependencies, then block this parent on the final QA / CTO review issue so the parent only wakes when the chain finishes. ### 7. Phase 0 hygiene on the named tree Phase 0 cleans up the live tree without papering over evidence: - Move stalled `in_review` leaves with no participant to `todo` with a precise next action and named owner ([PAP-2335](/PAP/issues/PAP-2335)). - Detach cancelled/dead blockers from chains they were holding hostage; do not silently mark issues `done` to clear backlog. - Leave a comment on the original named issue summarizing what changed and why; never hide the recovery chain history. ### 8. Final close-out When the phase chain is complete, post a board-level summary comment on the parent issue: what changed, what the new contract is, what the rollout step is (e.g. "restart the control-plane to pick up the new response shape"), and the live state of the originally-named tree. Then close the parent. ## Pitfalls - **Coding before approval.** The user has said "make a plan first" on every recent diagnostic issue. Producing code in the forensic phase wastes the round-trip. - **Restating one invariant at the cost of another.** Bound continuation too tightly and productive work stalls; loosen recovery and infinite loops return. Always check all three. - **Skipping the recent-work survey.** Proposing a contract that contradicts what shipped 24 hours ago is the easiest way to get the plan rejected. - **Letting "in_review" mean done.** A leaf assigned to another agent with no participant or active run is not progress; treat it as a stop. - **Bypassing company scoping.** Cross-company forensics needs a board-approved diagnostic path, not a database read. - **Recursive recovery.** Stranded-work recovery that recovers its own recovery issues is the canonical infinite loop ([PAP-2486](/PAP/issues/PAP-2486)). Detect it and refuse to deepen. - **Hiding the chain.** Don't silently delete or hide the symptomatic recovery issues — the operator needs the audit trail. ## Verification checklist (before posting the plan) - [ ] The exact stop point in the named tree is identified with run ids / comment ids. - [ ] Recent shipped work in the same area was surveyed and is referenced. - [ ] Every non-progressing issue is classified human-needed / agent-actionable / already-covered. - [ ] The proposed rule is stated as a contract, not a patch. - [ ] All three invariants are explicitly preserved. - [ ] No code change has landed in this heartbeat. - [ ] A `request_confirmation` against the latest plan revision is open. - [ ] Phase 0 of the plan addresses the live named tree without destroying evidence. - [ ] Implementation phases name specialty-appropriate assignees and `blockedByIssueIds` dependencies.