# Failure Mode Catalog

Real ways loops fail — and how good design mitigates them. Use this when debugging a misbehaving loop or writing a new pattern.

## Classification

| Severity | Meaning |
|----------|---------|
| **S1 — Annoying** | Wasted time/tokens, no user harm |
| **S2 — Harmful** | Wrong code merged, bad tickets, alert fatigue |
| **S3 — Critical** | Security, data loss, production incident |

---

## Infinite Fix Loop

**Symptom**: Same PR or CI job gets automated fix attempts 5+ times; never converges.

**Severity**: S2

**Causes**:
- Verifier too weak or same session as implementer
- Root cause misdiagnosed (symptom fixing)
- Flaky test treated as regression

**Mitigations**:
- Hard cap on attempts (e.g. 3) → escalate to human
- Separate verifier model / higher reasoning effort
- Classify flakes in triage; quarantine instead of code change
- Record attempt count in state file

---

## State Rot

**Symptom**: `STATE.md` references merged PRs, closed tickets, or stale branches.

**Severity**: S1 → S2 (loop acts on ghosts)

**Causes**:
- No prune step at end of run
- State file not read at start
- Multiple loops writing same file without schema

**Mitigations**:
- Prune closed/merged items every run
- `Last run` timestamp + validate IDs against live API
- One state file per loop pattern, or clear sections

---

## Verifier Theater

**Symptom**: Verifier "approves" but tests fail in CI or review finds obvious bugs.

**Severity**: S2

**Causes**:
- Verifier prompt too vague ("looks good")
- Verifier doesn't run tests
- Same model, same context as implementer

**Mitigations**:
- Verifier must run test/lint commands and report output
- Different instructions: "find reasons to reject"
- Stronger model on verifier for unattended loops

---

## Notification Fatigue

**Symptom**: Slack/email pings every 5 minutes; team mutes the bot.

**Severity**: S1 → S2 (real escalations missed)

**Causes**:
- Notify on every run, not every *actionable* finding
- Low bar for "high priority" in triage skill

**Mitigations**:
- Notify only when human decision required
- Digest mode for report-only loops
- Tighten triage "High Priority" rules

---

## Token Burn

**Symptom**: Bill spikes; loop runs full sub-agent chains on empty or noisy triage.

**Severity**: S1

**Causes**:
- Sub-minute cadence with heavy sub-agents
- No early exit when watchlist empty
- Retrying entire pipeline on transient API errors

**Mitigations**:
- Cheaper triage-only pass first; spawn sub-agents only for actionable items
- `scheduler_delete` when nothing to watch
- Daily token budget → pause loop
- See [operating-loops.md](./operating-loops.md)

---

## Over-Reach (Wrong Scope)

**Symptom**: Loop refactors unrelated modules, "fixes" design issues, or touches denylisted paths.

**Severity**: S2 → S3

**Causes**:
- minimal-fix skill too permissive
- No path allowlist/denylist
- Triage puts architectural work in "High Priority"

**Mitigations**:
- [safety.md](./safety.md) denylist enforced in skills
- "Smallest possible diff" + verifier checks touched files
- Triage skill: signal only, no invention

---

## Comprehension Debt Spiral

**Symptom**: Velocity up, but no one can explain recent changes; review becomes rubber-stamp.

**Severity**: S2 (long-term)

**Causes**:
- Human stops reading loop output
- Auto-merge on growing allowlist
- No weekly human synthesis of loop actions

**Mitigations**:
- Mandatory human review for non-trivial PRs
- Weekly "loop digest" read by owner
- Cap auto-merge to truly trivial paths

---

## Cognitive Surrender

**Symptom**: "The loop handles it" — no opinions on correctness or design.

**Severity**: S2 (cultural)

**Causes**:
- Loop success metric = volume, not quality
- No human gates on medium-risk work

**Mitigations**:
- Explicit human gates in every pattern
- Success metric: time saved *with* quality bar held
- Osmani: "Build it like someone who intends to stay the engineer"

---

## Parallel Collision

**Symptom**: Two sub-agents edit same files; merge conflicts; corrupted state.

**Severity**: S2

**Causes**:
- No worktree isolation
- Two loops acting on same PR without coordination

**Mitigations**:
- `isolation: worktree` for all code-editing sub-agents
- Lock or queue in state: "PR #1234 — worktree in progress"

---

## Escalation Failure

**Symptom**: Loop stuck retrying; human never notified.

**Severity**: S2

**Causes**:
- Max attempts not implemented
- Escalation only writes to state no one reads

**Mitigations**:
- Connector ping on escalation (Slack, Linear comment)
- `High Priority (waiting on human)` section in STATE.md
- Alert if item in that section >24h

---

## Contributing Failures

Have a story? Add a row via PR to this doc or open an issue with:
- Pattern name
- Symptom
- What mitigated it (or didn't)