# ADR-016: Infrastructure-Failure Guard

| Field | Value |
|-------|-------|
| **Decision ID** | ADR-016 |
| **Initiative** | unsorry Phase 3 — operational correctness |
| **Proposed By** | unsorry maintainers |
| **Date** | 2026-06-11 |
| **Status** | Accepted |

## WH(Y) Decision Statement
**In the context of** an agent loop whose failure path treats every failed claude call as evidence about the goal — exhausted budget → decompose, unusable split → demote −10 (below τ_v the goal leaves the pool),
**facing** two production incidents on 2026-06-11 in which a CLI quota outage made every call die in ~1 minute, and the loop — unable to tell "the model tried and failed" from "the model never ran" — demoted every open leaf of the active tree below the viability threshold, emptied its own pool, exited "no claimable goal", and twice required maintainer affinity-restore PRs (#165, and the #168–#181 cleanup),
**we decided for** classifying a failed proof-surface call as an *infrastructure failure* when it died in under `UNSORRY_FASTFAIL` seconds (default 240 — a real attempt must at least read the goal and run a build) **and** a follow-up health probe on the cheap model also fails; on that classification the cycle aborts with no `prove-failed` event, no decomposition, no demote, the claim is released, and the agent exits with a distinct code 3 so the orchestrator knows to reschedule rather than diagnose goals,
**and neglected** retry-with-backoff inside the loop (an outage measured in hours would idle a worktree and mislead the orchestrator; a clean exit hands timing to whoever can see the clock), pre-claim health probes on every cycle (cost on the healthy path for a rare event), and distinguishing quota from auth/network failures (identical handling either way; the probe answers "can the CLI run", which is all the queue needs),
**to achieve** a queue whose affinity and decomposition state only ever encode *model evidence about goals*, never infrastructure weather — an outage now costs wall-clock, not state repair,
**accepting that** a genuinely-broken call that fails fast while the cheap model happens to be healthy still counts as a real attempt (conservative: queue penalties stay possible), the probe spends one cheap-model call per fast failure, and a wall-timeout followed by a quota death is still recorded as a real attempt (the model had its chance).

## Context

The demote path (ADR-010) and the decomposition fallback (ADR-009) both assume the budget was actually *spent on the goal*. The 05:48Z and 10:43Z outages violated that assumption at scale: 8 and 9 spurious demotes respectively, plus duplicate-demote PR churn from two agents failing in parallel, plus a stalled Thread-A tree each time. The fix lives at the only place that can tell the difference — the call site, with the duration and a probe in hand. Soundness is untouched: this changes what the *queue* learns from failures, never what the gates accept.

## Options Considered

### Option 1: Fast-fail + health-probe classification, clean exit 3 (Selected)
**Pros:** zero cost on the healthy path; pure-function classifier is hermetically testable; the orchestrator gets an unambiguous signal; queue state stays meaningful.
**Cons:** conservative misclassification possible (fast real failures with a healthy CLI remain "real"); one cheap probe per fast failure.

### Option 2: In-loop retry with backoff (Rejected)
Sleep and retry until the CLI recovers. Rejected: outages here are quota windows measured in hours; a sleeping loop holds its claim worktree, emits nothing, and looks identical to a hang from outside.

### Option 3: Pre-claim health probe every cycle (Rejected)
Probe before claiming. Rejected: pays on every healthy cycle to catch a rare event, and a probe that passes at claim time says nothing about a death 20 minutes into the attempt.

## Dependencies
| Relationship | ADR ID | Title | Notes |
|--------------|--------|-------|-------|
| Amends | ADR-009 | Goal Decomposition | Decompose fallback skipped on infra failure |
| Amends | ADR-010 | Affinity-Gap Selection | Demote requires a real attempt |
| Relates To | ADR-015 | Progressive Effort Escalation | Ladder attempts are individually classified |

## References
| Reference ID | Title | Type | Location |
|--------------|-------|------|----------|
| REF-1 | SPEC-016-A — Infrastructure-failure guard | Specification | specs/SPEC-016-A-Infrastructure-Failure-Guard.md |
| REF-2 | Incident timeline | Metrics | ../metrics/phase3-run-001.md (when landed) |

## Status History
| Status | Approver | Date |
|--------|----------|------|
| Proposed | unsorry maintainers | 2026-06-11 |
| Accepted | unsorry maintainers | 2026-06-11 |