---
name: error-recovery
description: Strategies for handling subagent failures with retry logic and escalation patterns.
allowed-tools: Read, Task
---

# Error Recovery Skill

Pattern for handling subagent failures gracefully with appropriate retry strategies.

## When to Load This Skill

- You are spawning subagents that may fail
- A subagent returned an error or unexpected output
- You need to decide whether to retry, escalate, or abort

## Failure Categories

| Category | Symptoms | Strategy |
|----------|----------|----------|
| **Transient** | Timeout, malformed output, parsing error | Simple Retry |
| **Context Gap** | "I don't have enough information", unclear task | Context Enhancement |
| **Complexity** | Partial completion, scope creep, tangents | Scope Reduction |
| **Boundary/Contract** | `status: blocked`, boundary_violation, contract_change | Escalation |
| **Fatal** | Repeated failures (3+), fundamental misunderstanding | Abort with Report |

## Retry Strategies

### Strategy 1: Simple Retry

For transient failures. Same prompt, up to 3 attempts.

```
# Track attempts
attempts: 0
max_attempts: 3

# On failure
IF attempts < max_attempts:
  attempts += 1
  Task(same_subagent_type, same_model, same_prompt)
ELSE:
  Mark as FAILED, move on
```

**Use when:**
- Output was malformed or truncated
- Timeout occurred
- Agent returned empty/null response

### Strategy 2: Context Enhancement

Add more information to help the agent succeed.

```
Task(
  subagent_type: "implementer",
  model: "sonnet",
  prompt: |
    ## PREVIOUS ATTEMPT FAILED

    Error: {error_message}
    Output received: {partial_output}

    ## ADDITIONAL CONTEXT

    Here is more information that may help:
    - Related file: @{additional_file_path}
    - Pattern to follow: {example_pattern}
    - Specific guidance: {clarification}

    ## ORIGINAL TASK

    {original_task_description}

    Output to: {output_path}
)
```

**Use when:**
- Agent said "I don't understand" or "unclear requirements"
- Agent made incorrect assumptions
- Agent asked questions in output

**Context to add:**
- Related code files the agent might need
- Similar implementations as examples
- Explicit clarification of ambiguous points
- Error message from previous attempt

### Strategy 3: Scope Reduction

Break the failing task into smaller, more manageable pieces.

```
# Original task failed
Task: "Implement full authentication system"

# Split into subtasks
Task(implementer, "Implement password hashing utility")
Task(implementer, "Implement session token generation")
Task(implementer, "Implement login endpoint")
Task(implementer, "Implement logout endpoint")
```

**Use when:**
- Agent completed partial work then failed
- Task description was too broad
- Agent went off on tangents
- Output shows confusion about scope

**Splitting guidelines:**
- Each subtask should be independently completable
- Each subtask should have clear boundaries
- Subtasks can run in parallel if no dependencies
- Recombine outputs after all subtasks complete

### Strategy 4: Escalation

Route to specialized agent for resolution.

```
# For boundary violations
Task(
  subagent_type: "contract-resolver",
  model: "sonnet",
  prompt: |
    A task is blocked due to boundary/contract issues.

    Blocked task output: memory/tasks/{task_id}/output.json
    Blocked reason: {blocked_reason}
    Current contracts: {contract_paths}

    Analyze impact and provide resolution.
    Output to: memory/contracts/resolution_{task_id}.json
)
```

**Escalation paths:**

| Failure Type | Escalate To | Action |
|--------------|-------------|--------|
| `blocked_reason: boundary_violation` | contract-resolver | Expand boundaries or redesign |
| `blocked_reason: contract_change` | contract-resolver | Modify contract, re-verify dependents |
| `blocked_reason: dependency_issue` | executor (self) | Re-check dependency status |
| Repeated implementation failures | architect | Reconsider design approach |

### Strategy 5: Abort with Report

When recovery is not possible, fail gracefully.

```json
{"tasks":[{"id":"{task_id}","status":"failed","failure_reason":"{specific reason}","attempts_made":3,"recovery_attempted":[{"strategy":"simple_retry","result":"same_error"},{"strategy":"context_enhancement","result":"different_error"},{"strategy":"scope_reduction","result":"subtasks_also_failed"}],"recommendation":"Task may need architectural redesign"}]}
```

**Use when:**
- 3+ retry attempts failed
- Different strategies all failed
- Fundamental misunderstanding of requirements
- Task is actually impossible given constraints

## Decision Tree

```
On Subagent Failure:
│
├─ Is output malformed/empty/timeout?
│  └─ YES → Strategy 1: Simple Retry (up to 3x)
│
├─ Did agent say "unclear" or ask questions?
│  └─ YES → Strategy 2: Context Enhancement
│
├─ Did agent complete partial work?
│  └─ YES → Strategy 3: Scope Reduction
│
├─ Is status "blocked" with boundary/contract reason?
│  └─ YES → Strategy 4: Escalation to contract-resolver
│
├─ Have we tried 3+ strategies already?
│  └─ YES → Strategy 5: Abort with Report
│
└─ Unknown error
   └─ Try Strategy 2 first, then escalate
```

## Retry State Tracking

Track retry attempts in the execution state file:

```json
{"tasks":[{"id":"task-001","status":"running","attempts":2,"last_error":"Timeout after 120s","retry_strategy":"simple_retry"},{"id":"task-002","status":"running","attempts":1,"last_error":"Needs access to src/config/db.ts","retry_strategy":"context_enhancement","context_added":["src/config/db.ts","src/types/config.ts"]}]}
```

## Integration with Executor Loop

```
# Enhanced execution loop
WHILE tasks remain incomplete:
  1. Read state file
  2. Find ready tasks
  3. Spawn ready tasks
  4. Check completed tasks:
     FOR each completed task:
       IF status == pre_complete:
         spawn verifier
       ELIF status == blocked:
         apply Strategy 4 (Escalation)
       ELIF status == failed:
         determine_failure_category()
         apply_appropriate_strategy()
         update_retry_state()
  5. Update state file
  6. IF all verified: EXIT
  7. IF all failed with no recovery: EXIT with failure report
```

## Principles

1. **Fail fast, recover smart** - Don't retry blindly; analyze the failure first
2. **Preserve partial work** - If agent completed 50%, don't discard it
3. **Escalate early** - Boundary/contract issues need resolver, not retries
4. **Track everything** - Log all attempts for reflection phase
5. **Know when to quit** - 3 failed strategies = abort, don't loop forever