--- name: error-recovery description: Strategies for handling subagent failures with retry logic and escalation patterns. allowed-tools: Read, Task --- # Error Recovery Skill Pattern for handling subagent failures gracefully with appropriate retry strategies. ## When to Load This Skill - You are spawning subagents that may fail - A subagent returned an error or unexpected output - You need to decide whether to retry, escalate, or abort ## Failure Categories | Category | Symptoms | Strategy | |----------|----------|----------| | **Transient** | Timeout, malformed output, parsing error | Simple Retry | | **Context Gap** | "I don't have enough information", unclear task | Context Enhancement | | **Complexity** | Partial completion, scope creep, tangents | Scope Reduction | | **Boundary/Contract** | `status: blocked`, boundary_violation, contract_change | Escalation | | **Fatal** | Repeated failures (3+), fundamental misunderstanding | Abort with Report | ## Retry Strategies ### Strategy 1: Simple Retry For transient failures. Same prompt, up to 3 attempts. ``` # Track attempts attempts: 0 max_attempts: 3 # On failure IF attempts < max_attempts: attempts += 1 Task(same_subagent_type, same_model, same_prompt) ELSE: Mark as FAILED, move on ``` **Use when:** - Output was malformed or truncated - Timeout occurred - Agent returned empty/null response ### Strategy 2: Context Enhancement Add more information to help the agent succeed. ``` Task( subagent_type: "implementer", model: "sonnet", prompt: | ## PREVIOUS ATTEMPT FAILED Error: {error_message} Output received: {partial_output} ## ADDITIONAL CONTEXT Here is more information that may help: - Related file: @{additional_file_path} - Pattern to follow: {example_pattern} - Specific guidance: {clarification} ## ORIGINAL TASK {original_task_description} Output to: {output_path} ) ``` **Use when:** - Agent said "I don't understand" or "unclear requirements" - Agent made incorrect assumptions - Agent asked questions in output **Context to add:** - Related code files the agent might need - Similar implementations as examples - Explicit clarification of ambiguous points - Error message from previous attempt ### Strategy 3: Scope Reduction Break the failing task into smaller, more manageable pieces. ``` # Original task failed Task: "Implement full authentication system" # Split into subtasks Task(implementer, "Implement password hashing utility") Task(implementer, "Implement session token generation") Task(implementer, "Implement login endpoint") Task(implementer, "Implement logout endpoint") ``` **Use when:** - Agent completed partial work then failed - Task description was too broad - Agent went off on tangents - Output shows confusion about scope **Splitting guidelines:** - Each subtask should be independently completable - Each subtask should have clear boundaries - Subtasks can run in parallel if no dependencies - Recombine outputs after all subtasks complete ### Strategy 4: Escalation Route to specialized agent for resolution. ``` # For boundary violations Task( subagent_type: "contract-resolver", model: "sonnet", prompt: | A task is blocked due to boundary/contract issues. Blocked task output: memory/tasks/{task_id}/output.json Blocked reason: {blocked_reason} Current contracts: {contract_paths} Analyze impact and provide resolution. Output to: memory/contracts/resolution_{task_id}.json ) ``` **Escalation paths:** | Failure Type | Escalate To | Action | |--------------|-------------|--------| | `blocked_reason: boundary_violation` | contract-resolver | Expand boundaries or redesign | | `blocked_reason: contract_change` | contract-resolver | Modify contract, re-verify dependents | | `blocked_reason: dependency_issue` | executor (self) | Re-check dependency status | | Repeated implementation failures | architect | Reconsider design approach | ### Strategy 5: Abort with Report When recovery is not possible, fail gracefully. ```json {"tasks":[{"id":"{task_id}","status":"failed","failure_reason":"{specific reason}","attempts_made":3,"recovery_attempted":[{"strategy":"simple_retry","result":"same_error"},{"strategy":"context_enhancement","result":"different_error"},{"strategy":"scope_reduction","result":"subtasks_also_failed"}],"recommendation":"Task may need architectural redesign"}]} ``` **Use when:** - 3+ retry attempts failed - Different strategies all failed - Fundamental misunderstanding of requirements - Task is actually impossible given constraints ## Decision Tree ``` On Subagent Failure: │ ├─ Is output malformed/empty/timeout? │ └─ YES → Strategy 1: Simple Retry (up to 3x) │ ├─ Did agent say "unclear" or ask questions? │ └─ YES → Strategy 2: Context Enhancement │ ├─ Did agent complete partial work? │ └─ YES → Strategy 3: Scope Reduction │ ├─ Is status "blocked" with boundary/contract reason? │ └─ YES → Strategy 4: Escalation to contract-resolver │ ├─ Have we tried 3+ strategies already? │ └─ YES → Strategy 5: Abort with Report │ └─ Unknown error └─ Try Strategy 2 first, then escalate ``` ## Retry State Tracking Track retry attempts in the execution state file: ```json {"tasks":[{"id":"task-001","status":"running","attempts":2,"last_error":"Timeout after 120s","retry_strategy":"simple_retry"},{"id":"task-002","status":"running","attempts":1,"last_error":"Needs access to src/config/db.ts","retry_strategy":"context_enhancement","context_added":["src/config/db.ts","src/types/config.ts"]}]} ``` ## Integration with Executor Loop ``` # Enhanced execution loop WHILE tasks remain incomplete: 1. Read state file 2. Find ready tasks 3. Spawn ready tasks 4. Check completed tasks: FOR each completed task: IF status == pre_complete: spawn verifier ELIF status == blocked: apply Strategy 4 (Escalation) ELIF status == failed: determine_failure_category() apply_appropriate_strategy() update_retry_state() 5. Update state file 6. IF all verified: EXIT 7. IF all failed with no recovery: EXIT with failure report ``` ## Principles 1. **Fail fast, recover smart** - Don't retry blindly; analyze the failure first 2. **Preserve partial work** - If agent completed 50%, don't discard it 3. **Escalate early** - Boundary/contract issues need resolver, not retries 4. **Track everything** - Log all attempts for reflection phase 5. **Know when to quit** - 3 failed strategies = abort, don't loop forever