---
name: error-recovery
description: Use when encountering failures - assess severity, preserve evidence, execute rollback decision tree, and verify post-recovery state
---

# Error Recovery

## Overview

Handle failures gracefully with structured recovery.

**Core principle:** When things break, don't panic. Assess, preserve, recover, verify.

**Announce at start:** "I'm using error-recovery to handle this failure."

## The Recovery Protocol

```
Error Detected
      │
      ▼
┌─────────────┐
│ 1. ASSESS   │ ← Severity? Scope? Impact?
└──────┬──────┘
       │
       ▼
┌─────────────┐
│ 2. PRESERVE │ ← Capture evidence before it's lost
└──────┬──────┘
       │
       ▼
┌─────────────┐
│ 3. RECOVER  │ ← Follow decision tree
└──────┬──────┘
       │
       ▼
┌─────────────┐
│ 4. VERIFY   │ ← Confirm clean state
└──────┬──────┘
       │
       ▼
┌─────────────┐
│ 5. DOCUMENT │ ← Record what happened
└─────────────┘
```

## Step 1: Assess Severity

### Severity Levels

| Level | Description | Examples |
|-------|-------------|----------|
| **Critical** | System unusable, data at risk | Build completely broken, tests cause data loss |
| **Major** | Significant functionality broken | Feature doesn't work, many tests failing |
| **Minor** | Isolated issue, workaround exists | Single test flaky, style error |
| **Info** | Warning only, not blocking | Deprecation notice, performance hint |

### Assessment Questions

```markdown
## Error Assessment

**Error:** [Description of error]
**Location:** [Where it occurred]

### Severity Checklist
- [ ] Is the system still functional?
- [ ] Is any data at risk?
- [ ] Are other features affected?
- [ ] Is this blocking progress?

### Scope
- Files affected: [list]
- Features affected: [list]
- Users affected: [none/some/all]
```

## Step 2: Preserve Evidence

**Capture BEFORE attempting fixes:**

### Error Logs

```bash
# Capture error output
pnpm test 2>&1 | tee error-log.txt

# Or from failed command
./failing-command 2>&1 | tee error-log.txt
```

### Stack Traces

```markdown
## Stack Trace

```
Error: Connection refused
    at Database.connect (src/db/connection.ts:45)
    at UserService.init (src/services/user.ts:23)
    at main (src/index.ts:12)
```
```

### State Capture

```bash
# Git state
git status
git diff

# Environment state
env | grep -E "NODE|NPM|PATH"

# Dependency state
pnpm list
```

### Screenshot (if visual)

For UI errors, capture screenshots before changes.

## Step 3: Recover

### Decision Tree

```
What type of failure?
         │
    ┌────┴────┬────────────┬────────────┐
    │         │            │            │
  Code      Build      Environment   External
  Error     Error        Issue       Service
    │         │            │            │
    ▼         ▼            ▼            ▼
  ┌────┐   ┌────┐      ┌────┐      ┌────┐
  │Git │   │Clean│     │Re-  │     │Wait/│
  │reco│   │build│     │init │     │Retry│
  │very│   │     │     │     │     │     │
  └────┘   └────┘      └────┘      └────┘
```

### Code Error Recovery

**Single file broken:**

```bash
# Revert just that file
git checkout HEAD -- path/to/file.ts
```

**Feature broken (multiple files):**

```bash
# Find last good commit
git log --oneline

# Revert to that commit (soft reset keeps changes staged)
git reset --soft [GOOD_COMMIT]

# Or hard reset (discards changes)
git reset --hard [GOOD_COMMIT]
```

**Working directory is a mess:**

```bash
# Stash current changes
git stash

# Verify clean state
git status

# Optionally recover stash later
git stash pop
```

### Build Error Recovery

```bash
# Clean build artifacts
rm -rf node_modules dist build .cache

# Reinstall dependencies
pnpm install --frozen-lockfile  # Clean install from lock file

# Rebuild
pnpm build
```

### Environment Error Recovery

```bash
# Check environment
env | grep -E "NODE|PNPM"

# Reset Node modules
rm -rf node_modules
pnpm install --frozen-lockfile

# If using nvm, verify version
nvm use

# Re-run init script
./scripts/init.sh
```

### External Service Error

```bash
# Check if service is up
curl -I https://service.example.com/health

# If down, wait and retry
sleep 60
curl -I https://service.example.com/health

# If still down, check status page
# Document as external blocker
```

## Step 4: Verify

After recovery, verify clean state:

### Basic Verification

```bash
# Clean working directory
git status
# Expected: "nothing to commit, working tree clean" or known changes

# Tests pass
pnpm test

# Build succeeds
pnpm build

# Types check
pnpm typecheck
```

### Functionality Verification

```bash
# Run the specific thing that was broken
pnpm test --grep "specific test"

# Or verify the feature manually
```

## Step 5: Document

### Issue Comment

```bash
gh issue comment [ISSUE_NUMBER] --body "## Error Recovery

**Error encountered:** [Description]

**Severity:** Major

**Evidence:**
\`\`\`
[Error output]
\`\`\`

**Recovery actions:**
1. [Action 1]
2. [Action 2]

**Verification:**
- [x] Tests pass
- [x] Build succeeds

**Root cause:** [If known]

**Prevention:** [If applicable]
"
```

### Knowledge Graph

```javascript
// Store for future reference
mcp__memory__add_observations({
  observations: [{
    entityName: "Issue #[NUMBER]",
    contents: [
      "Encountered [error type] on [date]",
      "Caused by: [root cause]",
      "Resolved by: [recovery action]"
    ]
  }]
});
```

## Common Recovery Patterns

### "Tests were passing, now failing"

```bash
# What changed?
git diff HEAD~3

# Did dependencies change?
git diff HEAD~3 pnpm-lock.yaml

# Clean reinstall
rm -rf node_modules && pnpm install --frozen-lockfile
```

### "Works locally, fails in CI"

```bash
# Check for environment differences
# - Node version
# - OS differences
# - Env vars

# Run with CI-like settings
CI=true pnpm test
```

### "Build was working, now broken"

```bash
# Check TypeScript errors
pnpm typecheck

# Check for circular dependencies
pnpm dlx madge --circular src/

# Clean build
rm -rf dist && pnpm build
```

### "I broke everything"

```bash
# Don't panic
# Find last known good state
git log --oneline

# Reset to that state
git reset --hard [GOOD_COMMIT]

# Verify
pnpm test

# Start again more carefully
```

## Escalation

If recovery fails after 2-3 attempts:

```markdown
## Escalation: Unrecoverable Error

**Issue:** #[NUMBER]

**Error:** [Description]

**Recovery attempts:**
1. [Attempt 1] - [Result]
2. [Attempt 2] - [Result]

**Current state:** [Broken/Partially working]

**Evidence preserved:** [Links to logs, screenshots]

**Requesting help with:** [Specific question]
```

Mark issue as Blocked and await human input.

## Checklist

When error occurs:

- [ ] Severity assessed
- [ ] Evidence preserved (logs, state, screenshots)
- [ ] Recovery action selected
- [ ] Recovery executed
- [ ] Clean state verified
- [ ] Tests pass
- [ ] Build succeeds
- [ ] Issue documented

## Integration

This skill is called by:
- `issue-driven-development` - When errors occur
- `ci-monitoring` - CI failures

This skill may trigger:
- `research-after-failure` - If cause is unknown
- Issue update via `issue-lifecycle`