---
name: incident-response
description: Incident triage, cascade prevention, and postmortem methodology. Use when handling production incidents, designing resilience patterns, or conducting chaos engineering exercises.
---

# Incident Response

Structured incident management from detection through postmortem, with resilience patterns for preventing and containing cascading failures.

## When to Use

- Production incident in progress (outage, degradation, data loss)
- Designing circuit breakers, bulkheads, or fallback strategies
- Conducting or planning chaos engineering exercises
- Writing or reviewing postmortem documents
- Establishing on-call procedures and escalation paths

Avoid when:
- The issue is a development-time bug with no production impact
- Designing general system architecture (use system-design instead)

## Quick Reference

| Topic | Load reference |
| --- | --- |
| **Triage Framework** | `skills/incident-response/references/triage-framework.md` |
| **Postmortem Patterns** | `skills/incident-response/references/postmortem-patterns.md` |

## Incident Response Workflow

### Phase 1: Detect

- Alert fires or user report received
- Confirm the issue is real (not a false positive)
- Identify affected services and user impact scope

### Phase 2: Triage

- Classify severity (P0-P3)
- Assign incident commander
- Open communication channel (war room, Slack channel)
- Begin status page updates

### Phase 3: Contain

- Stop the bleeding: rollback, feature flag, traffic shift
- Prevent cascade: circuit breakers, load shedding, bulkhead isolation
- Communicate: stakeholder updates every 15 minutes for P0/P1

### Phase 4: Resolve

- Implement fix (minimal viable fix first)
- Validate in staging if time permits
- Deploy with monitoring and rollback plan ready
- Confirm recovery with metrics returning to baseline

### Phase 5: Postmortem

- Document timeline within 48 hours
- Conduct blameless review with all participants
- Identify root cause and contributing factors
- Assign action items with owners and deadlines
- Update runbooks and alerting based on lessons learned

## Severity Framework

| Level | Impact | Response Time | Examples |
|-------|--------|---------------|---------|
| **P0** | Complete outage, data loss, security breach | Immediate (< 5 min) | Service down, data corruption, credential leak |
| **P1** | Major feature broken, significant user impact | < 30 min | Payment processing failed, auth broken for region |
| **P2** | Degraded performance, partial feature loss | < 4 hours | Elevated latency, non-critical feature unavailable |
| **P3** | Minor issue, workaround available | Next business day | UI glitch, slow report generation, cosmetic error |

## Output

- Incident timeline and severity classification
- Containment actions taken
- Postmortem document with action items
- Updated runbooks and alerting rules

## Common Mistakes

- Skipping severity classification and treating everything as P0
- Making changes without a rollback plan
- Forgetting to communicate status to stakeholders
- Writing postmortems that assign blame instead of identifying systemic issues
- Not following up on postmortem action items