--- name: "Incident Response" description: "Structured production incident triage, resolution, and post-mortem. Apply when production systems are down, degraded, or behaving unexpectedly. Covers detection, containment, resolution, and learning." allowed-tools: Read, Write, Edit, Bash, Grep, Glob version: 2.1.0 compatibility: Claude Opus 4.6, Sonnet 4.6, Claude Code v2.1.x updated: 2026-03-26 --- # Incident Response Structured workflow for production incidents: detect, contain, resolve, learn. ## When to Use - Production service is down or degraded - Users reporting errors or data issues - Monitoring alerts firing - Security incident detected - Data integrity issue discovered ## Severity Classification | Severity | Impact | Response Time | Examples | |:--------:|--------|:---:|---------| | **SEV-1** | Full outage, data breach, revenue loss | Immediate | Service down, auth broken, data leak | | **SEV-2** | Major degradation, subset of users affected | < 30 min | Slow responses, feature broken, partial outage | | **SEV-3** | Minor impact, workaround available | < 4 hours | UI bug, non-critical feature down | | **SEV-4** | No user impact, internal issue | Next business day | Log errors, monitoring gaps | ## Phase 1: Detection & Triage (First 5 Minutes) ```markdown ## Incident Triage **Time detected:** [timestamp] **Reporter:** [who noticed] **Severity:** SEV-[1-4] ### What's happening? [One sentence description of symptoms] ### Who's affected? [All users / subset / internal only] ### What changed recently? [Deployments, config changes, traffic spikes] ``` ### Quick Diagnostic Commands ```bash # Check recent deployments git log --oneline -5 --since="2 hours ago" # Check application logs tail -100 /var/log/app/error.log | grep -i "error\|fatal\|panic" # Check system resources top -bn1 | head -20 df -h free -h # Check database connectivity pg_isready -h $DB_HOST # Check external service health curl -sI https://api.stripe.com/v1 -o /dev/null -w "%{http_code}" ``` ## Phase 2: Containment (Minutes 5-15) **Goal:** Stop the bleeding. Don't fix root cause yet. | Containment Action | When to Use | |---|---| | **Rollback deployment** | Problem started after a deploy | | **Feature flag disable** | New feature causing issues | | **Scale up** | Traffic/load related | | **Failover to backup** | Primary system unrecoverable | | **Rate limit** | Being overwhelmed by requests | | **Block bad actor** | Malicious traffic identified | ```bash # Quick rollback (if deployment caused it) git revert HEAD --no-edit && git push # Or revert to last known good deployment # (platform-specific: Vercel, Railway, etc.) ``` ## Phase 3: Resolution ### Hypothesis-Driven Debugging ```markdown ## Debugging Log ### Hypothesis 1: [Most likely cause] Evidence for: [what supports this] Evidence against: [what contradicts this] Test: [how to verify] Result: [confirmed / rejected] ### Hypothesis 2: [Second most likely] ... ``` ### Common Root Causes | Symptom | Common Cause | Quick Fix | |---------|-------------|-----------| | 500 errors spike | Bad deployment | Rollback | | Slow responses | Database query regression | Kill slow queries, add index | | Connection timeouts | Connection pool exhaustion | Restart, increase pool size | | OOM crashes | Memory leak | Restart, set memory limits | | Auth failures | Token/cert expiry | Rotate credentials | | Data inconsistency | Race condition | Add locking, retry logic | ## Phase 4: Post-Mortem Write within 48 hours of resolution. **Blameless** — focus on systems, not people. ```markdown # Post-Mortem: [Incident Title] **Date:** YYYY-MM-DD **Duration:** [start time] to [resolution time] ([X] minutes) **Severity:** SEV-[X] **Author:** [Name] ## Summary [2-3 sentence summary of what happened and impact] ## Timeline | Time | Event | |------|-------| | HH:MM | [First symptom / alert] | | HH:MM | [Incident declared] | | HH:MM | [Root cause identified] | | HH:MM | [Fix deployed] | | HH:MM | [Confirmed resolved] | ## Root Cause [Technical explanation of what went wrong] ## Impact - Users affected: [number or percentage] - Duration: [minutes] - Revenue impact: [if applicable] - Data affected: [if applicable] ## What Went Well - [Good thing 1] - [Good thing 2] ## What Went Wrong - [Process/system gap 1] - [Process/system gap 2] ## Action Items | Action | Owner | Due Date | Priority | |--------|-------|----------|:--------:| | [Preventive measure] | [Name] | [Date] | P1 | | [Detection improvement] | [Name] | [Date] | P2 | | [Process improvement] | [Name] | [Date] | P3 | ## Lessons Learned [What the team should take away from this] ``` ## Communication Template ```markdown ## Status Update: [Incident Title] **Status:** Investigating / Identified / Monitoring / Resolved **Impact:** [Who/what is affected] **Current action:** [What we're doing right now] **ETA:** [When we expect resolution, if known] **Next update:** [When the next status update will be] ``` ## Sources - [PagerDuty Incident Response Guide](https://response.pagerduty.com/) - [Google SRE Handbook — Incident Management](https://sre.google/sre-book/managing-incidents/) - [Atlassian Incident Management](https://www.atlassian.com/incident-management)