--- name: incident-response description: Respond to production incidents — diagnosis, communication, resolution, post-mortem triggers: [production issue, incident, emergency, error spike, system down, customer impact] --- # SKILL: Incident Response ## The Golden Rules ``` 1. TRANSPARENCY — communicate early and often 2. FOCUS — get the system back online (perfection is secondary) 3. LEARNING — every incident generates a post-mortem 4. BLAMELESS — blame systems, not people ``` ## The Incident Runbook ### Phase 1: Detection & Triage (first 2 minutes) ``` □ Error rate > 1% within 5 minutes → trigger incident □ Page on-call engineer (PagerDuty / Slack integration) □ Declare severity: CRITICAL: 100% of users affected, core feature down HIGH: 50%+ of users affected or > 5% error rate MEDIUM: <50% users affected, workaround exists LOW: minor feature broken, no customer impact □ Create incident ticket with: - Start time (exact second) - Severity level - Affected service(s) - Initial error message / metric spike ``` ### Phase 2: Initial Response (first 5 minutes) ``` □ Gather on-call team: - Service owner (knows the system) - On-call engineer (fixes the issue) - On-call manager (communicates) □ Stop the bleeding (no perfect fix yet): OPTION A: Feature flag → disable the broken feature OPTION B: Scale → if overloaded, add capacity OPTION C: Revert → rollback last deployment (usually fastest = disable feature via flag) □ Communicate: - Update status page: "We are investigating..." - Email customers (if CRITICAL/HIGH) - Slack #incidents: "Service X down, investigating. ETA 10 min" ``` ### Phase 3: Root Cause Analysis (10-30 minutes) Check in this order (usually resolves in first 3): ``` 1. Recent deploy? → check deploy diff 2. Traffic spike? → check dashboard metrics 3. Database slow? → check query performance 4. Third-party issue? → check status pages (AWS, Stripe, etc) 5. Configuration? → verify environment variables 6. Resource exhaustion? → check CPU, memory, disk 7. Concurrency issue? → check logs for race conditions ``` ```python # Quick diagnosis script async def diagnose_incident(): checks = { "last_deploy": await get_last_deploy_time(), "error_rate": await get_current_error_rate(), "p99_latency": await get_p99_latency(), "database_health": await check_db_connection(), "cpu_usage": await get_cpu_percent(), "memory_usage": await get_memory_percent(), "disk_usage": await get_disk_percent(), "recent_errors": await get_latest_error_logs(limit=20), } # Print diagnosis for check, result in checks.items(): status = "🔴" if result["bad"] else "✅" print(f"{status} {check}: {result['message']}") ``` ### Phase 4: Implementation (minutes 30-60+) ``` IF root cause = recent deploy: Option 1: Revert the deploy (1 minute fix) flyctl releases rollback --app [app] git revert [commit-sha] Option 2: Fix forward (fix in code, redeploy) [make minimal fix] git commit -m "fix: [incident issue]" flyctl deploy IF root cause = database: Run query that caused slowness Add index or optimize query Redeploy Monitor for resolution IF root cause = resource exhaustion: Scale up: add more resources Monitor: does it recover? Once stable: investigate why (memory leak, inefficient code) ``` ### Phase 5: Recovery & Verification (as incident resolves) ``` □ Verify fix: - Error rate back to < 0.1% - P95 latency back to baseline - Customer reports of issues stop (monitor support channel) - Feature working in staging - Spot check in production □ Communicate resolution: - Update status page: "Issue resolved" - Email customers: "We have resolved the incident..." - Slack: "Incident resolved at [timestamp], root cause [brief]" □ Monitor closely for next 30 minutes: - Anomaly detection on all metrics - Any signs it's coming back? ``` ### Phase 6: Post-Mortem (within 24 hours) ```markdown # Incident Post-Mortem: [incident name] ## Timeline - 14:32 UTC: Error rate spike detected - 14:34 UTC: On-call team paged - 14:38 UTC: Root cause identified (memory leak in new feature) - 14:41 UTC: Feature disabled via flag - 14:44 UTC: Error rate normalized ## Root Cause Memory leak in new recommendation engine (deploy at 13:45). Each request allocated 100MB, never freed. After 200 requests, service ran out of memory and crashed. ## Why it happened - No memory profiling before deploy - Load testing only tested 10 concurrent users - Production gets 1000 concurrent users ## What we're doing to prevent this - [ ] Implement memory profiling in pre-deploy checks - [ ] Load test with 10x expected concurrent users - [ ] Add memory usage alerts (trigger if > 80% for 2 minutes) - [ ] Feature flags for high-risk features (this one should have had a flag) ## Follow-up - Memory profiling PR: due by [date] - Load testing process: due by [date] - Alert configuration: due by [date] ``` ## Quality checks - [ ] Runbook created and accessible to on-call team - [ ] Status page configured (can update without deployment) - [ ] Feature flags in place for all critical features - [ ] Monitoring alerts configured for critical metrics - [ ] On-call rotation documented - [ ] Post-mortem template created - [ ] Team trained on incident response process