--- name: incident-response description: Run structured incident response following detect-triage-mitigate-resolve-RCA lifecycle. Produces incident records, coordinates responders, and ensures blameless post-mortems with actionable follow-up items. --- # incident-response ## Purpose When production breaks, every minute counts. This skill provides a structured incident response framework that minimizes time-to-recovery and ensures every incident produces learning. ## Incident Severity | Severity | Impact | Response Time | Example | |----------|--------|---------------|---------| | S1 | Service fully down, data loss risk | < 15 min | Database corruption, total outage | | S2 | Major degradation, workaround exists | < 30 min | 50% error rate, payment failures | | S3 | Minor degradation, limited impact | < 2 hours | Slow queries, partial feature broken | | S4 | Cosmetic or non-impacting | < 1 day | UI glitch, log noise | ## Incident Response Lifecycle ### 1. DETECT - Monitoring alert fires, or user report received - Validate the alert is real (not a false positive) - Open an incident record ### 2. TRIAGE - Classify severity (S1–S4) - Identify affected services and users - Assemble response team based on severity - Assign Incident Commander (IC) ### 3. MITIGATE **Priority: Restore service FIRST, investigate LATER** - Rollback recent deployment? - Scale up infrastructure? - Enable feature flag / circuit breaker? - Route traffic to backup? - Communicate status to stakeholders ### 4. RESOLVE - Identify root cause - Deploy fix - Verify fix in production - Monitor for recurrence ### 5. RCA (Root Cause Analysis) Within 48 hours, produce a blameless post-mortem: ```markdown # Incident Post-Mortem: INC-XXXX **Date:** [date] **Duration:** [start to resolution] **Severity:** S[X] **Impact:** [what broke, who was affected, for how long] **IC:** [who led response] ## Timeline | Time | Event | |------|-------| | HH:MM | Alert fired | | HH:MM | IC assigned | | HH:MM | Root cause identified | | HH:MM | Mitigation applied | | HH:MM | Full resolution | ## Root Cause [What actually broke and why — blameless, focused on systems not people] ## Contributing Factors - [Factor 1: e.g., missing monitoring for X] - [Factor 2: e.g., no circuit breaker on Y] ## What Went Well - [Thing 1: fast detection] - [Thing 2: clear escalation] ## What Went Poorly - [Thing 1: slow rollback due to missing runbook] ## Action Items | Priority | Action | Owner | Deadline | |----------|--------|-------|----------| | P0 | [prevent recurrence] | [who] | [when] | | P1 | [improve detection] | [who] | [when] | | P2 | [update runbook] | [who] | [when] | ``` ## Anti-Patterns - Investigating before mitigating → restore service first, find root cause second - Blaming individuals → systems fail, not people; blameless post-mortems are mandatory - No follow-up on action items → unresolved action items mean the same incident will recur - Skipping RCA for S3/S4 → small incidents reveal systemic issues