--- name: incident-postmortem description: "Write blameless incident postmortems with timeline reconstruction, root cause analysis, action items, and preventive measures" --- # Incident Postmortem Builder ## Overview Create blameless incident postmortems that transform operational disruptions into learning opportunities. These documents focus on system failures and process gaps, not individual blame, enabling continuous improvement and preventing recurrence. ## Core Principles 1. **Blameless**: Focus on systems, not people. "Why did this happen?" not "Who screwed up?" 2. **Psychological Safety**: Team members must feel safe discussing root causes without fear 3. **Data-Driven**: Base findings on logs, metrics, and facts; not assumptions 4. **Action-Oriented**: Every finding leads to actionable improvements 5. **Learning Culture**: Treat incidents as valuable learning events, not failures 6. **Transparency**: Share findings broadly; communicate changes to prevent similar incidents ## Timeline Reconstruction Create a detailed chronology of events: ``` Time (UTC) | Who | What | Evidence | Context -----------|-----|------|----------|---------- 2024-02-15 14:32 | Jenkins | Deploy v2.1.3 (buggy) | Logs | Automated Friday deploy 14:35 | Customer | Website errors | CloudFront | 500 errors reported 14:37 | On-call | PagerDuty alert | Alert | Error rate exceeded threshold 14:42 | Eng team | Investigation starts | Slack #incidents | Identified deploy cause 14:55 | Lead | Rollback initiated | Logs | Reverted to v2.1.2 15:02 | On-call | Error rate normal | Metrics | Customers back to normal 15:30 | Team | Root cause meeting | Notes | Identified root cause ``` **Timeline Template**: - **T+0 (Alert)**: When first detected - **T+X (Detection)**: When incident was recognized - **T+Y (Communication)**: When stakeholders notified - **T+Z (Mitigation)**: When incident owner took action - **T+N (Resolution)**: When system returned to normal - **Duration**: Total time from detection to resolution ## Root Cause Analysis (5 Whys) Go beyond the obvious cause to find systemic issues: ``` Incident: Website down for 28 minutes Why 1: Why did website go down? Answer: Deployment v2.1.3 contained a bug causing infinite loop in auth service Why 2: Why did the bug reach production? Answer: Code review missed the issue; test suite didn't catch it Why 3: Why didn't test suite catch the infinite loop? Answer: Load/stress tests only run occasionally; not part of standard CI pipeline Why 4: Why aren't load tests mandatory in CI? Answer: Historically slow; team prioritized speed over reliability Why 5: Why does team optimize for deploy speed over testing? Answer: Pressure to ship features fast; no documented standard for testing rigor ROOT CAUSE: Process gap - no mandatory load testing in CI; pressure to ship ``` **Avoid**: - Stopping too early ("operator didn't notice error") - Human error as root cause ("developer made a mistake") - Unclear systemic issues **Focus on**: - Process failures - Monitoring gaps - Communication breakdowns - Knowledge gaps - Tool limitations - Architectural weaknesses ## Contributing Factors (Swiss Cheese Model) Most incidents involve multiple failures aligning: ``` Incident: Late-night 30-minute outage Contributing Factors: 1. Code change made Friday afternoon (rush to deploy before weekend) 2. No automated rollback capability (manual process) 3. On-call engineer had weak knowledge of new code (hired 3 weeks ago) 4. No load test coverage for auth service changes (technical debt) 5. Monitoring alert threshold set too high (missed early warning) 6. Deployment not staged; went straight to production (process gap) 7. No change advisory board approval (governance gap) Any ONE of these alone wouldn't have caused incident. Combined: 28-minute outage. ``` ## Root Cause vs. Proximate Cause **Proximate Cause** (immediate cause): - Infinite loop in authentication code - Deployment that shouldn't have happened - Missing monitor alert **Root Cause** (systemic failure): - Code review process insufficient for critical changes - Deployment process lacked staged/canary deployment - Testing strategy doesn't include stress tests - Knowledge gap in on-call team for recent changes Focus postmortem on root causes, not proximate causes. ## Action Items (Follow-up) **Structure**: [Priority] | [What] | [Why] | [Owner] | [Due Date] | [Status] ### Immediate Actions (0-7 days) ``` CRITICAL | Deploy hotfix for infinite loop | Prevent recurrence | Sarah | 2024-02-15 | DONE HIGH | Document code change impact | Knowledge transfer | John | 2024-02-16 | IN PROGRESS MEDIUM | Post-incident communication to customers | Transparency | PM | 2024-02-15 | DONE ``` ### Short-term Actions (1-4 weeks) ``` HIGH | Implement automatic canary deployment | Catch issues pre-production | DevOps | 2024-03-01 | PENDING HIGH | Add auth load tests to CI pipeline | Catch performance issues early | QA | 2024-03-01 | PENDING MEDIUM | Onboard new on-call engineer on recent changes | Knowledge gap closure | Tech Lead | 2024-02-28 | IN PROGRESS ``` ### Long-term Actions (1-3 months) ``` MEDIUM | Implement automated rollback capability | Faster recovery time | Arch | 2024-04-15 | PENDING LOW | Review change advisory board process | Governance improvement | Ops | 2024-05-01 | PENDING LOW | Schedule quarterly load testing for critical services | Proactive risk management | Perf | 2024-06-01 | PENDING ``` **SMART Action Items**: - **S**pecific: What exactly needs to be done? - **M**easurable: How will we know it's complete? - **A**ssignable: Who owns this? - **R**ealistic: Can it actually be done? - **T**ime-bound: When is it due? ## Preventive Measures **Prevention Strategy**: How do we prevent this type of incident? 1. **Process Changes**: - Implement staged/canary deployments for critical services - Add code review requirement for auth service changes - Require load test passing for critical path changes 2. **Monitoring & Alerting**: - Lower error rate alert threshold (early warning) - Add CPU/memory alerts for auth service - Add canary endpoint synthetic monitoring 3. **Automation**: - Automatic rollback if error rate >2% - Load test gate in CI pipeline (mandatory, not optional) - Automated chaos engineering tests weekly 4. **Documentation & Training**: - Document architecture of auth service - Create runbook for auth service incidents - Schedule knowledge transfer session for on-call team 5. **Organizational**: - Remove deadline pressure; don't deploy Friday afternoon - Add on-call engineer to code reviews of critical services - Establish incident SLA: detection to resolution <15 minutes for P0 incidents ## Template Structure **INCIDENT POSTMORTEM** Incident ID: INC-2024-047 Date: 2024-02-15 Duration: 28 minutes (14:35 UTC - 15:03 UTC) Impact: Website unavailable; 0.5M page requests failed Severity: P1 (Critical) **INCIDENT SUMMARY** [1 paragraph overview of what happened and impact] **TIMELINE** [Chronological events table] **ROOT CAUSE ANALYSIS** Primary Root Cause: [High-level finding] 5 Whys Analysis: [Why chain] Contributing Factors: [List of systemic issues] **IMPACT ANALYSIS** - Customers Affected: [Number or percentage] - Duration: [Minutes] - Revenue Impact: [If quantifiable] - Reputation Impact: [Qualitative assessment] - Data Loss: [Yes/No, details if yes] **DETECTION & RESPONSE** - Detection Time: [Minutes to detect] - Response Time: [Minutes to start mitigation] - Resolution Time: [Minutes to full recovery] - Response Quality: [Smooth/Some delays/Chaotic - why?] **WHAT WENT WELL** - [Good thing 1]: Enabled [outcome] - [Good thing 2]: Enabled [outcome] - [Good thing 3]: Enabled [outcome] *Recognize excellent work; reinforce good behaviors* **WHAT COULD BE BETTER** - [Gap 1]: Impact was [consequence] - [Gap 2]: Impact was [consequence] - [Gap 3]: Impact was [consequence] **ACTION ITEMS** [Immediate, short-term, long-term actions with owners and dates] **PREVENTIVE MEASURES** [How we prevent this incident class in future] **APPENDICES** - Error logs (anonymized) - Customer communication - Monitoring graphs during incident - Architecture diagram - Related incidents (historical patterns) ## Postmortem Facilitation **Blameless Meeting Principles**: 1. **Start with context, not blame**: "Here's what was happening at 14:30 UTC..." 2. **Use neutral language**: "Code changed" not "Code was broken" 3. **Ask curious questions**: "What were you seeing on your screen?" not "Why didn't you check the logs?" 4. **Encourage storytelling**: Let people describe their experience; narrative flow 5. **Capture assumptions**: "I assumed..." statements reveal knowledge gaps 6. **No hierarchy**: On-call engineer's observations valued same as CTO's 7. **Record decisions**: Why did we choose rollback vs. fix? Document the thinking 8. **Record learnings**: What surprised people? What did they learn? **Participants**: - On-call engineer (incident responder) - Service owner - DevOps/Infrastructure team - Product/Business owner - Facilitator (experienced, neutral party) **Meeting Duration**: 30-60 minutes maximum **When to Hold**: Within 48 hours of incident resolution (while details are fresh) ## Distribution & Follow-up 1. **Share Widely**: Postmortem is internal tool for learning; share with full engineering org 2. **Executive Summary**: One-page summary for leadership 3. **Customer Communication**: Transparency about what happened and prevention measures 4. **Process Review**: Monthly review of open action items 5. **Trend Analysis**: Quarterly review - are we preventing incident classes or just firefighting? ## Preventing Similar Incidents **Incident Class Tracking**: - Authentication failures - Database performance degradation - Memory leaks - Configuration errors - Dependency failures - Deployment failures If same class happens twice: escalate prevention measures If same incident happens three times: organizational escalation (management review) --- **Use this skill to**: Transform incidents into learning opportunities, improve system resilience, and build a psychologically safe incident response culture.