--- name: postmortem-writing description: Write effective blameless postmortems with root cause analysis, timelines, and action items. Use when conducting incident reviews, writing postmortem documents, or improving incident response processes. --- # Postmortem Writing Comprehensive guide to writing effective, blameless postmortems that drive organizational learning and prevent incident recurrence. ## When to Use This Skill - Conducting post-incident reviews - Writing postmortem documents - Facilitating blameless postmortem meetings - Identifying root causes and contributing factors - Creating actionable follow-up items - Building organizational learning culture ## Core Concepts ### 1. Blameless Culture | Blame-Focused | Blameless | |---------------|-----------| | "Who caused this?" | "What conditions allowed this?" | | "Someone made a mistake" | "The system allowed this mistake" | | Punish individuals | Improve systems | | Hide information | Share learnings | | Fear of speaking up | Psychological safety | ### 2. Postmortem Triggers - SEV1 or SEV2 incidents - Customer-facing outages > 15 minutes - Data loss or security incidents - Near-misses that could have been severe - Novel failure modes - Incidents requiring unusual intervention ## Quick Start ### Postmortem Timeline ``` Day 0: Incident occurs Day 1-2: Draft postmortem document Day 3-5: Postmortem meeting Day 5-7: Finalize document, create tickets Week 2+: Action item completion Quarterly: Review patterns across incidents ``` ## Templates ### Template 1: Standard Postmortem ```markdown # Postmortem: [Incident Title] **Date**: 2024-01-15 **Authors**: @alice, @bob **Status**: Draft | In Review | Final **Incident Severity**: SEV2 **Incident Duration**: 47 minutes ## Executive Summary On January 15, 2024, the payment processing service experienced a 47-minute outage affecting approximately 12,000 customers. The root cause was a database connection pool exhaustion triggered by a configuration change in deployment v2.3.4. The incident was resolved by rolling back to v2.3.3 and increasing connection pool limits. **Impact**: - 12,000 customers unable to complete purchases - Estimated revenue loss: $45,000 - 847 support tickets created - No data loss or security implications ## Timeline (All times UTC) | Time | Event | |------|-------| | 14:23 | Deployment v2.3.4 completed to production | | 14:31 | First alert: `payment_error_rate > 5%` | | 14:33 | On-call engineer @alice acknowledges alert | | 14:35 | Initial investigation begins, error rate at 23% | | 14:41 | Incident declared SEV2, @bob joins | | 14:45 | Database connection exhaustion identified | | 14:52 | Decision to rollback deployment | | 14:58 | Rollback to v2.3.3 initiated | | 15:10 | Rollback complete, error rate dropping | | 15:18 | Service fully recovered, incident resolved | ## Root Cause Analysis ### What Happened The v2.3.4 deployment included a change to the database query pattern that inadvertently removed connection pooling for a frequently-called endpoint. Each request opened a new database connection instead of reusing pooled connections. ### Why It Happened 1. **Proximate Cause**: Code change in `PaymentRepository.java` replaced pooled `DataSource` with direct `DriverManager.getConnection()` calls. 2. **Contributing Factors**: - Code review did not catch the connection handling change - No integration tests specifically for connection pool behavior - Staging environment has lower traffic, masking the issue - Database connection metrics alert threshold was too high (90%) 3. **5 Whys Analysis**: - Why did the service fail? → Database connections exhausted - Why were connections exhausted? → Each request opened new connection - Why did each request open new connection? → Code bypassed connection pool - Why did code bypass connection pool? → Developer unfamiliar with codebase patterns - Why was developer unfamiliar? → No documentation on connection management patterns ### System Diagram ``` [Client] → [Load Balancer] → [Payment Service] → [Database] ↓ Connection Pool (broken) ↓ Direct connections (cause) ``` ## Detection ### What Worked - Error rate alert fired within 8 minutes of deployment - Grafana dashboard clearly showed connection spike - On-call response was swift (2 minute acknowledgment) ### What Didn't Work - Database connection metric alert threshold too high - No deployment-correlated alerting - Canary deployment would have caught this earlier ### Detection Gap The deployment completed at 14:23, but the first alert didn't fire until 14:31 (8 minutes). A deployment-aware alert could have detected the issue faster. ## Response ### What Worked - On-call engineer quickly identified database as the issue - Rollback decision was made decisively - Clear communication in incident channel ### What Could Be Improved - Took 10 minutes to correlate issue with recent deployment - Had to manually check deployment history - Rollback took 12 minutes (could be faster) ## Impact ### Customer Impact - 12,000 unique customers affected - Average impact duration: 35 minutes - 847 support tickets (23% of affected users) - Customer satisfaction score dropped 12 points ### Business Impact - Estimated revenue loss: $45,000 - Support cost: ~$2,500 (agent time) - Engineering time: ~8 person-hours ### Technical Impact - Database primary experienced elevated load - Some replica lag during incident - No permanent damage to systems ## Lessons Learned ### What Went Well 1. Alerting detected the issue before customer reports 2. Team collaborated effectively under pressure 3. Rollback procedure worked smoothly 4. Communication was clear and timely ### What Went Wrong 1. Code review missed critical change 2. Test coverage gap for connection pooling 3. Staging environment doesn't reflect production traffic 4. Alert thresholds were not tuned properly ### Where We Got Lucky 1. Incident occurred during business hours with full team available 2. Database handled the load without failing completely 3. No other incidents occurred simultaneously ## Action Items | Priority | Action | Owner | Due Date | Ticket | |----------|--------|-------|----------|--------| | P0 | Add integration test for connection pool behavior | @alice | 2024-01-22 | ENG-1234 | | P0 | Lower database connection alert threshold to 70% | @bob | 2024-01-17 | OPS-567 | | P1 | Document connection management patterns | @alice | 2024-01-29 | DOC-89 | | P1 | Implement deployment-correlated alerting | @bob | 2024-02-05 | OPS-568 | | P2 | Evaluate canary deployment strategy | @charlie | 2024-02-15 | ENG-1235 | | P2 | Load test staging with production-like traffic | @dave | 2024-02-28 | QA-123 | ## Appendix ### Supporting Data #### Error Rate Graph [Link to Grafana dashboard snapshot] #### Database Connection Graph [Link to metrics] ### Related Incidents - 2023-11-02: Similar connection issue in User Service (POSTMORTEM-42) ### References - [Connection Pool Best Practices](internal-wiki/connection-pools) - [Deployment Runbook](internal-wiki/deployment-runbook) ``` ### Template 2: 5 Whys Analysis ```markdown # 5 Whys Analysis: [Incident] ## Problem Statement Payment service experienced 47-minute outage due to database connection exhaustion. ## Analysis ### Why #1: Why did the service fail? **Answer**: Database connections were exhausted, causing all new requests to fail. **Evidence**: Metrics showed connection count at 100/100 (max), with 500+ pending requests. --- ### Why #2: Why were database connections exhausted? **Answer**: Each incoming request opened a new database connection instead of using the connection pool. **Evidence**: Code diff shows direct `DriverManager.getConnection()` instead of pooled `DataSource`. --- ### Why #3: Why did the code bypass the connection pool? **Answer**: A developer refactored the repository class and inadvertently changed the connection acquisition method. **Evidence**: PR #1234 shows the change, made while fixing a different bug. --- ### Why #4: Why wasn't this caught in code review? **Answer**: The reviewer focused on the functional change (the bug fix) and didn't notice the infrastructure change. **Evidence**: Review comments only discuss business logic. --- ### Why #5: Why isn't there a safety net for this type of change? **Answer**: We lack automated tests that verify connection pool behavior and lack documentation about our connection patterns. **Evidence**: Test suite has no tests for connection handling; wiki has no article on database connections. ## Root Causes Identified 1. **Primary**: Missing automated tests for infrastructure behavior 2. **Secondary**: Insufficient documentation of architectural patterns 3. **Tertiary**: Code review checklist doesn't include infrastructure considerations ## Systemic Improvements | Root Cause | Improvement | Type | |------------|-------------|------| | Missing tests | Add infrastructure behavior tests | Prevention | | Missing docs | Document connection patterns | Prevention | | Review gaps | Update review checklist | Detection | | No canary | Implement canary deployments | Mitigation | ``` ### Template 3: Quick Postmortem (Minor Incidents) ```markdown # Quick Postmortem: [Brief Title] **Date**: 2024-01-15 | **Duration**: 12 min | **Severity**: SEV3 ## What Happened API latency spiked to 5s due to cache miss storm after cache flush. ## Timeline - 10:00 - Cache flush initiated for config update - 10:02 - Latency alerts fire - 10:05 - Identified as cache miss storm - 10:08 - Enabled cache warming - 10:12 - Latency normalized ## Root Cause Full cache flush for minor config update caused thundering herd. ## Fix - Immediate: Enabled cache warming - Long-term: Implement partial cache invalidation (ENG-999) ## Lessons Don't full-flush cache in production; use targeted invalidation. ``` ## Facilitation Guide ### Running a Postmortem Meeting ```markdown ## Meeting Structure (60 minutes) ### 1. Opening (5 min) - Remind everyone of blameless culture - "We're here to learn, not to blame" - Review meeting norms ### 2. Timeline Review (15 min) - Walk through events chronologically - Ask clarifying questions - Identify gaps in timeline ### 3. Analysis Discussion (20 min) - What failed? - Why did it fail? - What conditions allowed this? - What would have prevented it? ### 4. Action Items (15 min) - Brainstorm improvements - Prioritize by impact and effort - Assign owners and due dates ### 5. Closing (5 min) - Summarize key learnings - Confirm action item owners - Schedule follow-up if needed ## Facilitation Tips - Keep discussion on track - Redirect blame to systems - Encourage quiet participants - Document dissenting views - Time-box tangents ``` ## Anti-Patterns to Avoid | Anti-Pattern | Problem | Better Approach | |--------------|---------|-----------------| | **Blame game** | Shuts down learning | Focus on systems | | **Shallow analysis** | Doesn't prevent recurrence | Ask "why" 5 times | | **No action items** | Waste of time | Always have concrete next steps | | **Unrealistic actions** | Never completed | Scope to achievable tasks | | **No follow-up** | Actions forgotten | Track in ticketing system | ## Best Practices ### Do's - **Start immediately** - Memory fades fast - **Be specific** - Exact times, exact errors - **Include graphs** - Visual evidence - **Assign owners** - No orphan action items - **Share widely** - Organizational learning ### Don'ts - **Don't name and shame** - Ever - **Don't skip small incidents** - They reveal patterns - **Don't make it a blame doc** - That kills learning - **Don't create busywork** - Actions should be meaningful - **Don't skip follow-up** - Verify actions completed ## Resources - [Google SRE - Postmortem Culture](https://sre.google/sre-book/postmortem-culture/) - [Etsy's Blameless Postmortems](https://codeascraft.com/2012/05/22/blameless-postmortems/) - [PagerDuty Postmortem Guide](https://postmortems.pagerduty.com/)