--- name: incident-management description: Handle production incidents effectively. Use when responding to outages, conducting post-mortems, or improving reliability. Covers incident response and blameless culture. allowed-tools: Read, Write, Glob, Grep --- # Incident Management ## Incident Severity | Level | Impact | Response Time | |-------|--------|---------------| | SEV1 | Complete outage | Immediate | | SEV2 | Major degradation | < 15 min | | SEV3 | Minor degradation | < 1 hour | | SEV4 | Low impact | Next business day | ## Incident Response ### 1. Detect - Monitoring alerts - Customer reports - Error logs ### 2. Triage - Assess severity - Assign incident commander - Create communication channel ### 3. Investigate - Check recent changes - Review logs and metrics - Identify root cause ### 4. Mitigate - Apply quick fix - Rollback if needed - Communicate status ### 5. Resolve - Confirm fix - Monitor for recurrence - Close incident ### 6. Learn - Post-mortem meeting - Document findings - Create action items ## Post-Mortem Template ```markdown # Post-Mortem: [Incident Title] ## Summary [Brief description of what happened] ## Timeline - HH:MM - [Event] - HH:MM - [Event] - HH:MM - [Resolution] ## Impact - Duration: [X hours] - Users affected: [X] - Revenue impact: [if applicable] ## Root Cause [What caused this incident] ## Contributing Factors - [Factor 1] - [Factor 2] ## What Went Well - [Positive 1] - [Positive 2] ## What Could Be Improved - [Improvement 1] - [Improvement 2] ## Action Items - [ ] [Action 1] - Owner: [Name] - [ ] [Action 2] - Owner: [Name] ``` ## Blameless Culture - Focus on systems, not people - "What failed?" not "Who failed?" - Share learnings openly - Celebrate near-misses