--- name: incident-response description: Respond to production incidents systematically with triage, investigation, resolution, and post-mortem analysis to minimize downtime and prevent recurrence. Use when handling production outages, triaging incidents, investigating critical bugs, coordinating incident response, implementing hotfixes, conducting post-mortems, or establishing incident response procedures. --- # Incident Response - Production Issue Management ## When to use this skill - Responding to production outages - Triaging critical incidents - Investigating high-severity bugs - Coordinating incident response teams - Implementing emergency hotfixes - Conducting post-mortem analyses - Establishing incident response procedures - Communicating status during incidents - Creating runbooks for common issues - Implementing rollback strategies - Documenting incident timelines - Preventing incident recurrence ## When to use this skill - Responding to outages, managing incidents, conducting postmortems. - When working on related tasks or features - During development that requires this expertise **Use when**: Responding to outages, managing incidents, conducting postmortems. ## Incident Response Process ### 1. Detect - Monitoring alerts - User reports - Automated checks ### 2. Triage - Assess severity (P0-P4) - Page on-call engineer - Create incident channel ### 3. Mitigate - Rollback to last known good - Scale resources - Apply hotfix - Communicate status ### 4. Resolve - Verify fix - Monitor metrics - Update status page - Close incident ### 5. Postmortem - Timeline of events - Root cause analysis - Action items - Follow-up tasks ## Severity Levels - **P0 (Critical)**: Complete outage, data loss - **P1 (High)**: Major feature broken, revenue impact - **P2 (Medium)**: Degraded performance, workaround exists - **P3 (Low)**: Minor bug, cosmetic issue - **P4 (Informational)**: Enhancement request ## Example Runbook \`\`\`markdown # High CPU Usage Runbook ## Symptoms - Server CPU > 90% - Slow response times - Request timeouts ## Investigation 1. Check top processes: \`top\` 2. Check memory: \`free -h\` 3. Check logs: \`tail -f app.log\` ## Mitigation 1. Scale horizontally: Add servers 2. Restart service: \`systemctl restart app\` 3. Rate limit: Enable aggressive rate limiting ## Resolution 1. Identify root cause (N+1 query, memory leak, etc.) 2. Deploy fix 3. Monitor for 1 hour \`\`\` ## Communication Template \`\`\` [INCIDENT] Service X degraded Status: Investigating Impact: 20% of users seeing slow load times ETA: 30 minutes Updates: - 10:00 AM: Issue detected - 10:05 AM: On-call paged, investigation started - 10:15 AM: Root cause identified (database bottleneck) - 10:30 AM: Fix deployed, monitoring Next update: 11:00 AM \`\`\` ## Resources - [Incident Management Guide](https://www.pagerduty.com/resources/learn/what-is-incident-management/) - [Postmortem Template](https://github.com/dastergon/postmortem-templates) - [PagerDuty](https://www.pagerduty.com/)