---
name: managing-incidents
description: Guide incident response from detection to post-mortem using SRE principles, severity classification, on-call management, blameless culture, and communication protocols. Use when setting up incident processes, designing escalation policies, or conducting post-mortems.
---

# Incident Management

Provide end-to-end incident management guidance covering detection, response, communication, and learning. Emphasizes SRE culture, blameless post-mortems, and structured processes for high-reliability operations.

## When to Use This Skill

Apply this skill when:
- Setting up incident response processes for a team
- Designing on-call rotations and escalation policies
- Creating runbooks for common failure scenarios
- Conducting blameless post-mortems after incidents
- Implementing incident communication protocols (internal and external)
- Choosing incident management tooling and platforms
- Improving MTTR and incident frequency metrics

## Core Principles

### Incident Management Philosophy

**Declare Early and Often:** Do not wait for certainty. Declaring an incident enables coordination, can be downgraded if needed, and prevents delayed response.

**Mitigation First, Root Cause Later:** Stop customer impact immediately (rollback, disable feature, failover). Debug and fix root cause after stability restored.

**Blameless Culture:** Assume good intentions. Focus on how systems failed, not who failed. Create psychological safety for honest learning.

**Clear Command Structure:** Assign Incident Commander (IC) to own coordination. IC delegates tasks but does not do hands-on debugging.

**Communication is Critical:** Internal coordination via dedicated channels, external transparency via status pages. Update stakeholders every 15-30 minutes during critical incidents.

## Severity Classification

Standard severity levels with response times:

**SEV0 (P0) - Critical Outage:**
- Impact: Complete service outage, critical data loss, payment processing down
- Response: Page immediately 24/7, all hands on deck, executive notification
- Example: API completely down, entire customer base affected

**SEV1 (P1) - Major Degradation:**
- Impact: Major functionality degraded, significant customer subset affected
- Response: Page during business hours, escalate off-hours, IC assigned
- Example: 15% error rate, critical feature unavailable

**SEV2 (P2) - Minor Issues:**
- Impact: Minor functionality impaired, edge case bug, small user subset
- Response: Email/Slack alert, next business day response
- Example: UI glitch, non-critical feature slow

**SEV3 (P3) - Low Impact:**
- Impact: Cosmetic issues, no customer functionality affected
- Response: Ticket queue, planned sprint
- Example: Visual inconsistency, documentation error

For detailed severity decision framework and interactive classifier, see `references/severity-classification.md`.

## Incident Roles

**Incident Commander (IC):**
- Owns overall incident response and coordination
- Makes strategic decisions (rollback vs. debug, when to escalate)
- Delegates tasks to responders (does NOT do hands-on debugging)
- Declares incident resolved when stability confirmed

**Communications Lead:**
- Posts status updates to internal and external channels
- Coordinates with stakeholders (executives, product, support)
- Drafts post-incident customer communication
- Cadence: Every 15-30 minutes for SEV0/SEV1

**Subject Matter Experts (SMEs):**
- Hands-on debugging and mitigation
- Execute runbooks and implement fixes
- Provide technical context to IC

**Scribe:**
- Documents timeline, actions, decisions in real-time
- Records incident notes for post-mortem reconstruction

Assign roles based on severity:
- SEV2/SEV3: Single responder
- SEV1: IC + SME(s)
- SEV0: IC + Communications Lead + SME(s) + Scribe

For detailed role responsibilities, see `references/incident-roles.md`.

## On-Call Management

### Rotation Patterns

**Primary + Secondary:**
- Primary: First responder
- Secondary: Backup if primary doesn't ack within 5 minutes
- Rotation length: 1 week (optimal balance)

**Follow-the-Sun (24/7):**
- Team A: US hours, Team B: Europe hours, Team C: Asia hours
- Benefit: No night shifts, improved work-life balance
- Requires: Multiple global teams

**Tiered Escalation:**
- Tier 1: Junior on-call (common issues, runbook-driven)
- Tier 2: Senior on-call (complex troubleshooting)
- Tier 3: Team lead/architect (critical decisions)

### Best Practices

- Rotation length: 1 week per rotation
- Handoff ceremony: 30-minute call to discuss active issues
- Compensation: On-call stipend + time off after major incidents
- Tooling: PagerDuty, Opsgenie, or incident.io
- Limits: Max 2-3 pages per night; escalate if exceeded

## Incident Response Workflow

Standard incident lifecycle:

```
Detection → Triage → Declaration → Investigation
  ↓
Mitigation → Resolution → Monitoring → Closure
  ↓
Post-Mortem (within 48 hours)
```

### Key Decision Points

**When to Declare:** When in doubt, declare (can always downgrade severity)

**When to Escalate:**
- No progress after 30 minutes
- Severity increases (SEV2 → SEV1)
- Specialized expertise needed

**When to Close:**
- Issue resolved and stable for 30+ minutes
- Monitoring shows all metrics at baseline
- No customer-reported issues

For complete workflow details, see `references/incident-workflow.md`.

## Communication Protocols

### Internal Communication

**Incident Slack Channel:**
- Format: `#incident-YYYY-MM-DD-topic-description`
- Pin: Severity, IC name, status update template, runbook links

**War Room:** Video call for SEV0/SEV1 requiring real-time voice coordination

**Status Update Cadence:**
- SEV0: Every 15 minutes
- SEV1: Every 30 minutes
- SEV2: Every 1-2 hours or at major milestones

### External Communication

**Status Page:**
- Tools: Statuspage.io, Instatus, custom
- Stages: Investigating → Identified → Monitoring → Resolved
- Transparency: Acknowledge issue publicly, provide ETAs when possible

**Customer Email:**
- When: SEV0/SEV1 affecting customers
- Timing: Within 1 hour (acknowledge), post-resolution (full details)
- Tone: Apologetic, transparent, action-oriented

**Regulatory Notifications:**
- Data Breach: GDPR requires notification within 72 hours
- Financial Services: Immediate notification to regulators
- Healthcare: HIPAA breach notification rules

For communication templates, see `examples/communication-templates.md`.

## Runbooks and Playbooks

### Runbook Structure

Every runbook should include:
1. **Trigger:** Alert conditions that activate this runbook
2. **Severity:** Expected severity level
3. **Prerequisites:** System state requirements
4. **Steps:** Numbered, executable commands (copy-pasteable)
5. **Verification:** How to confirm fix worked
6. **Rollback:** How to undo if steps fail
7. **Owner:** Team/person responsible
8. **Last Updated:** Date of last revision

### Best Practices

- **Executable:** Commands copy-pasteable, not just descriptions
- **Tested:** Run during disaster recovery drills
- **Versioned:** Track changes in Git
- **Linked:** Reference from alert definitions
- **Automated:** Convert manual steps to scripts over time

For runbook templates, see `examples/runbooks/` directory.

## Blameless Post-Mortems

### Blameless Culture Tenets

**Assume Good Intentions:** Everyone made the best decision with information available.

**Focus on Systems:** Investigate how processes failed, not who failed.

**Psychological Safety:** Create environment where honesty is rewarded.

**Learning Opportunity:** Incidents are gifts of organizational knowledge.

### Post-Mortem Process

**1. Schedule Review (Within 48 Hours):** While memory is fresh

**2. Pre-Work:** Reconstruct timeline, gather metrics/logs, draft document

**3. Meeting Facilitation:**
- Timeline walkthrough
- 5 Whys Analysis to identify systemic root causes
- What Went Well / What Went Wrong
- Define action items with owners and due dates

**4. Post-Mortem Document:**
- Sections: Summary, Timeline, Root Cause, Impact, What Went Well/Wrong, Action Items
- Distribution: Engineering, product, support, leadership
- Storage: Archive in searchable knowledge base

**5. Follow-Up:** Track action items in sprint planning

For detailed facilitation guide and template, see `references/blameless-postmortems.md` and `examples/postmortem-template.md`.

## Alert Design Principles

**Actionable Alerts Only:**
- Every alert requires human action
- Include graphs, runbook links, recent changes
- Deduplicate related alerts
- Route to appropriate team based on service ownership

**Preventing Alert Fatigue:**
- Audit alerts quarterly: Remove non-actionable alerts
- Increase thresholds for noisy metrics
- Use anomaly detection instead of static thresholds
- Limit: Max 2-3 pages per night

## Tool Selection

### Incident Management Platforms

**PagerDuty:**
- Best for: Established enterprises, complex escalation policies
- Cost: $19-41/user/month
- When: Team size 10+, budget $500+/month

**Opsgenie:**
- Best for: Atlassian ecosystem users, flexible routing
- Cost: $9-29/user/month
- When: Using Atlassian products, budget $200-500/month

**incident.io:**
- Best for: Modern teams, AI-powered response, Slack-native
- When: Team size 5-50, Slack-centric culture

For detailed tool comparison, see `references/tool-comparison.md`.

### Status Page Solutions

**Statuspage.io:** Most trusted, easy setup ($29-399/month)
**Instatus:** Budget-friendly, modern design ($19-99/month)

## Metrics and Continuous Improvement

### Key Incident Metrics

**MTTA (Mean Time To Acknowledge):**
- Target: < 5 minutes for SEV1
- Improvement: Better on-call coverage

**MTTR (Mean Time To Recovery):**
- Target: < 1 hour for SEV1
- Improvement: Runbooks, automation

**MTBF (Mean Time Between Failures):**
- Target: > 30 days for critical services
- Improvement: Root cause fixes

**Incident Frequency:**
- Track: SEV0, SEV1, SEV2 counts per month
- Target: Downward trend

**Action Item Completion Rate:**
- Target: > 90%
- Improvement: Sprint integration, ownership clarity

### Continuous Improvement Loop

```
Incident → Post-Mortem → Action Items → Prevention
   ↑                                          ↓
   └──────────── Fewer Incidents ─────────────┘
```

## Decision Frameworks

### Severity Classification Decision Tree

```
Is production completely down or critical data at risk?
├─ YES → SEV0
└─ NO  → Is major functionality degraded?
          ├─ YES → Is there a workaround?
          │        ├─ YES → SEV1
          │        └─ NO  → SEV0
          └─ NO  → Are customers impacted?
                   ├─ YES → SEV2
                   └─ NO  → SEV3
```

Use interactive classifier: `python scripts/classify-severity.py`

### Escalation Matrix

For detailed escalation guidance, see `references/escalation-matrix.md`.

### Mitigation vs. Root Cause

**Prioritize Mitigation When:**
- Active customer impact ongoing
- Quick fix available (rollback, disable feature)

**Prioritize Root Cause When:**
- Customer impact already mitigated
- Fix requires careful analysis

**Default:** Mitigation first (99% of cases)

## Anti-Patterns to Avoid

- **Delayed Declaration:** Waiting for certainty before declaring incident
- **Skipping Post-Mortems:** "Small" incidents still provide learning
- **Blame Culture:** Punishing individuals prevents systemic learning
- **Ignoring Action Items:** Post-mortems without follow-through waste time
- **No Clear IC:** Multiple people leading creates confusion
- **Alert Fatigue:** Noisy, non-actionable alerts cause on-call to ignore pages
- **Hands-On IC:** IC should delegate debugging, not do it themselves

## Implementation Checklist

### Phase 1: Foundation (Week 1)
- [ ] Define severity levels (SEV0-SEV3)
- [ ] Choose incident management platform
- [ ] Set up basic on-call rotation
- [ ] Create incident Slack channel template

### Phase 2: Processes (Weeks 2-3)
- [ ] Create first 5 runbooks for common incidents
- [ ] Set up status page
- [ ] Train team on incident response
- [ ] Conduct tabletop exercise

### Phase 3: Culture (Weeks 4+)
- [ ] Conduct first blameless post-mortem
- [ ] Establish post-mortem cadence
- [ ] Implement MTTA/MTTR dashboards
- [ ] Track action items in sprint planning

### Phase 4: Optimization (Months 3-6)
- [ ] Automate incident declaration
- [ ] Implement runbook automation
- [ ] Monthly disaster recovery drills
- [ ] Quarterly incident trend reviews

## Integration with Other Skills

**Observability:** Monitoring alerts trigger incidents → Use incident-management for response

**Disaster Recovery:** DR provides recovery procedures → Incident-management provides operational response

**Security Incident Response:** Similar process with added compliance/forensics

**Infrastructure-as-Code:** IaC enables fast recovery via automated rebuild

**Performance Engineering:** Performance incidents trigger response → Performance team investigates post-mitigation

## Examples and Templates

**Runbook Templates:**
- `examples/runbooks/database-failover.md`
- `examples/runbooks/cache-invalidation.md`
- `examples/runbooks/ddos-mitigation.md`

**Post-Mortem Template:**
- `examples/postmortem-template.md` - Complete blameless post-mortem structure

**Communication Templates:**
- `examples/communication-templates.md` - Status updates, customer emails

**On-Call Handoff:**
- `examples/oncall-handoff-template.md` - Weekly handoff format

**Integration Scripts:**
- `examples/integrations/pagerduty-slack.py`
- `examples/integrations/statuspage-auto-update.py`
- `examples/integrations/postmortem-generator.py`

## Scripts

**Interactive Severity Classifier:**
```bash
python scripts/classify-severity.py
```
Asks questions to determine appropriate severity level based on impact and urgency.

## Further Reading

**Books:**
- Google SRE Book: "Postmortem Culture" (Chapter 15)
- "The Phoenix Project" by Gene Kim
- "Site Reliability Engineering" (Full book)

**Online Resources:**
- Atlassian: "How to Run a Blameless Postmortem"
- PagerDuty: "Incident Response Guide"
- Google SRE: "Postmortem Culture: Learning from Failure"

**Standards:**
- Incident Command System (ICS) - FEMA standard adapted for tech
- ITIL Incident Management - Traditional IT service management