--- name: incident-runbook-templates description: Create structured incident response runbooks with step-by-step procedures, escalation paths, and recovery actions. Use when building runbooks, responding to incidents, or establishing incident response procedures. --- # Incident Runbook Templates Production-ready templates for incident response runbooks covering detection, triage, mitigation, resolution, and communication. ## When to Use This Skill - Creating incident response procedures - Building service-specific runbooks - Establishing escalation paths - Documenting recovery procedures - Responding to active incidents - Onboarding on-call engineers ## Core Concepts ### 1. Incident Severity Levels | Severity | Impact | Response Time | Example | |----------|--------|---------------|---------| | **SEV1** | Complete outage, data loss | 15 min | Production down | | **SEV2** | Major degradation | 30 min | Critical feature broken | | **SEV3** | Minor impact | 2 hours | Non-critical bug | | **SEV4** | Minimal impact | Next business day | Cosmetic issue | ### 2. Runbook Structure ``` 1. Overview & Impact 2. Detection & Alerts 3. Initial Triage 4. Mitigation Steps 5. Root Cause Investigation 6. Resolution Procedures 7. Verification & Rollback 8. Communication Templates 9. Escalation Matrix ``` ## Runbook Templates ### Template 1: Service Outage Runbook ```markdown # [Service Name] Outage Runbook ## Overview **Service**: Payment Processing Service **Owner**: Platform Team **Slack**: #payments-incidents **PagerDuty**: payments-oncall ## Impact Assessment - [ ] Which customers are affected? - [ ] What percentage of traffic is impacted? - [ ] Are there financial implications? - [ ] What's the blast radius? ## Detection ### Alerts - `payment_error_rate > 5%` (PagerDuty) - `payment_latency_p99 > 2s` (Slack) - `payment_success_rate < 95%` (PagerDuty) ### Dashboards - [Payment Service Dashboard](https://grafana/d/payments) - [Error Tracking](https://sentry.io/payments) - [Dependency Status](https://status.stripe.com) ## Initial Triage (First 5 Minutes) ### 1. Assess Scope ```bash # Check service health kubectl get pods -n payments -l app=payment-service # Check recent deployments kubectl rollout history deployment/payment-service -n payments # Check error rates curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))" ``` ### 2. Quick Health Checks - [ ] Can you reach the service? `curl -I https://api.company.com/payments/health` - [ ] Database connectivity? Check connection pool metrics - [ ] External dependencies? Check Stripe, bank API status - [ ] Recent changes? Check deploy history ### 3. Initial Classification | Symptom | Likely Cause | Go To Section | |---------|--------------|---------------| | All requests failing | Service down | Section 4.1 | | High latency | Database/dependency | Section 4.2 | | Partial failures | Code bug | Section 4.3 | | Spike in errors | Traffic surge | Section 4.4 | ## Mitigation Procedures ### 4.1 Service Completely Down ```bash # Step 1: Check pod status kubectl get pods -n payments # Step 2: If pods are crash-looping, check logs kubectl logs -n payments -l app=payment-service --tail=100 # Step 3: Check recent deployments kubectl rollout history deployment/payment-service -n payments # Step 4: ROLLBACK if recent deploy is suspect kubectl rollout undo deployment/payment-service -n payments # Step 5: Scale up if resource constrained kubectl scale deployment/payment-service -n payments --replicas=10 # Step 6: Verify recovery kubectl rollout status deployment/payment-service -n payments ``` ### 4.2 High Latency ```bash # Step 1: Check database connections kubectl exec -n payments deploy/payment-service -- \ curl localhost:8080/metrics | grep db_pool # Step 2: Check slow queries (if DB issue) psql -h $DB_HOST -U $DB_USER -c " SELECT pid, now() - query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND duration > interval '5 seconds' ORDER BY duration DESC;" # Step 3: Kill long-running queries if needed psql -h $DB_HOST -U $DB_USER -c "SELECT pg_terminate_backend(pid);" # Step 4: Check external dependency latency curl -w "@curl-format.txt" -o /dev/null -s https://api.stripe.com/v1/health # Step 5: Enable circuit breaker if dependency is slow kubectl set env deployment/payment-service \ STRIPE_CIRCUIT_BREAKER_ENABLED=true -n payments ``` ### 4.3 Partial Failures (Specific Errors) ```bash # Step 1: Identify error pattern kubectl logs -n payments -l app=payment-service --tail=500 | \ grep -i error | sort | uniq -c | sort -rn | head -20 # Step 2: Check error tracking # Go to Sentry: https://sentry.io/payments # Step 3: If specific endpoint, enable feature flag to disable curl -X POST https://api.company.com/internal/feature-flags \ -d '{"flag": "DISABLE_PROBLEMATIC_FEATURE", "enabled": true}' # Step 4: If data issue, check recent data changes psql -h $DB_HOST -c " SELECT * FROM audit_log WHERE table_name = 'payment_methods' AND created_at > now() - interval '1 hour';" ``` ### 4.4 Traffic Surge ```bash # Step 1: Check current request rate kubectl top pods -n payments # Step 2: Scale horizontally kubectl scale deployment/payment-service -n payments --replicas=20 # Step 3: Enable rate limiting kubectl set env deployment/payment-service \ RATE_LIMIT_ENABLED=true \ RATE_LIMIT_RPS=1000 -n payments # Step 4: If attack, block suspicious IPs kubectl apply -f - < 15 min unresolved SEV1 | Engineering Manager | @manager (Slack) | | Data breach suspected | Security Team | #security-incidents | | Financial impact > $10k | Finance + Legal | @finance-oncall | | Customer communication needed | Support Lead | @support-lead | ## Communication Templates ### Initial Notification (Internal) ``` 🚨 INCIDENT: Payment Service Degradation Severity: SEV2 Status: Investigating Impact: ~20% of payment requests failing Start Time: [TIME] Incident Commander: [NAME] Current Actions: - Investigating root cause - Scaling up service - Monitoring dashboards Updates in #payments-incidents ``` ### Status Update ``` 📊 UPDATE: Payment Service Incident Status: Mitigating Impact: Reduced to ~5% failure rate Duration: 25 minutes Actions Taken: - Rolled back deployment v2.3.4 → v2.3.3 - Scaled service from 5 → 10 replicas Next Steps: - Continuing to monitor - Root cause analysis in progress ETA to Resolution: ~15 minutes ``` ### Resolution Notification ``` ✅ RESOLVED: Payment Service Incident Duration: 45 minutes Impact: ~5,000 affected transactions Root Cause: Memory leak in v2.3.4 Resolution: - Rolled back to v2.3.3 - Transactions auto-retried successfully Follow-up: - Postmortem scheduled for [DATE] - Bug fix in progress ``` ``` ### Template 2: Database Incident Runbook ```markdown # Database Incident Runbook ## Quick Reference | Issue | Command | |-------|---------| | Check connections | `SELECT count(*) FROM pg_stat_activity;` | | Kill query | `SELECT pg_terminate_backend(pid);` | | Check replication lag | `SELECT extract(epoch from (now() - pg_last_xact_replay_timestamp()));` | | Check locks | `SELECT * FROM pg_locks WHERE NOT granted;` | ## Connection Pool Exhaustion ```sql -- Check current connections SELECT datname, usename, state, count(*) FROM pg_stat_activity GROUP BY datname, usename, state ORDER BY count(*) DESC; -- Identify long-running connections SELECT pid, usename, datname, state, query_start, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY query_start; -- Terminate idle connections SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '10 minutes'; ``` ## Replication Lag ```sql -- Check lag on replica SELECT CASE WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0 ELSE extract(epoch from now() - pg_last_xact_replay_timestamp()) END AS lag_seconds; -- If lag > 60s, consider: -- 1. Check network between primary/replica -- 2. Check replica disk I/O -- 3. Consider failover if unrecoverable ``` ## Disk Space Critical ```bash # Check disk usage df -h /var/lib/postgresql/data # Find large tables psql -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid)) FROM pg_catalog.pg_statio_user_tables ORDER BY pg_total_relation_size(relid) DESC LIMIT 10;" # VACUUM to reclaim space psql -c "VACUUM FULL large_table;" # If emergency, delete old data or expand disk ``` ``` ## Best Practices ### Do's - **Keep runbooks updated** - Review after every incident - **Test runbooks regularly** - Game days, chaos engineering - **Include rollback steps** - Always have an escape hatch - **Document assumptions** - What must be true for steps to work - **Link to dashboards** - Quick access during stress ### Don'ts - **Don't assume knowledge** - Write for 3 AM brain - **Don't skip verification** - Confirm each step worked - **Don't forget communication** - Keep stakeholders informed - **Don't work alone** - Escalate early - **Don't skip postmortems** - Learn from every incident ## Resources - [Google SRE Book - Incident Management](https://sre.google/sre-book/managing-incidents/) - [PagerDuty Incident Response](https://response.pagerduty.com/) - [Atlassian Incident Management](https://www.atlassian.com/incident-management)