--- name: runbook-creation description: Create operational runbooks, playbooks, standard operating procedures (SOPs), and incident response guides. Use when documenting operational procedures, on-call guides, or incident response processes. --- # Runbook Creation ## Overview Create comprehensive operational runbooks that provide step-by-step procedures for common operational tasks, incident response, and system maintenance. ## When to Use - Incident response procedures - Standard operating procedures (SOPs) - On-call playbooks - System maintenance guides - Disaster recovery procedures - Deployment runbooks - Escalation procedures - Service restoration guides ## Incident Response Runbook Template ```markdown # Incident Response Runbook ## Quick Reference **Severity Levels:** - P0 (Critical): Complete outage, data loss, security breach - P1 (High): Major feature down, significant user impact - P2 (Medium): Minor feature degradation, limited user impact - P3 (Low): Cosmetic issues, minimal user impact **Response Times:** - P0: Immediate (24/7) - P1: 15 minutes (business hours), 1 hour (after hours) - P2: 4 hours (business hours) - P3: Next business day **Escalation Contacts:** - On-call Engineer: PagerDuty rotation - Engineering Manager: +1-555-0100 - VP Engineering: +1-555-0101 - CTO: +1-555-0102 ## Table of Contents 1. [Service Down](#service-down) 2. [Database Issues](#database-issues) 3. [High CPU/Memory Usage](#high-cpu-memory-usage) 4. [API Performance Degradation](#api-performance-degradation) 5. [Security Incidents](#security-incidents) 6. [Data Loss Recovery](#data-loss-recovery) 7. [Rollback Procedures](#rollback-procedures) --- ## Service Down ### Symptoms - Health check endpoint returning 500 errors - Users unable to access application - Load balancer showing all instances unhealthy - Alerts: `service_down`, `health_check_failed` ### Severity: P0 (Critical) ### Initial Response (5 minutes) 1. **Acknowledge the incident** ```bash # Acknowledge in PagerDuty # Post in #incidents Slack channel ``` 2. **Create incident channel** ``` Create Slack channel: #incident-YYYY-MM-DD-service-down Post incident details and status updates ``` 3. **Assess impact** ```bash # Check service status kubectl get pods -n production # Check recent deployments kubectl rollout history deployment/api -n production # Check logs kubectl logs -f deployment/api -n production --tail=100 ``` ### Investigation Steps #### Check Application Health ```bash # 1. Check pod status kubectl get pods -n production -l app=api # Expected output: All pods Running # NAME READY STATUS RESTARTS AGE # api-7d8c9f5b6d-4xk2p 1/1 Running 0 2h # api-7d8c9f5b6d-7nm8r 1/1 Running 0 2h # 2. Check pod logs for errors kubectl logs -f deployment/api -n production --tail=100 | grep -i error # 3. Check application endpoints curl -v https://api.example.com/health curl -v https://api.example.com/api/v1/status # 4. Check database connectivity kubectl exec -it deployment/api -n production -- sh psql $DATABASE_URL -c "SELECT 1" ``` #### Check Infrastructure ```bash # 1. Check load balancer aws elb describe-target-health \ --target-group-arn arn:aws:elasticloadbalancing:... \ --query 'TargetHealthDescriptions[*].[Target.Id,TargetHealth.State]' \ --output table # 2. Check DNS resolution dig api.example.com nslookup api.example.com # 3. Check SSL certificates echo | openssl s_client -connect api.example.com:443 2>/dev/null | \ openssl x509 -noout -dates # 4. Check network connectivity kubectl exec -it deployment/api -n production -- \ curl -v https://database.example.com:5432 ``` #### Check Database ```bash # 1. Check database connections psql $DATABASE_URL -c "SELECT count(*) FROM pg_stat_activity" # 2. Check for locks psql $DATABASE_URL -c " SELECT pid, usename, pg_blocking_pids(pid) as blocked_by, query FROM pg_stat_activity WHERE cardinality(pg_blocking_pids(pid)) > 0 " # 3. Check database size psql $DATABASE_URL -c " SELECT pg_size_pretty(pg_database_size(current_database())) " # 4. Check long-running queries psql $DATABASE_URL -c " SELECT pid, now() - query_start as duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC LIMIT 10 " ``` ### Resolution Steps #### Option 1: Restart Pods (Quick Fix) ```bash # Restart all pods (rolling restart) kubectl rollout restart deployment/api -n production # Watch restart progress kubectl rollout status deployment/api -n production # Verify pods are healthy kubectl get pods -n production -l app=api ``` #### Option 2: Scale Up (If Overload) ```bash # Check current replicas kubectl get deployment api -n production # Scale up kubectl scale deployment/api -n production --replicas=10 # Watch scaling kubectl get pods -n production -l app=api -w ``` #### Option 3: Rollback (If Bad Deploy) ```bash # Check deployment history kubectl rollout history deployment/api -n production # Rollback to previous version kubectl rollout undo deployment/api -n production # Rollback to specific revision kubectl rollout undo deployment/api -n production --to-revision=5 # Verify rollback kubectl rollout status deployment/api -n production ``` #### Option 4: Database Connection Reset ```bash # If database connection pool exhausted kubectl exec -it deployment/api -n production -- sh kill -HUP 1 # Reload process, reset connections # Or restart database connection pool psql $DATABASE_URL -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE application_name = 'api' AND state = 'idle'" ``` ### Verification ```bash # 1. Check health endpoint curl https://api.example.com/health # Expected: {"status": "healthy"} # 2. Check API endpoints curl https://api.example.com/api/v1/users # Expected: Valid JSON response # 3. Check metrics # Visit https://grafana.example.com # Verify: # - Error rate < 1% # - Response time < 500ms # - All pods healthy # 4. Check logs for errors kubectl logs deployment/api -n production --tail=100 | grep -i error # Expected: No new errors ``` ### Communication **Initial Update (within 5 minutes):** ``` 🚨 INCIDENT: Service Down Status: Investigating Severity: P0 Impact: All users unable to access application Start Time: 2025-01-15 14:30 UTC We are investigating reports of users unable to access the application. Our team is working to identify the root cause. Next update in 15 minutes. ``` **Progress Update (every 15 minutes):** ``` 🔍 UPDATE: Service Down Status: Identified Root Cause: Database connection pool exhausted Action: Restarting application pods ETA: 5 minutes We have identified the issue and are implementing a fix. ``` **Resolution Update:** ``` ✅ RESOLVED: Service Down Status: Resolved Resolution: Restarted application pods, reset database connections Duration: 23 minutes The service is now fully operational. We are monitoring closely and will conduct a post-mortem to prevent future occurrences. ``` ### Post-Incident 1. **Create post-mortem document** - Timeline of events - Root cause analysis - Action items to prevent recurrence 2. **Update monitoring** - Add alerts for this scenario - Improve detection time 3. **Update runbook** - Document any new findings - Add shortcuts for faster resolution --- ## Database Issues ### High Connection Count **Symptoms:** - Database rejecting new connections - Error: "too many connections" - Alert: `db_connections_high` **Quick Fix:** ```bash # 1. Check connection count psql $DATABASE_URL -c " SELECT count(*), application_name FROM pg_stat_activity GROUP BY application_name " # 2. Kill idle connections psql $DATABASE_URL -c " SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '10 minutes' " # 3. Restart connection pools kubectl rollout restart deployment/api -n production ``` ### Slow Queries **Symptoms:** - API response times > 5 seconds - Database CPU at 100% - Alert: `slow_query_detected` **Investigation:** ```sql -- Find slow queries SELECT pid, now() - query_start as duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC LIMIT 10; -- Check for missing indexes SELECT schemaname, tablename, seq_scan, seq_tup_read, idx_scan FROM pg_stat_user_tables WHERE seq_scan > 0 ORDER BY seq_scan DESC LIMIT 10; -- Kill long-running query (if needed) SELECT pg_terminate_backend(12345); -- Replace with actual PID ``` --- ## High CPU/Memory Usage ### Symptoms - Pods being OOMKilled - Response times increasing - Alert: `high_memory_usage`, `high_cpu_usage` ### Investigation ```bash # 1. Check pod resources kubectl top pods -n production # 2. Check resource limits kubectl describe pod -n production | grep -A 5 Limits # 3. Check for memory leaks kubectl logs deployment/api -n production | grep -i "out of memory" # 4. Profile application (if needed) kubectl exec -it -n production -- sh # Run profiler: node --inspect, py-spy, etc. ``` ### Resolution ```bash # Option 1: Increase resources kubectl set resources deployment/api -n production \ --limits=cpu=2000m,memory=4Gi \ --requests=cpu=1000m,memory=2Gi # Option 2: Scale horizontally kubectl scale deployment/api -n production --replicas=6 # Option 3: Restart problematic pods kubectl delete pod -n production ``` --- ## Rollback Procedures ### Application Rollback ```bash # 1. List deployment history kubectl rollout history deployment/api -n production # 2. Check specific revision kubectl rollout history deployment/api -n production --revision=5 # 3. Rollback to previous kubectl rollout undo deployment/api -n production # 4. Rollback to specific revision kubectl rollout undo deployment/api -n production --to-revision=5 # 5. Verify rollback kubectl rollout status deployment/api -n production kubectl get pods -n production ``` ### Database Rollback ```bash # 1. Check migration status npm run db:migrate:status # 2. Rollback last migration npm run db:migrate:undo # 3. Rollback to specific migration npm run db:migrate:undo --to 20250115120000-migration-name # 4. Verify database state psql $DATABASE_URL -c "\dt" ``` --- ## Escalation Path 1. **Level 1 - On-call Engineer** (You) - Initial response and investigation - Attempt standard fixes from runbook 2. **Level 2 - Senior Engineers** - Escalate if not resolved in 30 minutes - Escalate if issue is complex/unclear - Contact via PagerDuty or Slack 3. **Level 3 - Engineering Manager** - Escalate if not resolved in 1 hour - Escalate if cross-team coordination needed 4. **Level 4 - VP Engineering / CTO** - Escalate for P0 incidents > 2 hours - Escalate for security breaches - Escalate for data loss --- ## Useful Commands ```bash # Kubernetes kubectl get pods -n production kubectl logs -f -n production kubectl describe pod -n production kubectl exec -it -n production -- sh kubectl top pods -n production # Database psql $DATABASE_URL -c "SELECT version()" psql $DATABASE_URL -c "SELECT * FROM pg_stat_activity" # AWS aws ecs list-tasks --cluster production aws rds describe-db-instances aws cloudwatch get-metric-statistics ... # Monitoring URLs # Grafana: https://grafana.example.com # Datadog: https://app.datadoghq.com # PagerDuty: https://example.pagerduty.com # Status Page: https://status.example.com ``` ``` ## Best Practices ### ✅ DO - Include quick reference section at top - Provide exact commands to run - Document expected outputs - Include verification steps - Add communication templates - Define severity levels clearly - Document escalation paths - Include useful links and contacts - Keep runbooks up-to-date - Test runbooks regularly - Include screenshots/diagrams - Document common gotchas ### ❌ DON'T - Use vague instructions - Skip verification steps - Forget to document prerequisites - Assume knowledge of tools - Skip communication guidelines - Forget to update after incidents ## Resources - [PagerDuty Incident Response](https://response.pagerduty.com/) - [Google SRE Book](https://sre.google/books/) - [Atlassian Incident Handbook](https://www.atlassian.com/incident-management/handbook)