--- name: runbooks-structure user-invocable: false description: Use when creating structured operational runbooks for human operators. Covers runbook organization, documentation patterns, and best practices for clear operational procedures. allowed-tools: - Read - Write - Edit - Bash - Grep - Glob --- # Runbooks - Structure Creating clear, actionable runbooks for operational tasks, maintenance, and troubleshooting. ## What is a Runbook? A runbook is step-by-step documentation for operational tasks: - **Troubleshooting** - Diagnosing and fixing issues - **Incident Response** - Handling production incidents - **Maintenance** - Routine operational tasks - **On-Call** - Reference for on-call engineers ## Basic Runbook Structure ### Minimum Viable Runbook ```markdown # Service Name: Task/Issue ## Overview Brief description of what this runbook covers. ## Prerequisites - Required access/permissions - Tools needed - Knowledge required ## Steps ### 1. First Step Detailed instructions for first action. ### 2. Second Step Detailed instructions for second action. ## Validation How to verify the task was completed successfully. ## Rollback (if applicable) How to undo the changes if needed. ``` ## Comprehensive Runbook Template ```markdown # [Service]: [Task/Issue Title] **Last Updated:** 2025-01-15 **Owner:** Platform Team **Severity:** High/Medium/Low **Estimated Time:** 15 minutes ## Overview Brief description of the problem or task this runbook addresses. ## When to Use This Runbook - Alert fired: `high_cpu_usage` - Customer report: slow response times - Scheduled maintenance window ## Prerequisites - [ ] VPN access to production network - [ ] AWS console access (read/write) - [ ] kubectl configured for production cluster - [ ] Slack access to #incidents channel ## Context ### Architecture Overview Brief explanation of relevant system architecture. ### Common Causes - Database connection pool exhaustion - Memory leaks in worker processes - Third-party API rate limiting ## Diagnosis Steps ### 1. Check System Health ```bash # Check pod status kubectl get pods -n production # Expected output: All pods Running ``` **Decision Point:** If pods are CrashLooping, proceed to step 2. Otherwise, skip to step 3. ### 2. Check Application Logs ```bash # View recent logs kubectl logs -n production deployment/api-server --tail=100 ``` **Look for:** - Error messages containing "connection refused" - Stack traces - Repeated warnings ### 3. Check Database Performance ```bash # Connect to RDS metrics aws cloudwatch get-metric-statistics \ --namespace AWS/RDS \ --metric-name DatabaseConnections \ --dimensions Name=DBInstanceIdentifier,Value=prod-db ``` ## Resolution Steps ### Option A: Restart Services (Quick Fix) 1. **Drain pod gracefully:** ```bash kubectl drain pod/api-server-abc123 --ignore-daemonsets ``` 2. **Verify new pod is healthy:** ```bash kubectl get pods -n production | grep api-server ``` 3. **Monitor for 5 minutes:** Check Datadog dashboard for error rates. ### Option B: Scale Resources (Long-term Fix) 1. **Increase database connection pool:** ```bash # Edit configmap kubectl edit configmap app-config -n production # Change: DB_POOL_SIZE: "10" # To: DB_POOL_SIZE: "20" ``` 2. **Restart deployment:** ```bash kubectl rollout restart deployment/api-server -n production ``` 3. **Monitor rollout:** ```bash kubectl rollout status deployment/api-server -n production ``` ## Validation - [ ] All pods in Running state: `kubectl get pods -n production` - [ ] Error rate < 1%: Check Datadog dashboard - [ ] Response time < 200ms p95: Check Datadog dashboard - [ ] No alerts firing: Check PagerDuty ## Rollback If the fix causes issues: ```bash # Rollback to previous version kubectl rollout undo deployment/api-server -n production # Verify rollback kubectl rollout status deployment/api-server -n production ``` ## Follow-up Actions - [ ] Create post-mortem ticket: JIRA-1234 - [ ] Update monitoring alerts if needed - [ ] Schedule post-incident review - [ ] Document learnings in team wiki ## Related Runbooks Example format: ```markdown - [Database Connection Troubleshooting](./database-connections.md) - [Pod Restart Procedures](./pod-restart.md) - [Incident Response Guide](./incident-response.md) ``` ## Contact Information - **Primary On-Call:** Check PagerDuty schedule - **Escalation:** Platform Team Lead - **Slack Channel:** #platform-incidents ## Revision History | Date | Author | Changes | |------|--------|---------| | 2025-01-15 | Alice | Added validation steps | | 2025-01-10 | Bob | Initial version | ``` ## Runbook Organization ### Directory Structure ``` runbooks/ ├── README.md # Index of all runbooks ├── templates/ │ ├── troubleshooting.md # Template for troubleshooting │ ├── maintenance.md # Template for maintenance tasks │ └── incident-response.md # Template for incidents ├── services/ │ ├── api-gateway/ │ │ ├── high-latency.md │ │ ├── connection-errors.md │ │ └── scaling.md │ └── database/ │ ├── slow-queries.md │ └── replication-lag.md ├── infrastructure/ │ ├── kubernetes/ │ │ ├── pod-restart.md │ │ └── node-drain.md │ └── networking/ │ └── firewall-rules.md └── on-call/ ├── first-responder.md ├── escalation-guide.md └── common-alerts.md ``` ## Best Practices ### Write for Clarity ```markdown # Good: Clear, specific steps ## Restart the API Server 1. Check current pod status: ```bash kubectl get pods -n production -l app=api-server ``` 1. Delete the pod (it will auto-restart): ```bash kubectl delete pod -n production ``` 2. Verify new pod is running: ```bash kubectl get pods -n production -l app=api-server ``` Expected: STATUS = Running ``` ```markdown # Bad: Vague, assumes knowledge ## Fix API Issues 1. Restart the thing 2. Check if it works 3. If not, try something else ``` ### Include Decision Trees ```markdown ## Diagnosis 1. Check if service is responding: ```bash curl -f https://api.example.com/health ``` **If succeeds:** Service is up. Check application logs (Step 3). **If fails:** Service is down. Check pod status (Step 2). 1. Check pod status: ```bash kubectl get pods -n production ``` **If CrashLooping:** Check logs for errors (Step 4). **If Pending:** Check node resources (Step 5). **If Running:** Service should be up. Check DNS (Step 6). ``` ### Provide Expected Outputs ```markdown ## Check Database Connection ```bash psql -h prod-db.example.com -U app -c "SELECT 1" ``` **Expected output:** ``` ?column? ---------- 1 (1 row) ``` **If you see "connection refused":** - Check security groups allow traffic - Verify database is running - Check credentials in secrets manager ``` ### Use Checklists ```markdown ## Pre-Deployment Checklist - [ ] Code reviewed and approved - [ ] Tests passing in CI - [ ] Database migrations tested - [ ] Rollback plan documented - [ ] Monitoring alerts configured - [ ] On-call engineer notified - [ ] Deploy window scheduled ``` ## Common Patterns ### Emergency Response ```markdown # EMERGENCY: Database Down **Time-Sensitive:** Complete within 5 minutes ## Immediate Actions (Do First) 1. **Page database team:** Use PagerDuty incident: "Database outage" 2. **Notify stakeholders:** Post in #incidents: "Database is down. Investigating." 3. **Enable maintenance mode:** ```bash kubectl set env deployment/api-server MAINTENANCE_MODE=true ``` ## Investigation (While Waiting) 1. Check RDS console for instance status 2. Review CloudWatch logs for errors 3. Check recent changes in deploy history ## Communication Template **Update every 10 minutes in #incidents:** > **[HH:MM]** Database still down. Database team investigating. > Impact: All API requests failing. > ETA: Unknown. ``` ### Routine Maintenance ```markdown # Monthly Database Maintenance **Schedule:** First Sunday of each month, 2 AM UTC **Duration:** ~2 hours **Downtime:** None (rolling maintenance) ## Pre-Maintenance **1 week before:** - [ ] Announce maintenance window in #engineering - [ ] Create calendar event - [ ] Prepare rollback plan **1 day before:** - [ ] Verify backup is recent (< 24 hours) - [ ] Test backup restoration (staging) - [ ] Confirm on-call coverage ## Maintenance Steps 1. **Take snapshot:** ```bash aws rds create-db-snapshot \ --db-instance-identifier prod-db \ --db-snapshot-identifier maintenance-$(date +%Y%m%d) ``` 1. **Apply updates:** Follow RDS console update wizard 2. **Monitor reboot:** Watch CloudWatch metrics for 15 minutes ## Post-Maintenance - [ ] Verify all services healthy - [ ] Post completion in #engineering - [ ] Update maintenance log ``` ### Knowledge Transfer ```markdown # New Engineer Onboarding: Deploy Process ## Learning Objectives After completing this runbook, you will: - Understand our deployment pipeline - Know how to deploy to staging - Know how to rollback a bad deploy ## Step 1: Understanding the Pipeline Our deployment flow: ``` GitHub → CI (GitHub Actions) → ArgoCD → Kubernetes ``` **Exercise:** Review a recent deploy in ArgoCD ## Step 2: Deploying to Staging **Shadowing:** Watch a senior engineer deploy first. 1. Create feature branch 2. Open pull request 3. Wait for CI to pass 4. Merge to main 5. Watch ArgoCD sync **Practice:** Deploy your own change to staging. ## Step 3: Deploying to Production **Requirements before production deploy:** - [ ] Completed 3+ staging deploys - [ ] Reviewed with team lead - [ ] Read incident response runbook **Hands-on:** Pair with team lead for first production deploy. ## Certification - [ ] Completed 5 staging deploys - [ ] Completed 1 production deploy (supervised) - [ ] Can explain rollback procedure **Certified by:** _________________ Date: _______ ``` ## Anti-Patterns ### Don't Leave Out Context ```markdown ## Bad: No context ### Fix the API Run this command: ```bash kubectl delete pod api-server-abc ``` ## Good: Explain why ### Restart API Server **When to use:** API returning 500 errors, logs show memory leak. **Why this works:** Deletes pod, Kubernetes creates fresh pod with clean state. ```bash kubectl delete pod api-server-abc ``` **What to expect:** ~30 seconds downtime while new pod starts. ``` ### Don't Skip Validation Steps ```markdown # Bad: No validation ## Deploy New Version 1. Update image tag 2. Apply changes 3. Done! # Good: Include validation ## Deploy New Version 1. Update image tag in deployment.yaml 2. Apply changes: ```bash kubectl apply -f deployment.yaml ``` 1. **Verify rollout:** ```bash kubectl rollout status deployment/api-server ``` Wait for "successfully rolled out" 2. **Check pod health:** ```bash kubectl get pods -l app=api-server ``` All pods should show STATUS: Running 3. **Verify endpoints:** ```bash curl https://api.example.com/health ``` Should return 200 OK ``` ### Don't Assume Knowledge ```markdown # Bad: Uses jargon without explanation ## Fix Pod Check the ingress and make sure the service mesh is working. # Good: Explains terms ## Fix Pod Connectivity 1. **Check ingress** (the load balancer that routes traffic to pods): ```bash kubectl get ingress -n production ``` 1. **Verify service mesh** (Istio, manages pod-to-pod communication): ```bash kubectl get pods -n istio-system ``` All Istio control plane pods should be Running. ``` ## Related Skills - **troubleshooting-guides**: Creating diagnostic procedures - **incident-response**: Handling production incidents