--- name: runbook-creator description: Templates and patterns for creating operational runbooks and playbooks. Use when creating runbooks, writing operational documentation, playbook creation, or documenting procedures for on-call teams. --- # Runbook Creator Templates and best practices for creating effective operational runbooks. ## When to Use This Skill - Creating runbooks for new services - Documenting incident response procedures - Writing operational playbooks - Standardizing on-call documentation - Automating common procedures ## Runbook Principles 1. **Actionable**: Every step should be executable 2. **Testable**: Verify each step works 3. **Current**: Update when systems change 4. **Accessible**: Available during incidents (not behind VPN-only) 5. **Linked**: Referenced from alerts ## Standard Runbook Template Copy and customize this template: ```markdown # [Service Name] - [Issue Type] ## Overview Brief description of what this runbook addresses. **Last Updated**: YYYY-MM-DD **Owner**: [Team/Person] **Related Alerts**: [Alert names that link here] ## Symptoms What indicates this issue is occurring: - [ ] Symptom 1 - [ ] Symptom 2 - [ ] Symptom 3 ## Impact - **Users Affected**: [Description] - **Severity**: [SEV1/SEV2/SEV3/SEV4] - **Business Impact**: [Description] ## Prerequisites - Access to [system/tool] - Permissions: [required permissions] - Tools: [required CLI tools] ## Diagnostic Steps ### Step 1: [Verify the Issue] ```bash # Command to run kubectl get pods -n production | grep -v Running ``` **Expected Output**: [What you should see] **If Different**: [What to do] ### Step 2: [Gather Information] ```bash # Command to run kubectl logs deployment/my-service -n production --tail=100 ``` **Look For**: [What to look for in output] ## Resolution Steps ### Option A: [Quick Fix - e.g., Restart] Use when: [conditions] ```bash # Step 1: Restart the service kubectl rollout restart deployment/my-service -n production # Step 2: Verify pods are coming up kubectl get pods -n production -w ``` **Verification**: [How to confirm fix worked] ### Option B: [Rollback] Use when: [conditions] ```bash # Step 1: Check rollout history kubectl rollout history deployment/my-service -n production # Step 2: Rollback to previous version kubectl rollout undo deployment/my-service -n production ``` **Verification**: [How to confirm fix worked] ## Verification How to confirm the issue is resolved: - [ ] Error rate returned to normal - [ ] Latency within SLO - [ ] No related alerts firing - [ ] User-facing functionality working ## Escalation If this runbook doesn't resolve the issue: 1. **First**: Contact [Team/Person] via [Slack/Phone] 2. **Then**: Page [Escalation contact] 3. **Finally**: [Further escalation path] ## Related Resources - [Dashboard Link](https://grafana/d/xxx) - [Architecture Diagram](link) - [Related Runbook](link) ## Revision History | Date | Author | Change | |------|--------|--------| | YYYY-MM-DD | Name | Initial version | ``` ## Quick Runbook Templates ### Service Restart ```markdown # [Service] - Restart Procedure ## When to Use - Service unresponsive - Memory leak suspected - After configuration change ## Steps 1. **Notify team** ``` Post in #incidents: "Restarting [service] due to [reason]" ``` 2. **Restart service** ```bash kubectl rollout restart deployment/[service] -n [namespace] ``` 3. **Monitor rollout** ```bash kubectl rollout status deployment/[service] -n [namespace] ``` 4. **Verify health** ```bash kubectl get pods -n [namespace] | grep [service] # All pods should be Running, 1/1 Ready ``` 5. **Check metrics** - Error rate: [dashboard link] - Latency: [dashboard link] ## Rollback If restart makes things worse: ```bash kubectl rollout undo deployment/[service] -n [namespace] ``` ``` ### Database Failover ```markdown # [Database] - Failover Procedure ## When to Use - Primary database unresponsive - Planned maintenance - Primary showing errors ## Prerequisites - Database admin access - Verify replica is in sync ## Pre-Failover Checks 1. **Check replication status** ```sql SELECT * FROM pg_stat_replication; ``` Verify: `state = 'streaming'`, lag is minimal 2. **Check replica health** ```bash pg_isready -h replica-host -p 5432 ``` ## Failover Steps 1. **Stop writes to primary** (if possible) ```sql ALTER SYSTEM SET default_transaction_read_only = on; SELECT pg_reload_conf(); ``` 2. **Promote replica** ```bash pg_ctl promote -D /var/lib/postgresql/data ``` 3. **Update connection strings** - Update DNS/load balancer to point to new primary - Or update application config 4. **Verify applications reconnected** ```sql SELECT count(*) FROM pg_stat_activity WHERE state = 'active'; ``` ## Post-Failover - [ ] Monitor error rates - [ ] Set up new replica from old primary - [ ] Update documentation ``` ### Cache Clear ```markdown # [Service] - Cache Clear Procedure ## When to Use - Stale data being served - Cache corruption suspected - After data migration ## Impact Assessment - Cache clear will cause temporary latency spike - Database load will increase temporarily ## Steps 1. **Notify team** ``` Post in #incidents: "Clearing [cache] cache due to [reason]" ``` 2. **Clear cache** **Redis - All keys**: ```bash redis-cli -h [host] FLUSHALL ``` **Redis - Specific pattern**: ```bash redis-cli -h [host] --scan --pattern "user:*" | xargs redis-cli DEL ``` **Application cache**: ```bash curl -X POST http://[service]/admin/cache/clear ``` 3. **Monitor** - Watch cache hit rate recover - Monitor database load - Check latency ## Verification - Cache hit rate returning to normal - No errors from cache operations - Latency stabilizing ``` ## Runbook Checklist Before publishing a runbook, verify: ``` Runbook Quality Checklist: - [ ] Title clearly describes the issue/procedure - [ ] Symptoms section helps identify when to use - [ ] All commands are copy-pasteable - [ ] Expected output documented for each command - [ ] Verification steps confirm success - [ ] Escalation path is clear - [ ] Links to dashboards work - [ ] Tested by someone other than author - [ ] Linked from relevant alerts ``` ## Automation Integration ### Runbook with Automation Hooks ```markdown # [Service] - Automated Recovery ## Automatic Actions The following actions run automatically: 1. Pod restart on OOMKilled (Kubernetes) 2. Scale-up on high CPU (HPA) ## Manual Steps (if auto-recovery fails) ### Check why auto-recovery failed ```bash kubectl describe hpa [service] -n [namespace] kubectl get events -n [namespace] --sort-by='.lastTimestamp' ``` ### Manual intervention [Steps here] ``` ### Script-Backed Runbook ```markdown # [Service] - Diagnostic Script ## Quick Diagnosis Run the diagnostic script: ```bash ./scripts/diagnose-service.sh [service-name] ``` This script checks: - Pod status - Recent logs - Resource usage - Dependency health ## Interpreting Results | Result | Meaning | Action | |--------|---------|--------| | `HEALTHY` | All checks pass | No action needed | | `DEGRADED` | Some issues | Follow specific recommendations | | `CRITICAL` | Major issues | Escalate immediately | ``` ## Common Runbook Categories Every service should have runbooks for: ``` Essential Runbooks: - [ ] Service restart - [ ] Rollback deployment - [ ] Scale up/down - [ ] Clear cache - [ ] Database failover (if applicable) - [ ] Dependency failure response - [ ] High error rate investigation - [ ] High latency investigation ``` ## Additional Resources - [Example Runbooks](references/example-runbooks.md) - [Runbook Automation](references/automation.md)