--- name: runbook-creation description: Operational runbook templates for incident response and procedures allowed-tools: Read, Glob, Grep, Write, Edit --- # Runbook Creation Skill ## When to Use This Skill Use this skill when: - **Runbook Creation tasks** - Working on operational runbook templates for incident response and procedures - **Planning or design** - Need guidance on Runbook Creation approaches - **Best practices** - Want to follow established patterns and standards ## Overview Create operational runbooks for incident response, maintenance procedures, and operational tasks. ## MANDATORY: Documentation-First Approach Before creating runbooks: 1. **Invoke `docs-management` skill** for runbook patterns 2. **Verify SRE best practices** via MCP servers (perplexity) 3. **Base guidance on Google SRE principles** ## Runbook Types ```text Runbook Categories: ┌─────────────────────────────────────────────────────────────────────────────┐ │ Incident Response Runbooks │ │ • Alert-triggered procedures │ │ • Escalation paths │ │ • Communication templates │ ├─────────────────────────────────────────────────────────────────────────────┤ │ Operational Runbooks │ │ • Deployment procedures │ │ • Maintenance tasks │ │ • Backup/restore operations │ ├─────────────────────────────────────────────────────────────────────────────┤ │ Troubleshooting Runbooks │ │ • Diagnostic procedures │ │ • Common issue resolution │ │ • Debug workflows │ ├─────────────────────────────────────────────────────────────────────────────┤ │ Emergency Runbooks │ │ • Disaster recovery │ │ • Security incident response │ │ • Business continuity │ └─────────────────────────────────────────────────────────────────────────────┘ ``` ## Standard Runbook Template ```markdown # Runbook: [TITLE] | Property | Value | |----------|-------| | **ID** | RB-[NUMBER] | | **Category** | [Incident/Operational/Troubleshooting/Emergency] | | **Service** | [Service Name] | | **Owner** | [Team/Individual] | | **Last Updated** | [YYYY-MM-DD] | | **Last Tested** | [YYYY-MM-DD] | | **Review Frequency** | [Quarterly/Monthly/Annually] | --- ## Overview **Purpose:** [What this runbook helps you accomplish] **When to Use:** [Conditions that trigger this runbook] **Expected Outcome:** [What success looks like] **Estimated Duration:** [Time to complete] --- ## Prerequisites ### Required Access - [ ] [System/Tool 1] - [Role/Permission needed] - [ ] [System/Tool 2] - [Role/Permission needed] ### Required Knowledge - [Skill/Knowledge 1] - [Skill/Knowledge 2] ### Tools Needed | Tool | Purpose | Access URL | |------|---------|------------| | [Tool 1] | [Purpose] | [URL/Link] | | [Tool 2] | [Purpose] | [URL/Link] | --- ## Quick Reference ```text Quick Commands: ┌────────────────────────────────────────────────────────────────┐ │ Check service status: kubectl get pods -n [namespace] │ │ View logs: kubectl logs -f [pod-name] -n [namespace] │ │ Restart service: kubectl rollout restart deployment/[name] │ │ Check metrics: [monitoring-url] │ └────────────────────────────────────────────────────────────────┘ ``` --- ## Procedure ### Step 1: [Step Name] **Objective:** [What this step accomplishes] **Actions:** 1. [Action 1] ```bash # Command example kubectl get pods -n production ``` 2. [Action 2] **Expected Result:** [What you should see] **If This Fails:** Go to [Troubleshooting Section](#troubleshooting) --- ### Step 2: [Step Name] **Objective:** [What this step accomplishes] **Actions:** 1. [Action 1] 2. [Action 2] **Decision Point:** ```text ┌─────────────────────────────────────┐ │ Is the service responding? │ │ │ │ YES → Continue to Step 3 │ │ NO → Go to Step 4 (Escalation) │ └─────────────────────────────────────┘ ``` --- ### Step 3: [Verification] **Objective:** Verify the issue is resolved **Verification Checklist:** - [ ] Service is responding to health checks - [ ] Metrics show normal values - [ ] No new errors in logs - [ ] Users can access the service --- ## Troubleshooting ### Issue: [Common Issue 1] **Symptoms:** [What you observe] **Cause:** [Root cause] **Resolution:** 1. [Step 1] 2. [Step 2] ### Issue: [Common Issue 2] **Symptoms:** [What you observe] **Cause:** [Root cause] **Resolution:** 1. [Step 1] 2. [Step 2] --- ## Escalation ### When to Escalate - [ ] Issue not resolved after [X] minutes - [ ] Impact affects [threshold] - [ ] Required access not available - [ ] Unsure of next steps ### Escalation Path | Level | Contact | Method | Response Time | |-------|---------|--------|---------------| | L1 | On-call Engineer | PagerDuty | 15 min | | L2 | Team Lead | Slack #incidents | 30 min | | L3 | Engineering Manager | Phone | 1 hour | | L4 | VP Engineering | Phone | As needed | --- ## Communication ### Status Updates **Template:** ```text [TIMESTAMP] - [SERVICE] - [STATUS] Current Status: [Investigating/Identified/Monitoring/Resolved] Impact: [Description of user impact] Next Update: [Time of next update] Actions Taken: - [Action 1] - [Action 2] Next Steps: - [Planned action] ``` ### Stakeholder Notification | Stakeholder | When to Notify | Method | |-------------|----------------|--------| | Engineering | Immediately | Slack | | Product | If user-impacting | Slack | | Support | If customer-facing | Email | | Leadership | If SEV1/SEV2 | Phone | --- ## Post-Incident ### Cleanup Tasks - [ ] Remove any temporary fixes - [ ] Update monitoring/alerts if needed - [ ] Document any new learnings ### Post-Incident Review - [ ] Schedule post-mortem meeting - [ ] Gather timeline and evidence - [ ] Identify action items --- ## Appendix ### Related Runbooks - [RB-XXX: Related Runbook 1] - [RB-YYY: Related Runbook 2] ### Reference Documentation - [Link to architecture docs] - [Link to service docs] ### Revision History | Version | Date | Author | Changes | |---------|------|--------|---------| | 1.0 | [Date] | [Name] | Initial version | | 1.1 | [Date] | [Name] | [Changes] | ```text ``` ## Incident Response Runbook Template ```markdown # Incident Runbook: [Alert Name] | Property | Value | |----------|-------| | **Alert** | [Alert Name/ID] | | **Severity** | [SEV1/SEV2/SEV3/SEV4] | | **Service** | [Service Name] | | **SLO Impact** | [Which SLO is affected] | --- ## Alert Details **Trigger Condition:** ```text [Alert query/condition] Example: error_rate > 1% for 5 minutes ``` **Alert Meaning:** [What this alert indicates] **False Positive Indicators:** [Signs this might be a false alarm] --- ## Immediate Actions (First 5 Minutes) ### 1. Acknowledge Alert ```bash # Acknowledge in PagerDuty pd incident:acknowledge # Or via Slack /pd ack ``` ### 2. Assess Impact **Quick Health Checks:** ```bash # Check service status curl -s https://api.example.com/health | jq . # Check error rate kubectl logs -l app=service --tail=100 | grep -c ERROR # Check pod status kubectl get pods -n production -l app=service ``` **Impact Assessment:** | Check | Command | Expected | Actual | |-------|---------|----------|--------| | Health endpoint | `curl /health` | 200 OK | [Result] | | Error rate | `grep ERROR` | < 10 | [Result] | | Pod status | `kubectl get pods` | Running | [Result] | ### 3. Initial Communication Post in #incidents: ```text 🔴 INCIDENT: [Service] - [Brief Description] Severity: [SEV level] Impact: [User impact] Status: Investigating Lead: @[your-name] ``` --- ## Diagnosis ### Common Causes and Checks #### Cause 1: High Traffic ```bash # Check request rate kubectl top pods -n production -l app=service # Check HPA status kubectl get hpa -n production ``` **If traffic spike confirmed:** - Scale replicas: `kubectl scale deployment/service --replicas=10` - Enable rate limiting if available #### Cause 2: Database Issues ```bash # Check database connections kubectl exec -it [pod] -- psql -c "SELECT count(*) FROM pg_stat_activity;" # Check slow queries kubectl logs -l app=service | grep "slow query" ``` **If database issues:** - Check connection pool exhaustion - Look for long-running queries - Consider read replica failover #### Cause 3: Dependency Failure ```bash # Check external dependencies curl -s https://status.dependency.com/api/v2/status.json | jq . # Check circuit breaker status kubectl logs -l app=service | grep "circuit" ``` **If dependency failure:** - Verify external service status - Check for timeout configuration - Consider enabling fallback behavior --- ## Resolution Steps ### Quick Fixes | Issue | Quick Fix | Command | |-------|-----------|---------| | Pod crash loop | Restart deployment | `kubectl rollout restart deployment/service` | | Memory pressure | Increase limits | `kubectl edit deployment/service` | | Config error | Rollback config | `kubectl rollout undo deployment/service` | ### Rollback Procedure ```bash # List recent deployments kubectl rollout history deployment/service -n production # Rollback to previous version kubectl rollout undo deployment/service -n production # Rollback to specific revision kubectl rollout undo deployment/service -n production --to-revision=2 ``` --- ## Resolution Verification **Verification Checklist:** - [ ] Alert has cleared - [ ] Health checks passing - [ ] Error rate below threshold - [ ] No user complaints in support channels - [ ] Metrics returning to baseline **Monitoring Period:** Monitor for 15 minutes after resolution --- ## Closure ### Update Status ```text ✅ RESOLVED: [Service] - [Brief Description] Duration: [X] minutes Root Cause: [Brief cause] Resolution: [What fixed it] Follow-up: [Any action items] ``` ### Post-Incident Tasks - [ ] Update incident timeline - [ ] Create post-mortem doc if SEV1/SEV2 - [ ] File tickets for follow-up work - [ ] Update runbook if needed ```text ``` ## Database Failover Runbook ```markdown # Runbook: Database Failover | Property | Value | |----------|-------| | **ID** | RB-DB-001 | | **Category** | Emergency | | **Service** | PostgreSQL Primary | | **Owner** | Platform Team | | **Last Tested** | 2025-01-15 | --- ## Overview **Purpose:** Failover from primary database to replica when primary is unavailable. **When to Use:** - Primary database unresponsive for > 5 minutes - Primary database corruption detected - Planned maintenance requiring failover **Expected Outcome:** Application traffic routed to new primary **Estimated Duration:** 15-30 minutes --- ## Prerequisites ### Required Access - [ ] Azure Portal - Contributor on resource group - [ ] kubectl - cluster-admin - [ ] Database credentials - postgres superuser ### Pre-Failover Checks ```bash # Verify replica is healthy and caught up az postgres flexible-server replica list --resource-group rg-prod --name pg-primary # Check replication lag psql -h pg-replica.postgres.database.azure.com -U postgres -c \ "SELECT pg_last_wal_receive_lsn() - pg_last_wal_replay_lsn() AS lag_bytes;" ``` **Acceptable lag:** < 1MB --- ## Failover Procedure ### Step 1: Confirm Primary is Unavailable ```bash # Test primary connectivity psql -h pg-primary.postgres.database.azure.com -U postgres -c "SELECT 1;" # Check Azure status az postgres flexible-server show --resource-group rg-prod --name pg-primary --query "state" ``` **Expected:** Connection timeout or error state ### Step 2: Notify Stakeholders ```text 🔴 DATABASE FAILOVER INITIATED Target: pg-primary → pg-replica Reason: [Primary unavailable/Maintenance/etc.] Expected Downtime: 5-10 minutes ``` ### Step 3: Promote Replica ```bash # Promote replica to primary (Azure Flexible Server) az postgres flexible-server replica stop-replication \ --resource-group rg-prod \ --name pg-replica # Verify promotion az postgres flexible-server show \ --resource-group rg-prod \ --name pg-replica \ --query "replicationRole" ``` **Expected:** `replicationRole: None` (standalone) ### Step 4: Update Connection Strings ```bash # Update Kubernetes secret kubectl create secret generic db-connection \ --from-literal=host=pg-replica.postgres.database.azure.com \ --dry-run=client -o yaml | kubectl apply -f - # Restart applications to pick up new connection kubectl rollout restart deployment -l uses-database=true -n production ``` ### Step 5: Verify Application Connectivity ```bash # Check application logs kubectl logs -l app=api-service --tail=50 | grep -i database # Test application health curl -s https://api.example.com/health | jq .database ``` --- ## Post-Failover ### Immediate Tasks - [ ] Verify all applications connected to new primary - [ ] Check for data consistency - [ ] Monitor error rates ### Recovery Tasks (Next 24 Hours) - [ ] Investigate original primary failure - [ ] Create new replica from new primary - [ ] Update DNS/connection strings permanently - [ ] Document incident and learnings --- ## Rollback If failover causes issues: ```bash # If original primary is recoverable # Stop writes to new primary kubectl scale deployment --replicas=0 -l uses-database=true -n production # Restore original primary az postgres flexible-server update --resource-group rg-prod --name pg-primary --state Enabled # Revert connection strings kubectl create secret generic db-connection \ --from-literal=host=pg-primary.postgres.database.azure.com \ --dry-run=client -o yaml | kubectl apply -f - # Restart applications kubectl rollout restart deployment -l uses-database=true -n production ``` ```text ``` ## Runbook Quality Checklist | Criterion | Description | Check | |-----------|-------------|-------| | **Actionable** | Every step has a specific action | [ ] | | **Testable** | Can be practiced in non-prod | [ ] | | **Current** | Reflects current system state | [ ] | | **Complete** | Covers happy and error paths | [ ] | | **Accessible** | Available during incidents | [ ] | | **Versioned** | Changes tracked with dates | [ ] | ## Workflow When creating runbooks: 1. **Identify Need**: What operation/incident needs documentation? 2. **Gather Information**: Interview operators, review past incidents 3. **Draft Runbook**: Use appropriate template 4. **Validate Steps**: Walk through with subject matter expert 5. **Test in Non-Prod**: Execute runbook in staging 6. **Publish**: Add to runbook collection 7. **Train Team**: Ensure operators know where to find it 8. **Maintain**: Review and update regularly ## References For detailed guidance: --- **Last Updated:** 2025-12-26