--- name: runbook-generator description: Эксперт по runbooks. Используй для создания операционных процедур, incident response и maintenance документации. --- # Runbook Generator Expert in creating comprehensive, standardized runbooks for operational procedures, incident response, and system maintenance tasks. ## Runbook Structure ```yaml runbook_template: metadata: title: "Runbook title" version: "1.0" last_updated: "2024-01-15" owner: "Team/Person" reviewers: ["Name 1", "Name 2"] overview: purpose: "What this runbook accomplishes" scope: "Systems/services affected" audience: "Who should use this" prerequisites: access: - "AWS Console access" - "SSH key for production servers" - "Database credentials" tools: - "kubectl configured" - "AWS CLI installed" - "jq for JSON parsing" knowledge: - "Basic Kubernetes concepts" - "Understanding of service architecture" execution: estimated_time: "15-30 minutes" risk_level: "Medium" requires_change_ticket: true requires_approval: true can_be_automated: true steps: [] # Detailed steps below verification: [] # How to confirm success rollback: [] # How to undo changes troubleshooting: [] # Common issues contacts: primary_oncall: "PagerDuty" escalation: "Engineering Manager" subject_experts: ["DBA Team", "Platform Team"] ``` ## Standard Runbook Template ```markdown # [Runbook Title] **Version:** 1.0 **Last Updated:** YYYY-MM-DD **Owner:** Team Name **Risk Level:** Low | Medium | High | Critical ## Overview ### Purpose Brief description of what this runbook accomplishes. ### When to Use - Trigger condition 1 - Trigger condition 2 - Alert: "Alert Name" fires ### Scope Systems and services affected: - Service A - Database B - External dependency C ## Prerequisites ### Required Access - [ ] Production AWS Console - [ ] Kubernetes cluster access - [ ] Database read/write permissions ### Required Tools ```bash # Verify kubectl kubectl version --client # Verify AWS CLI aws sts get-caller-identity # Verify database connectivity psql -h $DB_HOST -U $DB_USER -c "SELECT 1" ``` ### Required Knowledge - Kubernetes pod management - Service architecture overview - Incident response process ## Pre-Execution Checklist - [ ] Change ticket created: CHG-XXXXX - [ ] Approval obtained from: [Name] - [ ] Backup verified (if applicable) - [ ] Stakeholders notified - [ ] Maintenance window scheduled (if applicable) ## Execution Steps ### Step 1: [Action Name] **Purpose:** Why this step is necessary **Command:** ```bash kubectl get pods -n production -l app=myservice ``` **Expected Output:** ``` NAME READY STATUS RESTARTS AGE myservice-abc123-xyz 1/1 Running 0 2d myservice-def456-uvw 1/1 Running 0 2d ``` **Verification:** Confirm all pods show STATUS=Running **If unexpected:** See Troubleshooting section --- ### Step 2: [Next Action] **Purpose:** Description **Command:** ```bash # Command with explanation kubectl scale deployment myservice --replicas=3 -n production ``` **Expected Output:** ``` deployment.apps/myservice scaled ``` **Verification:** ```bash # Verify new replicas are running kubectl get pods -n production -l app=myservice -w ``` **Wait for:** All 3 pods to show Running status (typically 2-5 minutes) --- ## Post-Execution Verification ### Verify Service Health ```bash # Check deployment status kubectl rollout status deployment/myservice -n production # Check service endpoints kubectl get endpoints myservice -n production # Verify application health curl -s https://api.example.com/health | jq . ``` **Expected:** ```json { "status": "healthy", "version": "1.2.3", "uptime": "2h30m" } ``` ### Verify Metrics - [ ] Error rate returned to normal (<0.1%) - [ ] Latency within SLA (<200ms p99) - [ ] No new alerts firing ## Rollback Procedure ### When to Rollback - Error rate exceeds 1% - Latency exceeds 500ms p99 - Critical functionality broken ### Rollback Steps ```bash # Rollback to previous deployment kubectl rollout undo deployment/myservice -n production # Verify rollback kubectl rollout status deployment/myservice -n production # Confirm previous version kubectl get deployment myservice -n production -o jsonpath='{.spec.template.spec.containers[0].image}' ``` ## Troubleshooting | Symptom | Likely Cause | Resolution | |---------|--------------|------------| | Pods stuck in Pending | Resource constraints | Check node capacity: `kubectl describe nodes` | | CrashLoopBackOff | Application error | Check logs: `kubectl logs -f ` | | ImagePullBackOff | Registry auth issue | Verify secret: `kubectl get secret regcred` | | Connection refused | Service not ready | Wait for readiness probe, check endpoints | ### Common Issues **Issue: Deployment times out** ```bash # Check pod events kubectl describe pod -n production # Check resource limits kubectl top pods -n production ``` **Issue: Database connection failures** ```bash # Verify database connectivity kubectl exec -it -n production -- psql -h $DB_HOST -c "SELECT 1" # Check connection pool kubectl logs -n production | grep -i "connection" ``` ## Emergency Contacts | Role | Contact | When to Engage | |------|---------|----------------| | On-call Engineer | PagerDuty | Any issue | | Database Team | #dba-oncall | Database issues | | Platform Team | #platform-oncall | Infrastructure issues | | Engineering Manager | [Name] | Escalation | ## Change Log | Version | Date | Author | Changes | |---------|------|--------|---------| | 1.0 | 2024-01-15 | Author | Initial version | ## Related Documentation - [Service Architecture](link) - [Incident Response Process](link) - [Monitoring Dashboard](link) ``` ## Runbook Types ### Incident Response Runbook ```yaml incident_runbook: sections: detection: alert_name: "High Error Rate - Payment Service" threshold: "Error rate > 5% for 5 minutes" severity: "P1" immediate_actions: - step: "Acknowledge alert" command: "In PagerDuty, acknowledge incident" time: "< 5 min" - step: "Assess impact" command: | # Check error rate curl -s "https://metrics.example.com/api/v1/query?query=rate(http_errors[5m])" time: "< 2 min" - step: "Notify stakeholders" action: "Post in #incident-channel" template: | 🚨 INCIDENT: Payment Service High Errors Severity: P1 Status: Investigating Impact: Payment processing affected IC: @oncall investigation: - "Check recent deployments" - "Review error logs" - "Check dependent services" - "Review infrastructure metrics" mitigation: options: - name: "Rollback deployment" when: "Error started after deploy" command: "kubectl rollout undo deployment/payment -n prod" - name: "Scale up" when: "Load-related errors" command: "kubectl scale deployment/payment --replicas=10 -n prod" - name: "Enable circuit breaker" when: "Downstream dependency failing" command: "Toggle feature flag: payment.circuit_breaker=true" resolution: checklist: - "[ ] Error rate < 0.1%" - "[ ] No P1 alerts" - "[ ] Stakeholders notified" - "[ ] Incident documented" ``` ### Deployment Runbook ```yaml deployment_runbook: pre_deployment: checklist: - "[ ] Code review approved" - "[ ] CI/CD pipeline passed" - "[ ] Staging tested" - "[ ] Change ticket approved" - "[ ] Rollback plan documented" verification: - step: "Verify staging health" command: | curl -s https://staging.example.com/health - step: "Check deployment queue" command: | kubectl get pods -n staging -l app=myservice deployment: - step: "Apply deployment" command: | kubectl apply -f k8s/production/deployment.yaml - step: "Monitor rollout" command: | kubectl rollout status deployment/myservice -n production --timeout=10m - step: "Verify new version" command: | kubectl get deployment myservice -n production \ -o jsonpath='{.spec.template.spec.containers[0].image}' post_deployment: - step: "Smoke test" command: | ./scripts/smoke-test.sh production - step: "Monitor metrics" duration: "15 minutes" watch: - "Error rate" - "Latency p99" - "Request rate" - step: "Update ticket" action: "Mark CHG ticket as completed" ``` ### Maintenance Runbook ```yaml maintenance_runbook: log_rotation: schedule: "Weekly, Sunday 02:00 UTC" steps: - step: "Connect to server" command: | ssh admin@logs.example.com - step: "Rotate logs" command: | sudo logrotate -f /etc/logrotate.d/application - step: "Verify rotation" command: | ls -la /var/log/application/ # Should see rotated files with date suffix - step: "Clean old logs" command: | # Remove logs older than 30 days find /var/log/application/ -name "*.log.*" -mtime +30 -delete - step: "Verify disk space" command: | df -h /var/log # Should show > 20% free database_maintenance: schedule: "Monthly, first Sunday 03:00 UTC" steps: - step: "Check table sizes" command: | psql -c " SELECT tablename, pg_size_pretty(pg_total_relation_size(tablename::text)) FROM pg_tables WHERE schemaname = 'public' ORDER BY pg_total_relation_size(tablename::text) DESC LIMIT 10; " - step: "Run VACUUM ANALYZE" command: | psql -c "VACUUM ANALYZE;" - step: "Reindex if needed" command: | psql -c "REINDEX DATABASE mydb;" ``` ## Writing Guidelines ```yaml principles: clarity: - "Use active voice" - "Be explicit, never assume" - "One action per step" completeness: - "Include all commands" - "Show expected output" - "Document verification" safety: - "Test in non-prod first" - "Include rollback steps" - "Document risks" formatting: commands: - "Use code blocks with language" - "Include full paths" - "Add comments for complex commands" steps: - "Number sequentially" - "Include purpose" - "Show expected result" - "Note time estimate" variables: format: "$VARIABLE_NAME or " document: "List all variables at start" ``` ## Quality Checklist ```yaml validation: structure: - "[ ] Clear title and metadata" - "[ ] Prerequisites listed" - "[ ] Steps numbered and clear" - "[ ] Expected outputs included" - "[ ] Verification steps present" - "[ ] Rollback documented" - "[ ] Troubleshooting section" - "[ ] Contacts listed" testing: - "[ ] All commands tested" - "[ ] Outputs verified" - "[ ] Rollback tested" - "[ ] Time estimates accurate" maintenance: - "[ ] Version number updated" - "[ ] Change log maintained" - "[ ] Quarterly review scheduled" - "[ ] Owner assigned" ``` ## Лучшие практики 1. **Test everything** — каждая команда должна быть проверена 2. **Show expected output** — пользователь должен знать что увидит 3. **Include rollback** — всегда план отката 4. **Keep updated** — ревью каждый квартал 5. **Version control** — runbooks в git 6. **Automate when possible** — автоматизируй повторяющиеся процедуры