--- name: devops-sre description: Use this skill when designing or reviewing CI/CD pipelines, deployment strategies, observability systems, incident response, or any system involving production operations and reliability. Applies operational thinking to specifications, designs, and implementations. version: 0.1.0 --- # DevOps & SRE Engineering ## When to Apply Use this skill when the system involves: - CI/CD pipelines and deployment automation - Production deployments and rollback strategies - Monitoring, alerting, and observability - Incident response and on-call procedures - SLOs, SLIs, and error budgets - Capacity planning and performance management ## Mindset DevOps/SRE engineers think about the entire lifecycle from commit to production and beyond. **Questions to always ask:** - How do we deploy this safely? How do we roll back? - How do we know it's working? What do we alert on? - What's the SLO? What happens when we miss it? - How do we debug this in production? - What's the on-call burden? Is this operable at 3am? - How do we handle traffic spikes? Gradual degradation? - What's the blast radius of a bad deploy? **Assumptions to challenge:** - "It works on my machine" - Production is different. Test in production-like environments. - "We'll monitor it later" - If you can't observe it, you can't operate it. - "Deploys are safe" - Any change can break things. Deploy progressively. - "More alerts are better" - Alert fatigue is real. Alert on symptoms, not causes. - "We'll scale when needed" - Know your limits before you hit them. - "Rollback is easy" - Is it? Have you tested it? What about data migrations? ## Practices ### CI/CD Pipeline Automate everything from commit to deploy. Fast feedback loops (< 10 min to know if broken). Reproducible builds. Immutable artifacts. **Don't** have manual steps in the pipeline, slow feedback loops, or build differently for different environments. ### Deployment Strategy Use progressive rollouts (canary, blue-green, rolling). Define rollback triggers and automate rollback. Separate deploy from release (feature flags). **Don't** deploy 100% immediately, rely on manual rollback, or couple deploy with feature enablement. ### Observability Instrument the four golden signals: latency, traffic, errors, saturation. Use structured logging with correlation IDs. Implement distributed tracing. **Don't** rely on logs alone, use unstructured logs, or skip tracing in distributed systems. ### Alerting Alert on symptoms (SLO breach), not causes. Page only for actionable, urgent issues. Route non-urgent to tickets. Include runbook links in alerts. **Don't** alert on every metric, page for non-actionable issues, or have alerts without runbooks. ### SLOs & Error Budgets Define SLOs based on user experience. Measure SLIs accurately. Use error budget to balance velocity and reliability. **Don't** set arbitrary SLOs, measure proxies instead of user experience, or ignore error budget burn. ### Incident Response Have clear escalation paths. Blameless postmortems. Document incidents and learnings. Practice incident response regularly. **Don't** blame individuals, skip postmortems, or let learnings rot in docs. ### Runbooks Document common operational tasks. Include debugging steps for known failure modes. Keep runbooks next to alerts. **Don't** rely on tribal knowledge, write runbooks that assume context, or let runbooks go stale. ### Capacity Planning Know your limits before you hit them. Load test regularly. Plan for peak, not average. Have scaling playbooks ready. **Don't** discover limits in production, test with unrealistic load, or assume linear scaling. ## Vocabulary Use precise terminology: | Instead of | Say | |------------|-----| | "reliable" | "99.9% availability SLO" / "< 1% error rate" | | "monitored" | "SLI dashboards" / "alerting on p99 > 500ms" | | "deployed" | "canary at 5%" / "blue-green with instant rollback" | | "fast deploys" | "< 15 min commit-to-prod" / "10 deploys/day" | | "observable" | "traces, metrics, structured logs with correlation" | | "on-call" | "PagerDuty rotation" / "< 5 pages/week" | ## SDD Integration **During Specification:** - Define SLOs based on user-facing requirements - Identify operational requirements (deployment frequency, rollback needs) - Clarify observability requirements - Establish on-call expectations **During Design:** - Design for observability from the start - Specify deployment strategy and rollback approach - Document what metrics/logs/traces each component emits - Plan for graceful degradation - Identify what runbooks will be needed **During Review:** - Verify observability is instrumented - Check deployment strategy is progressive - Confirm rollback is automated and tested - Validate alerts are actionable with runbooks - Ensure SLIs actually measure SLOs