--- name: production-readiness description: Comprehensive checklist for production deployment readiness covering reliability, observability, security, and operational requirements. Use when preparing for go-live, launch readiness review, production deployment checklist, or assessing if a service is ready for production. --- # Production Readiness Systematic checklist to ensure services are ready for production deployment. ## When to Use This Skill - Preparing a new service for production launch - Go-live readiness review - Production deployment checklist needed - Assessing service maturity - Pre-launch security review ## Quick Readiness Assessment Copy and complete this checklist: ``` Production Readiness Assessment: Service: _______________ Date: _________________ Reviewer: _____________ Reliability: [ ] Pass [ ] Partial [ ] Fail Observability: [ ] Pass [ ] Partial [ ] Fail Security: [ ] Pass [ ] Partial [ ] Fail Operations: [ ] Pass [ ] Partial [ ] Fail Documentation: [ ] Pass [ ] Partial [ ] Fail Overall Status: [ ] Ready [ ] Conditional [ ] Not Ready ``` ## Reliability Checklist ### SLOs Defined ``` SLO Checklist: - [ ] Availability SLO defined (e.g., 99.9%) - [ ] Latency SLO defined (e.g., p99 < 200ms) - [ ] Error rate SLO defined (e.g., < 0.1%) - [ ] SLOs documented and communicated - [ ] Error budget policy established ``` ### Fault Tolerance ``` Fault Tolerance: - [ ] No single points of failure - [ ] Graceful degradation implemented - [ ] Circuit breakers for dependencies - [ ] Retry logic with exponential backoff - [ ] Timeouts configured for all external calls - [ ] Rate limiting in place ``` ### Capacity ``` Capacity Planning: - [ ] Load tested to 2x expected peak - [ ] Auto-scaling configured (if applicable) - [ ] Resource limits set (CPU, memory) - [ ] Connection pool sizes appropriate - [ ] Queue capacity sufficient ``` ### Data Resilience ``` Data Protection: - [ ] Backups configured and tested - [ ] Backup restoration tested - [ ] Data replication in place - [ ] RPO/RTO defined and achievable - [ ] No data loss on service restart ``` ## Observability Checklist ### Metrics ``` Metrics: - [ ] RED metrics exposed (Rate, Errors, Duration) - [ ] Resource metrics available (CPU, memory, disk) - [ ] Business metrics tracked - [ ] Dependency health metrics - [ ] Custom metrics for key operations ``` ### Logging ``` Logging: - [ ] Structured logging (JSON) - [ ] Request/trace IDs in all logs - [ ] Log levels appropriate (no excessive DEBUG) - [ ] Sensitive data not logged - [ ] Log retention configured ``` ### Tracing ``` Distributed Tracing: - [ ] Trace context propagated - [ ] Spans for external calls - [ ] Key operations instrumented - [ ] Sampling rate configured - [ ] Trace storage/retention set ``` ### Alerting ``` Alerts: - [ ] SLO-based alerts configured - [ ] Alert thresholds tuned (not noisy) - [ ] Runbooks linked to alerts - [ ] Escalation paths defined - [ ] On-call rotation assigned ``` ### Dashboards ``` Dashboards: - [ ] Service health dashboard exists - [ ] Key metrics visualized - [ ] Dashboard accessible to team - [ ] Dependencies shown - [ ] Historical data available ``` ## Security Checklist ### Authentication & Authorization ``` Auth: - [ ] Authentication required for all endpoints - [ ] Authorization checks implemented - [ ] Service-to-service auth configured - [ ] No hardcoded credentials - [ ] Secrets in secret manager ``` ### Network Security ``` Network: - [ ] TLS for all connections - [ ] Network policies/firewall rules - [ ] Internal services not publicly exposed - [ ] Egress traffic controlled - [ ] DDoS protection (if public) ``` ### Data Security ``` Data: - [ ] Sensitive data encrypted at rest - [ ] PII handling documented - [ ] Data retention policy applied - [ ] Audit logging for sensitive operations - [ ] GDPR/compliance requirements met ``` ### Vulnerability Management ``` Vulnerabilities: - [ ] Dependencies scanned for CVEs - [ ] Container images scanned - [ ] No critical vulnerabilities - [ ] Security review completed - [ ] Penetration testing (if required) ``` ## Operations Checklist ### Deployment ``` Deployment: - [ ] CI/CD pipeline configured - [ ] Deployment is automated - [ ] Rollback procedure documented - [ ] Rollback tested - [ ] Blue-green or canary supported - [ ] Feature flags for risky changes ``` ### Runbooks ``` Runbooks: - [ ] Startup/shutdown procedures - [ ] Common troubleshooting steps - [ ] Escalation procedures - [ ] Disaster recovery steps - [ ] Maintenance procedures ``` ### On-Call ``` On-Call Readiness: - [ ] On-call rotation scheduled - [ ] Team trained on service - [ ] Escalation paths clear - [ ] Contact information current - [ ] Handoff procedures defined ``` ## Documentation Checklist ``` Documentation: - [ ] Architecture diagram current - [ ] API documentation complete - [ ] README with setup instructions - [ ] Dependencies documented - [ ] Configuration documented - [ ] Known issues/limitations listed ``` ## Rollback Plan Every production deployment needs a rollback plan: ``` Rollback Plan: - Rollback trigger: [What conditions trigger rollback] - Rollback method: [How to rollback - automated/manual] - Rollback time: [Expected time to complete] - Data considerations: [Any data migration concerns] - Verification: [How to verify rollback success] ``` ## Pre-Launch Final Checklist Complete immediately before go-live: ``` Final Pre-Launch: - [ ] All checklist items above addressed - [ ] Stakeholders notified of launch - [ ] War room/incident channel ready - [ ] Key personnel available - [ ] Monitoring dashboards open - [ ] Rollback ready to execute - [ ] Communication templates prepared ``` ## Common Blockers See [references/common-blockers.md](references/common-blockers.md) for typical issues that block production readiness. ## Additional Resources - [Common Production Blockers](references/common-blockers.md) - [SLO Design Guide](references/slo-guide.md)