--- name: deployment-automation-enforcer description: Use when designing deployment pipelines, CI/CD, terraform, or infrastructure automation. Enforces rollback checkpoint then TodoWrite with 19+ items. Triggers: "deploy", "CI/CD", "kubernetes", "terraform". If thinking "rollback later" - use this first. --- # Deployment Automation Enforcer ## ROLLBACK CHECKPOINT (COMPLETE FIRST) **MANDATORY before creating TodoWrite:** - [ ] Rollback script exists? [Path: _______ | or "NEW - will create first"] - [ ] Tested in staging? [Date: _______ | or "must test before production"] - [ ] Duration measured? [_____ minutes] - [ ] Triggers defined? [List: _______] **Why first:** 27% skip rollback when checkpoint appears later. --- ## TodoWrite Requirements **CREATE TodoWrite** with 4 sections (19+ items total): | Section | Min Items | Order | |---------|-----------|-------| | Automation | 5+ | 1st | | Observability | 5+ | 2nd (BEFORE Failure Recovery) | | Failure Recovery | 5+ | 3rd (requires Observability) | | Verification | 4+ | 4th | **Section order matters:** You cannot define failure recovery without observability to detect failures. --- ## Verification Checkpoint After creating TodoWrite, verify 3 random items: **Each item must have ALL THREE:** - ✓ Concrete numbers/thresholds ("error rate > 5%", "15 min timeout") - ✓ Specific tools ("GitHub Actions", "CloudWatch", "PagerDuty") - ✓ Measurable outcome ("rollback tested on [date]", "alert fires within 5min") | ❌ FAILS | ✅ PASSES | |----------|-----------| | "Add monitoring" | "CloudWatch: `deployment.duration_seconds`, Grafana dashboard at `/dashboards/deployments`, PagerDuty alert if error rate > 5% for 3min" | | "Implement rollback" | "Rollback `.github/workflows/rollback.yml` reverts to previous Docker tag from S3 `deployment-history/latest-stable.txt`. Triggers: manual OR error rate > 5% for 3min. Target: < 5 minutes. Test staging on [date]" | --- ## Section Requirements ### Automation (5+ items) - [ ] Identify manual steps in current deployment - [ ] Replace with automated scripts/workflows (GitHub Actions, GitLab CI) - [ ] Idempotency checks for safe re-runs - [ ] Rollback automation for this change - [ ] Document exceptions for remaining manual steps ### Observability (5+ items) - BEFORE Failure Recovery - [ ] Deployment logging (structured: deployment-id, timestamps, steps) - [ ] Failure alerts (PagerDuty/SNS on failure, error rate spike) - [ ] Metrics (duration, success rate in CloudWatch/Datadog) - [ ] Health endpoint (`/health` returns 200 + dependency status) - [ ] Log/metric locations documented ### Failure Recovery (5+ items) - AFTER Observability - [ ] Failure scenarios defined (won't start, migration fails, health check fails) - [ ] Automated rollback triggers (error rate > X%, failed health checks Y minutes) - [ ] Health checks post-deployment - [ ] Rollback tested in staging (date, duration, success) - [ ] Manual recovery documentation as last resort ### Verification (4+ items) - [ ] Pre-deployment tests automated (unit, integration, lint) - [ ] Smoke tests post-deployment (critical flows, key endpoints) - [ ] Monitoring/alerts verified working (trigger test alert) - [ ] Rollback procedure accessible (script in repo, documented) --- ## Red Flags - STOP When You Think: | Thought | Reality | Data | |---------|---------|------| | "Manual deploy is broken, need automation fast" | Automating without rollback creates WORSE problems | 27% skip rollback | | "We'll add monitoring/rollback after" | Can't detect/recover from failures without them | 80% never add "later" | | "Rollback is overkill" | Manual recovery ALWAYS takes 10x longer | 30+ min manual vs 2 min automated | | "We can manually revert" | Detect (no monitoring) + find version (no automation) + apply (error-prone) | 30+ min | --- ## Response Templates ### "We'll Add Rollback Later" ❌ **BLOCKED**: Cannot deploy without rollback capability. - 27% skip rollback when not required upfront - 80% of "add later" items never get added - Manual recovery takes 30+ minutes vs 2 minutes automated - Production incidents without rollback = extended downtime + customer impact **Required to override:** 1. Specific retrofit date (not "later") 2. Budget allocated (engineer-weeks + incident risk cost) 3. Risk acceptance signed by decision maker 4. Interim mitigation plan (24/7 on-call? manual monitoring?) --- ## Override Requirements To skip ANY requirement, provide ALL 4: 1. Specific retrofit date (not "later") 2. Budget allocated (engineer-weeks + incident risk) 3. Risk acceptance signed by decision maker 4. Interim mitigation plan (24/7 on-call? manual monitoring?) --- ## Self-Grading Before Complete ``` [ ] 19+ items across 4 sections [ ] 80%+ items have concrete numbers [ ] 80%+ items name specific tools [ ] 100% items have measurable outcomes [ ] 3 random items pass specificity test [ ] Rollback checkpoint completed [ ] Observability BEFORE Failure Recovery (correct order) Grade 7+/8: Ready to proceed Grade <7: Revise TodoWrite ``` --- ## Evidence Collection Before marking complete: - [ ] Automation code link (workflow file URL) - [ ] Staging deployment log (screenshot/excerpt) - [ ] Monitoring dashboard screenshot - [ ] Rollback test evidence (log with timestamp, duration) - [ ] Alert test confirmation