--- name: saas-launch-checklist description: Pre-launch verification across infrastructure, security, legal, payment, email, analytics, and performance. Day-1 monitoring, rollback plan, incident response skeleton, and post-launch week-1 checklist. --- # SaaS Launch Checklist A structured checklist for shipping a SaaS product to production with confidence. Every item exists because someone skipped it and regretted it. ## Pre-Launch Checklist (28 Items) ### Infrastructure (5 items) - [ ] **SSL/TLS** -- Certificate installed, auto-renewal configured (Let's Encrypt / ACM), HSTS header present - [ ] **CDN** -- Static assets served through CDN, cache-control headers set, origin shielding enabled - [ ] **Backup** -- Database backup scheduled (daily minimum), tested restore procedure, off-site copy - [ ] **Monitoring** -- Health check endpoint returns 200 when healthy and 503 when degraded, uptime monitor pings every 60s - [ ] **Alerting** -- PagerDuty/Opsgenie configured, on-call rotation set, escalation policy defined (5 min ack SLA) ### Security (5 items) - [ ] **Penetration test** -- Run OWASP ZAP or equivalent against staging, fix all HIGH/CRITICAL findings - [ ] **Dependency audit** -- `npm audit` / `pip-audit` / `trivy` shows zero critical CVEs - [ ] **Secret rotation** -- All production secrets differ from staging, rotation schedule documented - [ ] **Rate limiting** -- API endpoints rate-limited (e.g., 100 req/min per user), auth endpoints stricter (10 req/min) - [ ] **Headers** -- CSP, X-Frame-Options, X-Content-Type-Options, Referrer-Policy all configured ### Legal (4 items) - [ ] **Terms of Service** -- Reviewed by legal counsel, published at `/terms`, acceptance recorded at signup - [ ] **Privacy Policy** -- GDPR/CCPA compliant, published at `/privacy`, covers data collection, retention, and third parties - [ ] **Cookie consent** -- Banner shown to EU visitors, non-essential cookies blocked until consent, preferences stored - [ ] **DPA (Data Processing Agreement)** -- Template available for enterprise customers, sub-processor list published ### Payment (4 items) - [ ] **Test transactions** -- Full purchase flow tested with test cards (success, decline, 3DS), subscription upgrade/downgrade verified - [ ] **Webhook verification** -- Stripe/Paddle webhook signature verified, replay attacks prevented with idempotency keys - [ ] **Invoice generation** -- Invoices auto-generated with correct tax info, downloadable as PDF, emailed on payment - [ ] **Dunning flow** -- Failed payment retry schedule configured (day 1, 3, 5, 7), grace period before cancellation ### Email (4 items) - [ ] **SPF/DKIM/DMARC** -- DNS records configured, DMARC policy set to `quarantine` or `reject`, verified with mail-tester.com - [ ] **Transactional templates** -- Welcome, password reset, invoice, and trial expiry emails tested across Gmail/Outlook/Apple Mail - [ ] **Unsubscribe** -- One-click unsubscribe header present (RFC 8058), `/unsubscribe` endpoint works, preference center available - [ ] **Sender reputation** -- Dedicated IP warmed up (if >10k emails/day), bounce handling configured, complaint feedback loop active ### Analytics (3 items) - [ ] **Core events** -- Signup, activation, purchase, and churn events tracked with user ID and properties - [ ] **Funnels** -- Signup-to-activation and trial-to-paid funnels configured in analytics dashboard - [ ] **Dashboards** -- Real-time dashboard shows active users, revenue (MRR), error rate, and response time ### Performance (3 items) - [ ] **Load test** -- k6/Artillery run at 2x expected peak traffic, p95 latency under 500ms, zero errors - [ ] **CDN cache** -- Cache hit ratio above 90% for static assets, proper `Cache-Control` and `ETag` headers - [ ] **Image optimization** -- All images served as WebP/AVIF with responsive srcset, lazy loading below the fold ## Day-1 Monitoring Dashboard Build this dashboard before launch. Every panel answers a specific question. ``` +-----------------------------------------------------+ | SaaS Launch Dashboard | +---------------------------+-------------------------+ | Request Rate (req/s) | Error Rate (%) | | Normal: 10-50/s | Target: < 1% | | Alert: > 200/s | Alert: > 2% | +---------------------------+-------------------------+ | p95 Latency (ms) | Active Users (real-time) | | Target: < 200ms | Shows WebSocket/polling | | Alert: > 500ms | count | +---------------------------+-------------------------+ | Signup Rate (/hr) | Payment Success Rate (%) | | Compare to projection | Target: > 95% | | Alert: 0 for 30min | Alert: < 90% | +---------------------------+-------------------------+ | CPU / Memory | Database Connections | | Alert: > 80% | Alert: > 80% pool | +---------------------------+-------------------------+ ``` ### Dashboard as Code (Prometheus Queries) ```yaml # monitoring/launch-dashboard.yml panels: - title: Request Rate query: sum(rate(http_requests_total[5m])) alert_threshold: 200 - title: Error Rate query: | sum(rate(http_requests_total{status_code=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100 alert_threshold: 2 - title: p95 Latency query: | histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) alert_threshold: 0.5 - title: Signup Rate query: sum(increase(user_signups_total[1h])) alert_threshold: 0 # alert if zero for 30 min - title: Payment Success Rate query: | sum(rate(payment_completed_total[5m])) / sum(rate(payment_attempted_total[5m])) * 100 alert_threshold: 90 # alert if below ``` ## Rollback Plan Template Complete this before you deploy. If you cannot fill every field, you are not ready to launch. ```markdown # Rollback Plan: [Product Name] v[X.Y.Z] ## Decision Criteria Trigger rollback if ANY of these occur within 60 minutes of deploy: - [ ] Error rate exceeds 5% - [ ] p95 latency exceeds 2 seconds for 5 consecutive minutes - [ ] Payment processing fails for 3+ consecutive attempts - [ ] Data integrity issue detected (mismatched records) ## Rollback Steps 1. Set feature flag `launch_v1` to OFF (immediate, <30 seconds) 2. Revert DNS/load balancer to previous deployment 3. Run: `kubectl rollout undo deployment/api --to-revision=PREV` OR: `docker compose -f docker-compose.prod.yml up -d --force-recreate` 4. Verify health check returns 200 on previous version 5. Notify #incidents channel: "Rollback executed, investigating" ## Data Migration Rollback - [ ] Database migration has a DOWN migration - [ ] Tested: `npm run migrate:down` or `python manage.py migrate APP PREVIOUS` - [ ] New columns are NULLABLE (old code ignores them) - [ ] No destructive changes (column drops, renames) in this release ## Communication - Engineering: #incidents Slack channel - Support: Pre-drafted message in support tool - Customers: Status page update (only if downtime > 5 min) ## Owner - Rollback decision: [On-call engineer name] - Execution: [DevOps engineer name] - Communication: [Support lead name] ``` ## Incident Response Playbook Skeleton ```markdown # Launch Day Incident Playbook ## Severity Levels | Level | Definition | Response | Example | |-------|-----------------------------|-----------|----------------------------| | SEV-1 | Service down, data loss | 5 min | Database unreachable | | SEV-2 | Major feature broken | 15 min | Payments failing | | SEV-3 | Minor feature broken | 1 hour | Email delivery delayed | | SEV-4 | Cosmetic issue | Next day | Alignment bug on Safari | ## On-Call Roster (Launch Day) | Role | Primary | Secondary | Contact | |--------------------|----------------|----------------|----------------| | Incident Commander | [Name] | [Name] | [Phone/Slack] | | Backend Engineer | [Name] | [Name] | [Phone/Slack] | | Frontend Engineer | [Name] | [Name] | [Phone/Slack] | | DevOps/SRE | [Name] | [Name] | [Phone/Slack] | ## Response Flow 1. DETECT - Alert fires or user reports issue 2. TRIAGE - Assign severity, open incident channel (#inc-YYYYMMDD-NN) 3. CONTAIN - Feature flag OFF, rollback, or scale up 4. FIX - Root cause fix, deploy to staging, verify 5. DEPLOY - Push fix to production with monitoring 6. REVIEW - Post-incident review within 48 hours ## Pre-Written Status Page Messages - Investigating: "We are aware of an issue affecting [X] and are investigating." - Identified: "The issue has been identified. A fix is being deployed." - Resolved: "The issue has been resolved. All systems are operational." ``` ## Post-Launch: First 7 Days Checklist ### Day 1 (Launch Day) - [ ] All-hands monitoring: dashboard on a shared screen or TV - [ ] Respond to every support ticket within 2 hours - [ ] Track signup funnel: landing page visits -> signups -> activation - [ ] Check error logs every hour for new error patterns - [ ] Verify email deliverability (check spam folders with seed accounts) ### Day 2-3 - [ ] Review Day 1 analytics: signup count, activation rate, bounce rate - [ ] Fix any P1/P2 bugs discovered by real users - [ ] Check payment reconciliation: Stripe dashboard matches your records - [ ] Send first-day cohort a "how's it going?" email (personal, from founder) - [ ] Monitor server costs: actual vs projected infrastructure spend ### Day 4-5 - [ ] Analyze user behavior: where do users drop off? Heatmaps, session recordings - [ ] Run second load test if traffic patterns differ from projections - [ ] Review and tune alerting thresholds (reduce noise, catch real issues) - [ ] Check SEO: is Google indexing your pages? robots.txt correct? - [ ] Backlog grooming: prioritize bugs and feature requests from real users ### Day 6-7 - [ ] Week 1 retrospective: what worked, what did not, what to change - [ ] Publish week 1 metrics: signups, activation, retention, revenue - [ ] Plan week 2 priorities based on actual user data (not assumptions) - [ ] Thank early adopters (personal email, discount, or swag) - [ ] Remove launch-specific feature flags and temporary monitoring ## Contrast: Phased Rollout vs Big Bang Launch ### BAD: Big Bang Launch ``` Monday 9:00 AM: - Mass email to 50,000 person waitlist - ProductHunt launch post goes live - Hacker News submission - All social media posts scheduled simultaneously - Full feature set available to everyone Monday 9:15 AM: - 3,000 concurrent users hit the app - Database connections exhausted - Payment webhook queue backs up - Support inbox: 200 tickets in 30 minutes - Error rate: 15% - Team panics Monday 10:00 AM: - Emergency rollback - Status page: "We are experiencing issues" - ProductHunt comments: "This doesn't work" - First impression destroyed ``` ### GOOD: Phased Rollout ```typescript // Feature flag configuration for phased rollout const LAUNCH_PHASES = { phase1: { name: 'Team and Friends', startDate: '2025-01-06', criteria: { userList: 'internal-testers' }, // 50 users goal: 'Find critical bugs before anyone else sees them', }, phase2: { name: 'Beta Waitlist (10%)', startDate: '2025-01-08', criteria: { percentage: 10 }, // 500 users goal: 'Validate onboarding flow and payment', }, phase3: { name: 'Beta Waitlist (50%)', startDate: '2025-01-10', criteria: { percentage: 50 }, // 2,500 users goal: 'Load test with real traffic patterns', }, phase4: { name: 'Full Waitlist + Public', startDate: '2025-01-13', criteria: { percentage: 100 }, // everyone goal: 'General availability', }, } as const ``` ``` Week 1 Monday: 50 internal users -> Find showstoppers Week 1 Wednesday: 500 beta users -> Validate payment flow Week 1 Friday: 2,500 beta users -> Verify infrastructure holds Week 2 Monday: Full public launch -> Confidence backed by data Each phase gate: [x] Error rate < 1% [x] p95 latency < 300ms [x] Payment success > 98% [x] NPS from phase users > 30 [x] Zero data integrity issues -> Only then proceed to next phase ``` ### Why Phased Rollout Wins | Dimension | Big Bang | Phased Rollout | |-------------------|------------------------|--------------------------| | Risk | All-or-nothing | Contained per phase | | Feedback | Overwhelming, chaotic | Manageable, actionable | | Infrastructure | Guess and hope | Scale based on real data | | Recovery | Public failure | Private fix | | First impression | One shot | Refined over 4 attempts | | Team stress | Maximum | Distributed | **Key principle**: Launch is not a single moment -- it is a process. Every item you skip is a bet that nothing will go wrong in that area. The checklist exists to make the boring stuff automatic so you can focus on users.