--- name: experiment-decision description: Decide when to A/B test vs just ship. Framework for experiment planning and prioritization. disable-model-invocation: false user-invocable: true --- # Experiment Decision Framework: When to A/B Test vs Ship ## Quick Start ``` /experiment-decision ``` Then provide: 1. **What you're considering building** (feature, change, or experiment) 2. **Expected impact** (metric + estimated improvement) 3. **Your concern** (is this risky? reversible? controversial?) I'll walk you through the decision tree: reversibility, hypothesis strength, detectable impact, and risk level. You'll get a clear recommendation: A/B test, ship + monitor, or just ship. **Output:** Decision documented inline or saved to `thoughts/shared/product/decisions/` **Time:** ~5 min for clear-cut cases, ~15 min for nuanced decisions **When to use:** Before building any feature, when stakeholders demand "data-driven" decisions, or when unsure if testing is worth the effort **Framework source:** Aakash Gupta's "When to A/B Test vs Just Ship" --- ## The Decision Framework Use this decision tree: ### Question 1: Is it reversible? **If YES → Ship it** - CSS changes - Messaging tweaks - UI polish - Non-destructive features **Why:** Reversible changes have low risk. Ship, monitor, rollback if needed. **If NO → Continue to Question 2** --- ### Question 2: Do you have a hypothesis with measurable impact? **If NO → Don't test** - Building "nice to haves" - No clear success metric - Can't measure the outcome **Why:** Testing without a hypothesis is wasteful. Either clarify the hypothesis or don't build it. **If YES → Continue to Question 3** --- ### Question 3: Is the expected impact large enough to detect? **Run a power calculation:** ``` Minimum Detectable Effect (MDE) = Effect you need to see to justify the work If your feature is expected to improve conversion by 0.5%, but you need 10M users to detect it → Don't test, just ship and monitor ``` **If impact is too small to detect → Ship without test** **If impact is detectable → Continue to Question 4** --- ### Question 4: Is the risk of being wrong high? **High risk scenarios:** - Affects revenue directly (pricing, checkout) - Impacts core user experience (onboarding, core flows) - Controversial decision (stakeholder disagreement) - Large engineering investment **If HIGH risk → A/B test** **If LOW risk → Ship without test** --- ## Decision Matrix | Risk Level | Impact Size | Reversible? | Decision | | ---------- | ----------- | ----------- | --------------------------------------- | | High | Large | No | **A/B Test** | | High | Large | Yes | **A/B Test** (or ship with kill switch) | | High | Small | No | **Don't build** | | High | Small | Yes | **Ship + Monitor** | | Low | Large | No | **Ship + Monitor** | | Low | Large | Yes | **Just Ship** | | Low | Small | No | **Just Ship** | | Low | Small | Yes | **Just Ship** | --- ## When to A/B Test ### ✅ Test When: **1. High-stakes decisions** - Pricing changes - Checkout flow modifications - Core product changes - Revenue-impacting features **2. Controversial hypotheses** - Team is divided on approach - Stakeholders disagree - User research is conflicting **3. Long-term bets** - Features that are expensive to reverse - Architectural decisions - Platform changes **4. Optimization work** - Conversion rate improvements - Engagement optimization - Retention experiments --- ## When to Just Ship ### ✅ Ship When: **1. Fast iteration needed** - Competitive pressure - Time-sensitive opportunities - Market windows closing **2. Low risk, high certainty** - Bug fixes - Obvious improvements - User-requested features (with clear demand) **3. Qualitative insights are strong** - Clear user pain validated through research - Competitive parity features - Accessibility improvements **4. Testing would take too long** - Small user base (can't reach significance) - Slow conversion cycles (months to convert) - Complex setup (weeks to build test infrastructure) --- ## The Cost of A/B Testing **Time costs:** - Engineering: 2-4 weeks to build test infrastructure - Analysis: 1-2 weeks to run experiment + analyze - **Total: 3-6 weeks delay** **Engineering costs:** - Feature flagging system - Analytics instrumentation - A/A test validation - Test maintenance **Opportunity costs:** - Could have shipped 3-5 other features - Delayed value delivery to users - Competitors may ship first **When testing costs exceed value → Just ship** --- ## Real-World Examples ### Example 1: Amazon's "Add to Cart" Button Color **Decision: A/B Test** - High traffic (millions of users) - Direct revenue impact - Easy to detect small improvements - **Result:** +2% conversion = $100M+ annually --- ### Example 2: Slack's Message Threading **Decision: Just Ship** - Highly requested feature - Strong qualitative signal from users - Reversible (users can ignore threads) - **Result:** Successful launch, became core feature --- ### Example 3: Netflix's "Are you still watching?" prompt **Decision: A/B Test** - Controversial (could annoy users) - Impact on engagement unclear - Risk of hurting retention - **Result:** Test showed improved engagement (prevented zombie sessions) --- ## Common Mistakes ❌ **Testing everything "to be data-driven"** - Problem: Slows down velocity - Fix: Reserve tests for high-stakes decisions ❌ **Shipping without monitoring** - Problem: Bad changes go unnoticed - Fix: Ship with dashboards and alerts ❌ **Running underpowered tests** - Problem: Waste time on inconclusive results - Fix: Calculate sample size before starting ❌ **Testing when qualitative data is clear** - Problem: Delays obvious improvements - Fix: Trust strong user research signals --- ## Quick Reference Checklist Before building any feature, ask: - [ ] Is this reversible? (If yes → ship) - [ ] Do I have a clear hypothesis? (If no → don't build) - [ ] Can I measure the impact? (If no → don't test) - [ ] Is the expected impact large enough to detect? (Power calculation) - [ ] What's the risk of being wrong? (High risk → test) - [ ] What's the cost of testing vs shipping? (ROI check) - [ ] Do I have strong qualitative data? (If yes → consider shipping) --- ## Statistical Power Guidance Before committing to an A/B test, estimate whether you have enough traffic to detect a meaningful difference. ### Power Calculation Essentials **Three inputs you need:** 1. **Minimum Detectable Effect (MDE)** -- what's the smallest improvement worth detecting? - For checkout conversion: 1-2% relative change matters (high revenue impact) - For feature adoption: 5-10% relative change is typical MDE - For engagement metrics: 3-5% relative change is reasonable 2. **Baseline conversion rate** -- what's the current rate you're trying to improve? - Higher baselines need more samples to detect small changes - Lower baselines are easier to move (but may need larger sample) 3. **Daily traffic to the experiment** -- how many users will enter the test per day? ### Rule of Thumb **You need approximately 1,000 conversions per variant to detect a 5% relative change at 80% power (95% confidence).** | Baseline Rate | MDE (Relative) | Conversions Needed Per Variant | At 1K daily visitors, days needed | | ------------- | -------------- | ------------------------------ | --------------------------------- | | 50% | 5% | ~3,200 | ~7 days | | 20% | 5% | ~12,500 | ~63 days | | 5% | 10% | ~15,000 | ~300 days | | 2% | 10% | ~40,000 | ~800 days | ### When Traffic Is Too Low If your power calculation shows the test would take longer than 4-6 weeks: 1. **Accept a larger MDE** -- only test if you expect a big swing (15%+ improvement) 2. **Use a composite metric** -- combine multiple success signals into one metric for higher sensitivity 3. **Run a qualitative test** -- 5-10 user tests instead of a statistical A/B test 4. **Just ship and monitor** -- launch with clear success criteria, compare before/after with caveats 5. **Use Bayesian methods** -- more forgiving with small samples, give probability ranges instead of p-values ### Common Pitfalls - **Peeking at results early** -- checking before reaching sample size inflates false positive rate. Commit to a runtime upfront. - **Stopping at first significant result** -- random fluctuations can look significant early. Use sequential testing if you must peek. - **Testing too many variants** -- each variant divides your traffic. Stick to 2-3 variants max. --- ## When to Skip the Framework Some decisions don't need the full decision tree: ### 1. Regulatory/Compliance Requirement **Action:** Just ship it. You don't have a choice. **But:** Document the change, set up monitoring, track any user impact. ### 2. Bug Fix **Action:** Just fix it. No one A/B tests bug fixes. **But:** If the "bug fix" changes user behavior significantly, monitor post-fix metrics. ### 3. CEO/Board Mandate **Action:** Document the decision and ship. Set up measurement so you can report on impact. **But:** Frame your measurement as "proving the impact" rather than "testing whether to do it." This builds credibility for future data-driven decisions. ### 4. Competitive Response **Action:** If a competitor just shipped a similar feature and your users are asking for it, speed matters more than experimentation. Ship fast, measure after. **But:** Don't use "competitive pressure" as an excuse for every feature. Reserve this for genuine market urgency. ### 5. Sunset/Deprecation **Action:** If you're removing a feature that <1% of users touch, just remove it with advance notice. **But:** If the feature has any paying customers relying on it, communicate early and provide alternatives. --- ## Output Quality Self-Check Before delivering the experiment decision, verify: - [ ] **Decision is clear** -- the recommendation is explicitly "A/B test," "Ship + Monitor," or "Just Ship" - [ ] **Reversibility** is assessed with specific reasoning (not just "yes/no") - [ ] **Hypothesis** is stated in If/Then/Because format - [ ] **Power calculation** is included if recommending a test (MDE, baseline, sample size, duration) - [ ] **Risk level** is justified with specific stakes (revenue impact, user count affected) - [ ] **Cost of testing** is weighed against cost of being wrong - [ ] **Edge cases** are checked (compliance, bug fix, mandate, competitive response) - [ ] **Stakeholder consensus** is noted -- does the team agree on the approach? - [ ] **Monitoring plan** exists regardless of decision (even "just ship" needs dashboards) - [ ] **Next step** is clear -- if testing, what metrics? If shipping, what success criteria? - [ ] **Connected to past decisions** -- have we made similar decisions before? What happened? --- ## Related Skills - `/experiment-metrics` - Choose the right metrics to measure - `/activation-analysis` - Test activation improvements - `/metrics-framework` - Understand leading vs lagging metrics - `/define-north-star` - Align tests to North Star --- **Framework credit:** Adapted from Aakash Gupta's experiment decision frameworks. Read: https://www.news.aakashg.com/p/when-to-ab-test --- ## Context Routing Strategy When the PM uses `/experiment-decision`, I automatically: ### 1. Check Historical Reversibility Precedent **Source:** `thoughts/shared/product/decisions/`, past decisions - **What I look for:** Similar decisions, how reversibility was judged - **How I use it:** Ensure consistent reversibility assessment - **Example:** "Last time we shipped CSS changes without testing; this is similar" ### 2. Extract Success Metrics Framework **Source:** `thoughts/shared/pm/metrics/`, active PRDs - **What I look for:** What metrics you typically measure, variance patterns - **How I use it:** Calculate minimum detectable effect (MDE) more accurately - **Example:** "Based on your metrics history, conversion rate variance is 3%, so MDE = 2%" ### 3. Route to Experiment Metrics if Testing **Source:** Connection to `/experiment-metrics` skill - **What I look for:** Whether decision routes to testing - **How I use it:** If decision is "test", auto-suggest next step with `/experiment-metrics` - **Example:** "Now that you've decided to test, let's pick the right metrics using STEDII" ### 4. Check Stakeholder Consensus on Risk **Source:** `thoughts/shared/pm/context/stakeholder-template.md`, recent discussions - **What I look for:** Stakeholder risk tolerance, veto power - **How I use it:** Surface if high-risk decision needs executive approval - **Example:** "CEO is risk-averse, so even medium-risk decisions should be tested" ### 5. Calculate Cost of Testing vs Shipping **Source:** Team capacity, past experiment timelines - **What I look for:** How long experiments take, engineering cost - **How I use it:** ROI calculation in the framework - **Example:** "Last experiment took 3 weeks; if we ship in 1 week and monitor, ROI favors shipping"