--- name: Experiment Design description: Comprehensive guide to A/B testing, multivariate testing, statistical significance, and experiment analysis for data-driven product decisions --- # Experiment Design ## Types of Experiments ### 1. A/B Test (Two Variants) **What:** Compare two versions (A vs B) **Example:** - **Control (A):** Blue "Buy Now" button - **Treatment (B):** Green "Buy Now" button **When to Use:** - Testing single change - Clear hypothesis - Binary decision (ship or don't ship) **Pros:** - Simple to implement - Easy to analyze - Clear winner **Cons:** - Only tests one change - Can't test interactions ### 2. Multivariate Test (Multiple Changes) **What:** Test multiple changes simultaneously **Example:** - **Variable 1:** Button color (Blue, Green, Red) - **Variable 2:** Button text ("Buy Now", "Add to Cart", "Get Started") - **Variants:** 3 × 3 = 9 combinations **When to Use:** - Testing multiple elements - Want to find best combination - Have enough traffic **Pros:** - Test interactions between variables - Find optimal combination **Cons:** - Requires much more traffic - Complex analysis - Longer test duration ### 3. Sequential Testing **What:** Continuously monitor and stop early if clear winner **Example:** - Start A/B test - Check results daily - Stop when statistical significance reached (could be day 3 or day 14) **When to Use:** - Want to ship winners fast - High traffic - Using tools that support it (Statsig, GrowthBook) **Pros:** - Faster results - Less opportunity cost **Cons:** - Requires special statistical methods - Can't "peek" with traditional A/B tests ### 4. Holdout Groups (Long-Term Effects) **What:** Keep small % of users on old experience permanently **Example:** - **95% of users:** New feature - **5% of users:** Old experience (holdout) **When to Use:** - Measure long-term effects - Detect delayed negative impacts - Validate cumulative changes **Pros:** - Detects long-term issues - Measures true impact **Cons:** - Some users get worse experience - Requires ongoing monitoring --- ## When to Experiment ### ✅ Experiment When: 1. **Significant Features (High Impact)** - Major redesign - New pricing model - Core flow changes 2. **Uncertain Outcomes** - Don't know if it will work - Conflicting opinions - No clear data 3. **Multiple Solution Options** - Two different approaches - Want to pick the best 4. **Optimization Opportunities** - Incremental improvements - Conversion optimization - Engagement optimization ### ❌ Don't Experiment When: 1. **Obvious Bugs/Fixes** - Broken functionality - Security issues - Legal compliance 2. **Very Low Traffic** - Can't reach statistical significance - Would take months 3. **Trivial Changes** - Copy typo fix - Minor styling adjustment 4. **Ethical Issues** - Manipulative dark patterns - Harmful to users --- ## Experiment Design Process ### Step 1: Define Hypothesis **Template:** > "If we [change], then [metric] will [improve by X%], because [reasoning]." **Example:** > "If we change the CTA button from blue to green, then click-through rate will increase by 10%, because green is more attention-grabbing." ### Step 2: Choose Metrics **Primary Metric:** What you're optimizing - Example: Click-through rate **Secondary Metrics:** Other important outcomes - Example: Conversion rate, revenue per user **Counter Metrics:** Watch for negatives - Example: Bounce rate, time on page ### Step 3: Determine Sample Size **Inputs:** - Baseline conversion rate: 5% - Expected improvement: 10% relative lift (5% → 5.5%) - Significance level: 0.05 (95% confidence) - Power: 0.80 (80% chance of detecting effect) **Output:** - Sample size needed: ~31,000 users per variant **Tools:** - Evan Miller's calculator: https://www.evanmiller.org/ab-testing/sample-size.html - Optimizely sample size calculator ### Step 4: Set Test Duration **Factors:** - Sample size needed - Daily traffic - Weekly patterns (run at least 1-2 weeks) - Business cycles **Example:** - Sample size: 31,000 per variant (62,000 total) - Daily traffic: 5,000 - Duration: 62,000 / 5,000 = 12.4 days → **Run for 2 weeks** ### Step 5: Design Variants **Control (A):** Current experience **Treatment (B):** New experience **Best Practices:** - Change only one thing (for A/B test) - Make change meaningful (not trivial) - Ensure variants are distinct ### Step 6: Launch Test **Checklist:** - [ ] Hypothesis documented - [ ] Metrics instrumented - [ ] Sample size calculated - [ ] Randomization working - [ ] QA tested both variants - [ ] Monitoring dashboard ready ### Step 7: Analyze Results **Check:** - Statistical significance (p < 0.05) - Practical significance (is improvement meaningful?) - Secondary metrics (any red flags?) - Segment analysis (works for everyone?) ### Step 8: Decide (Ship, Iterate, Kill) **Ship if:** - Positive, significant, no red flags **Iterate if:** - Mixed results, some segments good **Kill if:** - Negative, not significant, opportunity cost too high --- ## Choosing Metrics ### Primary Metric (What We're Optimizing) **Characteristics:** - Directly tied to hypothesis - Sensitive to change - Measurable in test duration **Examples:** - Click-through rate (CTR) - Conversion rate - Sign-up completion rate - Time to first action **Bad Primary Metrics:** - Revenue (too noisy, delayed) - Retention (takes too long to measure) - NPS (survey-based, low sample) ### Secondary Metrics (Guardrails, Side Effects) **Purpose:** Ensure we're not breaking other things **Examples:** - Revenue per user - Engagement (sessions per user) - Feature adoption - Customer satisfaction ### Counter Metrics (Watch for Negatives) **Purpose:** Detect unintended negative consequences **Examples:** - Bounce rate (users leaving immediately) - Error rate (technical issues) - Support tickets (confusion) - Churn rate (users leaving) ### Example: Checkout Flow Test **Hypothesis:** > "If we reduce checkout from 5 steps to 3 steps, conversion will increase by 15%." **Metrics:** - **Primary:** Checkout conversion rate - **Secondary:** Average order value, time to complete checkout - **Counter:** Cart abandonment rate, error rate, support tickets --- ## Statistical Significance ### P-Value < 0.05 (95% Confidence) **What it Means:** - Less than 5% chance result is due to random chance - 95% confident the effect is real **Example:** - Control: 5.0% conversion - Treatment: 5.5% conversion - P-value: 0.03 ✅ (< 0.05, statistically significant) **Interpretation:** > "We're 95% confident that the treatment is better than control." ### Statistical Power (80%+) **What it Means:** - 80% chance of detecting an effect if it exists - Reduces false negatives **Example:** - Power: 80% - Means: 20% chance of missing a real effect ### Minimum Detectable Effect (MDE) **What it Means:** - Smallest effect size you can reliably detect - Depends on sample size **Example:** - Baseline: 5% conversion - Sample size: 10,000 per variant - MDE: 0.5% absolute (10% relative) - Can detect: 5.0% → 5.5% or larger **Trade-off:** - Larger sample size → Smaller MDE (detect smaller effects) - Smaller sample size → Larger MDE (only detect big effects) --- ## Sample Size Calculation ### Formula (Simplified) ``` n = (Z_α/2 + Z_β)² × (p₁(1-p₁) + p₂(1-p₂)) / (p₁ - p₂)² Where: - n = sample size per variant - Z_α/2 = 1.96 (for 95% confidence) - Z_β = 0.84 (for 80% power) - p₁ = baseline conversion rate - p₂ = expected conversion rate ``` ### Example Calculation **Inputs:** - Baseline conversion rate (p₁): 5% = 0.05 - Expected improvement: 10% relative lift - New conversion rate (p₂): 5.5% = 0.055 - Significance level (α): 0.05 - Power (1-β): 0.80 **Calculation:** ``` n = (1.96 + 0.84)² × (0.05×0.95 + 0.055×0.945) / (0.05 - 0.055)² n = 7.84 × (0.0475 + 0.052) / 0.000025 n = 7.84 × 0.0995 / 0.000025 n ≈ 31,200 per variant ``` **Total sample size:** 62,400 users ### Using Online Calculators **Evan Miller's Calculator:** 1. Go to https://www.evanmiller.org/ab-testing/sample-size.html 2. Enter baseline conversion rate: 5% 3. Enter minimum detectable effect: 10% (relative) 4. Get sample size: ~31,000 per variant **Optimizely Calculator:** 1. Go to Optimizely sample size calculator 2. Enter baseline: 5% 3. Enter minimum detectable effect: 0.5% (absolute) 4. Get sample size: ~31,000 per variant --- ## Test Duration ### Minimum Duration: 1-2 Weeks **Why:** - Capture weekly patterns (weekday vs weekend) - Avoid day-of-week bias - Account for user behavior cycles **Example:** - Don't run Monday-Wednesday only - Run at least Monday-Sunday (1 full week) ### Full Business Cycles **Examples:** - **E-commerce:** Include payday (1st and 15th of month) - **B2B SaaS:** Include full week (avoid Friday-only) - **Seasonal:** Avoid holidays (unless testing holiday-specific) ### Enough Data for Significance **Formula:** ``` Duration = Sample Size Needed / Daily Traffic ``` **Example:** - Sample size: 62,000 total - Daily traffic: 5,000 - Duration: 62,000 / 5,000 = 12.4 days - **Run for:** 2 weeks (14 days) ### Not Too Long (Opportunity Cost) **Trade-off:** - Longer test = More confidence - Longer test = Delayed learnings, slower iteration **Guideline:** - Most tests: 1-4 weeks - High-traffic sites: 1-2 weeks - Low-traffic sites: 2-4 weeks - Don't run > 1 month (diminishing returns) --- ## Experiment Variants ### Control (Current Experience) **What:** The existing experience **Example:** - Current checkout flow (5 steps) - Current button color (blue) - Current pricing page **Purpose:** Baseline for comparison ### Treatment (New Experience) **What:** The proposed change **Example:** - New checkout flow (3 steps) - New button color (green) - New pricing page **Purpose:** Test hypothesis ### Multiple Treatments (If Testing Different Approaches) **Example:** - **Control:** 5-step checkout - **Treatment A:** 3-step checkout (combine steps) - **Treatment B:** 1-page checkout (all on one page) **Traffic Split:** - Control: 33% - Treatment A: 33% - Treatment B: 34% **Analysis:** - Compare each treatment to control - Compare treatments to each other --- ## Randomization ### User-Level Randomization (Consistent Experience) **What:** Each user always sees same variant **How:** ```javascript const variant = hashUserId(userId) % 2 === 0 ? 'control' : 'treatment'; ``` **When to Use:** - Logged-in users - Want consistent experience - Testing flows (multi-step) **Pros:** - Consistent experience - No confusion **Cons:** - Requires user ID ### Session-Level (For Anonymous Users) **What:** Each session sees same variant (but different sessions can differ) **How:** ```javascript const variant = hashSessionId(sessionId) % 2 === 0 ? 'control' : 'treatment'; ``` **When to Use:** - Anonymous users - Single-page tests **Pros:** - Works for anonymous users **Cons:** - Same user can see different variants across sessions ### Stratified Sampling (For Segments) **What:** Ensure even distribution across segments **Example:** - Segment 1: Free users (50% control, 50% treatment) - Segment 2: Paid users (50% control, 50% treatment) **Why:** - Avoid imbalanced segments - Enable segment analysis --- ## Common Pitfalls ### 1. Peeking (Stopping Test Early When "Winning") **Problem:** ``` Day 3: Treatment is winning! (p = 0.04) → Ship it! Day 7: Treatment is losing... (p = 0.12) → Oops. ``` **Why It's Bad:** - Increases false positive rate - P-value fluctuates during test **Solution:** - Decide sample size upfront - Don't look until test completes - Or use sequential testing (proper method) ### 2. Sample Ratio Mismatch (Uneven Splits) **Problem:** ``` Expected: 50% control, 50% treatment Actual: 48% control, 52% treatment ``` **Why It's Bad:** - Indicates randomization bug - Results may be invalid **Solution:** - Check sample ratio before analyzing - Investigate if mismatch > 1% ### 3. Novelty Effect (Users Trying New Thing) **Problem:** ``` Week 1: Treatment is winning! (+20%) Week 4: Treatment is same as control (0%) ``` **Why It's Bad:** - Users try new thing out of curiosity - Effect fades over time **Solution:** - Run test longer (2-4 weeks) - Use holdout group for long-term measurement - Segment by new vs returning users ### 4. Seasonality (Testing During Holidays) **Problem:** ``` Test during Black Friday: +50% conversion Test during normal week: +5% conversion ``` **Why It's Bad:** - Holiday behavior is different - Results don't generalize **Solution:** - Avoid testing during holidays - Or run test across multiple weeks (include holiday + normal) --- ## Sequential Testing ### What is Sequential Testing? **Traditional A/B Test:** - Decide sample size upfront - Run until sample size reached - Analyze once at end **Sequential Testing:** - Monitor continuously - Stop early if clear winner - Adjust significance threshold ### How It Works **Algorithm:** - Use adjusted significance threshold (not 0.05) - Account for multiple looks - Stop when threshold crossed **Example (Simplified):** ``` Day 1: p = 0.10 → Continue Day 3: p = 0.03 → Continue Day 5: p = 0.001 → Stop! (clear winner) ``` ### Tools That Support Sequential Testing - **Statsig:** Built-in sequential testing - **GrowthBook:** Bayesian statistics - **Optimizely:** Stats Engine (sequential) ### Benefits - Faster results (stop early if clear winner) - Less opportunity cost - Detect large effects quickly ### Drawbacks - Requires special tools - Can't use traditional p-value - More complex --- ## Holdout Groups ### What is a Holdout Group? **Definition:** Small % of users kept on old experience permanently **Example:** - 95% of users: New feature - 5% of users: Old experience (holdout) ### Why Use Holdout Groups? **Measure Long-Term Effects:** - A/B test shows +10% conversion in 2 weeks - Holdout shows +5% conversion after 6 months - **Learning:** Effect diminishes over time **Detect Delayed Negative Impacts:** - A/B test shows +15% signups - Holdout shows +10% churn after 3 months - **Learning:** Feature attracts wrong users ### How Long to Keep Holdout? **Guideline:** - 1-3 months for most features - 6-12 months for major changes - Permanent for critical features ### When to Remove Holdout? **Remove if:** - No long-term differences detected - Opportunity cost too high (5% of users on worse experience) - Feature is critical (everyone should have it) --- ## Experiment Analysis ### Step 1: Compare Primary Metric **Example:** - Control: 5.0% conversion - Treatment: 5.5% conversion - Lift: +10% relative - P-value: 0.03 ✅ **Decision:** Treatment is statistically significantly better. ### Step 2: Check Secondary Metrics **Example:** - Revenue per user: $10.50 (control) vs $11.20 (treatment) ✅ - Time to checkout: 3.2 min (control) vs 2.8 min (treatment) ✅ **Decision:** Secondary metrics also improved. ### Step 3: Check Counter Metrics **Example:** - Bounce rate: 30% (control) vs 32% (treatment) ⚠️ - Error rate: 0.5% (control) vs 0.5% (treatment) ✅ **Decision:** Slight increase in bounce rate, investigate. ### Step 4: Segment Analysis **Did it work for everyone?** | Segment | Control | Treatment | Lift | |---------|---------|-----------|------| | Mobile | 4.5% | 5.2% | +15% ✅ | | Desktop | 5.5% | 5.8% | +5% ✅ | | Free users | 3.0% | 3.6% | +20% ✅ | | Paid users | 7.0% | 7.1% | +1% ⚠️ | **Learning:** Works great for mobile and free users, minimal impact on paid users. ### Step 5: Statistical Significance **Check:** - P-value < 0.05 ✅ - Confidence interval doesn't include 0 ✅ **Example:** - Lift: +10% - 95% CI: [+5%, +15%] - Interpretation: We're 95% confident the true lift is between 5% and 15%. ### Step 6: Practical Significance **Is the improvement meaningful?** **Example:** - Statistically significant: Yes (p = 0.04) - Lift: +0.1% (5.0% → 5.005%) - **Decision:** Not practically significant (too small to matter) **Guideline:** - Small lift but high volume → Ship (e.g., +0.1% on 1M users = 1,000 more conversions) - Large lift but low volume → Maybe ship (e.g., +50% on 100 users = 50 more conversions) --- ## Decision Framework ### Ship If: ✅ **Positive:** Treatment is better than control ✅ **Significant:** P-value < 0.05 ✅ **No Red Flags:** Secondary and counter metrics look good ✅ **Works for Key Segments:** At least works for majority **Example:** - Conversion: +10% (p = 0.03) ✅ - Revenue: +8% (p = 0.05) ✅ - Bounce rate: No change ✅ - Works for mobile and desktop ✅ - **Decision: Ship!** ### Iterate If: ⚠️ **Mixed Results:** Some metrics up, some down ⚠️ **Works for Some Segments Only:** E.g., only mobile, not desktop ⚠️ **Close to Significance:** P = 0.06 (just missed) **Example:** - Conversion: +10% (p = 0.03) ✅ - Revenue: -5% (p = 0.08) ⚠️ - **Decision: Iterate.** Conversion is up but revenue is down. Investigate why. ### Kill If: ❌ **Negative:** Treatment is worse than control ❌ **Not Significant:** P-value > 0.05 ❌ **Opportunity Cost Too High:** Could be working on better ideas **Example:** - Conversion: +2% (p = 0.15) ❌ - Took 4 weeks to test - **Decision: Kill.** Not significant, move on to next idea. --- ## Tools ### Feature Flags **LaunchDarkly:** - Feature flag management - Gradual rollouts - Kill switches **Split.io:** - Feature flags + experimentation - Real-time metrics **Unleash:** - Open-source feature flags - Self-hosted option ### Experimentation Platforms **Optimizely:** - Full-stack experimentation - Visual editor for web - Stats Engine (sequential testing) **VWO (Visual Website Optimizer):** - A/B testing for web - Heatmaps, session recordings - Visual editor **GrowthBook:** - Open-source experimentation - Bayesian statistics - Feature flags **Statsig:** - Modern experimentation platform - Sequential testing - Free tier ### Analytics **Amplitude:** - Product analytics - Funnel analysis - Cohort analysis **Mixpanel:** - Event-based analytics - A/B test analysis - Retention analysis **PostHog:** - Open-source product analytics - Feature flags - Session replay --- ## A/B Testing for Engineers ### 1. Feature Flag Implementation **Node.js (LaunchDarkly):** ```javascript const LaunchDarkly = require('launchdarkly-node-server-sdk'); const client = LaunchDarkly.init(process.env.LAUNCHDARKLY_SDK_KEY); await client.waitForInitialization(); app.get('/checkout', async (req, res) => { const user = { key: req.user.id, email: req.user.email, custom: { plan: req.user.plan } }; const showNewCheckout = await client.variation('new-checkout-flow', user, false); if (showNewCheckout) { res.render('checkout-new'); } else { res.render('checkout-old'); } }); ``` **Python (Statsig):** ```python from statsig import statsig statsig.initialize(os.environ['STATSIG_SERVER_KEY']) @app.route('/checkout') def checkout(): user = { 'userID': current_user.id, 'email': current_user.email, 'custom': { 'plan': current_user.plan } } show_new_checkout = statsig.check_gate(user, 'new_checkout_flow') if show_new_checkout: return render_template('checkout_new.html') else: return render_template('checkout_old.html') ``` ### 2. Metric Instrumentation **Segment (Event Tracking):** ```javascript const Analytics = require('analytics-node'); const analytics = new Analytics(process.env.SEGMENT_WRITE_KEY); // Track checkout started analytics.track({ userId: user.id, event: 'Checkout Started', properties: { variant: showNewCheckout ? 'treatment' : 'control', cart_value: cart.total, items_count: cart.items.length } }); // Track checkout completed analytics.track({ userId: user.id, event: 'Checkout Completed', properties: { variant: showNewCheckout ? 'treatment' : 'control', order_id: order.id, revenue: order.total } }); ``` ### 3. Data Pipeline **Architecture:** ``` Application ↓ (events) Segment ↓ (forwards to) ├── Amplitude (analytics) ├── Mixpanel (analytics) ├── Data Warehouse (BigQuery, Snowflake) └── Statsig (experimentation) ``` ### 4. Results Dashboard **Grafana Dashboard:** ```json { "dashboard": { "title": "A/B Test: New Checkout Flow", "panels": [ { "title": "Conversion Rate by Variant", "targets": [ { "expr": "sum(checkout_completed{variant='control'}) / sum(checkout_started{variant='control'})", "legendFormat": "Control" }, { "expr": "sum(checkout_completed{variant='treatment'}) / sum(checkout_started{variant='treatment'})", "legendFormat": "Treatment" } ] }, { "title": "Sample Size", "targets": [ { "expr": "sum(checkout_started{variant='control'})", "legendFormat": "Control" }, { "expr": "sum(checkout_started{variant='treatment'})", "legendFormat": "Treatment" } ] } ] } } ``` --- ## Real Experiment Examples ### Example 1: Button Color Test (Classic) **Hypothesis:** > "If we change the CTA button from blue to orange, click-through rate will increase by 10%, because orange is more attention-grabbing." **Test:** - Control: Blue button - Treatment: Orange button - Sample size: 10,000 per variant - Duration: 1 week **Results:** - Control: 5.2% CTR - Treatment: 5.7% CTR - Lift: +9.6% - P-value: 0.04 ✅ **Decision:** Ship orange button. ### Example 2: Checkout Flow Optimization **Hypothesis:** > "If we reduce checkout from 5 steps to 3 steps, conversion will increase by 15%, because users abandon due to flow length." **Test:** - Control: 5-step checkout - Treatment: 3-step checkout (combined steps) - Sample size: 50,000 per variant - Duration: 2 weeks **Results:** - Control: 8.5% conversion - Treatment: 9.8% conversion - Lift: +15.3% - P-value: 0.001 ✅ **Secondary Metrics:** - Time to checkout: 4.2 min → 3.1 min ✅ - Error rate: 2.1% → 1.8% ✅ **Decision:** Ship 3-step checkout. ### Example 3: Pricing Page Variants **Hypothesis:** > "If we show annual pricing first (instead of monthly), annual plan adoption will increase by 25%, because anchoring effect." **Test:** - Control: Monthly pricing shown first - Treatment: Annual pricing shown first - Sample size: 20,000 per variant - Duration: 3 weeks **Results:** - Control: 12% annual adoption - Treatment: 18% annual adoption - Lift: +50% - P-value: 0.001 ✅ **Counter Metrics:** - Overall conversion: 10.5% → 10.2% ⚠️ (slight drop) **Decision:** Ship, but monitor overall conversion. ### Example 4: Onboarding Flow **Hypothesis:** > "If we add an interactive tutorial in onboarding, activation rate will increase by 30%, because users don't know how to get started." **Test:** - Control: No tutorial - Treatment: Interactive tutorial (5 steps) - Sample size: 15,000 per variant - Duration: 2 weeks **Results:** - Control: 25% activation rate - Treatment: 28% activation rate - Lift: +12% - P-value: 0.08 ❌ (not significant) **Segment Analysis:** - New users: +20% (p = 0.03) ✅ - Returning users: +2% (p = 0.5) ❌ **Decision:** Iterate. Show tutorial only to new users. --- ## Advanced: Bayesian A/B Testing ### Traditional (Frequentist) A/B Testing **Approach:** - Null hypothesis: No difference between A and B - P-value: Probability of seeing this result if null is true - Reject null if p < 0.05 **Interpretation:** > "There's a 95% chance the result is not due to random chance." ### Bayesian A/B Testing **Approach:** - Prior belief: What we believe before test - Likelihood: Data from test - Posterior belief: Updated belief after test **Interpretation:** > "There's a 95% probability that B is better than A." ### Benefits of Bayesian 1. **Easier to Interpret:** - "95% probability B is better" (intuitive) - vs "p = 0.03" (confusing) 2. **Can Stop Early:** - No peeking problem - Stop when confident enough 3. **Incorporates Prior Knowledge:** - Use historical data - More accurate with small samples ### Tools That Use Bayesian - **GrowthBook:** Bayesian by default - **VWO:** Bayesian engine option - **Google Optimize:** Bayesian (deprecated) ### Example **Test:** - Control: 5.0% conversion (1000 users) - Treatment: 5.5% conversion (1000 users) **Frequentist:** - P-value: 0.15 (not significant) - Decision: Can't conclude **Bayesian:** - Probability B > A: 87% - Expected lift: +10% - Decision: Likely better, but not confident enough (need 95%) --- ## Summary ### Quick Reference **Experiment Types:** - A/B test: Two variants - Multivariate: Multiple changes - Sequential: Stop early - Holdout: Long-term measurement **When to Experiment:** - Significant features - Uncertain outcomes - Multiple options - Optimization **Process:** 1. Define hypothesis 2. Choose metrics 3. Calculate sample size 4. Set duration 5. Design variants 6. Launch 7. Analyze 8. Decide **Metrics:** - Primary: What we're optimizing - Secondary: Guardrails - Counter: Watch for negatives **Statistical Significance:** - P-value < 0.05 - Power > 80% - Minimum detectable effect **Common Pitfalls:** - Peeking - Sample ratio mismatch - Novelty effect - Seasonality **Decision Framework:** - Ship: Positive, significant, no red flags - Iterate: Mixed results - Kill: Negative, not significant **Tools:** - Feature flags: LaunchDarkly, Split.io - Experimentation: Optimizely, Statsig, GrowthBook - Analytics: Amplitude, Mixpanel, PostHog