--- name: exp-driven-dev description: Builds features with A/B testing in mind using Ronny Kohavi's frameworks and Netflix/Airbnb experimentation culture. Use when implementing feature flags, choosing metrics, designing experiments, or building for fast iteration. Focuses on guardrail metrics, statistical significance, and experiment-driven development. --- # Experimentation-Driven Development ## When This Skill Activates Claude uses this skill when: - Building new features that affect core metrics - Implementing A/B testing infrastructure - Making data-driven decisions - Setting up feature flags for gradual rollouts - Choosing which metrics to track ## Core Frameworks ### 1. Experiment Design (Source: Ronny Kohavi, Microsoft/Netflix) **The HITS Framework:** **H - Hypothesis:** > "We believe that [change] will cause [metric] to [increase/decrease] because [reason]" **I - Implementation:** - Feature flag setup - Treatment vs control - Sample size calculation **T - Test:** - Run for statistical significance - Monitor guardrail metrics - Watch for unexpected effects **S - Ship or Stop:** - Ship if positive - Stop if negative - Iterate if inconclusive **Example:** ```markdown Hypothesis: "We believe that adding social proof ('X people bought this') will increase conversion rate by 10% because it reduces purchase anxiety." Implementation: - Control: No social proof - Treatment: Show "X people bought" - Sample size: 10,000 users per variant - Duration: 2 weeks Test: - Primary metric: Conversion rate - Guardrails: Cart abandonment, return rate Ship or Stop: - If conversion +5% or more → Ship - If conversion -2% or less → Stop - If inconclusive → Iterate and retest ``` --- ### 2. Metric Selection **Primary Metric:** - ONE metric you're trying to move - Directly tied to business value - Clear success threshold **Guardrail Metrics:** - Metrics that shouldn't degrade - Prevent gaming the system - Ensure quality maintained **Example:** ``` Feature: Streamlined checkout Primary Metric: ✅ Purchase completion rate (+10%) Guardrail Metrics: ⚠️ Cart abandonment (don't increase) ⚠️ Return rate (don't increase) ⚠️ Support tickets (don't increase) ⚠️ Load time (stay <2s) ``` --- ### 3. Statistical Significance **The Math:** ``` Minimum sample size = (Effect size, Confidence, Power) Typical settings: - Confidence: 95% (p < 0.05) - Power: 80% (detect 80% of real effects) - Effect size: Minimum detectable change Example: - Baseline conversion: 10% - Minimum detectable effect: +1% (to 11%) - Required: ~15,000 users per variant ``` **Common Mistakes:** - ❌ Stopping test early (peeking bias) - ❌ Running too short (seasonal effects) - ❌ Too many variants (dilutes sample) - ❌ Changing test mid-flight --- ### 4. Feature Flag Architecture **Implementation:** ```javascript // Feature flag pattern function checkoutFlow(user) { if (isFeatureEnabled(user, 'new-checkout')) { return newCheckoutExperience(); } else { return oldCheckoutExperience(); } } // Gradual rollout function isFeatureEnabled(user, feature) { const rolloutPercent = getFeatureRollout(feature); const userBucket = hashUserId(user.id) % 100; return userBucket < rolloutPercent; } // Experiment assignment function assignExperiment(user, experiment) { const variant = consistentHash(user.id, experiment); track('experiment_assigned', { userId: user.id, experiment: experiment, variant: variant }); return variant; } ``` --- ## Decision Tree: Should We Experiment? ``` NEW FEATURE │ ├─ Affects core metrics? ──────YES──→ EXPERIMENT REQUIRED │ NO ↓ │ ├─ Risky change? ──────────────YES──→ EXPERIMENT RECOMMENDED │ NO ↓ │ ├─ Uncertain impact? ──────────YES──→ EXPERIMENT USEFUL │ NO ↓ │ ├─ Easy to A/B test? ─────────YES──→ WHY NOT EXPERIMENT? │ NO ↓ │ └─ SHIP WITHOUT TEST ←────────────────┘ (But still feature flag for rollback) ``` ## Action Templates ### Template 1: Experiment Spec ```markdown # Experiment: [Name] ## Hypothesis **We believe:** [change] **Will cause:** [metric] to [increase/decrease] **Because:** [reasoning] ## Variants ### Control (50%) [Current experience] ### Treatment (50%) [New experience] ## Metrics ### Primary Metric - **What:** [metric name] - **Current:** [baseline] - **Target:** [goal] - **Success:** [threshold] ### Guardrail Metrics - **Metric 1:** [name] - Don't decrease - **Metric 2:** [name] - Don't increase - **Metric 3:** [name] - Maintain ## Sample Size - **Users needed:** [X per variant] - **Duration:** [Y days] - **Confidence:** 95% - **Power:** 80% ## Implementation ```javascript if (experiment('feature-name') === 'treatment') { // New experience } else { // Old experience } ``` ## Success Criteria - [ ] Primary metric improved by [X]% - [ ] No guardrail degradation - [ ] Statistical significance reached - [ ] No unexpected negative effects ## Decision - **If positive:** Ship to 100% - **If negative:** Rollback, iterate - **If inconclusive:** Extend or redesign ``` ### Template 2: Feature Flag Implementation ```typescript // features.ts export const FEATURES = { 'new-checkout': { rollout: 10, // 10% of users enabled: true, description: 'New streamlined checkout flow' }, 'ai-recommendations': { rollout: 0, // Not live yet enabled: false, description: 'AI-powered product recommendations' } }; // feature-flags.ts export function isEnabled(userId: string, feature: string): boolean { const config = FEATURES[feature]; if (!config || !config.enabled) return false; const bucket = consistentHash(userId) % 100; return bucket < config.rollout; } // usage in code if (isEnabled(user.id, 'new-checkout')) { return ; } else { return ; } ``` ### Template 3: Experiment Dashboard ```markdown # Experiment Dashboard ## Active Experiments ### Experiment 1: [Name] - **Status:** Running - **Started:** [date] - **Progress:** [X]% sample size reached - **Primary metric:** [current result] - **Guardrails:** ✅ All healthy ### Experiment 2: [Name] - **Status:** Complete - **Result:** Treatment won (+15% conversion) - **Decision:** Ship to 100% - **Shipped:** [date] ## Key Metrics ### Experiment Velocity - **Experiments launched:** [X per month] - **Win rate:** [Y]% - **Average duration:** [Z] days ### Impact - **Revenue impact:** +$[X] - **Conversion improvement:** +[Y]% - **User satisfaction:** +[Z] NPS ## Learnings - [Key insight 1] - [Key insight 2] - [Key insight 3] ``` ## Quick Reference ### 🧪 Experiment Checklist **Before Starting:** - [ ] Hypothesis written (believe → cause → because) - [ ] Primary metric defined - [ ] Guardrails identified - [ ] Sample size calculated - [ ] Feature flag implemented - [ ] Tracking instrumented **During Experiment:** - [ ] Don't peek early (wait for significance) - [ ] Monitor guardrails daily - [ ] Watch for unexpected effects - [ ] Log any external factors (holidays, outages) **After Experiment:** - [ ] Statistical significance reached - [ ] Guardrails not degraded - [ ] Decision made (ship/stop/iterate) - [ ] Learning documented --- ## Real-World Examples ### Example 1: Netflix Experimentation **Volume:** 250+ experiments running at once **Approach:** Everything is an experiment **Culture:** "Strong opinions, weakly held - let data decide" **Example Test:** - Hypothesis: Bigger thumbnails increase engagement - Result: No improvement, actually hurt browse time - Decision: Rollback - Learning: Saved $$ by not shipping --- ### Example 2: Airbnb's Experiments **Test:** New search ranking algorithm **Primary:** Bookings per search **Guardrails:** - Search quality (ratings of bookings) - Host earnings (don't concentrate bookings) - Guest satisfaction **Result:** +3% bookings, all guardrails healthy → Ship --- ### Example 3: Stripe's Feature Flags **Approach:** Every feature behind flag **Benefits:** - Instant rollback (flip flag) - Gradual rollout (1% → 5% → 25% → 100%) - Test in production safely **Example:** ```javascript if (experiments.isEnabled('instant-payouts')) { return ; } ``` --- ## Common Pitfalls ### ❌ Mistake 1: Peeking Too Early **Problem:** Stopping test before statistical significance **Fix:** Calculate sample size upfront, wait for it ### ❌ Mistake 2: No Guardrails **Problem:** Gaming the metric (increase clicks but hurt quality) **Fix:** Always define guardrails ### ❌ Mistake 3: Too Many Variants **Problem:** Not enough users per variant **Fix:** Limit to 2-3 variants max ### ❌ Mistake 4: Ignoring External Factors **Problem:** Holiday spike looks like treatment effect **Fix:** Note external events, extend duration --- ## Related Skills - **metrics-frameworks** - For choosing right metrics - **growth-embedded** - For growth experiments - **ship-decisions** - For when to ship vs test more - **strategic-build** - For deciding what to test --- ## Key Quotes **Ronny Kohavi:** > "The best way to predict the future is to run an experiment." **Netflix Culture:** > "Strong opinions, weakly held. Let data be the tie-breaker." **Airbnb:** > "We trust our intuition to generate hypotheses, and we trust data to make decisions." --- ## Further Learning - **references/experiment-design-guide.md** - Complete methodology - **references/statistical-significance.md** - Sample size calculations - **references/feature-flags-implementation.md** - Code examples - **references/guardrail-metrics.md** - Choosing guardrails