--- name: experiment-metrics description: STEDII framework for selecting trustworthy experiment metrics. Ensures metric validity and reliability. disable-model-invocation: false user-invocable: true --- # Experiment Metrics Selection: STEDII Framework **When to use:** Before launching any experiment, when metrics feel unreliable, or when experiment results are confusing **Framework source:** Aakash Gupta's "How to Choose the Right Metrics to Evaluate Experiments" --- ## The STEDII Framework Choose experiment metrics that are: 1. **S**ensitive 2. **T**imely 3. **E**fficient 4. **D**ebuggable 5. **I**nterpretable 6. **I**solated --- ## 1. Sensitive (Detects Small But Meaningful Changes) **What it means:** The metric moves when your feature actually improves the experience **Bad example:** - Metric: Monthly Active Users (MAU) - Problem: Too coarse. A good onboarding improvement might not move MAU for months. **Good example:** - Metric: Day 7 activation rate - Why: Sensitive enough to detect onboarding improvements within a week **How to check:** Ask: "If this experiment succeeds, will this metric move within the experiment window?" **Common mistake:** Using metrics that are too aggregated (MAU, total revenue) when you need something more granular (daily activation, conversion rate by cohort). --- ## 2. Timely (Results Available Quickly) **What it means:** You get signal fast enough to make decisions **Bad example:** - Metric: 90-day retention - Problem: Takes 90 days to know if your experiment worked **Good example:** - Metric: Day 7 retention + leading indicators - Why: Faster feedback, correlates with long-term retention **Tradeoff alert:** Sometimes you NEED slow metrics (LTV, annual retention). In those cases: - Use leading indicators to get fast signal - Run smaller experiments to validate - Accept longer experiment duration for critical decisions **How to check:** Ask: "Can I get actionable results within [1 week / 2 weeks / 1 month]?" --- ## 3. Efficient (High Statistical Power) **What it means:** You can detect the effect with reasonable sample size and time **Bad example:** - Metric: Revenue per user - Problem: High variance, need massive sample sizes **Good example:** - Metric: Conversion rate - Why: Lower variance, reaches significance faster **Statistical power explained:** - Power = ability to detect a real effect - Higher variance metrics = lower power = longer experiments - Formula: Sample size needed ∝ (Variance / Expected Effect Size)² **How to check:** Run a power calculation: ``` Minimum sample size = (Z + Z)² × (σ² / δ²) Where: - Z = confidence level (usually 1.96 for 95%) - σ = standard deviation of metric - δ = minimum detectable effect ``` **Practical tip:** If you need >1M users to detect a 5% lift, your metric isn't efficient enough. --- ## 4. Debuggable (Easy to Diagnose Issues) **What it means:** When something goes wrong, you can figure out why **Bad example:** - Metric: "Engagement score" (black box formula) - Problem: If it drops, you don't know what broke **Good example:** - Metric: Click-through rate (CTR) - Why: Simple, transparent, easy to debug **How to check:** Ask: "If this metric tanks, can I quickly understand what happened?" **What makes metrics debuggable:** - ✅ Simple calculations - ✅ Can be broken down by segments - ✅ Can view user-level data - ✅ Clear numerator and denominator **Red flags:** - ❌ Proprietary "engagement scores" - ❌ Complex weighted formulas - ❌ Metrics with 5+ variables - ❌ Black box ML model outputs --- ## 5. Interpretable (Easy to Understand and Explain) **What it means:** Stakeholders can understand what the metric represents **Bad example:** - Metric: "Quality-adjusted sessions per visitor" - Problem: What does "quality-adjusted" mean? **Good example:** - Metric: "% of users who complete onboarding" - Why: Crystal clear what it measures **The grandma test:** Can you explain this metric to your grandma? If not, it fails interpretability. **How to check:** - Can you explain it in one sentence? - Would a new PM understand it immediately? - Can executives grasp it without training? --- ## 6. Isolated (Measures Only What You Changed) **What it means:** The metric moves because of your experiment, not external factors **Bad example:** - Metric: Total signups - Problem: Could move due to marketing campaigns, seasonality, competitor changes **Good example:** - Metric: Signup conversion rate (for signup flow experiment) - Why: Isolated to the signup flow you're testing **Common isolation failures:** - Network effects (social features affect all users) - Cross-contamination (treatment bleeds to control) - Seasonality (holiday effects) - Marketing campaigns running simultaneously **How to check:** Ask: "Could something OTHER than my experiment cause this metric to move?" --- ## How to Use This Framework ### Step 1: List Your Candidate Metrics ``` Use /experiment-metrics I'm running an experiment to: [describe your experiment] Help me brainstorm 5-10 candidate metrics we could measure. ``` --- ### Step 2: Score Each Metric Against STEDII Create a table: | Metric | Sensitive? | Timely? | Efficient? | Debuggable? | Interpretable? | Isolated? | Total Score | |--------|------------|---------|------------|-------------|----------------|-----------|-------------| | Metric 1 | 2/3 | 3/3 | 2/3 | 3/3 | 3/3 | 2/3 | 15/18 | | Metric 2 | 3/3 | 1/3 | 3/3 | 2/3 | 3/3 | 3/3 | 15/18 | Scoring: - 3 = Excellent - 2 = Acceptable - 1 = Poor - 0 = Fails this criterion --- ### Step 3: Select Primary + Guardrail Metrics **Primary metric:** The ONE metric your experiment is designed to move - Should score 15+/18 on STEDII - The metric you'll make decisions on **Guardrail metrics (3-5):** Metrics you DON'T want to hurt - Revenue (don't tank it) - Core engagement (don't break the product) - Quality metrics (don't hurt user experience) **Example:** - **Primary:** Day 7 activation rate - **Guardrails:** Revenue per user, Daily active users, Customer satisfaction score, Page load time --- ### Step 4: Run Pre-Experiment Checks Before launching: 1. **A:A Test** - Run experiment with no actual change - Both groups should be identical - If metrics differ, you have a setup problem 2. **Sample Ratio Check** - Verify 50/50 split is actually 50/50 - If you see 52/48 or worse, investigate 3. **Metric Stability** - Check historical variance - High variance = longer experiment needed --- ## Common Metric Selection Mistakes ### Mistake #1: Using Only One Metric **Problem:** Optimize one thing, break another **Solution:** Always have guardrail metrics - Primary: what you're trying to improve - Guardrails: what you don't want to hurt --- ### Mistake #2: Confusing Leading and Lagging Metrics **Lagging metrics:** - Slow to respond - Ultimate outcome you care about - Example: LTV, annual retention, NPS **Leading metrics:** - Fast signal - Predictive of lagging metrics - Example: Day 7 retention, activation rate **Best practice:** Use leading metrics to get fast signal, validate with lagging metrics on a sample. --- ### Mistake #3: Metric Dilution **Problem:** Testing a small feature but measuring site-wide metrics **Example:** - Test: New checkout button color - Metric: Monthly revenue - Issue: Only 5% of users even see checkout, signal is too diluted **Solution:** Measure metrics scoped to exposed users - Better metric: Revenue per checkout visitor - Or: Conversion rate (checkout started → completed) --- ### Mistake #4: Simpson's Paradox **Problem:** Aggregate metric moves one way, segments move the opposite way **Example:** - Overall conversion rate: +5% ✅ - Mobile conversion: -10% ❌ - Desktop conversion: -5% ❌ - Why? More cheap mobile traffic shifted the mix **Solution:** Always segment your metrics (new vs returning, mobile vs desktop, etc.) --- ## Real-World Examples ### Example 1: Netflix Thumbnail Test **Experiment:** Testing new thumbnail images **Bad metric:** Monthly viewing hours - Not sensitive (too aggregated) - Not timely (takes too long) - Not isolated (affected by content releases) **Good metric:** Click-through rate on thumbnails - Sensitive: Directly measures thumbnail appeal - Timely: Results in 1-2 days - Efficient: Lots of impressions = fast significance - Debuggable: Can see which thumbnails work - Interpretable: "% of people who click" - Isolated: Measures only thumbnail change --- ### Example 2: Booking.com Pricing Test **Experiment:** Showing "Only 2 rooms left!" urgency message **Bad metric:** Bookings per visitor - Not efficient (high variance) - Not timely (slow conversion cycle) **Good metrics:** - Primary: Booking conversion rate - Guardrail: Customer satisfaction (don't annoy users) - Guardrail: Return visit rate (don't hurt trust) **Result:** +2.5% conversion, but -5% satisfaction and -3% return visits **Decision:** Don't ship. Guardrails caught a bad long-term tradeoff. --- ## Quick Reference: Metric Selection Checklist Before you launch an experiment, verify: - [ ] **Primary metric clearly defined** - What are you measuring? - How is it calculated? - What's the minimum detectable effect? - [ ] **STEDII checklist passed** - [ ] Sensitive enough to detect improvements - [ ] Results available within [X] days - [ ] Sample size achievable - [ ] Can be debugged if issues arise - [ ] Stakeholders understand it - [ ] Isolated from external factors - [ ] **Guardrails defined (3-5 metrics)** - Revenue metrics - Engagement metrics - Quality metrics - [ ] **Statistical plan complete** - Significance level (usually 95%) - Minimum sample size calculated - Experiment duration estimated - A:A test passed - [ ] **Segmentation plan** - How will you break down results? - New vs returning users - Mobile vs desktop - Geographic segments --- ## Related Skills - `/experiment-decision` - Decide when to A/B test vs ship - `/metrics-framework` - Understand leading vs lagging metrics - `/define-north-star` - Choose your North Star Metric - `/retention-analysis` - Measure long-term impact --- **Framework credit:** Adapted from Aakash Gupta's STEDII framework. Read the full article: https://www.news.aakashg.com/p/metrics-experiments --- ## Context Routing Strategy When the PM uses `/experiment-metrics`, I automatically: ### 1. Pull Metrics from PRDs & Strategy **Source:** `thoughts/shared/pm/prds/`, success metrics defined there - **What I look for:** Feature's pre-defined success metrics, targets - **How I use it:** Pre-populate primary and secondary metrics for STEDII evaluation - **Example:** "Your PRD says success = conversion >60%, let's test if that's STEDII-compliant" ### 2. Query Analytics MCPs for Historical Data **Source:** PostHog, PostHog, Posthog (if connected) - **What I look for:** Variance of potential metrics, time-to-signal data - **How I use it:** Validate metrics are Sensitive and Timely with real data - **Example:** "Metric X has 12% variance historically, so needs N=5000 sample size" ### 3. Check for Metric Conflicts with Guardrails **Source:** `thoughts/shared/pm/metrics/`, company guardrails - **What I look for:** Metrics that must not decline, company KPIs - **How I use it:** Ensure secondary metrics include guardrails - **Example:** "NPS is a company guardrail, must include in secondary metrics" ### 4. Reference Past Experiments for Benchmarks **Source:** `thoughts/shared/pm/metrics/`, A/B test results - **What I look for:** What worked in past experiments, surprising metric learnings - **How I use it:** Suggest metrics that detected real impacts before - **Example:** "In past experiments, page load time was poorly Sensitive, don't use it" ### 5. Route to Experiment Decision Framework **Source:** Connection to `/experiment-decision` skill - **What I look for:** Is testing even the right call? - **How I use it:** If you should ship without testing, auto-flag before selecting metrics - **Example:** "CSS changes are reversible, don't need this full STEDII analysis" --- ## Output Quality Self-Check Before presenting output to the PM, verify: - [ ] **Context was checked:** Reviewed `thoughts/shared/pm/metrics/` for existing experiments and baselines, and `thoughts/shared/pm/prds/` for pre-defined success metrics - [ ] **Each metric evaluated against all 6 STEDII dimensions:** Every candidate metric has a score (0-3) for Sensitive, Timely, Efficient, Debuggable, Interpretable, and Isolated, with reasoning for each score - [ ] **Sample size requirements calculated:** The output includes a minimum sample size estimate for the primary metric based on expected effect size and variance - [ ] **Metric sensitivity analysis included:** The output states whether the expected change is detectable given current traffic, variance, and experiment duration - [ ] **Guardrail metrics identified:** At least 3 guardrail metrics are defined with acceptable ranges to prevent unintended harm - [ ] **No vanity metrics without justification:** If any metric could be considered a vanity metric (e.g., page views, total signups), the output explains why it is valid for this specific experiment