--- name: ab-test-setup description: Structured guide for setting up A/B tests with mandatory gates for hypothesis, metrics, and execution readiness. metadata: scope: [root] auto_invoke: "Working with ab test setup" --- # A/B Test Setup ## 1️⃣ Purpose & Scope Ensure every A/B test is **valid, rigorous, and safe** before a single line of code is written. - Prevents "peeking" - Enforces statistical power - Blocks invalid hypotheses --- ## 2️⃣ Pre-Requisites You must have: - A clear user problem - Access to an analytics source - Roughly estimated traffic volume ### Hypothesis Quality Checklist A valid hypothesis includes: - Observation or evidence - Single, specific change - Directional expectation - Defined audience - Measurable success criteria --- ### 3️⃣ Hypothesis Lock (Hard Gate) Before designing variants or metrics, you MUST: - Present the **final hypothesis** - Specify: - Target audience - Primary metric - Expected direction of effect - Minimum Detectable Effect (MDE) Ask explicitly: > “Is this the final hypothesis we are committing to for this test?” **Do NOT proceed until confirmed.** --- ### 4️⃣ Assumptions & Validity Check (Mandatory) Explicitly list assumptions about: - Traffic stability - User independence - Metric reliability - Randomization quality - External factors (seasonality, campaigns, releases) If assumptions are weak or violated: - Warn the user - Recommend delaying or redesigning the test --- ### 5️⃣ Test Type Selection Choose the simplest valid test: - **A/B Test** – single change, two variants - **A/B/n Test** – multiple variants, higher traffic required - **Multivariate Test (MVT)** – interaction effects, very high traffic - **Split URL Test** – major structural changes Default to **A/B** unless there is a clear reason otherwise. --- ### 6️⃣ Metrics Definition #### Primary Metric (Mandatory) - Single metric used to evaluate success - Directly tied to the hypothesis - Pre-defined and frozen before launch #### Secondary Metrics - Provide context - Explain _why_ results occurred - Must not override the primary metric #### Guardrail Metrics - Metrics that must not degrade - Used to prevent harmful wins - Trigger test stop if significantly negative --- ### 7️⃣ Sample Size & Duration Define upfront: - Baseline rate - MDE - Significance level (typically 95%) - Statistical power (typically 80%) Estimate: - Required sample size per variant - Expected test duration **Do NOT proceed without a realistic sample size estimate.** --- ### 8️⃣ Execution Readiness Gate (Hard Stop) You may proceed to implementation **only if all are true**: - Hypothesis is locked - Primary metric is frozen - Sample size is calculated - Test duration is defined - Guardrails are set - Tracking is verified If any item is missing, stop and resolve it. --- ## Running the Test ### During the Test **DO:** - Monitor technical health - Document external factors **DO NOT:** - Stop early due to “good-looking” results - Change variants mid-test - Add new traffic sources - Redefine success criteria --- ## Analyzing Results ### Analysis Discipline When interpreting results: - Do NOT generalize beyond the tested population - Do NOT claim causality beyond the tested change - Do NOT override guardrail failures - Separate statistical significance from business judgment ### Interpretation Outcomes | Result | Action | | -------------------- | -------------------------------------- | | Significant positive | Consider rollout | | Significant negative | Reject variant, document learning | | Inconclusive | Consider more traffic or bolder change | | Guardrail failure | Do not ship, even if primary wins | --- ## Documentation & Learning ### Test Record (Mandatory) Document: - Hypothesis - Variants - Metrics - Sample size vs achieved - Results - Decision - Learnings - Follow-up ideas Store records in a shared, searchable location to avoid repeated failures. --- ## Refusal Conditions (Safety) Refuse to proceed if: - Baseline rate is unknown and cannot be estimated - Traffic is insufficient to detect the MDE - Primary metric is undefined - Multiple variables are changed without proper design - Hypothesis cannot be clearly stated Explain why and recommend next steps. --- ## Key Principles (Non-Negotiable) - One hypothesis per test - One primary metric - Commit before launch - No peeking - Learning over winning - Statistical rigor first --- ## Final Reminder A/B testing is not about proving ideas right. It is about **learning the truth with confidence**. If you feel tempted to rush, simplify, or “just try it” — that is the signal to **slow down and re-check the design**.