---
name: ab-test-setup
description: Structured guide for setting up A/B tests with mandatory gates for hypothesis, metrics, and execution readiness.
metadata:
  scope: [root]
  auto_invoke: "Working with ab test setup"
---

# A/B Test Setup

## 1️⃣ Purpose & Scope

Ensure every A/B test is **valid, rigorous, and safe** before a single line of code is written.

- Prevents "peeking"
- Enforces statistical power
- Blocks invalid hypotheses

---

## 2️⃣ Pre-Requisites

You must have:

- A clear user problem
- Access to an analytics source
- Roughly estimated traffic volume

### Hypothesis Quality Checklist

A valid hypothesis includes:

- Observation or evidence
- Single, specific change
- Directional expectation
- Defined audience
- Measurable success criteria

---

### 3️⃣ Hypothesis Lock (Hard Gate)

Before designing variants or metrics, you MUST:

- Present the **final hypothesis**
- Specify:
  - Target audience
  - Primary metric
  - Expected direction of effect
  - Minimum Detectable Effect (MDE)

Ask explicitly:

> “Is this the final hypothesis we are committing to for this test?”

**Do NOT proceed until confirmed.**

---

### 4️⃣ Assumptions & Validity Check (Mandatory)

Explicitly list assumptions about:

- Traffic stability
- User independence
- Metric reliability
- Randomization quality
- External factors (seasonality, campaigns, releases)

If assumptions are weak or violated:

- Warn the user
- Recommend delaying or redesigning the test

---

### 5️⃣ Test Type Selection

Choose the simplest valid test:

- **A/B Test** – single change, two variants
- **A/B/n Test** – multiple variants, higher traffic required
- **Multivariate Test (MVT)** – interaction effects, very high traffic
- **Split URL Test** – major structural changes

Default to **A/B** unless there is a clear reason otherwise.

---

### 6️⃣ Metrics Definition

#### Primary Metric (Mandatory)

- Single metric used to evaluate success
- Directly tied to the hypothesis
- Pre-defined and frozen before launch

#### Secondary Metrics

- Provide context
- Explain _why_ results occurred
- Must not override the primary metric

#### Guardrail Metrics

- Metrics that must not degrade
- Used to prevent harmful wins
- Trigger test stop if significantly negative

---

### 7️⃣ Sample Size & Duration

Define upfront:

- Baseline rate
- MDE
- Significance level (typically 95%)
- Statistical power (typically 80%)

Estimate:

- Required sample size per variant
- Expected test duration

**Do NOT proceed without a realistic sample size estimate.**

---

### 8️⃣ Execution Readiness Gate (Hard Stop)

You may proceed to implementation **only if all are true**:

- Hypothesis is locked
- Primary metric is frozen
- Sample size is calculated
- Test duration is defined
- Guardrails are set
- Tracking is verified

If any item is missing, stop and resolve it.

---

## Running the Test

### During the Test

**DO:**

- Monitor technical health
- Document external factors

**DO NOT:**

- Stop early due to “good-looking” results
- Change variants mid-test
- Add new traffic sources
- Redefine success criteria

---

## Analyzing Results

### Analysis Discipline

When interpreting results:

- Do NOT generalize beyond the tested population
- Do NOT claim causality beyond the tested change
- Do NOT override guardrail failures
- Separate statistical significance from business judgment

### Interpretation Outcomes

| Result               | Action                                 |
| -------------------- | -------------------------------------- |
| Significant positive | Consider rollout                       |
| Significant negative | Reject variant, document learning      |
| Inconclusive         | Consider more traffic or bolder change |
| Guardrail failure    | Do not ship, even if primary wins      |

---

## Documentation & Learning

### Test Record (Mandatory)

Document:

- Hypothesis
- Variants
- Metrics
- Sample size vs achieved
- Results
- Decision
- Learnings
- Follow-up ideas

Store records in a shared, searchable location to avoid repeated failures.

---

## Refusal Conditions (Safety)

Refuse to proceed if:

- Baseline rate is unknown and cannot be estimated
- Traffic is insufficient to detect the MDE
- Primary metric is undefined
- Multiple variables are changed without proper design
- Hypothesis cannot be clearly stated

Explain why and recommend next steps.

---

## Key Principles (Non-Negotiable)

- One hypothesis per test
- One primary metric
- Commit before launch
- No peeking
- Learning over winning
- Statistical rigor first

---

## Final Reminder

A/B testing is not about proving ideas right.
It is about **learning the truth with confidence**.

If you feel tempted to rush, simplify, or “just try it” —
that is the signal to **slow down and re-check the design**.