--- name: ab-testing-framework description: A/B and multivariate testing methodology. Design experiments, calculate sample sizes, determine statistical significance, avoid common pitfalls, and interpret results. Platform-agnostic framework applicable to landing pages, emails, ads, pricing, and product features. Use when the user asks about A/B testing, split testing, experiment design, statistical significance, or conversion experiments. license: MIT origin: custom author: Rebecca Rae Barton author_url: https://github.com/thatrebeccarae metadata: version: 1.0.0 category: analytics domain: experimentation updated: 2026-03-18 tested: 2026-03-18 tested_with: "Claude Code v2.1" --- # A/B Testing Framework Design, run, and analyze conversion experiments with statistical rigor. ## Install ```bash git clone https://github.com/thatrebeccarae/claude-marketing.git && cp -r claude-marketing/skills/ab-testing-framework ~/.claude/skills/ ``` ## Test Design Process ### Step 1: Hypothesis **Template:** If we [change X], then [metric Y] will [increase/decrease] by [Z%] because [reason]. **Good hypothesis:** "If we change the CTA from Get Started to Start Free Trial, then signup rate will increase by 15% because it reduces uncertainty about cost." **Bad hypothesis:** "If we change the button color, conversions will improve." (No reasoning, no expected magnitude.) ### Step 2: Sample Size Calculation To determine how long to run a test: ``` Required sample per variation = 16 * (p * (1-p)) / (MDE^2) Where: p = baseline conversion rate (as decimal) MDE = minimum detectable effect (as decimal) ``` | Baseline Rate | 10% MDE | 20% MDE | 30% MDE | |--------------|---------|---------|---------| | 1% | 253,414 | 63,354 | 28,157 | | 3% | 82,369 | 20,592 | 9,152 | | 5% | 48,640 | 12,160 | 5,404 | | 10% | 23,040 | 5,760 | 2,560 | | 20% | 10,240 | 2,560 | 1,138 | **Minimum test duration:** 2 full business weeks (to capture day-of-week effects), even if sample size is reached sooner. ### Step 3: Test Execution Rules 1. **Random assignment** — visitors must be randomly assigned to control/variant 2. **No peeking** — do not check results before reaching sample size 3. **No mid-test changes** — do not modify variants during the test 4. **Even traffic split** — 50/50 for A/B, even splits for multivariate 5. **Single variable** — change only one thing per test (unless multivariate) 6. **Full duration** — run for the pre-calculated duration, not until significance ### Step 4: Statistical Analysis #### Frequentist Approach **Z-test for proportions:** ``` Z = (p1 - p2) / sqrt(p_pooled * (1 - p_pooled) * (1/n1 + 1/n2)) Where: p1, p2 = conversion rates of control and variant p_pooled = (x1 + x2) / (n1 + n2) n1, n2 = sample sizes ``` **p-value interpretation:** - p < 0.05: Statistically significant (95% confidence) - p < 0.01: Highly significant (99% confidence) - p >= 0.05: Not significant — do not declare a winner #### Bayesian Approach **When to use Bayesian:** - Low traffic (small sample sizes) - Need to make decisions faster - Want probability of each variant being best (not just "significant or not") **Interpretation:** "There is a 94% probability that Variant B is better than Control" vs frequentist "We reject the null hypothesis at 95% confidence." ### Step 5: Decision Framework | Result | Significance | Action | |--------|-------------|--------| | Variant wins | p < 0.05 | Implement variant | | Control wins | p < 0.05 | Keep control, learn from failure | | No difference | p >= 0.05 | Keep control, test something bigger | | Variant wins | p = 0.05-0.10 | Consider traffic — may need more time | ## Common Testing Pitfalls 1. **Peeking** — checking results early inflates false positive rate from 5% to 26%+ 2. **Stopping early** — reaching significance != reaching required sample size 3. **Testing too many variants** — each variant needs full sample size 4. **Ignoring segments** — overall winner may be loser for key segments 5. **Too small an effect** — testing for 2% lift needs enormous sample sizes 6. **Not accounting for seasonality** — run full weeks, avoid holidays 7. **Multiple metrics** — primary metric must be pre-declared; secondary are directional 8. **Survivorship bias** — only measuring users who complete, not those who abandon 9. **Simpson paradox** — segment-level winners can reverse at aggregate level 10. **Novelty effect** — new designs get temporary lift; re-test after 2-4 weeks ## What to Test (Prioritized by Impact) ### High Impact - Value proposition / headline - CTA text and placement - Pricing and offer structure - Form length (fields removed) - Page layout (single column vs multi) - Social proof presence and placement ### Medium Impact - Image/video vs static - Testimonial format (text vs video) - Navigation presence on landing pages - Trust badges and security signals - Urgency elements (countdown, stock) ### Low Impact (Usually Not Worth Testing) - Button color (unless extreme contrast issue) - Font changes - Minor copy tweaks - Icon styles - Footer content ## Integration with Other Skills - **cro-auditor** — CRO audit generates test hypotheses; this skill designs the experiments - **google-analytics** — GA4 for experiment data and segment analysis - **copywriting-frameworks** — Generate variant copy using proven frameworks