--- name: ab-test-setup description: > Design and implement statistically rigorous A/B tests and experiments. Covers hypothesis formulation, sample size calculation, metric selection, traffic allocation, implementation patterns (client-side and server-side), statistical analysis, and common pitfalls. Use when planning experiments, calculating sample sizes, designing test variants, analyzing results, or when someone says "let's test that. license: MIT + Commons Clause metadata: version: 1.0.0 author: borghei category: product-team domain: experimentation updated: 2026-03-09 tags: [ab-testing, experimentation, hypothesis, statistical-significance] frameworks: hypothesis-testing, statistical-significance, feature-flags --- # A/B Test Setup - Experimentation Design & Analysis **Category:** Product Team **Tags:** A/B testing, experiments, statistical significance, sample size, feature flags, hypothesis testing ## Overview A/B Test Setup provides the complete framework for designing experiments that produce statistically valid, actionable results. Most A/B tests fail not because the variant was wrong, but because the test was poorly designed: wrong sample size, wrong metric, or someone peeked at results and stopped early. This skill prevents those mistakes. --- ## The Experiment Lifecycle ``` 1. HYPOTHESIZE → 2. DESIGN → 3. CALCULATE → 4. IMPLEMENT ↑ │ │ ▼ 7. ITERATE ← 6. DOCUMENT ← 5. ANALYZE ← [Run to completion] ``` --- ## Step 1: Hypothesis Formulation ### The Hypothesis Template ``` Because [observation or data point], we believe [specific change] will cause [measurable outcome] for [defined audience segment]. We'll know this is true when [primary metric] changes by [minimum detectable effect]. We'll watch [guardrail metrics] to ensure no negative impact. ``` ### Good vs Bad Hypotheses | Quality | Hypothesis | Problem | |---------|-----------|---------| | Bad | "Changing the button color might increase clicks" | No data basis, no target, no measurement plan | | Mediocre | "A green button will get more clicks than blue" | No "why", no target size, no guardrails | | Good | "Because heatmaps show 40% of users don't notice our CTA, making the button 2x larger with contrasting color will increase CTA clicks by 15%+ for new visitors. Guardrail: page load time stays under 2s." | Data-backed, specific change, measurable outcome, defined audience, guardrail | ### Hypothesis Sources (Where to Find Test Ideas) | Source | What to Look For | Example | |--------|-----------------|---------| | Analytics data | Drop-off points, low-performing pages | "80% of users drop off at step 3 of onboarding" | | User research | Confusion, frustration, unmet needs | "Users don't understand what the product does from the homepage" | | Heatmaps/session recordings | Ignored elements, rage clicks | "Nobody scrolls past the fold on pricing page" | | Support tickets | Recurring complaints, feature confusion | "Users constantly ask how to invite team members" | | Competitor analysis | Different approaches to same problem | "Competitor uses a wizard; we use a form" | | Sales objections | Common reasons prospects don't convert | "Prospects want to see pricing before signing up" | --- ## Step 2: Test Design ### Test Types | Type | Variants | Traffic Need | Best For | |------|----------|-------------|---------| | A/B | 2 (control + 1 variant) | Moderate | Single change validation | | A/B/n | 3+ variants | High | Comparing multiple approaches | | Multivariate (MVT) | Combinations of changes | Very high | Optimizing multiple elements | | Split URL | Different pages | Moderate | Major redesigns | | Bandit | Dynamic allocation | Low-moderate | Revenue optimization | **Default recommendation:** Standard A/B test. Only use A/B/n or MVT when you have enough traffic and a specific need. ### What to Test (By Impact) | Category | High Impact | Medium Impact | Low Impact | |----------|-----------|---------------|-----------| | **Copy** | Headline/value prop, CTA text | Body copy, social proof | Microcopy, labels | | **Design** | Page layout, above-fold content | Visual hierarchy, imagery | Color, font size | | **UX** | Number of steps, form fields | Button placement, navigation | Animations, transitions | | **Pricing** | Price point, plan names | Feature packaging, anchoring | Billing frequency display | | **Social Proof** | Testimonials vs none, logos | Testimonial format, placement | Testimonial count | ### Metric Selection Every test needs three types of metrics: **Primary Metric (1 only)** - The single metric that determines success - Directly tied to the hypothesis - Must be measurable within the test duration - Examples: signup rate, click-through rate, purchase rate **Secondary Metrics (2-3)** - Explain why the primary metric moved - Provide context for decision-making - Examples: time on page, scroll depth, feature adoption rate **Guardrail Metrics (1-3)** - Things that must NOT get worse - Stop the test if significantly negative - Examples: error rate, support ticket volume, page load time, refund rate --- ## Step 3: Sample Size Calculation ### Quick Reference Table Minimum visitors PER VARIANT needed (95% confidence, 80% power): | Baseline Rate | 5% Lift | 10% Lift | 15% Lift | 20% Lift | 50% Lift | |--------------|---------|----------|----------|----------|----------| | 1% | 620,000 | 156,000 | 70,000 | 39,000 | 6,400 | | 2% | 305,000 | 77,000 | 34,000 | 19,500 | 3,200 | | 3% | 200,000 | 51,000 | 23,000 | 12,800 | 2,100 | | 5% | 116,000 | 29,500 | 13,200 | 7,500 | 1,250 | | 10% | 54,000 | 13,800 | 6,200 | 3,500 | 600 | | 20% | 24,000 | 6,200 | 2,800 | 1,600 | 280 | | 50% | 6,100 | 1,600 | 720 | 410 | 75 | ### Duration Calculation ``` Duration (days) = (Sample size per variant * Number of variants) / Daily traffic to test page ``` **Minimum duration:** 7 days (to capture day-of-week effects) **Maximum recommended:** 6 weeks (beyond this, external factors contaminate results) ### What If You Don't Have Enough Traffic? | Situation | Solution | |-----------|----------| | Need 100K visitors, get 5K/week | Increase minimum detectable effect (test bolder changes) | | Very low traffic (<1K/week) | Use qualitative testing (user testing, surveys) instead | | Medium traffic (5-20K/week) | Run for 4-6 weeks, test big changes only | | High traffic (50K+/week) | You can test subtle changes, run multiple tests | --- ## Step 4: Implementation ### Client-Side Implementation JavaScript modifies the page after initial render. **Pros:** Quick to implement, no deploy needed **Cons:** Can cause flicker (flash of original content), blocked by ad blockers **Tools:** PostHog, Optimizely, VWO, Google Optimize **Anti-flicker pattern:** ```javascript // Add to before any rendering // In your test script (runs after variant assignment): document.documentElement.classList.remove('ab-test-hide'); ``` ### Server-Side Implementation Variant determined before page renders. No flicker, no client-side dependency. **Pros:** No flicker, not blocked by ad blockers, works for logged-in features **Cons:** Requires engineering work, deploy needed **Tools:** PostHog, LaunchDarkly, Split, Unleash, custom feature flags **Basic feature flag pattern:** ```python # Server-side variant assignment def get_variant(user_id: str, experiment: str) -> str: # Deterministic hash ensures same user always sees same variant hash_input = f"{user_id}:{experiment}" hash_value = hashlib.md5(hash_input.encode()).hexdigest() bucket = int(hash_value[:8], 16) % 100 if bucket < 50: return "control" else: return "variant" ``` ### Traffic Allocation | Strategy | Split | When to Use | |----------|-------|-------------| | Standard | 50/50 | Default. Maximum statistical power. | | Conservative | 90/10 or 80/20 | Risky changes, revenue-impacting tests | | Ramped | Start 95/5, increase to 50/50 | New infrastructure, technical risk | **Critical rules:** - Users must see the same variant on every visit (sticky assignment by user ID or cookie) - Allocation must be balanced across time of day and day of week - Never change allocation mid-test --- ## Step 5: Running the Test ### Pre-Launch Checklist - [ ] Hypothesis documented with primary metric and minimum detectable effect - [ ] Sample size calculated, expected duration estimated - [ ] Both variants implemented and QA'd on all device types - [ ] Tracking verified (events fire correctly for both variants) - [ ] No other tests running on the same page/feature - [ ] Stakeholders informed of test duration and "no peeking" rule - [ ] External factor calendar checked (no major launches, holidays, press) ### During the Test **DO:** - Monitor for technical errors (variant not rendering, tracking broken) - Check that traffic split is balanced daily - Document any external events that might affect results **DO NOT:** - Look at results before reaching sample size ("peeking problem") - Make changes to either variant - Add traffic from new sources mid-test - Stop the test early because one variant "looks like it's winning" ### The Peeking Problem (Critical) Looking at results before reaching the planned sample size and stopping because one variant looks better leads to a **25-40% false positive rate** (vs the intended 5%). Why: Statistical significance fluctuates wildly with small samples. A variant can show p < 0.05 at 20% of planned sample size and p > 0.30 at full sample. **Solutions:** 1. Pre-commit to sample size and do not check results until reached 2. If you must monitor: use sequential testing methods (group sequential design, always-valid p-values) 3. Set calendar reminder for expected completion date -- that is when you look --- ## Step 6: Analysis ### Analysis Checklist 1. **Did we reach planned sample size?** If not, results are preliminary only. 2. **Is it statistically significant?** p < 0.05 = 95% confidence the difference is real. 3. **What's the confidence interval?** Tells you the range of likely true effect. 4. **Is the effect size meaningful?** A 0.1% lift that's "significant" may not be worth implementing. 5. **Are secondary metrics consistent?** Do they support the primary result? 6. **Any guardrail violations?** Did anything get worse? 7. **Segment analysis:** Different results for mobile vs desktop? New vs returning? ### Interpreting Results | Result | Primary Metric | Confidence | Action | |--------|---------------|------------|--------| | Clear winner | Variant +15%, p < 0.01 | High | Implement variant | | Modest winner | Variant +5%, p < 0.05 | Medium | Implement if easy, else run longer | | Flat | < 2% difference, p > 0.20 | High (no effect) | Keep control, test something bolder | | Loser | Variant -10%, p < 0.05 | High | Keep control, investigate why | | Inconclusive | 5% difference, p = 0.08 | Low | Need more traffic or bolder test | | Mixed signals | Primary up, guardrail down | Investigate | Dig into segments, do not ship blindly | ### Common Analysis Mistakes | Mistake | Consequence | Prevention | |---------|-------------|------------| | Stopping at first significance | 25-40% false positive rate | Commit to sample size | | Cherry-picking segments | Finding "winners" that don't replicate | Pre-register segments of interest | | Ignoring confidence intervals | Overestimating effect size | Always report CI alongside p-value | | Multiple comparisons | Inflated Type I error | Bonferroni correction for A/B/n | | Survivorship bias | Only analyzing users who completed flow | Include all users from assignment point | | Simpson's paradox | Aggregate hides segment reversal | Always check key segments | --- ## Step 7: Documentation Every test must be documented, regardless of outcome. ### Test Documentation Template ``` EXPERIMENT: [Name] DATE: [Start] to [End] OWNER: [Name] HYPOTHESIS: Because [observation], we believed [change] would cause [outcome] for [audience]. VARIANTS: - Control: [description] - Variant: [description + screenshot] METRICS: - Primary: [metric] (baseline: [X]%, MDE: [Y]%) - Secondary: [metrics] - Guardrails: [metrics] RESULTS: - Sample size: [actual] / [planned] - Duration: [X] days - Primary metric: Control [X]% vs Variant [Y]% (p = [Z], CI: [range]) - Secondary metrics: [results] - Guardrails: [all clear / violation noted] DECISION: [Ship variant / Keep control / Iterate] LEARNINGS: - [What we learned about our users] - [What we'd do differently next time] ``` --- ## Experiment Prioritization Framework ### ICE Scoring | Factor | Score (1-10) | Question | |--------|-------------|----------| | **Impact** | How much will this move the metric? | Big change to primary KPI = 10 | | **Confidence** | How sure are we it will work? | Strong data supporting hypothesis = 10 | | **Ease** | How easy is it to implement and measure? | Can ship in a day = 10 | **ICE Score = (Impact + Confidence + Ease) / 3** Rank all test ideas by ICE score. Run highest first. ### Test Backlog Template | # | Hypothesis | Primary Metric | ICE | Est. Duration | Status | |---|-----------|---------------|-----|---------------|--------| | 1 | Larger CTA increases signups | Signup rate | 8.3 | 2 weeks | Ready | | 2 | Social proof on pricing increases conversion | Plan selection rate | 7.0 | 3 weeks | Needs design | | 3 | Shorter onboarding increases activation | Feature activation | 6.7 | 4 weeks | In backlog | --- ## Proactive Triggers - Someone debates between two design options: propose an A/B test instead of opinionating - Conversion rate mentioned as underperforming: offer to design a test, not guess at solutions - Pricing page changes discussed: always test pricing changes with guardrail metrics - Post-launch of any feature: propose follow-up experiment to optimize - "Let's just try it and see": redirect to structured hypothesis before implementation --- ## Related Skills | Skill | Use When | |-------|----------| | **analytics-tracking** | Setting up event tracking that feeds experiment metrics | | **campaign-analytics** | Folding experiment results into broader attribution | | **launch-strategy** | Testing within a product launch sequence | | **prompt-engineer-toolkit** | A/B testing AI prompts in production | --- ## Tool Reference ### sample_size_calculator.py Calculates required sample size per variant using the normal approximation to the two-proportion z-test. Includes Bonferroni correction for multi-variant tests and duration estimation. | Flag | Type | Default | Description | |------|------|---------|-------------| | `--baseline`, `-b` | float | (required) | Baseline conversion rate (e.g. 0.05 for 5%) | | `--mde`, `-m` | float | (required) | Minimum detectable effect as relative lift (e.g. 0.10 for 10%) | | `--alpha`, `-a` | float | 0.05 | Significance level | | `--power`, `-p` | float | 0.80 | Statistical power | | `--variants`, `-v` | int | 2 | Number of variants including control | | `--daily-traffic`, `-d` | int | 0 | Daily eligible traffic for duration estimation | | `--one-tailed` | flag | False | Use one-tailed test instead of two-tailed | | `--json` | flag | False | Output as JSON | ```bash python scripts/sample_size_calculator.py --baseline 0.05 --mde 0.10 python scripts/sample_size_calculator.py --baseline 0.12 --mde 0.15 --power 0.9 --daily-traffic 5000 python scripts/sample_size_calculator.py --baseline 0.05 --mde 0.10 --variants 3 --json ``` ### experiment_analyzer.py Analyzes A/B test results using the two-proportion z-test with confidence intervals and segment breakdown. | Flag | Type | Default | Description | |------|------|---------|-------------| | `input` | positional | (required) | CSV file with results or "sample" to create sample | | `--alpha`, `-a` | float | 0.05 | Significance level | | `--json` | flag | False | Output as JSON | **CSV format:** `variant,visitors,conversions,segment` ```bash python scripts/experiment_analyzer.py sample python scripts/experiment_analyzer.py results.csv python scripts/experiment_analyzer.py results.csv --alpha 0.01 --json ``` ### experiment_planner.py Generates a structured experiment plan from a hypothesis text, including metric selection, sample size, timeline, risks, and documentation template. | Flag | Type | Default | Description | |------|------|---------|-------------| | `--hypothesis`, `-H` | string | (required) | Experiment hypothesis text | | `--baseline`, `-b` | float | 0.05 | Baseline conversion rate | | `--mde`, `-m` | float | 0.10 | Minimum detectable effect as relative lift | | `--daily-traffic`, `-d` | int | 0 | Daily eligible traffic | | `--variants`, `-v` | int | 2 | Number of variants including control | | `--json` | flag | False | Output as JSON | ```bash python scripts/experiment_planner.py --hypothesis "Larger CTA will increase signups by 15%" python scripts/experiment_planner.py -H "Simplified checkout boosts conversions" -b 0.08 -m 0.15 -d 3000 python scripts/experiment_planner.py -H "New pricing page" --json ``` --- ## Troubleshooting | Problem | Cause | Solution | |---------|-------|----------| | Sample size is unrealistically large | MDE too small or baseline too low | Increase MDE (test bolder changes) or target a higher-traffic page | | Test duration exceeds 6 weeks | Insufficient daily traffic | Consider qualitative methods, test bigger changes, or combine traffic from multiple pages | | p-value hovers around 0.05 | Borderline significance | Do not stop early; run to planned sample size or extend 20% | | Results significant but lift is tiny (<1%) | Overpowered test | Check practical significance alongside statistical significance | | Segment results contradict overall | Simpson's paradox | Investigate segment composition; report both overall and segment results | | Variant performs differently on mobile vs desktop | Device-specific UX issues | Design device-specific variants; increase per-segment sample size | | Calculator produces negative CI | Very small samples or extreme rates | Ensure sufficient sample size; check data integrity | --- ## Success Criteria | Criterion | Target | How to Measure | |-----------|--------|----------------| | Tests reach planned sample size | 100% of tests | Compare actual vs planned sample at conclusion | | False positive rate | <5% | Track post-implementation lift vs test prediction | | Test velocity | 2+ tests per team per month | Count experiments documented per sprint | | Documentation completeness | 100% of tests documented | Audit experiment records quarterly | | Average test duration | <4 weeks | Measure start-to-conclusion calendar days | | Decision quality | >80% of shipped variants hold gains at 90 days | Post-ship metric tracking | --- ## Scope & Limitations **In scope:** - Hypothesis formulation and validation - Sample size and power calculations - Frequentist two-proportion z-tests - A/B, A/B/n, and split URL test planning - Segment-level analysis - Pre/post test documentation **Out of scope:** - Bayesian A/B testing methods (use dedicated Bayesian tools) - Multi-armed bandit algorithms (require real-time allocation infrastructure) - Multivariate testing (MVT) analysis (combinatorial explosion requires specialized tools) - Server-side feature flag implementation (see engineering skills) - Revenue-based metrics requiring transaction-level data - Sequential testing / always-valid p-values (use Optimizely Stats Engine or similar) --- ## Integration Points | Tool / Platform | Integration Method | Use Case | |-----------------|-------------------|----------| | PostHog / Amplitude | JSON export from experiment_analyzer | Feed results into product analytics | | Jira / Linear | experiment_planner JSON output | Create experiment tickets with metadata | | Google Sheets | CSV export from experiment_analyzer | Share results with non-technical stakeholders | | LaunchDarkly / Unleash | experiment_planner checklist | Pre-launch validation before feature flag rollout | | Slack / Notion | Copy human-readable output | Async experiment status updates | | CI/CD pipelines | `--json` flag on all scripts | Automated experiment health checks |