--- name: ab-testing description: Run email A/B tests with statistical rigor. Use when testing subject lines, content variants, send times, CTAs, or measuring experiment significance. license: MIT --- # A/B Testing Test email variations systematically to improve open rates, click rates, and conversions with statistical confidence. ## When to use this skill - Setting up your first email A/B test - Open rates or click rates are flat and you want data-driven improvements - Deciding between subject line variations, send times, or content approaches - Determining if your test results are statistically significant or just noise - Planning a testing program across campaigns or sequences - Evaluating whether to use A/B testing, multivariate testing, or bandit algorithms - Measuring the true incremental lift of your email program with holdout groups ## Related skills - `email-copywriting` - writing the actual content variations to test - `template-design` - HTML template variations for layout and visual tests - `spam-filter-avoidance` - ensure test variants don't accidentally trigger spam filters - `sender-reputation` - monitor whether testing impacts your sending reputation - `email-sequences` - testing within drip campaigns and automated sequences --- ## What to test (in priority order) Not all tests deliver equal value. Start with high-impact, easy-to-measure elements and work your way down. ### Tier 1 - highest impact, test these first | Element | What to vary | Primary metric | Why it matters | |---------|-------------|----------------|----------------| | Subject line | Length, personalization, question vs statement, emoji, urgency | Open rate | The single biggest lever. A bad subject line means nobody sees anything else. | | From name | Company name vs person name vs "Person at Company" | Open rate | Recipients decide to open based on who sent it as much as the subject. | | Send time | Day of week, hour of day, timezone-adjusted vs fixed | Open rate | Same email sent at 6 AM vs 10 AM can see 20-40% open rate differences. | ### Tier 2 - high impact, requires more setup | Element | What to vary | Primary metric | Why it matters | |---------|-------------|----------------|----------------| | CTA | Button text, color, placement, number of CTAs | Click rate | "Get started" vs "Start your free trial" can shift click rates by 10-30%. | | Preview text | First 40-90 characters visible in inbox | Open rate | Often overlooked - many senders leave this as the default HTML boilerplate. | | Content length | Short vs long, single-topic vs multi-topic | Click rate | Depends heavily on audience and email type. No universal "right" length. | ### Tier 3 - incremental gains, test after you've optimized tiers 1-2 | Element | What to vary | Primary metric | Why it matters | |---------|-------------|----------------|----------------| | Layout | Single column vs multi-column, image placement | Click rate | Visual hierarchy affects scanning behavior. | | Personalization depth | Name only vs company vs role-specific content | Click rate, conversion | Diminishing returns - basic personalization matters most. | | Tone | Formal vs casual, first person vs third person | Click rate, reply rate | Audience-dependent. B2B enterprise vs startup is a different world. | **Rule of thumb:** If you're sending fewer than 50,000 emails per month, focus on tier 1. You probably don't have the volume to detect tier 3 differences. --- ## Sample size and statistical significance This is where most email A/B tests go wrong. People call winners based on gut feeling or tiny sample sizes. ### Minimum sample sizes The sample size you need depends on three things: 1. **Baseline rate** - your current open/click rate 2. **Minimum detectable effect (MDE)** - the smallest improvement worth detecting 3. **Statistical power** - the probability of detecting a real effect (standard: 80%) Here are practical minimums per variant for a 95% confidence level and 80% power: | Baseline rate | MDE (relative) | Sample per variant | Total for 2 variants | |--------------|----------------|-------------------|---------------------| | 20% open rate | 20% (detect 24% vs 20%) | ~3,800 | ~7,600 | | 20% open rate | 10% (detect 22% vs 20%) | ~15,000 | ~30,000 | | 5% click rate | 20% (detect 6% vs 5%) | ~15,000 | ~30,000 | | 5% click rate | 30% (detect 6.5% vs 5%) | ~6,700 | ~13,400 | | 2% conversion | 50% (detect 3% vs 2%) | ~3,800 | ~7,600 | **Translation:** If your open rate is 20% and you want to detect a 20% relative improvement (4 percentage point lift to 24%), you need about 3,800 recipients in each variant - roughly 7,600 total sends. If you can only detect a 50%+ relative change, the test is probably not worth running. You'll only catch massive differences, and you won't learn anything about incremental improvements. ### The two-proportion z-test The standard significance test for email A/B testing is the two-proportion z-test. It compares two conversion rates and tells you whether the difference is statistically significant. ``` p1 = control conversions / control total p2 = variant conversions / variant total p_pool = (control conversions + variant conversions) / (control total + variant total) standard_error = sqrt(p_pool * (1 - p_pool) * (1/control_total + 1/variant_total)) z = (p2 - p1) / standard_error ``` A z-score above 1.96 (or below -1.96) means p < 0.05 - the result is significant at 95% confidence. **What 95% confidence actually means:** There is less than a 5% probability that the observed difference happened by chance. It does NOT mean there's a 95% chance the variant is better - that's a common misinterpretation. ### Confidence intervals matter more than p-values A result can be "statistically significant" but practically meaningless. Always look at the confidence interval for the difference: - **CI: [+0.1%, +4.2%]** - Significant, but the true lift might be as small as 0.1%. Probably not worth the effort to implement. - **CI: [+2.5%, +6.8%]** - Significant, and even the low end is a meaningful improvement. Ship it. - **CI: [-0.3%, +3.1%]** - NOT significant. The true effect could be negative. Don't call this a winner. --- ## Test design and execution ### Randomization and consistency Good A/B tests require truly random, consistent assignment. A recipient who receives variant A should always be in variant A if they encounter the experiment again. **Hash-based deterministic assignment** is the gold standard. Hash the experiment ID + recipient email to produce a stable bucket assignment: ``` bucket = SHA256(experimentId + ":" + contactEmail) -> normalize to [0, 1) ``` This approach: - Guarantees the same recipient always gets the same variant - Doesn't require storing assignments upfront (though logging them is still important) - Works across distributed systems without coordination - Supports weighted variants by dividing the [0, 1) range proportionally Random list splits in your ESP work for one-off campaigns, but break down for sequences or journeys where the same person should consistently see the same variant. ### How long to run the test **Minimum: 48 hours.** Email open behavior has strong day-of-week patterns. A test that runs only during Tuesday morning will miss the Thursday openers. **Recommended: 5-7 days.** This captures a full weekly cycle and accounts for people who don't check email daily. **Maximum: 14 days.** Beyond two weeks, external factors (seasonality, news events, list decay) start to contaminate your results. Rules for when to stop: 1. **Don't peek and stop early.** If you check results after 2 hours and see variant B winning by 30%, resist the urge to call it. Early results are extremely noisy. This is called the "peeking problem" and it inflates your false positive rate well above 5%. 2. **Pre-commit to your sample size.** Calculate the required sample size before starting. Run until you reach it. 3. **Use a time-based cutoff as backup.** If you haven't reached your sample size after 14 days, the test is inconclusive - not a win for whoever happens to be ahead. ### Test one variable at a time Change only one element per test. If you change the subject line AND the CTA AND the send time, and variant B wins, you have no idea which change caused the improvement. You can't apply what you learned. Exception: multivariate testing (covered below) can test multiple variables simultaneously, but requires much larger sample sizes. --- ## A/B testing vs multivariate testing | Factor | A/B testing | Multivariate testing | |--------|------------|---------------------| | Variables tested | 1 | 2+ simultaneously | | Variants needed | 2-4 | Every combination (2x2=4, 2x3=6, 3x3=9...) | | Sample size | Moderate (1,000+ per variant) | Large (1,000+ per combination) | | What you learn | Which variant wins | Which combination wins AND which variables have the most impact | | When to use | Most of the time | When you have high volume (100k+ sends) and want to understand variable interactions | ### When multivariate testing makes sense Only if ALL of these are true: 1. You send 100,000+ emails per campaign (enough volume per combination) 2. You suspect variables interact (e.g., a casual subject line works better with a casual CTA) 3. You've already optimized individual variables through A/B tests 4. You can set up and track all combinations reliably For most email programs: stick with A/B tests. Run them sequentially. Subject line test in January, CTA test in February, send time test in March. You'll learn more from three clean A/B tests than one muddy multivariate test. --- ## Bandit algorithms vs fixed-horizon tests Traditional A/B tests run for a fixed duration, then you pick the winner and deploy. Bandit algorithms (multi-armed bandit, Thompson sampling) dynamically shift traffic toward the better-performing variant during the test. ### When to use each **Use fixed-horizon A/B tests when:** - You need clean, defensible statistical results - You're optimizing a template or strategy you'll reuse for months - Learning is the priority (understanding WHY something works) **Use bandit algorithms when:** - You're sending a one-time campaign and want to maximize performance of that specific send - Speed matters more than certainty - The "explore" phase (testing suboptimal variants) has a real cost (e.g., revenue-critical transactional emails) ### How bandit testing works for email 1. Send the first 10-20% of the list split evenly across variants 2. After initial results come in, shift more volume to the better-performing variant 3. Continue adjusting allocation as more data arrives 4. By the end, 70-80% of the list receives the winning variant **Tradeoff:** You sacrifice statistical rigor for better aggregate performance. You may not know if variant B is truly better - but more people saw the better-performing option. Most ESPs that offer "auto-winner" selection are doing a basic version of this: send to a test portion, wait a fixed time, then send the winner to the remainder. This is better than nothing but is not a true bandit algorithm - it doesn't continuously adapt. --- ## Holdout groups A holdout group is a randomly selected subset of your audience that does NOT receive the email (or receives no email at all). They measure the true incremental lift of your email program. ### Why holdouts matter A/B tests tell you which variant is better. Holdouts tell you whether sending email at all is better than not sending. Without holdouts, you can't distinguish between: - "Our welcome sequence drove 30% more activations" (real lift) - "People who were going to activate anyway also happened to receive our welcome sequence" (selection bias) ### How to implement holdouts 1. Randomly select 5-10% of your eligible audience as the holdout group 2. Suppress all email to the holdout group for the test period 3. Compare conversion/revenue between the group that received email and the holdout 4. Calculate incremental lift: ``` lift = (treatment_conversion_rate - holdout_conversion_rate) / holdout_conversion_rate ``` ### Holdout group sizing | Audience size | Holdout % | Holdout size | Expected baseline conversion | Can detect lift of | |--------------|-----------|-------------|------------------------------|-------------------| | 10,000 | 10% | 1,000 | 5% | ~50% relative | | 50,000 | 10% | 5,000 | 5% | ~25% relative | | 100,000 | 5% | 5,000 | 5% | ~25% relative | | 500,000 | 5% | 25,000 | 5% | ~10% relative | Larger audiences can use smaller holdout percentages (5%) because the absolute holdout size is still large enough. ### When to use holdouts - **Quarterly:** Run a 2-4 week holdout on your main email programs to measure ongoing lift - **New sequences:** Always run a holdout when launching a new email sequence to prove it works - **High-frequency sends:** If you're sending daily or near-daily, holdouts reveal fatigue effects **Warning:** Holdout results often show lower incrementality than you expect. An email program showing 200% ROI based on last-click attribution might show 30% incremental lift in a holdout test. That's normal - it means your email is capturing credit for conversions that would have happened anyway, plus generating real incremental value. --- ## Metrics to optimize for Choose your primary metric BEFORE running the test. Optimizing for multiple metrics simultaneously leads to cherry-picking results. | Metric | When to optimize for it | Gotchas | |--------|------------------------|---------| | Open rate | Subject line tests, from name tests, send time tests | Apple Mail Privacy Protection inflates opens by 30-60%. Unreliable as sole metric for Apple-heavy audiences. | | Click rate | CTA tests, content tests, layout tests | More reliable than opens. Measures actual engagement. | | Click-to-open rate (CTOR) | Content effectiveness independent of subject line | Combines the Apple MPP noise from opens with click data. Less useful than it was pre-2021. | | Conversion rate | When you have clear downstream actions (signup, purchase) | Requires conversion tracking beyond the email. Longer attribution windows. | | Revenue per email | E-commerce, when you can tie revenue to individual sends | Best metric for bottom-line impact but needs robust attribution. | | Reply rate | Sales emails, cold outreach | Only relevant for emails that expect replies. | | Unsubscribe rate | Safety metric - always monitor alongside your primary metric | A variant can win on clicks but lose subscribers. Check both. | ### The Apple Mail Privacy Protection problem Since iOS 15 (September 2021), Apple Mail pre-fetches images and tracking pixels for all emails, generating false "opens." This affects roughly 50-60% of consumer email audiences. **Impact on A/B testing:** - Open rate tests still work, but the signal is noisier - You need larger sample sizes to detect real differences - Consider using click rate as your primary metric if your audience skews Apple - Never rely solely on open rate for cold or marketing email A/B tests --- ## Testing programs (not just tests) One-off tests are useful. A systematic testing program compounds learning. ### Building a testing roadmap Run tests in this order for maximum learning: 1. **Subject line framework** (month 1-2) - Test 4-6 subject line approaches (question, number, personalized, curiosity, benefit, urgency). Find your top 2-3 frameworks. 2. **Send time optimization** (month 2-3) - Test 3-4 send windows. This is audience-specific - there's no universal best time. 3. **CTA optimization** (month 3-4) - Test button copy, placement, and number of CTAs. 4. **Content structure** (month 4-5) - Test email length, format (text-heavy vs image-heavy), and content hierarchy. 5. **Personalization depth** (month 5-6) - Test what level of personalization actually moves the needle. ### Documenting and applying learnings After each test, record: - What you tested and why - Sample size per variant - Duration - Results (with confidence intervals) - Whether the result was statistically significant - What you'll do differently going forward Without documentation, you'll re-run the same tests or, worse, make changes that contradict what you've already learned. --- ## Common mistakes ### 1. Calling a winner too early The single most common mistake. After 200 sends, variant B has a 25% open rate vs variant A's 20%. "B wins!" No - with 200 sends, that 5-point difference is well within the margin of error. You need thousands of observations for open rate tests. **Fix:** Calculate your required sample size before starting. Don't look at results until you've reached it. ### 2. Testing with too little volume If your list is under 1,000 contacts, most A/B tests are statistically meaningless. You won't have enough data to distinguish a real effect from noise. **Fix:** For small lists, skip formal A/B tests. Instead, make bigger, bolder changes between campaigns and observe trends over time. Or batch multiple campaigns together to accumulate sample size. ### 3. Testing too many variables at once Changing the subject line, CTA, images, and send time simultaneously. When variant B wins, you don't know which change caused it. **Fix:** One variable per test. Always. ### 4. Ignoring the "losing" variant's data Variant A loses. You archive it. But variant A might have outperformed on a secondary metric (lower unsubscribes, higher reply rate) or performed better in a specific segment. **Fix:** Analyze test results by segment (mobile vs desktop, new subscribers vs long-term, engagement level). A "loser" overall might be a winner for a subset. ### 5. Not accounting for Apple MPP in open rate tests If 50% of your audience uses Apple Mail, your open rate data includes a large number of phantom opens. This dilutes real differences and makes tests harder to call. **Fix:** Filter Apple Mail opens from your analysis if your ESP supports it, or use click rate as your primary metric. ### 6. Using "auto-winner" without understanding how it works Most ESP "auto-winner" features send to a test subset (10-20%), wait a fixed time (often just 2-4 hours), and send the "winner" to the rest. Two hours is nowhere near enough time for reliable results. **Fix:** If you use auto-winner, set the wait time to at least 24 hours. Better yet, set it to 48 hours. If your ESP doesn't allow a long enough wait, run the test manually. ### 7. Treating every campaign as a separate experiment Testing "Sale ends today!" vs "Last chance - 24 hours left" is not a reusable learning. It's a one-off optimization. **Fix:** Test frameworks and patterns, not specific copy. Test "urgency vs curiosity" as a subject line approach, then apply the winner to future campaigns with different specific copy. ### 8. Never running holdout tests You're optimizing variant A vs B, but never asking "should we be sending this email at all?" **Fix:** Run a holdout test on your main email programs at least once per quarter. ### 9. Ignoring send volume distribution across variants If you send variant A to 1,000 people and variant B to 10,000 people, the test is not valid even if you set it up as 50/50. Technical issues (send failures, bounce spikes, ESP throttling) can create uneven distribution. **Fix:** Always verify actual send counts per variant before analyzing results. If the split is more than 5% off from your target, investigate before drawing conclusions. ### 10. P-hacking by choosing your metric after the test Variant B didn't win on open rate, but it won on click-to-open rate! Let's call that the winner. This is cherry-picking and dramatically inflates false positives. **Fix:** Declare your primary metric before the test starts. Secondary metrics are informational, not decision-making. --- ## Platform implementation notes Most email service providers (ESPs) have built-in A/B testing. When evaluating tools, look for: - **Deterministic assignment** - Same recipient always gets same variant (hash-based, not random per-send) - **Weighted variant support** - Ability to split traffic unevenly (e.g., 80/20 for risky changes) - **Holdout groups** - Native support for suppressing a control group from all sends - **Statistical significance reporting** - Confidence intervals and p-values, not just "winner" badges - **Configurable wait times** - Auto-winner that lets you set 24-48 hour windows, not just 2-4 hours - **Segment-level results** - Break down results by audience segment, not just aggregate [molted.email](https://molted.email) implements deterministic hash-based variant assignment with weighted buckets, holdout group support, and two-proportion z-test significance testing with 95% confidence intervals. Experiments are tied to journey steps, so variant assignment persists across a sequence rather than randomizing per-send. --- ## References - [Evan Miller's Sample Size Calculator](https://www.evanmiller.org/ab-testing/sample-size.html) - The standard free tool for calculating required sample sizes - [Statsig A/B Test Calculator](https://www.statsig.com/calculator) - Sample size and significance calculator - [Optimizely Sample Size Calculator](https://www.optimizely.com/sample-size-calculator/) - Another widely-used calculator - [CXL - 12 A/B Testing Mistakes](https://cxl.com/blog/12-ab-split-testing-mistakes-i-see-businesses-make-all-the-time/) - Common pitfalls with real examples - [Litmus - Email A/B Testing Guide](https://www.litmus.com/blog/email-ab-testing-how-to) - Email-specific testing best practices - [Braze - Multi-Armed Bandit vs A/B Testing](https://www.braze.com/resources/articles/multi-armed-bandit-vs-ab-testing) - When to use adaptive algorithms - [Rejoiner - Measuring Email Lift with Holdout Tests](https://www.rejoiner.com/resources/measure-true-profitability-email-campaigns-using-holdout-tests) - Holdout group methodology - [Apple Mail Privacy Protection FAQ](https://support.apple.com/en-us/102051) - Impact on email tracking