--- name: experiment-design description: Best practices for designing reproducible experiments --- # Experiment Design Patterns ## When to Use Load this skill when designing experiments that need to be reproducible and statistically valid. ## Reproducibility Setup ### Random Seeds ```python import random import numpy as np SEED = 42 random.seed(SEED) np.random.seed(SEED) print(f"[DECISION] Using random seed: {SEED}") ``` ### Environment Recording ```python import sys print(f"[INFO] Python: {sys.version}") print(f"[INFO] NumPy: {np.__version__}") print(f"[INFO] Pandas: {pd.__version__}") ``` ## Experimental Controls ### Train/Test Split ```python from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=SEED, stratify=y ) print(f"[EXPERIMENT] Train: {len(X_train)}, Test: {len(X_test)}") ``` ### Cross-Validation ```python from sklearn.model_selection import cross_val_score scores = cross_val_score(model, X, y, cv=5, scoring='accuracy') print(f"[METRIC] CV Accuracy: {scores.mean():.3f} (+/- {scores.std()*2:.3f})") ``` ## A/B Testing Pattern ```python print("[EXPERIMENT] A/B Test Design") print(f"[INFO] Control group: {len(control)}") print(f"[INFO] Treatment group: {len(treatment)}") # Power analysis from statsmodels.stats.power import TTestIndPower power = TTestIndPower() sample_size = power.solve_power(effect_size=0.5, alpha=0.05, power=0.8) print(f"[CALC] Required sample size per group: {sample_size:.0f}") ``` ## Power Analysis Power analysis ensures your experiment has sufficient sample size to detect meaningful effects. Without adequate power, you risk false negatives (missing real effects). ### Sample Size Calculation ```python from statsmodels.stats.power import TTestIndPower, FTestAnovaPower, NormalIndPower import numpy as np print("[DECISION] Conducting a priori power analysis before data collection") # For two-group comparison (t-test) power_analysis = TTestIndPower() # Parameters: # - effect_size: Expected Cohen's d (0.2=small, 0.5=medium, 0.8=large) # - alpha: Significance level (typically 0.05) # - power: Desired statistical power (typically 0.80 or 0.90) # - ratio: Ratio of group sizes (1.0 = equal groups) effect_size = 0.5 # Medium effect size alpha = 0.05 desired_power = 0.80 sample_size = power_analysis.solve_power( effect_size=effect_size, alpha=alpha, power=desired_power, ratio=1.0, alternative='two-sided' ) print(f"[STAT:estimate] Required n per group: {np.ceil(sample_size):.0f}") print(f"[STAT:estimate] Total sample needed: {np.ceil(sample_size)*2:.0f}") print(f"[DECISION] Targeting {effect_size} effect size (Cohen's d = medium)") ``` ### Achieved Power (Post-Hoc) ```python # After data collection, calculate achieved power actual_n = 50 # Actual sample size per group achieved_power = power_analysis.solve_power( effect_size=effect_size, alpha=alpha, nobs1=actual_n, ratio=1.0, alternative='two-sided' ) print(f"[STAT:estimate] Achieved power: {achieved_power:.3f}") if achieved_power < 0.80: print(f"[LIMITATION] Study is underpowered ({achieved_power:.0%} < 80%)") print("[LIMITATION] Negative results may be due to insufficient sample size") else: print(f"[DECISION] Adequate power achieved ({achieved_power:.0%} ≥ 80%)") ``` ### Power for Different Tests ```python # For ANOVA (multiple groups) from statsmodels.stats.power import FTestAnovaPower anova_power = FTestAnovaPower() n_groups = 3 effect_size_f = 0.25 # Cohen's f (0.1=small, 0.25=medium, 0.4=large) n_per_group = anova_power.solve_power( effect_size=effect_size_f, alpha=0.05, power=0.80, k_groups=n_groups ) print(f"[STAT:estimate] ANOVA: {np.ceil(n_per_group):.0f} per group needed") # For proportions (chi-square, A/B tests) from statsmodels.stats.power import GofChisquarePower from statsmodels.stats.proportion import proportion_effectsize # Convert expected proportions to effect size p1, p2 = 0.10, 0.15 # e.g., 10% baseline, 15% expected with treatment prop_effect = proportion_effectsize(p1, p2) print(f"[DECISION] Effect size h = {prop_effect:.3f} for proportions test") ``` ## Pre-registration Concept Pre-registration prevents HARKing (Hypothesizing After Results are Known) and distinguishes confirmatory from exploratory analyses. ### Define Analysis Before Data ```python print("[DECISION] Pre-registering analysis plan before examining data") # Document your analysis plan BEFORE looking at the data preregistration = { "primary_hypothesis": "H1: Treatment group shows higher conversion rate than control", "null_hypothesis": "H0: No difference in conversion rates between groups", "primary_endpoint": "conversion_rate", "secondary_endpoints": ["time_to_convert", "revenue_per_user"], "alpha": 0.05, "correction_method": "Bonferroni for secondary endpoints", "minimum_effect_size": "5 percentage points (10% → 15%)", "planned_sample_size": 500, "analysis_method": "Two-proportion z-test", "exclusion_criteria": "Users with < 1 day exposure" } print(f"[EXPERIMENT] Pre-registered analysis plan:") for key, value in preregistration.items(): print(f" {key}: {value}") ``` ### Confirmatory vs Exploratory Findings ```python # After analysis, clearly label findings print("[FINDING] Treatment increases conversion by 4.2pp (95% CI: [1.8, 6.6])") print("[STAT:ci] 95% CI [1.8, 6.6]") print("[STAT:effect_size] Cohen's h = 0.12 (small)") print("[STAT:p_value] p = 0.001") # This is CONFIRMATORY because it tests pre-registered hypothesis print("[OBSERVATION] Exploratory: Effect stronger for mobile users (+7.1pp)") print("[LIMITATION] Mobile subgroup analysis was NOT pre-registered") print("[DECISION] Flagging as exploratory - requires replication before action") # Document finding type CONFIRMATORY = True # Pre-registered hypothesis EXPLORATORY = False # Post-hoc discovery def label_finding(finding: str, is_confirmatory: bool): """Label findings appropriately based on pre-registration status.""" if is_confirmatory: print(f"[FINDING] CONFIRMATORY: {finding}") else: print(f"[OBSERVATION] EXPLORATORY: {finding}") print("[LIMITATION] This finding was not pre-registered and requires replication") ``` ### Document Deviations from Plan ```python # When you deviate from the pre-registered plan, document it! print("[DECISION] DEVIATION FROM PRE-REGISTRATION:") print(" Original plan: Two-proportion z-test") print(" Actual analysis: Fisher's exact test") print(" Reason: Cell counts < 5 in contingency table") print(" Impact: More conservative, may reduce power") # Keep a deviation log deviations = [ { "item": "Statistical test", "planned": "z-test", "actual": "Fisher's exact", "reason": "Low expected cell counts", "impact": "Minimal - Fisher's is more conservative" }, { "item": "Sample size", "planned": 500, "actual": 487, "reason": "13 users excluded due to technical issues", "impact": "Power reduced from 80% to 78%" } ] print("[EXPERIMENT] Deviation log:") for d in deviations: print(f" - {d['item']}: {d['planned']} → {d['actual']} ({d['reason']})") ``` ## Stopping Rules Define stopping criteria upfront to prevent p-hacking through optional stopping. ### Define Success/Failure Criteria Upfront ```python print("[DECISION] Defining stopping rules BEFORE experiment starts") stopping_rules = { "success_criterion": "Lower 95% CI bound > 0 (effect is positive)", "failure_criterion": "Upper 95% CI bound < minimum_effect (effect too small)", "futility_criterion": "Posterior probability of success < 5%", "max_sample_size": 1000, "interim_analyses": [250, 500, 750], # Pre-specified checkpoints "alpha_spending": "O'Brien-Fleming" # Preserve overall alpha } print(f"[EXPERIMENT] Stopping rules defined:") for key, value in stopping_rules.items(): print(f" {key}: {value}") ``` ### Avoid P-Hacking Through Optional Stopping ```python # BAD: Looking at p-value repeatedly and stopping when significant # This inflates false positive rate! # GOOD: Use alpha-spending functions to control Type I error from scipy import stats import numpy as np def obrien_fleming_boundary(alpha: float, n_looks: int, current_look: int) -> float: """ Calculate O'Brien-Fleming spending boundary. More conservative early, less conservative late. """ # Information fraction t = current_look / n_looks # O'Brien-Fleming boundary z_boundary = stats.norm.ppf(1 - alpha/2) / np.sqrt(t) p_boundary = 2 * (1 - stats.norm.cdf(z_boundary)) return p_boundary n_looks = 4 # Number of interim analyses alpha = 0.05 # Overall significance level print("[DECISION] Using O'Brien-Fleming alpha-spending to control Type I error") print("[EXPERIMENT] Adjusted significance thresholds:") for look in range(1, n_looks + 1): boundary = obrien_fleming_boundary(alpha, n_looks, look) print(f" Look {look}/{n_looks}: p < {boundary:.5f} to stop for efficacy") print("[LIMITATION] Stopping early requires more extreme evidence") ``` ### Sequential Analysis Methods (SPRT) ```python # Sequential Probability Ratio Test (SPRT) # Allows continuous monitoring with controlled error rates def sprt_bounds(alpha: float, beta: float) -> tuple: """ Calculate SPRT decision boundaries. Args: alpha: Type I error rate (false positive) beta: Type II error rate (false negative) Returns: (lower_bound, upper_bound) for log-likelihood ratio """ A = np.log((1 - beta) / alpha) # Upper boundary (accept H1) B = np.log(beta / (1 - alpha)) # Lower boundary (accept H0) return B, A alpha, beta = 0.05, 0.20 # 5% false positive, 20% false negative (80% power) lower, upper = sprt_bounds(alpha, beta) print("[DECISION] Using Sequential Probability Ratio Test (SPRT)") print(f"[STAT:estimate] Stop for H0 if LLR < {lower:.3f}") print(f"[STAT:estimate] Stop for H1 if LLR > {upper:.3f}") print("[EXPERIMENT] Continue sampling if {:.3f} < LLR < {:.3f}".format(lower, upper)) # Example SPRT monitoring def monitor_sprt(successes: int, trials: int, p0: float, p1: float, bounds: tuple): """Monitor SPRT decision status.""" lower, upper = bounds # Log-likelihood ratio if successes == 0 or successes == trials: llr = 0 # Avoid log(0) else: p_hat = successes / trials llr = successes * np.log(p1/p0) + (trials - successes) * np.log((1-p1)/(1-p0)) if llr > upper: return "STOP: Accept H1 (treatment effective)", llr elif llr < lower: return "STOP: Accept H0 (no effect)", llr else: return "CONTINUE: Need more data", llr # Example: Testing if conversion rate > 10% vs = 10% p_null, p_alt = 0.10, 0.15 status, llr = monitor_sprt(successes=45, trials=350, p0=p_null, p1=p_alt, bounds=(lower, upper)) print(f"[STAT:estimate] Current LLR: {llr:.3f}") print(f"[DECISION] {status}") ``` ### Document Stopping Decision ```python # When you stop an experiment, document the decision clearly print("[DECISION] Experiment stopped at interim analysis 2/4") print("[STAT:estimate] Current effect: 5.2pp (95% CI: [2.1, 8.3])") print("[STAT:p_value] p = 0.0012 (< O'Brien-Fleming boundary 0.005)") print("[EXPERIMENT] Decision: STOP FOR EFFICACY") print("[FINDING] Treatment significantly improves conversion (confirmed at interim)") print("[LIMITATION] Final sample (n=500) smaller than planned (n=1000)") print("[LIMITATION] Effect estimate may regress toward null with more data") ``` ## Documentation Pattern ```python print("[DECISION] Chose Random Forest over XGBoost because:") print(" - Better interpretability for stakeholders") print(" - Comparable performance (within 1% accuracy)") print(" - Faster training time for iteration") print("[LIMITATION] Model may not generalize to:") print(" - Data from different time periods") print(" - Users from different demographics") ```