---
name: experiment-design-checklist
description: Generates a rigorous experiment design given a hypothesis. Use when asked to design experiments, plan experiments, create an experimental setup, or figure out how to test a research hypothesis. Covers controls, baselines, ablations, metrics, statistical tests, and compute estimates.
---

# Experiment Design Checklist

Prevent the "I ran experiments for 3 months and they're meaningless" disaster through rigorous upfront design.

## The Core Principle

Before running ANY experiment, you should be able to answer:
1. What specific claim will this experiment support or refute?
2. What would convince a skeptical reviewer?
3. What could go wrong that would invalidate the results?

## Process

### Step 1: State the Hypothesis Precisely

Convert your research question into falsifiable predictions:

**Template:**
```
If [intervention/method], then [measurable outcome], because [mechanism].
```

**Examples:**
- "If we add auxiliary contrastive loss, then downstream task accuracy increases by >2%, because representations become more separable."
- "If we use learned positional encodings, then performance on sequences >4096 tokens improves, because the model can extrapolate beyond training length."

**Null hypothesis:** What does "no effect" look like? This is what you're trying to reject.

### Step 2: Identify Variables

**Independent Variables (what you manipulate):**
| Variable | Levels | Rationale |
|----------|--------|-----------|
| [Var 1] | [Level A, B, C] | [Why these levels] |

**Dependent Variables (what you measure):**
| Metric | How Measured | Why This Metric |
|--------|--------------|-----------------|
| [Metric 1] | [Procedure] | [Justification] |

**Control Variables (what you hold constant):**
| Variable | Fixed Value | Why Fixed |
|----------|-------------|-----------|
| [Var 1] | [Value] | [Prevents confound X] |

### Step 3: Choose Baselines

Every experiment needs comparisons. No result is meaningful in isolation.

**Baseline Hierarchy:**

1. **Random/Trivial Baseline**
   - What does random chance achieve?
   - Sanity check that the task isn't trivial

2. **Simple Baseline**
   - Simplest reasonable approach
   - Often embarrassingly effective

3. **Standard Baseline**
   - Well-known method from literature
   - Apples-to-apples comparison

4. **State-of-the-Art Baseline**
   - Current best published result
   - Only if you're claiming SOTA

5. **Ablated Self**
   - Your method minus key components
   - Shows each component contributes

**For each baseline, document:**
- Source (paper, implementation)
- Hyperparameters used
- Whether you re-ran or used reported numbers
- Any modifications made

### Step 4: Design Ablations

Ablations answer: "Is each component necessary?"

**Ablation Template:**
| Variant | What's Removed/Changed | Expected Effect | If No Effect... |
|---------|----------------------|-----------------|-----------------|
| Full Model | Nothing | Best performance | - |
| w/o Component A | Remove A | Performance drops X% | A isn't helping |
| w/o Component B | Remove B | Performance drops Y% | B isn't helping |
| Component A only | Only A, no B | Shows A's isolated contribution | - |

**Good ablations are:**
- Surgical (one change at a time)
- Interpretable (clear what was changed)
- Informative (result tells you something)

### Step 5: Address Confounds

Things that could explain your results OTHER than your hypothesis:

**Common Confounds:**

| Confound | How to Check | How to Control |
|----------|--------------|----------------|
| Hyperparameter tuning advantage | Same tuning budget for all | Report tuning procedure |
| Compute advantage | Matched FLOPs/params | Report compute used |
| Data leakage | Check train/test overlap | Strict separation |
| Random seed luck | Multiple seeds | Report variance |
| Implementation bugs (baseline) | Verify baseline numbers | Use official implementations |
| Cherry-picked examples | Random or systematic selection | Pre-register selection criteria |

### Step 6: Statistical Rigor

**Sample Size:**
- How many random seeds? (Minimum: 3, better: 5+)
- How many data splits? (If applicable)
- Power analysis: Can you detect expected effect size?

**What to Report:**
- Mean ± standard deviation (or standard error)
- Confidence intervals where appropriate
- Statistical significance tests if claiming "better"

**Appropriate Tests:**
| Comparison | Test | Assumptions |
|------------|------|-------------|
| Two methods, normal data | t-test | Normality, equal variance |
| Two methods, unknown dist | Mann-Whitney U | Ordinal data |
| Multiple methods | ANOVA + post-hoc | Normality |
| Multiple methods, unknown | Kruskal-Wallis | Ordinal data |
| Paired comparisons | Wilcoxon signed-rank | Same test instances |

**Avoid:**
- p-hacking (running until significant)
- Multiple comparison problems (Bonferroni correct)
- Reporting only favorable metrics

### Step 7: Compute Budget

Before running, estimate:

| Component | Estimate | Notes |
|-----------|----------|-------|
| Single training run | X GPU-hours | [Details] |
| Hyperparameter search | Y runs × X hours | [Search strategy] |
| Baselines | Z runs × W hours | [Which baselines] |
| Ablations | N variants × X hours | [Which ablations] |
| Seeds | M seeds × above | [How many seeds] |
| **Total** | **T GPU-hours** | Buffer: 1.5-2x |

**Go/No-Go Decision:** Is this feasible with available resources?

### Step 8: Pre-Registration (Optional but Recommended)

Write down BEFORE running:
- Exact hypotheses
- Primary metrics (not chosen post-hoc)
- Analysis plan
- What would constitute "success"

This prevents unconscious goal-post moving.

## Output: Experiment Design Document

```markdown
# Experiment Design: [Title]

## Hypothesis
[Precise statement]

## Variables
### Independent
[Table]

### Dependent
[Table]

### Controls
[Table]

## Baselines
1. [Baseline 1]: [Source, details]
2. [Baseline 2]: [Source, details]

## Ablations
[Table]

## Confound Mitigation
[Table]

## Statistical Plan
- Seeds: [N]
- Tests: [Which tests for which comparisons]
- Significance threshold: [α level]

## Compute Budget
[Table with total estimate]

## Success Criteria
- Primary: [What must be true]
- Secondary: [Nice to have]

## Timeline
- Phase 1: [What, when]
- Phase 2: [What, when]

## Known Risks
1. [Risk 1]: [Mitigation]
2. [Risk 2]: [Mitigation]
```

## Red Flags in Experiment Design

🚩 "We'll figure out the metrics later"
🚩 "One run should be enough"
🚩 "We don't need baselines, it's obviously better"
🚩 "Let's just see what happens"
🚩 "We can always run more if it's not significant"
🚩 No compute estimate before starting
🚩 Vague success criteria