# Experiment Integrity Protocol ## Core Principle **The model that writes experiment code must NOT be the model that judges experiment integrity.** This is the same principle as reviewer-independence, applied to experiments. ## Prohibited Patterns ### 1. Fake Ground Truth - ❌ Creating synthetic "reference" from model outputs and comparing against it - ❌ Using baseline model outputs as ground truth - ❌ Generating pseudo-GT that is structurally similar to predictions - ✅ Using dataset-provided ground truth - ✅ Using official evaluation scripts when available - ✅ Proxy evaluation is allowed IF explicitly labeled as `synthetic_proxy` ### 2. Score Normalization Fraud - ❌ Dividing metrics by max/min of model's own output to get 0.99+ - ❌ Rescaling scores to hide poor performance - ✅ Standard normalization (e.g., min-max across ALL methods including baselines) - ✅ Reporting raw and normalized scores side by side ### 3. Phantom Results - ❌ Claiming results from files that don't exist - ❌ Referencing metrics from functions that are never called - ❌ Reporting TRACKER status as DONE when it's still TODO - ✅ Every claimed number must trace to an actual output file ### 4. Insufficient Scope - ❌ Reporting 2-scene pilot as "comprehensive evaluation" - ❌ Using words like "robust", "extensive", "across settings" for tiny experiments - ✅ Honestly label scope: "pilot (N=2)", "preliminary", "limited evaluation" - ✅ State exact scope: N scenes, N seeds, N configurations ## Evaluation Types (must be declared) | Type | Label | What it means | Claim ceiling | |------|-------|---------------|---------------| | Real GT | `real_gt` | Dataset-provided ground truth | Full performance claims | | Synthetic proxy | `synthetic_proxy` | Model-generated reference | "Proxy consistency" only | | Self-supervised | `self_supervised_proxy` | No GT by design | Relative improvement only | | Simulation | `simulation_only` | Simulated environment | "In simulation" qualifier | | Human eval | `human_eval` | Human judges | Subject to inter-rater stats | ## Who Checks The **reviewer model** (different family from executor) performs integrity checks via `/experiment-audit`. The executor collects file paths; the reviewer reads code and results directly. **Never let the executor judge its own experiment integrity.**