--- name: experiment-plan description: 'Turn a refined research proposal or method idea into a detailed, claim-driven experiment roadmap. Use after `research-refine`, or when the user asks for a detailed experiment plan, ablation matrix, evaluation protocol, run order, compute budget, or paper-ready validation that supports the core problem, novelty, simplicity, and any LLM / VLM / Diffusion / RL-based contribution.' allowed-tools: Bash(*), Read, Write, Edit, Grep, Glob, WebSearch, WebFetch, Agent --- # Experiment Plan: Claim-Driven, Paper-Oriented Validation Refine and concretize: **$ARGUMENTS** ## Overview Use this skill after the method is stable enough that the next question becomes: **what exact experiments should we run, in what order, to defend the paper?** If the user wants the full chain in one request, prefer `/research-refine-pipeline`. The goal is not to generate a giant benchmark wishlist. The goal is to turn a proposal into a **claim -> evidence -> run order** roadmap that supports four things: 1. the method actually solves the anchored problem 2. the dominant contribution is real and focused 3. the method is elegant enough that extra complexity is unnecessary 4. any frontier-model-era component is genuinely useful, not decorative ## Constants - **OUTPUT_DIR = `refine-logs/`** — Default destination for experiment planning artifacts. - **MAX_PRIMARY_CLAIMS = 2** — Prefer one dominant claim plus one supporting claim. - **MAX_CORE_BLOCKS = 5** — Keep the must-run experimental story compact. - **MAX_BASELINE_FAMILIES = 3** — Prefer a few strong baselines over many weak ones. - **DEFAULT_SEEDS = 3** — Use 3 seeds when stochastic variance matters and budget allows. ## Workflow ### Phase 0: Load the Proposal Context Read the most relevant existing files first if they exist: - `refine-logs/FINAL_PROPOSAL.md` - `refine-logs/REVIEW_SUMMARY.md` - `refine-logs/REFINEMENT_REPORT.md` Extract: - **Problem Anchor** - **Dominant contribution** - **Optional supporting contribution** - **Critical reviewer concerns** - **Data / compute / timeline constraints** - **Which frontier primitive is central, if any** If these files do not exist, derive the same information from the user's prompt. ### Phase 1: Freeze the Paper Claims Before proposing experiments, write down the claims that must be defended. Use this structure: - **Primary claim**: the main mechanism-level contribution - **Supporting claim**: optional, only if it directly strengthens the main paper story - **Anti-claim to rule out**: e.g. "the gain only comes from more parameters," "the gain only comes from a larger search space," or "the modern component is just decoration" - **Minimum convincing evidence**: what would make each claim believable to a strong reviewer? Do not exceed `MAX_PRIMARY_CLAIMS` unless the paper truly has multiple inseparable claims. ### Phase 2: Build the Experimental Storyline Design the paper around a compact set of experiment blocks. Default to the following blocks and delete any that are not needed: 1. **Main anchor result** — does the method solve the actual bottleneck? 2. **Novelty isolation** — does the dominant contribution itself matter? 3. **Simplicity / elegance check** — can a bigger or more fragmented version be avoided? 4. **Frontier necessity check** — if an LLM / VLM / Diffusion / RL-era component is central, is it actually the right tool? 5. **Failure analysis or qualitative diagnosis** — what does the method still miss? For each block, decide whether it belongs in: - **Main paper** — essential to defend the core claims - **Appendix** — useful but non-blocking - **Cut** — interesting, but not worth the paper budget Prefer one strong baseline family over many weak baselines. If a stronger modern baseline exists, use it instead of padding the list. ### Phase 3: Specify Each Experiment Block For every kept block, fully specify: - **Claim tested** - **Why this block exists** - **Dataset / split / task** - **Compared systems**: strongest baselines, ablations, and variants only - **Metrics**: decisive metrics first, secondary metrics second - **Setup details**: backbone, frozen vs trainable parts, key hyperparameters, training budget, seeds - **Success criterion**: what outcome would count as convincing evidence? - **Failure interpretation**: if the result is negative, what does it mean? - **Table / figure target**: where this result should appear in the paper Special rules: - A **simplicity check** should usually compare the final method against either an overbuilt variant or a tempting extra component that the paper intentionally rejects. - A **frontier necessity check** should usually compare the chosen modern primitive against the strongest plausible simpler or older alternative. - If the proposal is intentionally non-frontier, say so explicitly and skip the frontier block instead of forcing one. ### Phase 4: Turn the Plan Into an Execution Order Build a realistic run order so the user knows what to do first. Use this milestone structure: 1. **Sanity stage** — data pipeline, metric correctness, one quick overfit or toy split 2. **Baseline stage** — reproduce the strongest baseline(s) 3. **Main method stage** — run the final method on the primary setting 4. **Decision stage** — run the decisive ablations for novelty, simplicity, and frontier necessity 5. **Polish stage** — robustness, qualitative figures, appendix extras For each milestone, estimate: - compute cost - expected turnaround time - stop / go decision gate - risk and mitigation Separate **must-run** from **nice-to-have** experiments. ### Phase 5: Write the Outputs #### Step 5.1: Write `refine-logs/EXPERIMENT_PLAN.md` Use this structure: ```markdown # Experiment Plan **Problem**: [problem] **Method Thesis**: [one-sentence thesis] **Date**: [today] ## Claim Map | Claim | Why It Matters | Minimum Convincing Evidence | Linked Blocks | |-------|-----------------|-----------------------------|---------------| | C1 | ... | ... | B1, B2 | ## Paper Storyline - Main paper must prove: - Appendix can support: - Experiments intentionally cut: ## Experiment Blocks ### Block 1: [Name] - Claim tested: - Why this block exists: - Dataset / split / task: - Compared systems: - Metrics: - Setup details: - Success criterion: - Failure interpretation: - Table / figure target: - Priority: MUST-RUN / NICE-TO-HAVE ### Block 2: [Name] ... ## Run Order and Milestones | Milestone | Goal | Runs | Decision Gate | Cost | Risk | |-----------|------|------|---------------|------|------| | M0 | ... | ... | ... | ... | ... | ## Compute and Data Budget - Total estimated GPU-hours: - Data preparation needs: - Human evaluation needs: - Biggest bottleneck: ## Risks and Mitigations - [Risk]: - [Mitigation]: ## Final Checklist - [ ] Main paper tables are covered - [ ] Novelty is isolated - [ ] Simplicity is defended - [ ] Frontier contribution is justified or explicitly not claimed - [ ] Nice-to-have runs are separated from must-run runs ``` #### Step 5.2: Write `refine-logs/EXPERIMENT_TRACKER.md` Use this structure: ```markdown # Experiment Tracker | Run ID | Milestone | Purpose | System / Variant | Split | Metrics | Priority | Status | Notes | |--------|-----------|---------|------------------|-------|---------|----------|--------|-------| | R001 | M0 | sanity | ... | ... | ... | MUST | TODO | ... | ``` Keep the tracker compact and execution-oriented. #### Step 5.3: Present a Brief Summary to the User ``` Experiment plan ready. Must-run blocks: - [Block 1] - [Block 2] Highest-risk assumption: - [risk] First three runs to launch: 1. [run] 2. [run] 3. [run] Plan file: refine-logs/EXPERIMENT_PLAN.md Tracker file: refine-logs/EXPERIMENT_TRACKER.md ``` ## Key Rules - **Large file handling**: If the Write tool fails due to file size, immediately retry using Bash (`cat << 'EOF' > file`) to write in chunks. Do NOT ask the user for permission — just do it silently. - **Every experiment must defend a claim.** If it does not change a reviewer belief, cut it. - **Prefer a compact paper story.** Design the main table first, then add only the ablations that defend it. - **Defend simplicity explicitly.** If complexity is a concern, include a deletion study or a stronger-but-bloated variant comparison. - **Defend frontier choices explicitly.** If a modern primitive is central, prove why it is better than the strongest simpler alternative. - **Prefer strong baselines over long baseline lists.** A short, credible comparison set is better than a padded one. - **Separate must-run from nice-to-have.** Do not let appendix ideas delay the core paper evidence. - **Reuse proposal constraints.** Do not invent unrealistic budgets or data assumptions. - **Do not fabricate results.** Plan evidence; do not claim evidence. ## Composing with Other Skills ``` /research-refine-pipeline -> one-shot method + experiment planning /research-refine -> method and claim refinement /experiment-plan -> detailed experiment roadmap /run-experiment -> execute the runs /auto-review-loop -> react to results and iterate on the paper ```