---
name: ml-subagent-dev
description: Use when executing ML experiment plans with subagents - code subtasks use standard superpowers TDD + spec review + quality review; the single integration training subtask additionally runs L0 + L1 once
---
Subtasks come in TWO types and have DIFFERENT completion gates. The plan MUST mark exactly ONE subtask as `[INTEGRATION]` (the final training pipeline that assembles all components). All others are code subtasks.
A subtask cannot be marked complete unless its type-specific gate passes. No exceptions.
## Code Subtask Completion Gate
For any subtask NOT marked `[INTEGRATION]` (model class, dataset, loss, custom layer, evaluator core, etc.):
- [ ] TDD red → green — unit tests written, failed, then passed
- [ ] Spec Review — passed (experiment design compliance confirmed)
- [ ] Quality Review — passed (code quality confirmed)
- [ ] Lightweight conclusion recorded — "implemented + N unit tests pass"
**Code subtasks do NOT run L0 or L1.** Their correctness is verified by unit tests + reviews. The Validation Pyramid only fires once, on the integration subtask.
## Integration Subtask Completion Gate
For the single `[INTEGRATION]` subtask (the final delivered training pipeline):
- [ ] TDD red → green — integration tests written, failed, then passed
- [ ] Spec Review — passed (experiment design compliance confirmed)
- [ ] Quality Review — passed (code quality confirmed)
- [ ] L0: VP Static Checks — passed (with actual numbers recorded)
- [ ] L1: ML Runtime Validation — passed (with actual metrics and pipeline stages confirmed)
- [ ] Full conclusion recorded — with metric evidence from L1
If ANY item is unchecked, the subtask is NOT complete. Do NOT proceed. Do NOT mark it as done.
## Anti-Pattern: "Every Subtask Needs Full VP"
Running L0 + L1 on every code subtask was the old design and it was wasteful: L1 takes 5-15 minutes per run, components alone cannot meaningfully validate a training pipeline, and integration bugs only surface at the integration step anyway.
| Thought | Reality |
|---------|---------|
| "I should validate this model class with L1" | Model class alone is not a training pipeline. Unit tests verify deterministic behavior; integration is where training is validated. |
| "Skipping L1 here might miss a shape bug" | TDD unit tests catch shape bugs at the function level. L0+L1 on the integration step catches end-to-end issues, including cross-component shape mismatches. |
| "I should run L0 on every subtask" | L0 checks runtime ML config (device, precision, optimizer, logging). Code subtasks don't have a training run yet — most checks aren't applicable. L0 fires on integration where the full training script exists. |
| "Saving the VP for one step is risky" | The integration step IS the validation step. Catching all integration issues there is the point. |
## Anti-Pattern: "This Integration Subtask Doesn't Need VP"
Equally dangerous in the other direction. Once a subtask is marked `[INTEGRATION]`:
| Thought | Reality |
|---------|---------|
| "This is a small experiment" | Toy experiments with wrong gradients waste days of debugging |
| "Unit tests already passed for the components" | Unit tests check components in isolation. VP checks the assembled training run. They test different things. |
| "L1 is overkill" | If this subtask is the delivered training pipeline, it WILL be trained. VP validates that exact path. |
The integration subtask gets the full L0 + L1 treatment, every time. No exceptions.
# ML Subagent-Driven Development
Execute ML experiment plans by dispatching a fresh subagent per subtask. Code subtasks follow the standard superpowers review path. The integration subtask additionally runs the Validation Pyramid (L0 + L1) once.
**Core principle:** Standard review for components + one full VP at integration = correct implementations with trustworthy training results, without wasting compute on per-component runtime validation.
**Adapted from:** `superpowers:subagent-driven-development`. Key differences:
- Code subtask: standard TDD → Spec Review → Quality Review (matches superpowers, with ML-aware spec criteria)
- Integration subtask: standard reviews + L0 (`spml:ml-static-checks`) + L1 (`spml:ml-runtime-validator`)
- Spec reviewer always checks experiment design compliance (hypothesis, variable control)
- Quality reviewer always checks code quality
- Code subtasks record a lightweight conclusion; integration records full metric evidence
- Shared fix loop: fail → Implementer fixes → re-run failed stage → 5 failures → user intervention
- Large fix rollback: fix > 50 lines → re-run all prior stages
## When to Use
- You have an ML experiment plan (from experiment-planning) with exactly one `[INTEGRATION]` subtask
- Subtasks are mostly independent
- You want to stay in this session (vs. superpowers:executing-plans in parallel session)
## Plan Gate
Before dispatching any implementer subagent, read the plan and fail fast on any of:
- Missing or duplicate `[INTEGRATION]` marker — there must be **exactly one** integration subtask. Send the plan back to `spml:experiment-planning` for revision.
- Plan describes a training task with evaluation but is missing any of:
- a dedicated evaluation subtask (a code subtask that builds the evaluator core)
- step-based evaluation cadence
- evaluation scope, defaulting to `full validation` unless explicitly overridden
- both required evaluation entry modes (checkpoint-based and in-memory during training)
- one shared evaluator core across both entry modes
- evaluation progress visibility requirements
- mode-aware failure-handling requirements at the evaluation boundary
- runtime checks (in the integration subtask) for cadence firing and evaluation mode reporting
These are not advisory. Incomplete plans must be sent back for revision before implementation starts.
## Revision Mode Adaptation
When the plan contains revision markers (`[x]`, `REVISED`, `NEW`):
- **`[x]` (unchanged, gate previously passed)** — Skip entirely. Prior results preserved.
- **`[ ] REVISED`** — Re-execute on existing code:
- Implementer subagent receives the old code file paths as context and modifies in place
- All gate items for that subtask type must re-run (unit tests + reviews; integration also re-runs L0 + L1)
- Old gate results are voided
- **`[ ] NEW`** — Normal fresh flow
If a revision touches the integration subtask, L0 + L1 must always re-run. If a revision touches a code subtask that the integration depends on, the integration subtask should be re-flagged for re-execution (its assumptions about that component may have shifted).
## The Process
```dot
digraph process {
rankdir=TB;
"Read plan, validate single [INTEGRATION] marker\nTaskCreate per subtask" [shape=box];
"Subtask type?" [shape=diamond];
subgraph cluster_code {
label="Code Subtask Path";
"Dispatch implementer (code)" [shape=box];
"TDD: tests + implement + tests pass" [shape=box];
"Dispatch spec reviewer" [shape=box];
"Spec compliant?" [shape=diamond];
"Implementer fixes spec gaps" [shape=box];
"Dispatch quality reviewer" [shape=box];
"Quality OK?" [shape=diamond];
"Implementer fixes quality issues" [shape=box];
"Code Completion Gate" [shape=diamond style=filled fillcolor=red fontcolor=white];
"Record lightweight conclusion" [shape=box style=filled fillcolor=lightgreen];
}
subgraph cluster_integration {
label="Integration Subtask Path (single, runs once)";
"Dispatch implementer (integration)" [shape=box];
"TDD: integration tests + assemble + tests pass" [shape=box];
"Dispatch spec reviewer (int)" [shape=box];
"Int spec compliant?" [shape=diamond];
"Implementer fixes spec gaps (int)" [shape=box];
"Dispatch quality reviewer (int)" [shape=box];
"Int quality OK?" [shape=diamond];
"Implementer fixes quality issues (int)" [shape=box];
"L0: VP Static Checks" [shape=box style=filled fillcolor=lightyellow];
"L0 passed?" [shape=diamond];
"Implementer fixes L0 issues" [shape=box];
"L1: ML Runtime Validation" [shape=box style=filled fillcolor=lightyellow];
"L1 passed?" [shape=diamond];
"Implementer fixes L1 issues" [shape=box];
"Integration Completion Gate" [shape=diamond style=filled fillcolor=red fontcolor=white];
"Record full conclusion w/ L1 metrics" [shape=box style=filled fillcolor=lightgreen];
}
"More subtasks?" [shape=diamond];
"Post-Completion Gate:\nAsk user Train / Research / Done" [shape=diamond style=filled fillcolor=orange fontcolor=white];
"Read plan, validate single [INTEGRATION] marker\nTaskCreate per subtask" -> "Subtask type?";
"Subtask type?" -> "Dispatch implementer (code)" [label="code"];
"Dispatch implementer (code)" -> "TDD: tests + implement + tests pass";
"TDD: tests + implement + tests pass" -> "Dispatch spec reviewer";
"Dispatch spec reviewer" -> "Spec compliant?";
"Spec compliant?" -> "Implementer fixes spec gaps" [label="no"];
"Implementer fixes spec gaps" -> "Dispatch spec reviewer" [label="re-review"];
"Spec compliant?" -> "Dispatch quality reviewer" [label="yes"];
"Dispatch quality reviewer" -> "Quality OK?";
"Quality OK?" -> "Implementer fixes quality issues" [label="no"];
"Implementer fixes quality issues" -> "Dispatch quality reviewer" [label="re-review"];
"Quality OK?" -> "Code Completion Gate" [label="yes"];
"Code Completion Gate" -> "Record lightweight conclusion" [label="all checked"];
"Record lightweight conclusion" -> "More subtasks?";
"Subtask type?" -> "Dispatch implementer (integration)" [label="[INTEGRATION]"];
"Dispatch implementer (integration)" -> "TDD: integration tests + assemble + tests pass";
"TDD: integration tests + assemble + tests pass" -> "Dispatch spec reviewer (int)";
"Dispatch spec reviewer (int)" -> "Int spec compliant?";
"Int spec compliant?" -> "Implementer fixes spec gaps (int)" [label="no"];
"Implementer fixes spec gaps (int)" -> "Dispatch spec reviewer (int)" [label="re-review"];
"Int spec compliant?" -> "Dispatch quality reviewer (int)" [label="yes"];
"Dispatch quality reviewer (int)" -> "Int quality OK?";
"Int quality OK?" -> "Implementer fixes quality issues (int)" [label="no"];
"Implementer fixes quality issues (int)" -> "Dispatch quality reviewer (int)" [label="re-review"];
"Int quality OK?" -> "L0: VP Static Checks" [label="yes"];
"L0: VP Static Checks" -> "L0 passed?";
"L0 passed?" -> "Implementer fixes L0 issues" [label="no"];
"Implementer fixes L0 issues" -> "L0: VP Static Checks" [label="re-run\n(fix>50 lines: rollback)"];
"L0 passed?" -> "L1: ML Runtime Validation" [label="yes"];
"L1: ML Runtime Validation" -> "L1 passed?";
"L1 passed?" -> "Implementer fixes L1 issues" [label="no"];
"Implementer fixes L1 issues" -> "L1: ML Runtime Validation" [label="re-run\n(fix>50 lines: rollback)"];
"L1 passed?" -> "Integration Completion Gate" [label="yes"];
"Integration Completion Gate" -> "Record full conclusion w/ L1 metrics" [label="all checked"];
"Record full conclusion w/ L1 metrics" -> "More subtasks?";
"More subtasks?" -> "Subtask type?" [label="yes"];
"More subtasks?" -> "Post-Completion Gate:\nAsk user Train / Research / Done" [label="no"];
}
```
## Progress Reporting
The orchestrator MUST use TaskCreate/TaskUpdate to give the user real-time visibility into subagent progress.
### Orchestrator Responsibilities
1. **Create one Task per subtask** before dispatching the implementer:
```
TaskCreate(
subject: "Subtask N: [name][ — INTEGRATION if marked]",
activeForm: "Implementing [name]",
description: "Phase: Implementation — starting"
)
```
2. **Update the Task before each phase transition** (use the task ID from step 1):
| Phase | activeForm | description | Applies to |
|-------|-----------|-------------|------------|
| Implementation | `Implementing [name]` | _(subagent updates internally)_ | both |
| Spec Review | `Spec reviewing [name]` | `Phase: Spec Review` | both |
| Quality Review | `Quality reviewing [name]` | `Phase: Quality Review` | both |
| L0 Static | `Running L0 static checks on [name]` | `Phase: L0 VP Static Checks` | integration only |
| L1 Runtime | `Running L1 runtime validation on [name]` | `Phase: L1 Runtime Validation` | integration only |
| Fix loop | `Fixing [stage] issues in [name]` | `Phase: Fix loop ([stage], attempt N/5)` | both |
| Done | _(mark completed)_ | Conclusion summary | both |
3. **Pass `TASK_ID: [id]`** in every subagent prompt so subagents can call TaskUpdate.
### Subagent Responsibilities
Every subagent receives a `TASK_ID` and MUST call TaskUpdate at each milestone to update the task's `description` field. Milestone updates should be concise, one-line status strings.
## ML Implementer Subagent Prompt (code subtask)
```
You are implementing Subtask N: [subtask name]
Type: CODE SUBTASK (no VP — standard review path only)
TASK_ID: [id from orchestrator's TaskCreate]
## Progress Reporting
You MUST call TaskUpdate(taskId=TASK_ID, description="...") at each milestone
below. This is how the user tracks your progress. Do NOT skip this.
## Experiment Context
**Overall hypothesis:** [from plan header]
**This subtask's role:** [what component this builds]
## Task Description
[FULL TEXT of subtask from plan]
## Code Separation Rule
Core code (model, training, data) must NEVER import from test/validation code
or toolkit. Validation scripts observe core code externally.
## Your Job
1. **Write unit tests** for any custom functions (deterministic code only)
→ TaskUpdate: "Phase: Implementation — writing unit tests (N test cases)"
2. **Run unit tests** — verify they fail (TDD red)
→ TaskUpdate: "Phase: Implementation — TDD red confirmed, implementing core code"
3. **Implement core code** (no test/validation imports)
4. **Run unit tests** — verify they pass (TDD green)
→ TaskUpdate: "Phase: Implementation — TDD green, all N tests passing"
5. **Self-review** — check your own code before submission
→ TaskUpdate: "Phase: Implementation — self-review complete, ready for spec review"
6. **Commit** with message: "experiment: [subtask description]"
Note: After your code passes unit tests, the orchestrator will run Spec Review and
Quality Review. You do NOT run reviews yourself. There is NO L0 or L1 for code
subtasks — those run only on the integration subtask.
If this subtask builds an evaluator core:
- build one evaluator core shared by checkpoint-based and in-memory entry modes
- expose mode-aware start/end reporting and boundary errors
- the trainer integration (which decides cadence) happens in the integration subtask
## Report Format
- What you implemented
- Unit test results (N tests, all passing)
- Files changed
- Any concerns or questions
```
## ML Implementer Subagent Prompt (integration subtask)
```
You are implementing Subtask N: [subtask name]
Type: INTEGRATION SUBTASK (this is the final delivered training pipeline)
TASK_ID: [id from orchestrator's TaskCreate]
## Progress Reporting
You MUST call TaskUpdate(taskId=TASK_ID, description="...") at each milestone
below. Do NOT skip this.
## Experiment Context
**Overall hypothesis:** [from plan header]
**This subtask's role:** Assemble all completed components into a runnable
training pipeline. This is THE deliverable — the entry point that gets run
during long-running training and (optionally) inside autoresearch / ml-iteration.
**Validation scope:** L0 + L1 will run on this subtask after standard reviews.
## Task Description
[FULL TEXT of integration subtask from plan]
## Components to Integrate
[list completed code subtasks the integration depends on, with file paths]
## Code Separation Rule
Core code (model, training, data) must NEVER import from test/validation code
or toolkit. Validation scripts observe core code externally.
## Your Job
1. **Write integration tests** — end-to-end smoke test that exercises the
full pipeline (data → model → loss → backward → step) on a tiny shape
→ TaskUpdate: "Phase: Implementation — writing integration tests"
2. **Run integration tests** — verify they fail (TDD red)
→ TaskUpdate: "Phase: Implementation — TDD red confirmed, assembling pipeline"
3. **Assemble the pipeline** — wire components, write the training script,
add logging (loss/speed file output, MFU, gradient norms), checkpoint
save/resume, fixed seeds. Match all production-training requirements
from the plan.
4. **Run integration tests** — verify they pass (TDD green)
→ TaskUpdate: "Phase: Implementation — TDD green, integration tests passing"
5. **Self-review** — check the assembled pipeline before submission
→ TaskUpdate: "Phase: Implementation — self-review complete, ready for spec review"
6. **Commit** with message: "experiment: [integration description]"
Note: After your code passes integration tests, the orchestrator will run Spec
Review → Quality Review → L0 (ml-static-checks) → L1 (ml-runtime-validator).
You do NOT run reviews or VP yourself.
If evaluation is in scope:
- trainer code decides WHEN evaluation fires (step-based cadence)
- evaluator code decides HOW evaluation runs (shared core across both entry modes)
- emit phase-start / progress / phase-end / result / efficiency signals
- surface mode-aware errors at the evaluation boundary
## Report Format
- What you assembled (file map: which components plug in where)
- Integration test results
- Files changed
- Any concerns or questions
```
## ML Spec Reviewer Prompt
```
You are reviewing whether a subtask implementation matches its experiment design.
TASK_ID: [id from orchestrator]
Subtask type: [CODE | INTEGRATION]
## Progress Reporting
Call TaskUpdate(taskId=TASK_ID, description="...") at start and end:
- Start: "Phase: Spec Review — checking experiment design compliance"
- End: "Phase: Spec Review — [✅ compliant | ❌ N issues found]"
## Experiment Design
**Hypothesis:** [from plan]
**Independent variable:** [what should change]
**Dependent variable:** [what to measure]
**Control variable:** [what must stay the same]
## Subtask Spec
[FULL TEXT of subtask requirements]
## Your Job
Read the actual code and verify:
**Experiment design compliance:**
- Does the implementation match the stated hypothesis?
- Is ONLY the independent variable changed? (no confounds)
- Are control variables truly unchanged?
- Is the dependent variable being measured correctly?
**Spec compliance:**
- Missing requirements?
- Extra/unneeded work?
- Misunderstandings?
**ML-specific checks:**
- Core code imports from test/validation code? (VIOLATION)
- Validation scripts observe externally? (hooks/wrappers, not modifying core)
- Correct loss function for the task?
- Data preprocessing matches training and evaluation?
**Integration-specific checks (only when subtask type is INTEGRATION):**
- Does the integration assemble exactly the completed components from the plan?
- If evaluation is in scope, does the plan/code preserve the split:
- trainer decides when evaluation runs
- evaluator decides how evaluation runs
- If evaluation is in scope, are both entry modes present through one shared evaluator core?
- If evaluation is in scope, is evaluation still observable during long runs?
Report:
- ✅ Spec compliant
- ❌ Issues found: [list with file:line references]
```
## ML Quality Reviewer Prompt
```
You are reviewing implementation quality for a completed ML subtask.
TASK_ID: [id from orchestrator]
Subtask type: [CODE | INTEGRATION]
## Progress Reporting
Call TaskUpdate(taskId=TASK_ID, description="...") at start and end:
- Start: "Phase: Quality Review — checking code quality"
- End: "Phase: Quality Review — [✅ approved | ❌ N issues found]"
Note: For integration subtasks, L0 (ml-static-checks) and L1 run AFTER this
review. Your job here is purely code quality, not VP.
## Your Job
**Code quality (same as standard review):**
- Clean, maintainable code?
- Proper error handling at system boundaries?
- No security issues?
**ML-specific quality:**
- Fixed random seeds where needed?
- Proper CUDA synchronization for timing?
- No data leakage between train/eval?
- Gradient computation correct (detach where needed)?
**Integration-specific quality (only when subtask type is INTEGRATION):**
- Production-training requirements met (human-readable log file, MFU,
tqdm/progress, checkpoint interval, resume support, fixed seeds)?
- If evaluation is in scope, are mode-aware boundary errors and progress
signals implemented where they belong?
Report:
- ✅ Approved
- ❌ Issues: [list with severity and file:line references]
```
## Conclusion Recording
### Code subtask (lightweight)
```markdown
### Subtask N Conclusion (code)
**Role:** [what component this builds]
**Result:** implemented
**Evidence:**
- N unit tests passing
- Files: [list]
```
### Integration subtask (full)
```markdown
### Subtask N Conclusion (INTEGRATION)
**Hypothesis:** [restated]
**Result:** effective / ineffective / inconclusive
**Evidence (from L1):**
- [metric]: [actual value] (expected: [threshold])
- [metric]: [actual value] (expected: [threshold])
**Anomalies:** [any unexpected observations]
**Recommendation:** [proceed / investigate further / abandon direction]
```
Record this in the plan document or a separate experiment log.
## Post-Completion Gate
After ALL subtasks are complete (code subtasks pass their gate AND the integration subtask passes its gate including L0+L1), you MUST pause and present the following to the user. Do NOT decide this yourself.
First, check if the brainstorm design doc contains a `## Autoresearch Protocol` section.
**If Autoresearch Protocol section exists**, present to the user:
> All subtasks complete. Integration VP passed. Next step:
>
> 1. **Research** — automated experiment iteration. I will invoke spml:autoresearch-handoff to generate the research protocol and startup prompt for autonomous exploration.
> 2. **Train** — needs long-running training (hours/days). I will invoke spml:training-handoff to generate experiment-context.md + watchdog-prompt.md for a new monitoring session.
> 3. **Done** — experiment is already complete within this session. I will invoke spml:verification.
>
> Which one?
**If no Autoresearch Protocol section**, present the original two options:
> All subtasks complete. Integration VP passed. Next step:
>
> 1. **Train** — needs long-running training (hours/days). I will invoke spml:training-handoff to generate experiment-context.md + watchdog-prompt.md for a new monitoring session.
> 2. **Done** — experiment is already complete within this session. I will invoke spml:verification.
>
> Which one?
- **User chooses Train** → Invoke `spml:training-handoff`. The integration subtask's L1-validated training script is the production training script.
- **User chooses Done** → Invoke `spml:verification` directly.
- **User chooses Research** → Invoke `spml:autoresearch-handoff`. Verification happens later, after autoresearch completes.
When the long-running phase includes evaluation, downstream checks should confirm:
- in-training evaluation fires at the planned step cadence
- checkpoint-based evaluation reports checkpoint load behavior
- in-training evaluation reports that it is using in-memory state
- evaluation start/end messages and progress output appear as runtime checks
- evaluation errors surface with mode-aware context at the evaluation boundary
## Red Flags
**Never:**
- Run L0 or L1 on code subtasks
- Skip L0 or L1 on the integration subtask
- Allow more than one subtask marked `[INTEGRATION]`
- Allow zero subtasks marked `[INTEGRATION]` for an experiment that ends in a training run
- Accept VP "pass" without checking actual numbers
- Let implementer skip unit tests for custom code
- Proceed when an integration VP layer fails (trigger diagnostics instead)
- Change control variables in a subtask (confounds the experiment)
- Record "effective" without L1 evidence
**Always:**
- Validate the `[INTEGRATION]` marker count at Plan Gate
- Record actual metric values (not just pass/fail) for the integration subtask
- Note anomalies even when passing
- Keep core code free of test/validation imports
- Fixed random seeds for reproducibility
## Integration
- **spml:experiment-planning** — Creates the plan this skill executes (must mark exactly one `[INTEGRATION]` subtask)
- **spml:validation-pyramid** — Defines the 2-level VP (runs only on the integration subtask)
- **spml:ml-static-checks** — L0 static analysis (dispatched as subagent on the integration subtask only)
- **spml:ml-runtime-validator** — L1 runtime validation (orchestrator invokes after L0, on the integration subtask only)
- **spml:diagnostics** — Called when integration VP check fails
- **spml:training-handoff** — Called after Post-Completion Gate if user chooses Train
- **spml:verification** — Called after Post-Completion Gate if user chooses Done
- **spml:autoresearch-handoff** — Called after Post-Completion Gate if user chooses Research
- **spml:ml-iteration / spml:autoresearch** — Iterative orchestrators that run their own per-round VP (each round IS an integration delivery); the integration-only rule does not change their per-round behavior