---
name: maestro-workflow
description: >
  Multi-LLM orchestration implementing the 5-stage coding workflow:
  Example Analysis → Hypothesis → Implementation → Debug Loop → Recursive Improvement.

  Based on "Towards a Science of Scaling Agent Systems" (Kim et al., 2025):
  - Centralized Consult architecture (Claude orchestrates, others advise)
  - Measured coordination (avoid MAS overhead in tool-heavy stages)
  - Tests-first selection (Poetiq pattern, not voting)

  Use when: Debugging complex issues, analyzing unfamiliar code, refactoring,
  or any task that benefits from diverse LLM perspectives with verification.
---

# Maestro Workflow: Multi-LLM Orchestration with Measured Coordination

## Core Philosophy (Paper-Based)

This workflow implements findings from "Towards a Science of Scaling Agent Systems":

### 1. Tool-Coordination Trade-off
- **Paper finding**: Tool-heavy tasks suffer from multi-agent coordination overhead
- **Our rule**: Only Claude Code (orchestrator) runs tools (edit files, run tests)
- **Sub-agents** (Codex/Gemini) provide TEXT ADVICE ONLY

### 2. Capability Saturation (~45% threshold)
- **Paper finding**: When single-agent baseline exceeds ~45%, MAS returns diminish
- **Our rule**: If you're confident about the solution, SKIP ensemble generation
- **Ask yourself**: "Am I stuck, or do I just want confirmation?"

### 3. Error Amplification Prevention
- **Paper finding**: Independent agents amplify errors 17.2x without verification
- **Our rule**: ALWAYS verify with tests before accepting any candidate
- **Use `maestro_select_best` with `tests_first` mode (not voting!)**

## Available Tools

| Tool | Purpose | When to Use |
|------|---------|-------------|
| `maestro_consult` | Single model consultation | Analysis, code review, specific questions |
| `maestro_ensemble_generate` | Multiple candidates | Hypothesis generation, solution exploration |
| `maestro_select_best` | Pick best candidate | After ensemble, with test/lint results |
| `maestro_pack_context` | Smart context packing | Before any consultation |
| `maestro_run_stage` | Execute workflow stage | Structured 5-stage execution |
| `maestro_workflow_state` | Check progress | Monitor budget, see history |
| `maestro_get_metrics` | Paper-aligned metrics | Performance analysis |

## The 5-Stage Workflow

### Stage 1: Example Analysis (analyze)
**Goal**: Freeze facts before guessing.

**Process**:
1. Gather context with file reads, `grep`, `ls`
2. Optionally use `maestro_consult(provider="gemini")` for large file summarization
3. Document observations, repro steps, affected modules

**Output** (JSON):
```json
{
  "observations": ["Test fails with IndexError on line 42"],
  "repro_steps": ["Run pytest test_auth.py::test_login"],
  "affected_modules": ["src/auth.py", "src/db.py"],
  "invariants": ["Must not break existing login flow"]
}
```

**Coordination Policy**: Low overhead allowed (2 consults max)

---

### Stage 2: Hypothesis Formulation (hypothesize)
**Goal**: Generate competing explanations with testable predictions.

**Process**:
1. Use `maestro_ensemble_generate(task="Top 3 root causes...", providers=["codex", "gemini"])`
2. Each hypothesis must have a VERIFICATION TEST
3. Use `maestro_select_best` to pick most testable hypothesis

**Output** (JSON):
```json
{
  "hypotheses": [
    {
      "id": "H1",
      "claim": "Off-by-one error in array indexing",
      "verification_test": "Add edge case test with empty array",
      "confidence": 0.7
    }
  ],
  "selected": "H1",
  "test_command": "pytest test_auth.py::test_empty_users -v"
}
```

**Coordination Policy**: Ensemble ENCOURAGED (best stage for MAS)

---

### Stage 3: Code Implementation (implement)
**Goal**: Apply minimal, testable changes.

**Process**:
1. Claude Code (orchestrator) edits the file directly
2. Optionally consult `maestro_consult(provider="codex")` for diff suggestions
3. Run tests IMMEDIATELY after edit

**Key Rules**:
- NO parallel implementations (creates conflicts)
- ONE change at a time
- Test after EVERY change

**Coordination Policy**: Single agent PREFERRED (tool-heavy = bad for MAS)

---

### Stage 4: Iterative Debugging (debug)
**Goal**: Fix without divergence.

**Process**:
1. Analyze the NEW error (what changed?)
2. Update hypothesis confidence
3. Make SINGLE smallest change
4. Test again

**WARNING**: Paper shows sequential debugging DEGRADES with multi-agent!

**Coordination Policy**:
- Single agent ONLY for first 2 iterations
- Consult external ONLY if stuck for 3+ iterations
- Feed error logs into context

**Iteration Limit**: 5 (escalate if exceeded)

---

### Stage 5: Recursive Improvement (improve)
**Goal**: Refactor and stabilize after tests pass.

**Process**:
1. Review for code quality (but don't over-engineer!)
2. Identify edge cases
3. Add regression tests
4. Optional: `maestro_consult(provider="claude")` for safety review

**Entry Condition**: ALL TESTS MUST PASS

**Coordination Policy**: Ensemble OK for review/suggestions

---

## Example Usage Patterns

### Pattern 1: Bug Investigation
```
User: "The login test is failing, can you debug it?"

1. [ANALYZE] Read test file, error logs
   maestro_pack_context(files=["tests/test_auth.py"], errors=[error_log], stage="analyze")

2. [HYPOTHESIZE] Generate root cause theories
   maestro_ensemble_generate(task="Top 3 causes for IndexError in auth...", providers=["codex", "gemini"])

3. [SELECT] Pick most testable hypothesis
   maestro_select_best(candidates=..., mode="tests_first", test_results=[...])

4. [IMPLEMENT] Fix (Claude edits directly)
   Edit file, run pytest

5. [DEBUG] If test still fails, iterate
   Single agent mode, minimal changes

6. [IMPROVE] After tests pass
   Add edge case tests, review for safety
```

### Pattern 2: Code Review with Diverse Perspectives
```
User: "Review this PR for security issues"

1. maestro_pack_context(files=[changed_files], stage="analyze")

2. maestro_ensemble_generate(
     task="Security review: identify vulnerabilities in...",
     providers=["codex", "gemini", "claude"]
   )

3. maestro_select_best(candidates=..., mode="llm_judge", criteria=["security", "severity"])
```

### Pattern 3: Checking Metrics Mid-Workflow
```
User: "How much coordination overhead have we used?"

maestro_workflow_state()
# Returns: consults used, budget remaining, efficiency score
```

## Coordination Budget

**Per Workflow Limits** (configurable):
- Max consults per stage: 2
- Max total consults: 6
- Capability threshold: 45%

**When to SKIP ensemble**:
- You're confident in the solution
- It's a tool-heavy stage (implement, debug)
- Budget is exhausted

## Error Handling

If a sub-agent fails:
1. Check `stderr` in the response
2. Try a different provider
3. Fall back to single-agent mode
4. Document the failure in tracing

## Metrics (Paper-Aligned)

After any workflow, check:
```
maestro_get_metrics()
```

Key metrics:
- **Coordination Overhead (O%)**: Extra calls vs single-agent
- **Efficiency Score (Ec)**: Success / overhead ratio
- **Test Coverage Rate**: Selections that had test signals

Target: O% < 300%, Ec > 0.4, Test Coverage > 80%