--- name: sub-agents user-invocable: true description: > Parallel Claude Code agent orchestration. Spawn multiple autonomous agents, each with its own GPU or compute, to divide work in parallel. Each agent works independently and reports findings via structured reports; the parent monitors progress and steers agents. Use when you need to run multiple debugging sessions, experiments, or research tasks in parallel across separate GPUs. --- # Sub-Agents Spawn parallel Claude Code processes, each with its own compute, to divide work across multiple GPUs. Each agent works autonomously and reports findings back to the parent via structured reports. This is a general orchestration pattern — use it for parallel debugging, parallel experiments, or any work that benefits from multiple agents. **IMPORTANT:** Deploy the sandbox/experiment app ONCE, then spawn agents against it. Do NOT use `modal run` per agent — concurrent `modal run` processes conflict and kill each other. ## Quick Reference ### Deploy Once, Call Many ```bash # Deploy ONCE (shared by all agents) modal deploy modal-gpu-dev/tools/gpu_sandbox.py # Each agent gets a sandbox via the deployed function python -c " import modal, pathlib fn = modal.Function.from_name('gpu-sandbox', 'sandbox') pubkey = pathlib.Path('~/.ssh/id_ed25519.pub').expanduser().read_text().strip() fn.spawn(ssh_public_key=pubkey, sandbox_id='agent-1') " ``` ### Option A: SSH Sandbox Agents (interactive debugging) Each agent gets its own GPU sandbox via the `modal-gpu-dev` skill's sandbox tool, SSHes in, and works interactively. ```bash AGENT_ID="agent-1" mkdir -p .agents/$AGENT_ID # Launch sandbox via deployed function python -c " import modal, pathlib fn = modal.Function.from_name('gpu-sandbox', 'sandbox') pubkey = pathlib.Path('~/.ssh/id_ed25519.pub').expanduser().read_text().strip() fn.spawn(ssh_public_key=pubkey, sandbox_id='$AGENT_ID') " & echo $! > .agents/$AGENT_ID/sandbox.pid # Wait for SSH info while true; do modal volume get gpu-sandbox-workspace /ssh-info/$AGENT_ID.json /tmp/$AGENT_ID-ssh.json 2>/dev/null && break sleep 3 done HOST=$(python3 -c "import json; d=json.load(open('/tmp/$AGENT_ID-ssh.json')); print(d['host'])") PORT=$(python3 -c "import json; d=json.load(open('/tmp/$AGENT_ID-ssh.json')); print(d['port'])") # Launch the agent claude -p "YOUR TASK DESCRIPTION. SSH into $HOST port $PORT with: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ~/.ssh/id_ed25519 -p $PORT root@$HOST IMPORTANT: Report findings periodically using: python sub-agents/tools/agent_summarize.py $AGENT_ID report --title 'TITLE' --body 'DETAILS' When done, signal completion: python sub-agents/tools/agent_summarize.py $AGENT_ID done --summary 'WHAT YOU DID' --findings 'FINDING 1' 'FINDING 2'" \ --dangerously-skip-permissions \ --output-format stream-json \ --model sonnet \ > .agents/$AGENT_ID/output.jsonl 2>.agents/$AGENT_ID/stderr.log & echo $! > .agents/$AGENT_ID/agent.pid ``` ### Option B: Batch Experiment Agents (no SSH) For workloads where each agent submits experiments and reads results — deploy your experiment function once and call it directly: ```bash # Deploy your experiment runner ONCE modal deploy my_experiment.py # Each agent calls the deployed function — no sandbox needed claude -p "Run experiments by calling: import modal fn = modal.Function.from_name('my-experiment-app', 'run_experiment') result = fn.remote(config=...) Report findings with sub-agents/tools/agent_summarize.py." \ --dangerously-skip-permissions \ --output-format stream-json \ --model sonnet \ > .agents/$AGENT_ID/output.jsonl 2>.agents/$AGENT_ID/stderr.log & ``` ### Reporting (Sub-Agent Side) ```bash # Report a finding (call multiple times during the run) python sub-agents/tools/agent_summarize.py agent-1 report \ --title "LR sweep results" \ --body "Optimal LR is 0.01 with val_loss=0.42" \ --data '{"best_lr": 0.01, "val_loss": 0.42}' # Signal completion python sub-agents/tools/agent_summarize.py agent-1 done \ --summary "Explored LR range 0.001-0.1, found 0.01 optimal" \ --findings "LR=0.01 gives best val_loss=0.42" "Batch size 256 is stable" ``` ### Monitoring (Parent Side) ```bash # Quick status check python sub-agents/tools/agent_report.py agent-1 # Read structured reports python sub-agents/tools/agent_report.py agent-1 --reports # Read full trajectory python sub-agents/tools/agent_report.py agent-1 --trajectory # Read final summary python sub-agents/tools/agent_report.py agent-1 --summary ``` ## Tools - ./tools/agent_report.py — Read agent status, reports, and trajectory - ./tools/agent_summarize.py — Write reports and signal completion ## References - ./references/sub-agents.md — Full orchestration guide (file structure, parallel patterns, follow-ups, cleanup) ## See Also - For the GPU sandbox primitive: use the `modal-gpu-dev` skill - For training/experiment apps to deploy: use the `modal-gpu-experiment` skill