--- name: monitor-experiment description: Monitor running experiments, check progress, collect results. Use when user says "check results", "is it done", "monitor", or wants experiment output. argument-hint: [server-alias or screen-name] allowed-tools: Bash(ssh *), Bash(echo *), Read, Write, Edit --- # Monitor Experiment Results Monitor: $ARGUMENTS ## Workflow ### Step 1: Check What's Running ```bash ssh "screen -ls" ``` ### Step 2: Collect Output from Each Screen For each screen session, capture the last N lines: ```bash ssh "screen -S -X hardcopy /tmp/screen_.txt && tail -50 /tmp/screen_.txt" ``` If hardcopy fails, check for log files or tee output. ### Step 3: Check for JSON Result Files ```bash ssh "ls -lt /*.json 2>/dev/null | head -20" ``` If JSON results exist, fetch and parse them: ```bash ssh "cat /.json" ``` ### Step 3.5: Pull W&B Metrics (when `wandb: true` in CLAUDE.md) **Skip this step entirely if `wandb` is not set or is `false` in CLAUDE.md.** Pull training curves and metrics from Weights & Biases via Python API: ```bash # List recent runs in the project ssh "python3 -c \" import wandb api = wandb.Api() runs = api.runs('/', per_page=10) for r in runs: print(f'{r.id} {r.state} {r.name} {r.summary.get(\"eval/loss\", \"N/A\")}') \"" # Pull specific metrics from a run (last 50 steps) ssh "python3 -c \" import wandb, json api = wandb.Api() run = api.run('//') history = list(run.scan_history(keys=['train/loss', 'eval/loss', 'eval/ppl', 'train/lr'], page_size=50)) print(json.dumps(history[-10:], indent=2)) \"" # Pull run summary (final metrics) ssh "python3 -c \" import wandb, json api = wandb.Api() run = api.run('//') print(json.dumps(dict(run.summary), indent=2, default=str)) \"" ``` **What to extract:** - **Training loss curve** — is it converging? diverging? plateauing? - **Eval metrics** — loss, PPL, accuracy at latest checkpoint - **Learning rate** — is the schedule behaving as expected? - **GPU memory** — any OOM risk? - **Run status** — running / finished / crashed? **W&B dashboard link** (include in summary for user): ``` https://wandb.ai///runs/ ``` > This gives the auto-review-loop richer signal than just screen output — training dynamics, loss curves, and metric trends over time. ### Step 4: Summarize Results Present results in a comparison table: ``` | Experiment | Metric | Delta vs Baseline | Status | |-----------|--------|-------------------|--------| | Baseline | X.XX | — | done | | Method A | X.XX | +Y.Y | done | ``` ### Step 5: Interpret - Compare against known baselines - Flag unexpected results (negative delta, NaN, divergence) - Suggest next steps based on findings ### Step 6: Feishu Notification (if configured) After results are collected, check `~/.claude/feishu.json`: - Send `experiment_done` notification: results summary table, delta vs baseline - If config absent or mode `"off"`: skip entirely (no-op) ## Key Rules - Always show raw numbers before interpretation - Compare against the correct baseline (same config) - Note if experiments are still running (check progress bars, iteration counts) - If results look wrong, check training logs for errors before concluding