--- name: summarize-experiment description: Create a lightweight summary of experiment results from a completed (fine-tuned and evaluated) experiment. Use after run-experiment to capture key metrics from the experiment in textual form. --- # Summarize Experiment Generate a `summary.md` file capturing key metrics from a completed experiment. Think R's `summary()` for experiment results. ## Your Task Create a lightweight summary of experiment results: 1. Parse run status from experiment_summary.yaml 2. Extract final training loss from SLURM stdout 3. Extract accuracy from inspect-ai .eval files 4. Generate summary.md in experiment directory 5. Log the process in logs/summarize-experiment.log ## Prerequisites - experiment_summary.yaml exists - At least some runs have completed (partial results acceptable) - run-experiment has been executed (or manual SLURM jobs run) - **Conda environment activated** - The `parse_eval_log.py` script requires inspect-ai. Activate the conda environment from `claude.local.md` before running extraction commands. ## Workflow ### 1. Locate Experiment Find the experiment directory: - If in an experiment directory (contains experiment_summary.yaml): use current directory - Otherwise: ask user for path ### 2. Parse Run Status Read experiment_summary.yaml to identify runs: **From `runs:` section:** - `name`: Run identifier - `type`: "fine-tuned" or "control" - `model`: Model name - `parameters`: Dict of hyperparameters (empty for control runs) **From `evaluation.matrix:` section:** - `run`: Run name - `tasks`: List of evaluation task names - `epochs`: List of epochs to evaluate (null for control runs) **Determine status by checking filesystem:** - Fine-tuning: Check for `{output_base}/ck-out-{run_name}/` and SLURM outputs - Evaluation: Check for `{run_dir}/eval/logs/*.eval` files ### 3. Extract Training Loss For each COMPLETED fine-tuning run: 1. Find SLURM stdout in the **output directory**: - Parse experiment_summary.yaml "Output" section for `output_dir_base` - Look in: `{output_dir_base}/ck-out-{run_name}/slurm-*.out` - If multiple files, use most recent by modification time 2. Extract final loss using regex: `(\d+)\|(\d+)\|Loss: ([0-9.]+)` - Pattern matches: `{epoch}|{step}|Loss: {value}` - Take the LAST match to get final loss - The step number (group 2) from the last match is the total training steps 3. Record: run_name, final_loss, total_steps, epoch, step **Note:** Training SLURM outputs are in the output directory, NOT the run directory. **If SLURM stdout missing:** - Log warning - Record "N/A" for loss - Continue with other runs ### 4. Extract Evaluation Accuracy For each COMPLETED evaluation: 1. Find .eval files: `{run_dir}/eval/logs/*.eval` 2. For each .eval file, run: ```bash python tools/inspect/parse_eval_log.py {path} ``` 3. Parse JSON output for accuracy 4. **Map to epoch using SLURM job names** (see below) 5. For binary tasks, also run `summary_binary.py` to get balanced accuracy and F1 6. Record: run_name, task, epoch, accuracy, balanced_accuracy, f1, samples **Script output format:** ```json { "status": "success", "task": "capitalization", "accuracy": 0.85, "samples": 100, "scorer": "exact_match", "model": "..." } ``` #### Mapping Epochs via SLURM Job Names The `.eval` files don't currently store epoch information directly. To reliably map each evaluation to its epoch: 1. **Find SLURM output files** in the eval directory: `{run_dir}/eval/slurm-*.out` 2. **Extract job IDs** from filenames (e.g., `slurm-2773062.out` → job ID 2773062) 3. **Query job names via sacct:** ```bash sacct -j {job_ids} --format=JobID,JobName%50 ``` 4. **Parse epoch from job name** - scaffold-inspect names jobs like `eval-{task}-{run}-ep{N}`: - `eval-general_eval-lowlr-ep0` → epoch 0 - `eval-general_eval-lowlr-ep9` → epoch 9 5. **Extract accuracy from SLURM output:** ```bash grep -oP 'match/accuracy: \K[0-9.]+' slurm-{jobid}.out ``` **Example workflow:** ```bash # Get job names for all eval jobs sacct -j 2773062,2773063,2773065 --format=JobID,JobName%50 # Output shows epoch in job name: # 2773062 eval-general_eval-lowlr-ep0 # 2773063 eval-general_eval-lowlr-ep1 # 2773065 eval-general_eval-lowlr-ep2 ``` This approach is reliable because: - Job names are set by scaffold-inspect and include epoch info - Works regardless of submission order or timing - Survives job failures and resubmissions **If extraction fails:** - Script returns `{"status": "error", "message": "..."}` - Log the error - Record "ERROR" for accuracy - Continue with other evaluations #### Computing Balanced Accuracy and F1 (Binary Classification) For binary classification tasks (0/1 targets), use `summary_binary.py` to compute additional metrics: ```bash python tools/inspect/summary_binary.py {path_to_eval_file} --json ``` **JSON output format:** ```json { "status": "success", "path": "/path/to/file.eval", "samples": 100, "accuracy": 0.85, "balanced_accuracy": 0.83, "f1": 0.82, "precision_1": 0.80, "recall_1": 0.84, "recall_0": 0.82, "confusion_matrix": {"tp": 42, "tn": 43, "fp": 7, "fn": 8, "other": 0} } ``` **Why these metrics matter for imbalanced data:** - **Balanced Accuracy** = (Recall_0 + Recall_1) / 2 — not inflated by majority class - **F1 Score** = harmonic mean of precision and recall — penalizes class imbalance **Note:** For non-binary tasks, only accuracy is reported (Bal. Acc and F1 shown as "-"). ### 5. Generate summary.md Create `{experiment_dir}/summary.md` with the following structure: ```markdown # Experiment Summary **Experiment:** `{experiment_name}` | **Generated:** {timestamp} | **Status:** {X}/{Y} complete ## Run Status | Run | Type | Fine-tuning | Evaluation | |-----|------|-------------|------------| | rank4_lr1e-5 | Fine-tuned | COMPLETED | COMPLETED | | rank8_lr1e-5 | Fine-tuned | COMPLETED | COMPLETED | | base_model | Control | N/A | COMPLETED | ## Training Results | Run | Final Loss | Total Steps | Epochs | Duration | |-----|------------|-------------|--------|----------| | rank4_lr1e-5 | 0.234 | 250 | 2 | 8m 15s | | rank8_lr1e-5 | 0.198 | 250 | 2 | 9m 02s | **Notes:** - Base model runs have no training loss (control) - Duration from SLURM elapsed time (if available) ## Evaluation Results | Run | Task | Epoch | Accuracy | Bal. Acc | F1 | Samples | |-----|------|-------|----------|----------|------|---------| | rank4_lr1e-5 | capitalization | 0 | 0.85 | 0.83 | 0.82 | 100 | | rank4_lr1e-5 | capitalization | 1 | 0.88 | 0.86 | 0.85 | 100 | | rank8_lr1e-5 | capitalization | 0 | 0.82 | 0.80 | 0.78 | 100 | | rank8_lr1e-5 | capitalization | 1 | 0.91 | 0.89 | 0.88 | 100 | | base_model | capitalization | - | 0.45 | 0.50 | 0.31 | 100 | **Best performing:** rank8_lr1e-5 (epoch 1) with 89% balanced accuracy ## Incomplete Runs | Run | Stage | Status | Notes | |-----|-------|--------|-------| | rank16_lr1e-5 | Fine-tuning | FAILED | Check slurm-12345.out | ## Next Steps 1. View detailed evaluation results: `inspect view --port=$(get_free_port)` 2. Export raw data: `inspect log export {run_dir}/eval/logs/*.eval --format csv` 3. Full analysis: `analyze-experiment` (when available) --- *Generated by summarize-experiment skill* ``` ### 6. Create Log Document the process in `{experiment_dir}/logs/summarize-experiment.log`. See [logging.md](logging.md) for action types and format. ## Error Handling ### If SLURM stdout missing - Log warning with action type `EXTRACT_LOSS` - Record "N/A" for loss in summary - Continue with other runs ### If .eval file cannot be parsed - Log error with file path - Record "ERROR" for accuracy in summary - Continue with other evaluations ### If all runs failed - Generate summary noting all failures - Include failure states in "Incomplete Runs" section - Suggest troubleshooting steps ### If partial results - Generate summary with available data - Clearly indicate which runs are missing in "Incomplete Runs" section - Still identify best performing run from available data ## Idempotency Running summarize-experiment multiple times overwrites summary.md. This is intentional: - Allows re-running after fixing failed runs - Summary always reflects current state ## Output Files ``` {experiment_dir}/ ├── summary.md # Human-readable summary (new) └── logs/ └── summarize-experiment.log # Process log (new) ``` ## Relationship to Other Skills - **After:** run-experiment (or manual execution) - **Before:** analyze-experiment (when available) - **Optional hook:** run-experiment can invoke this at completion ## Future Compatibility When analyze-experiment is built, summarize-experiment can either: - Remain as a quick summary option (text only, no plots) - Be deprecated in favor of richer output - Become a first stage that analyze-experiment builds upon