# Benchmark Skill ## Purpose Define the comparison metrics and extract baseline values from the current implementation. Record them as a structured JSON entry so downstream phases and final evaluation can read them directly. ## When to Use Phase 3 — after both current analysis and research analysis are complete. ## Input | Parameter | Type | Description | |-----------|------|-------------| | experiment_path | path | `experiments/{research_name}/` | ## Actions 1. **Define comparison metrics:** - Include ALL metrics already used in `current.ipynb`. - Add any additional metrics that are relevant for the new method. - For classification: accuracy, precision, recall, F1, AUC-ROC (as applicable). - For regression: MSE, RMSE, MAE, R² (as applicable). - Include training time if measurable. 2. **Extract baseline values:** - Read metric values from `current.ipynb` output cells. - If a metric is not computed in the notebook, record it as `null` and set `"needs_computation": true` — both notebooks must then compute it. 3. **Append a Phase 3 entry to `{experiment_path}/log.json`** under `phases`: ```json { "name": "Phase 3: Benchmark", "completed_at": "2026-04-17T10:45:00Z", "metrics": [ { "name": "accuracy", "description": "Fraction of correctly classified samples.", "higher_is_better": true, "baseline": 0.8726, "needs_computation": false }, { "name": "f1", "description": "F1 score (binary, positive class).", "higher_is_better": true, "baseline": 0.7277, "needs_computation": false }, { "name": "roc_auc", "description": "Area under the ROC curve.", "higher_is_better": true, "baseline": 0.9274, "needs_computation": false }, { "name": "training_time_seconds", "description": "Wall-clock training time.", "higher_is_better": false, "baseline": null, "needs_computation": true } ], "notes": "training_time_seconds must be added to both notebooks for a fair comparison." } ``` Do not overwrite earlier entries; append to the `phases` array. ## Output - `{experiment_path}/log.json` — updated with Phase 3 benchmark entry - Clear list (in `metrics`) of what the new implementation must compute