--- name: evaluate-model description: "Load the latest model checkpoint, run evaluation on the test set, and generate a metrics report with confusion matrix. Use this after training to assess model performance or to re-evaluate a specific checkpoint." user-invocable: true context: fork allowed-tools: Bash, Read, Grep, Write argument-hint: "[checkpoint-path] e.g. checkpoints/best_model.pt" --- You are running model evaluation for this project. Your goal is to load a trained model checkpoint, evaluate it on the held-out test set, compute comprehensive metrics, and generate a structured report. ## Dynamic Context Current branch: !`git branch --show-current` Available checkpoints: !`ls checkpoints/*.pt checkpoints/*.pth 2>/dev/null || echo "No checkpoints found"` Test data: !`ls data/processed/test* data/features/test* 2>/dev/null || echo "No test data found"` Latest metrics: !`ls -t reports/*.json experiments/*.json 2>/dev/null | head -3 || echo "No previous metrics found"` Config files: !`ls configs/*.yaml configs/*.toml 2>/dev/null || echo "No configs found"` ## Checkpoint Selection If the user provided a checkpoint path as an argument, use it: `$ARGUMENTS` Otherwise, find the latest checkpoint: 1. Look for `checkpoints/best_model.pt` or `checkpoints/best_model.pth` 2. If not found, find the most recently modified `.pt` or `.pth` file in `checkpoints/` 3. If no checkpoints exist, report the error and stop ## Evaluation Process ### Step 1: Load and Verify Checkpoint Verify the checkpoint file exists and can be loaded: ```bash python3 -c " import torch ckpt = torch.load('$CHECKPOINT_PATH', map_location='cpu', weights_only=False) print('Checkpoint keys:', list(ckpt.keys())) print('Epoch:', ckpt.get('epoch', 'unknown')) print('Best metric:', ckpt.get('best_metric', 'unknown')) print('Config:', ckpt.get('config', 'not stored')) " ``` Report the checkpoint metadata: epoch, stored metric, config used. ### Step 2: Run Evaluation Script Execute the evaluation: ```bash python3 -m src.models.evaluation.evaluate \ --checkpoint $CHECKPOINT_PATH \ --data-dir data/features/ \ --output-dir reports/ \ --config configs/experiment.yaml ``` Alternative patterns to try if the above fails: - `python3 src/evaluation/evaluate.py --checkpoint $CHECKPOINT_PATH` - `python3 evaluate.py --checkpoint $CHECKPOINT_PATH --test-data data/features/test.parquet` ### Step 3: Collect Metrics After evaluation completes, read the metrics output. Look for the metrics JSON file: ```bash cat reports/metrics.json 2>/dev/null || cat reports/evaluation_metrics.json 2>/dev/null ``` If no JSON file was generated, parse metrics from the script's stdout. ### Step 4: Generate Confusion Matrix If the evaluation script did not generate a confusion matrix plot, create one: ```bash python3 -c " import json import numpy as np from pathlib import Path # Load metrics that include confusion matrix data metrics_path = Path('reports/metrics.json') if metrics_path.exists(): metrics = json.loads(metrics_path.read_text()) if 'confusion_matrix' in metrics: cm = np.array(metrics['confusion_matrix']) print('Confusion Matrix:') print(cm) print() # Print per-class metrics for i, row in enumerate(cm): precision = row[i] / max(row.sum(), 1) recall = row[i] / max(cm[:, i].sum(), 1) print(f'Class {i}: Precision={precision:.4f}, Recall={recall:.4f}') " ``` ### Step 5: Compare with Baseline If previous metrics exist, load and compare: 1. Find the most recent previous metrics file (excluding the one just generated) 2. Compute deltas for each metric 3. Flag any metric regressions (where current is worse than previous) 4. Highlight improvements ### Step 6: Generate Summary Report Produce a structured evaluation report: ```markdown ## Model Evaluation Report ### Checkpoint - Path: [checkpoint path] - Epoch: [epoch number] - Training config: [config file used] ### Test Set Metrics | Metric | Value | |--------|-------| | Accuracy | X.XXXX | | Precision (macro) | X.XXXX | | Recall (macro) | X.XXXX | | F1 (macro) | X.XXXX | | AUC-ROC | X.XXXX | ### Confusion Matrix [confusion matrix table or reference to plot] ### Comparison with Previous Run | Metric | Previous | Current | Delta | |--------|----------|---------|-------| | ... | ... | ... | +/- ... | ### Observations - [Key findings about model performance] - [Any concerning patterns in errors] - [Recommendations for improvement] ``` Write this report to `reports/evaluation_report.md`. ## Error Handling - If checkpoint cannot be loaded: check for PyTorch version mismatch, report the error - If test data is missing: report which files are expected and where to find them - If CUDA is not available: run evaluation on CPU (will be slower but should work) - If metrics computation fails: report the specific error and which metric caused it