--- name: run-pipeline description: "Run the full data science pipeline: validate raw data, preprocess, engineer features, train model, and evaluate. Use this when you want to execute the end-to-end ML pipeline or re-run it after data or code changes." user-invocable: true context: fork allowed-tools: Bash, Read, Grep argument-hint: "[config-file] e.g. configs/experiment.yaml" --- You are executing the full data science pipeline for this project. Run each stage sequentially, verifying success before proceeding to the next stage. Stop immediately if any stage fails and report the error clearly. ## Dynamic Context Current branch: !`git branch --show-current` Data directory contents: !`ls data/ 2>/dev/null || echo "No data/ directory found"` Available configs: !`ls configs/*.yaml 2>/dev/null || ls configs/*.toml 2>/dev/null || echo "No config files found"` Python environment: !`which python3 && python3 --version 2>/dev/null || echo "Python not found"` Recent changes: !`git diff --stat HEAD~3 2>/dev/null || echo "No recent commits"` ## Configuration If the user provided a config file as an argument, use it: `$ARGUMENTS` Otherwise, look for the default config at `configs/experiment.yaml` or `configs/experiment.toml`. ## Pipeline Stages Execute each stage in order. After each stage, check for errors and verify outputs exist before proceeding. ### Stage 1: Environment Check Verify the Python environment is ready: ```bash python3 -c "import torch; import pandas; import numpy; print(f'PyTorch {torch.__version__}, pandas {pandas.__version__}, NumPy {numpy.__version__}')" ``` If imports fail, report which packages are missing and suggest `pip install -r requirements.txt`. ### Stage 2: Data Validation Run data validation on the raw data: ```bash python3 -m src.data.validate --data-dir data/raw/ ``` If the validation script does not exist, look for alternative patterns: - `python3 src/data/validate.py` - `python3 -m pytest tests/test_data/ -v --tb=short` - Check for pandera schemas in `src/data/` and report their status Verify: validation passes with no critical errors. Log any warnings. ### Stage 3: Preprocessing Run the preprocessing pipeline: ```bash python3 -m src.data.preprocess --config $CONFIG_FILE ``` Alternative patterns: - `python3 src/data/preprocess.py --config $CONFIG_FILE` - `dvc repro preprocess` (if DVC pipeline is configured) Verify: processed data files exist in `data/processed/` (check for `.parquet` or `.csv` files). ### Stage 4: Feature Engineering Run feature engineering: ```bash python3 -m src.features.build_features --config $CONFIG_FILE ``` Alternative patterns: - `python3 src/features/build_features.py` - `dvc repro features` Verify: feature files exist in `data/features/` with expected columns. ### Stage 5: Model Training Run model training: ```bash python3 -m src.models.training.trainer --config $CONFIG_FILE ``` Alternative patterns: - `python3 src/models/train.py --config $CONFIG_FILE` - `python3 train.py --config $CONFIG_FILE` Monitor output for: - Loss values (should decrease over epochs) - Validation metrics at each epoch - Any NaN or Inf values (indicates numerical instability) - Out-of-memory errors Verify: model checkpoint exists in `checkpoints/` directory. ### Stage 6: Evaluation Run model evaluation on the test set: ```bash python3 -m src.models.evaluation.evaluate --checkpoint checkpoints/best_model.pt --config $CONFIG_FILE ``` Alternative patterns: - `python3 src/evaluation/evaluate.py` - `python3 evaluate.py --checkpoint checkpoints/best_model.pt` Verify: metrics JSON file exists in `reports/` or `experiments/`. ### Stage 7: Summary After all stages complete, produce a summary: 1. Report which stages succeeded and which failed 2. Print the final evaluation metrics (read from the metrics JSON) 3. List all generated artifacts (checkpoints, processed data, feature files, metrics) 4. If any stage failed, provide the error message and suggest a fix 5. Report total pipeline execution time ## Error Handling - If a stage fails, do NOT proceed to the next stage (except validation warnings which are non-blocking) - Capture stderr and stdout from each command - For Python errors, read the traceback and identify the root cause - For file-not-found errors, check if the expected directory structure exists - For import errors, report the missing package - For CUDA out-of-memory, suggest reducing batch size in the config