--- name: ml-experiment-tracker description: Guides ML experiment logging, versioning, and reproducibility using tools like MLflow, Weights & Biases, and DVC for systematic model development. license: MIT --- # ML Experiment Tracker This skill provides guidance for systematic machine learning experimentation with proper tracking, versioning, and reproducibility practices. ## Core Competencies - **Experiment Tracking**: MLflow, Weights & Biases (wandb), Neptune, Comet - **Data Versioning**: DVC, Delta Lake, LakeFS - **Model Registry**: Version control for trained models - **Reproducibility**: Environment, code, data, and hyperparameter tracking ## Experiment Tracking Fundamentals ### What to Track Every experiment should log: | Category | Items | Why | |----------|-------|-----| | Code | Git commit hash, branch, diff | Reproduce exact code state | | Data | Dataset version, hash, lineage | Know which data was used | | Environment | Python version, dependencies, hardware | Reproduce runtime | | Hyperparameters | All config values | Understand what changed | | Metrics | Loss, accuracy, custom metrics | Compare performance | | Artifacts | Models, plots, predictions | Preserve outputs | ### Experiment Organization ``` project/ ├── experiments/ │ ├── baseline/ # Initial experiments │ ├── feature-engineering/ # Data improvements │ ├── architecture/ # Model changes │ └── hyperparameter/ # Tuning runs ├── data/ │ ├── raw/ # Original data (versioned) │ ├── processed/ # Cleaned data │ └── features/ # Feature store └── models/ ├── staging/ # Candidates └── production/ # Deployed models ``` ## MLflow Patterns ### Basic Experiment Logging ```python import mlflow # Set experiment (creates if not exists) mlflow.set_experiment("my-classification-project") with mlflow.start_run(run_name="baseline-v1"): # Log parameters mlflow.log_param("learning_rate", 0.01) mlflow.log_param("batch_size", 32) mlflow.log_param("epochs", 100) # Training loop for epoch in range(epochs): train_loss = train_epoch(model, train_loader) val_loss, val_acc = evaluate(model, val_loader) # Log metrics with step mlflow.log_metrics({ "train_loss": train_loss, "val_loss": val_loss, "val_accuracy": val_acc }, step=epoch) # Log model mlflow.pytorch.log_model(model, "model") # Log artifacts (plots, configs) mlflow.log_artifact("confusion_matrix.png") mlflow.log_artifact("config.yaml") ``` ### Model Registry Workflow ``` ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ Training │───▶│ Staging │───▶│ Production │ │ Runs │ │ Review │ │ Deployed │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ ▼ ▼ ▼ Candidate Validated Monitored Models Models Models ``` Stages: - **None**: Just logged, not registered - **Staging**: Candidate for production - **Production**: Active serving - **Archived**: Historical reference ## Weights & Biases Patterns ### Project Structure ```python import wandb # Initialize with config config = { "learning_rate": 0.01, "architecture": "ResNet50", "dataset": "imagenet-subset", "epochs": 100 } run = wandb.init( project="image-classification", group="architecture-experiments", # Group related runs tags=["baseline", "resnet"], config=config, notes="Testing ResNet50 baseline on subset" ) # Training with automatic logging for epoch in range(config["epochs"]): metrics = train_and_eval(model, train_loader, val_loader) wandb.log(metrics) # Log media wandb.log({"predictions": wandb.Image(pred_grid)}) wandb.log({"confusion_matrix": wandb.plot.confusion_matrix(...)}) wandb.finish() ``` ### Hyperparameter Sweeps ```yaml # sweep_config.yaml program: train.py method: bayes # or grid, random metric: name: val_accuracy goal: maximize parameters: learning_rate: distribution: log_uniform_values min: 0.0001 max: 0.1 batch_size: values: [16, 32, 64, 128] optimizer: values: ["adam", "sgd", "adamw"] early_terminate: type: hyperband min_iter: 10 ``` ## DVC for Data Versioning ### Setup and Usage ```bash # Initialize DVC in git repo dvc init # Track large files dvc add data/training.csv git add data/training.csv.dvc data/.gitignore git commit -m "Add training data v1" # Push to remote storage dvc remote add -d storage s3://bucket/dvc dvc push # Create pipeline dvc run -n preprocess \ -d src/preprocess.py -d data/raw \ -o data/processed \ python src/preprocess.py # Reproduce pipeline dvc repro ``` ### DVC Pipeline Definition ```yaml # dvc.yaml stages: preprocess: cmd: python src/preprocess.py deps: - src/preprocess.py - data/raw/ outs: - data/processed/ train: cmd: python src/train.py deps: - src/train.py - data/processed/ params: - train.epochs - train.learning_rate outs: - models/model.pkl metrics: - metrics.json: cache: false ``` ## Reproducibility Checklist ### Code Reproducibility - [ ] Pin git commit for each experiment - [ ] Track uncommitted changes (git diff) - [ ] Version control notebooks (nbstripout) - [ ] Document manual steps ### Environment Reproducibility - [ ] Lock dependencies (pip freeze, poetry.lock) - [ ] Specify Python version - [ ] Document CUDA/GPU requirements - [ ] Use containers for full isolation ### Data Reproducibility - [ ] Version datasets with DVC or similar - [ ] Document data collection process - [ ] Track preprocessing steps - [ ] Save train/val/test split indices ### Training Reproducibility - [ ] Set random seeds (Python, NumPy, PyTorch/TF) - [ ] Log all hyperparameters - [ ] Save model checkpoints - [ ] Document non-deterministic operations ## Best Practices ### Naming Conventions ``` experiment: {project}-{objective} run: {date}-{description}-{variant} model: {architecture}-{dataset}-{version} Examples: experiment: fraud-detection-baseline run: 2024-01-15-xgboost-tuning-lr001 model: xgboost-transactions-v2.3.1 ``` ### Comparison Dashboards Track these metrics for model comparison: - Primary metric (what you optimize) - Secondary metrics (constraints) - Resource usage (training time, memory) - Inference performance (latency, throughput) ### Experiment Documentation Each significant experiment should document: 1. **Hypothesis**: What change and expected outcome 2. **Method**: What was actually done 3. **Results**: Metrics and observations 4. **Conclusions**: What was learned, next steps ## References - `references/mlflow-setup.md` - MLflow installation and configuration - `references/wandb-patterns.md` - Advanced W&B features and sweeps - `references/reproducibility-checklist.md` - Detailed reproducibility guide