--- name: datarobot-model-training description: Comprehensive guidance for training models in DataRobot, including project creation, AutoML configuration, feature engineering, and model selection. Use when training models, creating AutoML projects, or selecting models in DataRobot. --- # DataRobot Model Training Skill This skill provides guidance for the complete model training workflow in DataRobot, from project creation through model selection and validation. ## Quick Start **Most common use case**: Create a project and train models 1. **Upload dataset**: `upload_dataset(file_path, dataset_name)` to upload training data 2. **Create project**: `create_project(dataset_id, project_name)` to create new project 3. **Start training**: `start_automl(project_id, mode)` to begin AutoML training **Example**: "Create a new project with sales_data.csv, set 'revenue' as target, and start Quick AutoML training" ## When to use this skill Use this skill when you need to: - Create new DataRobot projects - Upload training datasets - Configure AutoML experiments - Monitor training progress - Select and compare models - Understand feature engineering results - Export trained models ## Key capabilities ### 1. Project Management - Create new projects with appropriate settings - Upload datasets (CSV, Parquet, database connections) - Configure project settings (target, partitioning, time series) - Manage multiple projects and experiments ### 2. AutoML Configuration - Set training modes (Quick, Manual, Comprehensive) - Configure feature engineering options - Set time limits and resource constraints - Choose algorithms and model types ### 3. Training Execution - Start AutoML training runs - Monitor training progress - Handle training errors and warnings - Pause/resume training if needed ### 4. Model Analysis - Compare model performance metrics - Review feature importance - Analyze model insights and explanations - Select best models for deployment ## Workflow examples ### Example 1: Create and train a new project **User request**: "Create a new project using my sales_data.csv file, predict 'revenue' as the target, and start AutoML training." **Agent workflow**: 1. Upload the dataset to DataRobot 2. Create a new project with the dataset 3. Set 'revenue' as the target variable 4. Configure project settings (detect partitioning, handle time series if needed) 5. Start AutoML training with appropriate mode 6. Monitor training progress 7. Report when training completes with top model metrics ### Example 2: Configure advanced training options **User request**: "Train a model with time series settings: datetime column 'date', series ID 'store_id', forecast window 1-7 days." **Agent workflow**: 1. Create project with time series configuration 2. Set datetime column and series ID columns 3. Configure forecast window (1-7 days) 4. Set appropriate time series validation 5. Start training with time series-aware algorithms 6. Monitor progress and report results ## Using DataRobot SDK This skill guides you to use the DataRobot Python SDK directly. Install the SDK if needed: ```bash pip install datarobot ``` ### Key SDK Operations Use these DataRobot SDK methods for model training: **Projects**: - `dr.Project.create_from_dataset(dataset_id, project_name)` - Create project - `dr.Project.get(project_id)` - Get project details - `dr.Project.list()` - List all projects - `project.set_target(target_column)` - Set target variable **Training**: - `project.start(autopilot_on=True)` - Start AutoML training - `project.get_status()` - Check training status - `dr.Model.list(project_id)` - List trained models - `dr.Model.get(model_id)` - Get model details **Model Analysis**: - `model.get_metrics()` - Get performance metrics - `model.get_feature_impact()` - Get feature importance See the [Common Patterns](#common-patterns) section below for complete examples. ## Helper Scripts This skill includes executable helper scripts that Claude can run directly: - `scripts/create_project.py` - Create a new project from a dataset - `scripts/start_training.py` - Start AutoML training - `scripts/list_models.py` - List trained models with metrics **Usage example**: ```bash # Create project and set target python scripts/create_project.py dataset_123 "Sales Prediction" revenue # Start training python scripts/start_training.py project_456 Quick # List models python scripts/list_models.py project_456 AUC ``` Claude can run these scripts directly or use them as reference when writing code. ## Best practices 1. **Data preparation**: Ensure data is clean and properly formatted before upload 2. **Target selection**: Choose appropriate target variable (avoid leakage) 3. **Partitioning**: Use proper partitioning for time-aware or grouped data 4. **Feature engineering**: Let AutoML handle feature engineering, but review results 5. **Model selection**: Compare multiple models, not just the top performer 6. **Validation**: Review validation strategy and ensure it matches your use case ## Common patterns ### Pattern 1: Standard classification/regression ```python import datarobot as dr import os # Initialize client client = dr.Client( token=os.getenv("DATAROBOT_API_TOKEN"), endpoint=os.getenv("DATAROBOT_ENDPOINT") ) # Upload dataset dataset = dr.Dataset.create_from_file( file_path="training_data.csv", name="Sales Data" ) # Create project project = dr.Project.create_from_dataset( dataset_id=dataset.id, project_name="Sales Prediction" ) # Set target project.set_target( target="revenue", mode=dr.AUTOPILOT_MODE.QUICK ) # Start AutoML (Quick mode) project.start(autopilot_on=True, max_wait=3600) # Monitor training while project.get_status()['status'] not in ['complete', 'error']: import time time.sleep(30) project.get_status() # Get trained models models = dr.Model.list(project.id) best_model = max(models, key=lambda m: m.metrics.get('AUC', 0)) print(f"Best model: {best_model.id}, AUC: {best_model.metrics.get('AUC')}") ``` ### Pattern 2: Time series forecasting ```python import datarobot as dr # Upload dataset dataset = dr.Dataset.create_from_file("sales_data.csv", "Sales Forecast Data") # Create project project = dr.Project.create_from_dataset( dataset_id=dataset.id, project_name="Sales Forecast" ) # Configure time series settings project.set_target( target="sales", mode=dr.AUTOPILOT_MODE.COMPREHENSIVE, partitioning_method=dr.PARTITIONING_METHOD.DATETIME, datetime_partition_column="date", multiseries_id_columns=["store_id"], forecast_window_start=1, forecast_window_end=7 ) # Start training project.start(autopilot_on=True, max_wait=7200) # Wait for completion and get results project.wait_for_completion() models = dr.Model.list(project.id) ``` ## Model selection criteria When selecting models, consider: - **Performance metrics**: Accuracy, AUC, RMSE, MAPE (depending on problem type) - **Prediction speed**: Important for real-time deployments - **Interpretability**: Some models are more explainable - **Feature requirements**: Some models need specific feature types - **Deployment constraints**: Consider model size and resource requirements ## Error handling Common errors and solutions: - **Dataset upload failures**: Check file format, size limits, encoding - **Target errors**: Ensure target column exists and has appropriate values - **Training failures**: Check data quality, feature types, missing values - **Timeout errors**: Adjust time limits or use Quick mode for initial exploration ## SDK Setup ### Install DataRobot SDK ```bash pip install datarobot ``` ### Initialize Client ```python import datarobot as dr import os client = dr.Client( token=os.getenv("DATAROBOT_API_TOKEN"), endpoint=os.getenv("DATAROBOT_ENDPOINT", "https://app.datarobot.com") ) ``` ## Resources - [DataRobot Python SDK Documentation](https://datarobot-public-api-client.readthedocs-hosted.com/) - [DataRobot AutoML Documentation](https://docs.datarobot.com/en/docs/modeling/index.html) - [General Modeling Documentation – Time Series](https://docs.datarobot.com/en/docs/modeling/index.html) - [General Modeling Documentation – Feature Engineering](https://docs.datarobot.com/en/docs/modeling/index.html)