--- name: scikit-learn-best-practices description: Best practices for scikit-learn machine learning, model development, evaluation, and deployment in Python --- # Scikit-learn Best Practices Expert guidelines for scikit-learn development, focusing on machine learning workflows, model development, evaluation, and best practices. ## Code Style and Structure - Write concise, technical responses with accurate Python examples - Prioritize reproducibility in machine learning workflows - Use functional programming for data pipelines - Use object-oriented programming for custom estimators - Prefer vectorized operations over explicit loops - Follow PEP 8 style guidelines ## Machine Learning Workflow ### Data Preparation - Always split data before any preprocessing: train/validation/test - Use `train_test_split()` with `random_state` for reproducibility - Stratify splits for imbalanced classification: `stratify=y` - Keep test set completely separate until final evaluation ### Feature Engineering - Scale features appropriately for distance-based algorithms - Use `StandardScaler` for normally distributed features - Use `MinMaxScaler` for bounded features - Use `RobustScaler` for data with outliers - Encode categorical variables: `OneHotEncoder`, `OrdinalEncoder`, `LabelEncoder` - Handle missing values: `SimpleImputer`, `KNNImputer` ### Pipelines - Always use `Pipeline` to chain preprocessing and modeling - Prevents data leakage by fitting transformers only on training data - Makes code cleaner and more reproducible - Enables easy deployment and serialization ```python from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.ensemble import RandomForestClassifier pipeline = Pipeline([ ('scaler', StandardScaler()), ('classifier', RandomForestClassifier(random_state=42)) ]) ``` ### Column Transformers - Use `ColumnTransformer` for different preprocessing per feature type - Combine numeric and categorical preprocessing in single pipeline ## Model Selection and Tuning ### Cross-Validation - Use cross-validation for reliable performance estimates - `cross_val_score()` for quick evaluation - `cross_validate()` for multiple metrics - Use appropriate CV strategy: - `KFold` for regression - `StratifiedKFold` for classification - `TimeSeriesSplit` for temporal data - `GroupKFold` for grouped data ### Hyperparameter Tuning - Use `GridSearchCV` for exhaustive search - Use `RandomizedSearchCV` for large parameter spaces - Always tune on training/validation data, never test data - Set `n_jobs=-1` for parallel processing ## Model Evaluation ### Classification Metrics - Use appropriate metrics for your problem: - `accuracy_score` for balanced classes - `precision_score`, `recall_score`, `f1_score` for imbalanced - `roc_auc_score` for ranking ability - Use `classification_report()` for comprehensive overview - Examine `confusion_matrix()` for error analysis ### Regression Metrics - `mean_squared_error` (MSE) for general use - `mean_absolute_error` (MAE) for interpretability - `r2_score` for explained variance ### Evaluation Best Practices - Report confidence intervals, not just point estimates - Use multiple metrics to understand model behavior - Compare against meaningful baselines - Evaluate on held-out test set only once, at the end ## Handling Imbalanced Data - Use stratified splitting and cross-validation - Consider class weights: `class_weight='balanced'` - Use appropriate metrics (F1, AUC-PR, not accuracy) - Adjust decision threshold based on business needs ## Feature Selection - Use `SelectKBest` with statistical tests - Use `RFE` (Recursive Feature Elimination) - Use model-based selection: `SelectFromModel` - Examine feature importances from tree-based models ## Model Persistence - Use `joblib` for saving and loading models - Save entire pipelines, not just models - Version control model artifacts - Document model metadata ## Performance Optimization - Use `n_jobs=-1` for parallel processing where available - Consider `warm_start=True` for iterative training - Use sparse matrices for high-dimensional sparse data - Consider incremental learning with `partial_fit()` for large data ## Key Conventions - Import from submodules: `from sklearn.ensemble import RandomForestClassifier` - Set `random_state` for reproducibility - Use pipelines to prevent data leakage - Document model choices and hyperparameters