--- name: machine-learning description: Supervised/unsupervised learning, model selection, evaluation, and scikit-learn. Use for building classification, regression, or clustering models. sasmp_version: "1.3.0" bonded_agent: 04-machine-learning-ai bond_type: PRIMARY_BOND --- # Machine Learning with Scikit-Learn Build, train, and evaluate ML models for classification, regression, and clustering. ## Quick Start ### Classification ```python from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report # Split data X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) # Train model model = RandomForestClassifier(n_estimators=100, random_state=42) model.fit(X_train, y_train) # Predict predictions = model.predict(X_test) probabilities = model.predict_proba(X_test) # Evaluate print(classification_report(y_test, predictions)) ``` ### Regression ```python from sklearn.ensemble import GradientBoostingRegressor from sklearn.metrics import mean_absolute_error, r2_score model = GradientBoostingRegressor(n_estimators=100) model.fit(X_train, y_train) predictions = model.predict(X_test) print(f"MAE: {mean_absolute_error(y_test, predictions):.2f}") print(f"R²: {r2_score(y_test, predictions):.3f}") ``` ### Clustering ```python from sklearn.cluster import KMeans import matplotlib.pyplot as plt # Find optimal k (elbow method) inertias = [] for k in range(1, 11): km = KMeans(n_clusters=k, random_state=42) km.fit(X) inertias.append(km.inertia_) plt.plot(range(1, 11), inertias, marker='o') plt.xlabel('Number of clusters') plt.ylabel('Inertia') plt.show() # Train with optimal k kmeans = KMeans(n_clusters=5, random_state=42) clusters = kmeans.fit_predict(X) ``` ## Model Selection Guide **Classification:** - **Logistic Regression**: Linear, interpretable, baseline - **Random Forest**: Non-linear, feature importance, robust - **XGBoost**: Best performance, handles missing data - **SVM**: Small datasets, kernel trick **Regression:** - **Linear Regression**: Linear relationships, interpretable - **Ridge/Lasso**: Regularization, feature selection - **Random Forest**: Non-linear, robust to outliers - **XGBoost**: Best performance, often wins competitions **Clustering:** - **K-Means**: Fast, spherical clusters - **DBSCAN**: Arbitrary shapes, handles noise - **Hierarchical**: Dendrogram, no k selection ## Evaluation Metrics **Classification:** ```python from sklearn.metrics import ( accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix ) accuracy = accuracy_score(y_true, y_pred) precision = precision_score(y_true, y_pred, average='weighted') recall = recall_score(y_true, y_pred, average='weighted') f1 = f1_score(y_true, y_pred, average='weighted') roc_auc = roc_auc_score(y_true, y_pred_proba, multi_class='ovr') ``` **Regression:** ```python from sklearn.metrics import ( mean_absolute_error, mean_squared_error, r2_score ) mae = mean_absolute_error(y_true, y_pred) mse = mean_squared_error(y_true, y_pred) rmse = np.sqrt(mse) r2 = r2_score(y_true, y_pred) ``` ## Cross-Validation ```python from sklearn.model_selection import cross_val_score scores = cross_val_score(model, X, y, cv=5, scoring='f1_weighted') print(f"CV F1: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})") ``` ## Hyperparameter Tuning ```python from sklearn.model_selection import GridSearchCV param_grid = { 'n_estimators': [100, 200, 300], 'max_depth': [5, 10, 15], 'min_samples_split': [2, 5, 10] } grid_search = GridSearchCV( RandomForestClassifier(), param_grid, cv=5, scoring='f1_weighted', n_jobs=-1 ) grid_search.fit(X_train, y_train) print(f"Best params: {grid_search.best_params_}") print(f"Best score: {grid_search.best_score_:.3f}") # Use best model best_model = grid_search.best_estimator_ ``` ## Feature Engineering ```python from sklearn.preprocessing import StandardScaler, LabelEncoder # Scaling scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Encoding encoder = LabelEncoder() y_encoded = encoder.fit_transform(y) # Polynomial features from sklearn.preprocessing import PolynomialFeatures poly = PolynomialFeatures(degree=2) X_poly = poly.fit_transform(X) ``` ## Pipeline ```python from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.ensemble import RandomForestClassifier pipeline = Pipeline([ ('scaler', StandardScaler()), ('classifier', RandomForestClassifier(n_estimators=100)) ]) pipeline.fit(X_train, y_train) predictions = pipeline.predict(X_test) ``` ## Best Practices 1. Always split data before preprocessing 2. Use cross-validation for reliable estimates 3. Scale features for distance-based models 4. Handle class imbalance (SMOTE, class weights) 5. Check for overfitting (train vs test performance) 6. Save models with joblib or pickle