---
name: "scikit-learn-machine-learning"
description: "Classical ML in Python: classification, regression, clustering, dim reduction, evaluation, tuning, preprocessing pipelines. Linear models, tree ensembles, SVMs, K-Means, PCA, t-SNE. Use PyTorch/TF for deep learning; XGBoost/LightGBM for scale."
license: "BSD-3-Clause"
---

# scikit-learn

## Overview

scikit-learn is the standard Python library for classical machine learning. It provides consistent APIs for supervised learning (classification, regression), unsupervised learning (clustering, dimensionality reduction), model evaluation, and preprocessing, with seamless integration into NumPy/pandas workflows.

## When to Use

- Building classification models for labeled data (spam detection, disease diagnosis, species identification)
- Predicting continuous outcomes with regression (price prediction, dose-response modeling)
- Clustering unlabeled data into groups (patient stratification, gene expression clusters)
- Reducing dimensionality for visualization or feature engineering (PCA, t-SNE on multi-omics data)
- Evaluating and comparing model performance with cross-validation
- Tuning hyperparameters systematically (grid search, random search)
- Building reproducible ML pipelines with preprocessing and modeling steps
- For deep learning tasks (images, NLP), use `pytorch` or `transformers` instead
- For large-scale gradient boosting, use `xgboost` or `lightgbm` instead

## Prerequisites

- **Python packages**: `scikit-learn`, `numpy`, `pandas`
- **Optional**: `matplotlib`, `seaborn` for visualization
- **Data**: Tabular data as NumPy arrays or pandas DataFrames

```bash
pip install scikit-learn numpy pandas matplotlib seaborn
```

## Quick Start

```python
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.datasets import load_breast_cancer

# Load dataset, split, train, evaluate in 10 lines
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print(classification_report(y_test, y_pred, target_names=["malignant", "benign"]))
```

## Core API

### Module 1: Data Preprocessing

Scaling, encoding, imputation, and feature engineering.

```python
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
import numpy as np

# Scaling: zero mean, unit variance
X = np.array([[1, 2], [3, 4], [5, 6]])
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print(f"Mean: {X_scaled.mean(axis=0)}, Std: {X_scaled.std(axis=0)}")
# Mean: [0. 0.], Std: [1. 1.]

# Imputation: fill missing values
X_missing = np.array([[1, np.nan], [3, 4], [np.nan, 6]])
imputer = SimpleImputer(strategy="median")
X_filled = imputer.fit_transform(X_missing)
print(f"Filled:\n{X_filled}")
```

```python
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, LabelEncoder

# One-hot encoding for nominal categories
enc = OneHotEncoder(sparse_output=False, handle_unknown="ignore")
X_cat = np.array([["red"], ["blue"], ["green"], ["red"]])
X_encoded = enc.fit_transform(X_cat)
print(f"Categories: {enc.categories_}")
print(f"Encoded shape: {X_encoded.shape}")  # (4, 3)
```

### Module 2: Supervised Learning — Classification

Classifiers for discrete target prediction.

```python
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# Compare classifiers
classifiers = {
    "LogisticRegression": LogisticRegression(max_iter=200),
    "RandomForest": RandomForestClassifier(n_estimators=100, random_state=42),
    "SVM": SVC(kernel="rbf", C=1.0),
    "GradientBoosting": GradientBoostingClassifier(n_estimators=100, random_state=42),
}
for name, clf in classifiers.items():
    clf.fit(X_train, y_train)
    print(f"{name}: accuracy = {clf.score(X_test, y_test):.3f}")
```

### Module 3: Supervised Learning — Regression

Regressors for continuous target prediction.

```python
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error, r2_score

X, y = make_regression(n_samples=200, n_features=10, noise=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

models = {
    "Linear": LinearRegression(),
    "Ridge": Ridge(alpha=1.0),
    "Lasso": Lasso(alpha=0.1),
    "RandomForest": RandomForestRegressor(n_estimators=100, random_state=42),
}
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(f"{name}: RMSE={mean_squared_error(y_test, y_pred, squared=False):.2f}, R²={r2_score(y_test, y_pred):.3f}")
```

### Module 4: Unsupervised Learning — Clustering

Clustering algorithms for unlabeled data.

```python
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs

X, y_true = make_blobs(n_samples=300, centers=4, random_state=42)

# K-Means with elbow method
for k in [2, 3, 4, 5, 6]:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = km.fit_predict(X)
    sil = silhouette_score(X, labels)
    print(f"k={k}: silhouette={sil:.3f}, inertia={km.inertia_:.1f}")
```

```python
# DBSCAN — no need to specify k
from sklearn.cluster import DBSCAN

db = DBSCAN(eps=0.5, min_samples=5)
labels = db.fit_predict(X)
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = (labels == -1).sum()
print(f"DBSCAN: {n_clusters} clusters, {n_noise} noise points")
```

### Module 5: Dimensionality Reduction

PCA, t-SNE, and other methods for visualization and feature reduction.

```python
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.datasets import load_digits

X, y = load_digits(return_X_y=True)
print(f"Original shape: {X.shape}")  # (1797, 64)

# PCA — preserve 95% variance
pca = PCA(n_components=0.95)
X_pca = pca.fit_transform(X)
print(f"PCA: {X_pca.shape[1]} components, explained variance: {pca.explained_variance_ratio_.sum():.3f}")

# t-SNE — 2D visualization
tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_tsne = tsne.fit_transform(X)
print(f"t-SNE shape: {X_tsne.shape}")  # (1797, 2)
```

### Module 6: Model Evaluation & Selection

Cross-validation, metrics, hyperparameter tuning.

```python
from sklearn.model_selection import cross_val_score, GridSearchCV, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)

# Cross-validation
clf = RandomForestClassifier(n_estimators=100, random_state=42)
scores = cross_val_score(clf, X, y, cv=StratifiedKFold(5), scoring="accuracy")
print(f"CV accuracy: {scores.mean():.3f} ± {scores.std():.3f}")
```

```python
# Hyperparameter tuning with GridSearchCV
param_grid = {
    "n_estimators": [50, 100, 200],
    "max_depth": [5, 10, None],
    "min_samples_split": [2, 5]
}
grid = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid, cv=5, scoring="accuracy", n_jobs=-1
)
grid.fit(X, y)
print(f"Best params: {grid.best_params_}")
print(f"Best score: {grid.best_score_:.3f}")
```

### Module 7: Pipelines

Chain preprocessing and models; prevent data leakage.

```python
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import GradientBoostingClassifier

# Mixed-type preprocessing
numeric_features = ["age", "income"]
categorical_features = ["gender", "occupation"]

preprocessor = ColumnTransformer([
    ("num", Pipeline([
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler())
    ]), numeric_features),
    ("cat", Pipeline([
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OneHotEncoder(handle_unknown="ignore"))
    ]), categorical_features),
])

pipe = Pipeline([
    ("preprocessor", preprocessor),
    ("classifier", GradientBoostingClassifier(random_state=42))
])
# pipe.fit(X_train, y_train); pipe.predict(X_test)
print("Pipeline steps:", [name for name, _ in pipe.steps])
```

## Common Workflows

### Workflow 1: End-to-End Classification

**Goal**: Complete classification workflow from data loading to evaluation.

```python
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.datasets import load_breast_cancer

# Load data
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# Build pipeline
pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", RandomForestClassifier(n_estimators=200, random_state=42))
])

# Cross-validate
cv_scores = cross_val_score(pipe, X_train, y_train, cv=5, scoring="f1")
print(f"CV F1: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")

# Final evaluation
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
print(classification_report(y_test, y_pred))
```

### Workflow 2: Clustering with Visualization

**Goal**: Cluster data and visualize with dimensionality reduction.

```python
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt

# Generate and scale data
X, _ = make_blobs(n_samples=500, centers=4, random_state=42)
X_scaled = StandardScaler().fit_transform(X)

# Cluster
km = KMeans(n_clusters=4, random_state=42, n_init=10)
labels = km.fit_predict(X_scaled)
print(f"Silhouette: {silhouette_score(X_scaled, labels):.3f}")

# Visualize
X_2d = PCA(n_components=2).fit_transform(X_scaled)
plt.scatter(X_2d[:, 0], X_2d[:, 1], c=labels, cmap="viridis", s=20, alpha=0.7)
plt.title("K-Means Clustering (PCA projection)")
plt.savefig("clustering_result.png", dpi=150, bbox_inches="tight")
print("Saved clustering_result.png")
```

### Workflow 3: Feature Selection + Model Pipeline

**Goal**: Select best features and build a tuned model.

```python
from sklearn.datasets import make_classification
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

X, y = make_classification(n_samples=500, n_features=50, n_informative=10, random_state=42)

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("selector", SelectKBest(f_classif)),
    ("svm", SVC(kernel="rbf"))
])

param_grid = {
    "selector__k": [5, 10, 20],
    "svm__C": [0.1, 1, 10],
    "svm__gamma": ["scale", "auto"]
}

grid = GridSearchCV(pipe, param_grid, cv=5, scoring="accuracy", n_jobs=-1)
grid.fit(X, y)
print(f"Best params: {grid.best_params_}")
print(f"Best accuracy: {grid.best_score_:.3f}")
```

## Key Parameters

| Parameter | Module | Default | Range / Options | Effect |
|-----------|--------|---------|-----------------|--------|
| `n_estimators` | RandomForest, GradientBoosting | `100` | `50`-`1000` | Number of trees; higher = better but slower |
| `max_depth` | Tree-based models | `None` | `1`-`50`, `None` | Tree depth; `None` = no limit (can overfit) |
| `C` | SVM, LogisticRegression | `1.0` | `0.001`-`1000` | Regularization strength (inverse); lower = more regularization |
| `alpha` | Ridge, Lasso | `1.0` | `0.001`-`100` | Regularization strength; higher = more regularization |
| `n_clusters` | KMeans | required | `2`-`N` | Number of clusters to form |
| `eps` | DBSCAN | `0.5` | `0.01`-`10` | Neighborhood radius; smaller = more clusters |
| `n_components` | PCA | required | `1`-`N` or `0.0`-`1.0` | Components to keep; float = variance ratio |
| `perplexity` | t-SNE | `30` | `5`-`50` | Balance local/global structure |
| `cv` | GridSearchCV | `5` | `2`-`10` | Cross-validation folds |
| `scoring` | GridSearchCV, cross_val_score | varies | `accuracy`, `f1`, `roc_auc`, etc. | Evaluation metric |

## Common Recipes

### Recipe: Feature Importance Analysis

When to use: Understanding which features drive model predictions.

```python
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)
clf = RandomForestClassifier(n_estimators=200, random_state=42).fit(X, y)

importances = clf.feature_importances_
indices = np.argsort(importances)[::-1]
feature_names = load_iris().feature_names
for i in range(X.shape[1]):
    print(f"{feature_names[indices[i]]}: {importances[indices[i]]:.4f}")
```

### Recipe: Learning Curve Diagnosis

When to use: Diagnosing overfitting vs underfitting.

```python
from sklearn.model_selection import learning_curve
import matplotlib.pyplot as plt
import numpy as np

train_sizes, train_scores, val_scores = learning_curve(
    clf, X, y, cv=5, train_sizes=np.linspace(0.1, 1.0, 10), scoring="accuracy"
)
plt.plot(train_sizes, train_scores.mean(axis=1), label="Train")
plt.plot(train_sizes, val_scores.mean(axis=1), label="Validation")
plt.xlabel("Training size"); plt.ylabel("Accuracy"); plt.legend()
plt.savefig("learning_curve.png", dpi=150, bbox_inches="tight")
print("Saved learning_curve.png")
```

### Recipe: Save and Load Models

When to use: Persisting trained models for later use.

```python
import joblib

# Save
joblib.dump(pipe, "model_pipeline.joblib")
print("Model saved to model_pipeline.joblib")

# Load
loaded_pipe = joblib.load("model_pipeline.joblib")
y_pred = loaded_pipe.predict(X_test)
print(f"Loaded model predictions: {y_pred[:5]}")
```

## Troubleshooting

| Problem | Cause | Solution |
|---------|-------|----------|
| `ConvergenceWarning` | Model didn't converge | Increase `max_iter` (e.g., 1000) or scale features with `StandardScaler` |
| High train accuracy, low test accuracy | Overfitting | Add regularization, reduce `max_depth`, use cross-validation |
| `ValueError: unknown categories` | New categories in test data | Use `OneHotEncoder(handle_unknown='ignore')` |
| `MemoryError` with large data | Full dataset in memory | Use `SGDClassifier`/`MiniBatchKMeans` for incremental learning |
| Poor clustering results | Unscaled features or wrong k | Scale features first; use silhouette score to find optimal k |
| `NotFittedError` | Predict before fit | Call `model.fit(X_train, y_train)` first |
| Different results each run | Missing `random_state` | Set `random_state=42` in model and `train_test_split` |
| Slow GridSearchCV | Large parameter grid | Use `RandomizedSearchCV` or `HalvingGridSearchCV`; add `n_jobs=-1` |

## References

- [scikit-learn User Guide](https://scikit-learn.org/stable/user_guide.html) — official documentation
- [scikit-learn API Reference](https://scikit-learn.org/stable/api/index.html) — complete API
- [scikit-learn Examples Gallery](https://scikit-learn.org/stable/auto_examples/) — tutorials
- Pedregosa et al. (2011). Scikit-learn: Machine Learning in Python. *JMLR* 12:2825-2830.