---
name: umap-learn
description: >-
  UMAP dimensionality reduction for visualization, clustering prep, and feature
  engineering. Fast nonlinear manifold learning preserving local and global structure.
  Standard UMAP (fit/transform, sklearn-compatible), supervised/semi-supervised,
  Parametric UMAP (NN encoder/decoder, TensorFlow), DensMAP (density), AlignedUMAP
  (temporal/batch). 15+ distance metrics, custom Numba metrics, precomputed distances.
  For linear reduction use PCA; for neighborhood graphs use sklearn NearestNeighbors.
license: BSD-3-Clause
---

# UMAP-Learn

## Overview

UMAP (Uniform Manifold Approximation and Projection) is a dimensionality reduction algorithm for visualization and general non-linear dimensionality reduction. It is faster than t-SNE, scales to larger datasets, preserves both local and global structure, and supports supervised learning and embedding of new data points.

## When to Use

- Reducing high-dimensional data to 2D/3D for visualization
- Preprocessing for density-based clustering (HDBSCAN, DBSCAN)
- Feature engineering in ML pipelines (transform new data into learned embedding)
- Supervised/semi-supervised embedding with partial labels
- Tracking embeddings across time points or batches (AlignedUMAP)
- Density-preserving embeddings (DensMAP)
- Neural network-based embedding with custom architectures (Parametric UMAP)
- For linear dimensionality reduction use **PCA** (scikit-learn)
- For neighborhood-graph construction without embedding use **scikit-learn NearestNeighbors**

## Prerequisites

```bash
pip install umap-learn

# For Parametric UMAP (neural network variant)
pip install umap-learn[parametric_umap]  # requires TensorFlow 2.x
```

**Critical**: Always standardize features before applying UMAP to ensure equal weighting across dimensions.

## Quick Start

```python
import umap
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_digits

# Load and scale data
X, y = load_digits(return_X_y=True)
X_scaled = StandardScaler().fit_transform(X)

# Fit and transform
embedding = umap.UMAP(random_state=42).fit_transform(X_scaled)
print(f"Input: {X_scaled.shape}, Output: {embedding.shape}")
# Input: (1797, 64), Output: (1797, 2)
```

## Core API

### 1. Standard UMAP

Basic dimensionality reduction following scikit-learn conventions.

```python
import umap
from sklearn.preprocessing import StandardScaler

X_scaled = StandardScaler().fit_transform(data)

# Method 1: fit_transform (single step)
embedding = umap.UMAP(
    n_neighbors=15,     # local neighborhood size (2-200)
    min_dist=0.1,       # min distance between embedded points (0.0-0.99)
    n_components=2,     # output dimensions
    metric='euclidean', # distance metric
    random_state=42,    # reproducibility
).fit_transform(X_scaled)
print(f"Embedding shape: {embedding.shape}")

# Method 2: fit + access (for reuse)
reducer = umap.UMAP(random_state=42)
reducer.fit(X_scaled)
embedding = reducer.embedding_  # trained embedding
graph = reducer.graph_          # fuzzy simplicial set (sparse matrix)
```

```python
# Visualization
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 6))
plt.scatter(embedding[:, 0], embedding[:, 1], c=labels, cmap='Spectral', s=5)
plt.colorbar()
plt.title('UMAP Embedding')
plt.tight_layout()
plt.savefig('umap_embedding.png', dpi=150)
```

### 2. Supervised & Semi-Supervised UMAP

Incorporate label information to guide embedding via the `y` parameter.

```python
import umap

# Supervised — all labels known
embedding = umap.UMAP(random_state=42).fit_transform(X_scaled, y=labels)

# Semi-supervised — partial labels (mark unlabeled as -1)
semi_labels = labels.copy()
semi_labels[unlabeled_indices] = -1
embedding = umap.UMAP(random_state=42).fit_transform(X_scaled, y=semi_labels)

# Control label influence with target_weight (0.0=unsupervised, 1.0=fully supervised)
reducer = umap.UMAP(
    target_weight=0.7,               # emphasize labels
    target_metric='categorical',     # for classification; use distance metric for regression
    random_state=42
)
embedding = reducer.fit_transform(X_scaled, y=labels)
print(f"Supervised embedding: {embedding.shape}")
```

### 3. Transform New Data

Project unseen data into the trained embedding space.

```python
import umap
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Fit on training data
reducer = umap.UMAP(n_components=10, random_state=42)
X_train_emb = reducer.fit_transform(X_train_scaled)

# Transform test data
X_test_emb = reducer.transform(X_test_scaled)
print(f"Train: {X_train_emb.shape}, Test: {X_test_emb.shape}")

# Works in sklearn Pipelines
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('umap', umap.UMAP(n_components=10, random_state=42)),
    ('classifier', SVC())
])
pipeline.fit(X_train, y_train)
accuracy = pipeline.score(X_test, y_test)
print(f"Pipeline accuracy: {accuracy:.3f}")
```

### 4. Parametric UMAP

Neural network-based embedding via TensorFlow/Keras. Enables efficient transform, reconstruction, and custom architectures.

```python
from umap.parametric_umap import ParametricUMAP

# Default architecture (3-layer, 100-neuron FC network)
embedder = ParametricUMAP(n_components=2, random_state=42)
embedding = embedder.fit_transform(X_scaled)
new_emb = embedder.transform(new_data)  # fast neural network inference
print(f"Parametric embedding: {embedding.shape}")
```

```python
import tensorflow as tf
from umap.parametric_umap import ParametricUMAP

# Custom encoder/decoder for autoencoder mode
input_dim = X_scaled.shape[1]
encoder = tf.keras.Sequential([
    tf.keras.layers.InputLayer(input_shape=(input_dim,)),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(2),
])
decoder = tf.keras.Sequential([
    tf.keras.layers.InputLayer(input_shape=(2,)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(input_dim),
])

embedder = ParametricUMAP(
    encoder=encoder, decoder=decoder, dims=(input_dim,),
    parametric_reconstruction=True, autoencoder_loss=True,
    n_training_epochs=10, batch_size=128,
    n_neighbors=15, min_dist=0.1, random_state=42
)
embedding = embedder.fit_transform(X_scaled)
reconstructed = embedder.inverse_transform(embedding)
print(f"Reconstruction error: {np.mean((X_scaled - reconstructed)**2):.4f}")
```

### 5. DensMAP

Variant preserving local density information in the embedding.

```python
import umap

reducer = umap.UMAP(
    densmap=True,          # enable DensMAP
    dens_lambda=2.0,       # density preservation weight
    dens_frac=0.3,         # fraction for density estimation
    output_dens=True,      # output density estimates
    n_neighbors=15,
    min_dist=0.1,
    random_state=42
)
embedding = reducer.fit_transform(X_scaled)

# Access density estimates
original_density = reducer.rad_orig_  # density in original space
embedded_density = reducer.rad_emb_   # density in embedded space
print(f"DensMAP embedding: {embedding.shape}")
print(f"Density correlation: {np.corrcoef(original_density, embedded_density)[0,1]:.3f}")
```

### 6. AlignedUMAP

Align embeddings across multiple related datasets (time points, batches).

```python
from umap import AlignedUMAP

# Multiple related datasets
datasets = [day1_data, day2_data, day3_data]

mapper = AlignedUMAP(
    n_neighbors=15,
    alignment_regularisation=1e-2,  # alignment strength
    alignment_window_size=2,        # align with N adjacent datasets
    n_components=2,
    random_state=42
)
mapper.fit(datasets)

aligned_embeddings = mapper.embeddings_  # list of aligned embedding arrays
print(f"Aligned {len(aligned_embeddings)} datasets")
for i, emb in enumerate(aligned_embeddings):
    print(f"  Dataset {i}: {emb.shape}")
```

## Key Concepts

### Parameter Tuning Guide

| Parameter | Low | Medium (default) | High | Effect |
|-----------|-----|-------------------|------|--------|
| `n_neighbors` | 2-5 | 15 | 50-200 | Local detail vs global structure |
| `min_dist` | 0.0 | 0.1 | 0.5-0.99 | Tight clusters vs spread out |
| `n_components` | 2 | 2 | 5-50 | Visualization vs ML/clustering |
| `spread` | 0.5 | 1.0 | 2.0 | Embedding scale (with min_dist) |

### Configuration by Use-Case

| Use-Case | n_neighbors | min_dist | n_components | metric |
|----------|-------------|----------|-------------|--------|
| Visualization | 15 | 0.1 | 2 | euclidean |
| Clustering (HDBSCAN) | 30 | 0.0 | 5-10 | euclidean |
| Text/document embedding | 15 | 0.1 | 2 | cosine |
| Global structure | 100 | 0.5 | 2 | euclidean |
| ML feature engineering | 15-30 | 0.1 | 10-50 | euclidean |
| Binary/set data | 15 | 0.1 | 2 | hamming/jaccard |

### Supported Metrics

Minkowski family: `euclidean`, `manhattan`, `chebyshev`, `minkowski`. Spatial: `canberra`, `braycurtis`, `haversine`. Correlation: `cosine`, `correlation`. Binary: `hamming`, `jaccard`, `dice`, `russellrao`, `rogerstanimoto`, `sokalmichener`, `sokalsneath`, `yule`. Special: `precomputed` (distance matrix), custom Numba-compiled callables.

### Standard UMAP vs Parametric UMAP

| Feature | Standard | Parametric |
|---------|----------|-----------|
| Backend | Direct optimization | TensorFlow neural network |
| Transform speed | Moderate | Fast (neural net inference) |
| Inverse transform | Approximate, expensive | Decoder network, fast |
| Custom architecture | No | Yes (CNNs, RNNs, etc.) |
| Requirements | umap-learn | umap-learn + TensorFlow 2.x |
| Best for | Quick exploration | Production pipelines, reconstruction |

## Common Workflows

### Workflow 1: UMAP + HDBSCAN Clustering Pipeline

```python
import umap
import hdbscan
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import adjusted_rand_score

# Step 1: Preprocess
X_scaled = StandardScaler().fit_transform(data)
print(f"Input shape: {X_scaled.shape}")

# Step 2: UMAP for clustering (NOT visualization parameters)
reducer = umap.UMAP(
    n_neighbors=30,     # more global structure for clustering
    min_dist=0.0,       # allow tight packing
    n_components=10,    # higher dims preserve density better than 2D
    metric='euclidean',
    random_state=42
)
embedding = reducer.fit_transform(X_scaled)

# Step 3: HDBSCAN clustering
clusterer = hdbscan.HDBSCAN(min_cluster_size=15, min_samples=5)
cluster_labels = clusterer.fit_predict(embedding)

n_clusters = len(set(cluster_labels)) - (1 if -1 in cluster_labels else 0)
noise = sum(cluster_labels == -1)
print(f"Clusters: {n_clusters}, Noise: {noise}")

# Step 4: Separate 2D embedding for visualization
vis_emb = umap.UMAP(n_neighbors=15, min_dist=0.1, random_state=42).fit_transform(X_scaled)
plt.scatter(vis_emb[:, 0], vis_emb[:, 1], c=cluster_labels, cmap='Spectral', s=5)
plt.colorbar()
plt.title(f'HDBSCAN Clusters (n={n_clusters})')
plt.tight_layout()
plt.savefig('umap_clusters.png', dpi=150)
```

### Workflow 2: Supervised Embedding for Classification

```python
import umap
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import classification_report

# Split and scale
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

# Supervised UMAP for feature engineering
reducer = umap.UMAP(n_components=10, random_state=42)
X_train_emb = reducer.fit_transform(X_train_s, y=y_train)
X_test_emb = reducer.transform(X_test_s)

# Downstream classifier
clf = SVC(kernel='rbf')
clf.fit(X_train_emb, y_train)
y_pred = clf.predict(X_test_emb)

print(classification_report(y_test, y_pred))
```

### Workflow 3: Exploring Embedding Space with Inverse Transform

Text-only — combines Core API modules 1 and 3 (inverse_transform on standard UMAP):

1. Fit standard UMAP on data (Core API: Standard UMAP)
2. Create a grid of points spanning the embedding space
3. Apply `reducer.inverse_transform(grid_points)` to reconstruct high-dimensional data
4. Visualize reconstructed samples to understand embedding regions

Note: inverse transform is approximate; works poorly outside the convex hull of the training embedding.

## Key Parameters

| Parameter | Module | Default | Range | Effect |
|-----------|--------|---------|-------|--------|
| `n_neighbors` | UMAP | 15 | 2-200 | Local vs global structure balance |
| `min_dist` | UMAP | 0.1 | 0.0-0.99 | Cluster tightness |
| `n_components` | UMAP | 2 | 2-100 | Output dimensionality |
| `metric` | UMAP | `'euclidean'` | See metrics list | Distance calculation method |
| `spread` | UMAP | 1.0 | >0 | Embedding scale (with min_dist) |
| `n_epochs` | UMAP | `None` (auto) | 50-500+ | Training iterations |
| `learning_rate` | UMAP | 1.0 | >0 | SGD step size |
| `init` | UMAP | `'spectral'` | spectral/random/pca | Embedding initialization |
| `random_state` | UMAP | `None` | int | Reproducibility seed |
| `target_weight` | UMAP | 0.5 | 0.0-1.0 | Label influence (supervised) |
| `densmap` | UMAP | `False` | bool | Enable DensMAP |
| `dens_lambda` | UMAP | 2.0 | >0 | DensMAP density weight |
| `low_memory` | UMAP | `True` | bool | Memory-efficient mode |
| `encoder` | ParametricUMAP | `None` | Keras model | Custom encoder network |
| `decoder` | ParametricUMAP | `None` | Keras model | Custom decoder network |
| `n_training_epochs` | ParametricUMAP | 1 | 1-100 | Neural network training epochs |
| `alignment_regularisation` | AlignedUMAP | 0.01 | >0 | Alignment strength |
| `alignment_window_size` | AlignedUMAP | 3 | 1-N | Adjacent datasets to align |

## Best Practices

1. **Always standardize features**: Use `StandardScaler` before UMAP — unscaled features with different ranges will dominate the embedding.

2. **Set `random_state` for reproducibility**: UMAP uses stochastic optimization; results vary between runs without a fixed seed.

3. **Use different parameters for clustering vs visualization**: Clustering needs `n_neighbors=30, min_dist=0.0, n_components=5-10`. Visualization needs `n_neighbors=15, min_dist=0.1, n_components=2`.

4. **Anti-pattern — interpreting distances literally**: UMAP preserves topology, not precise distances. Cluster separations and point distances in the embedding are not proportional to original distances.

5. **Anti-pattern — using 2D embeddings for clustering**: 2D projections lose density information. Use 5-10 components for HDBSCAN input.

6. **Consider PCA preprocessing for very high dimensions**: For data with >1000 features, reducing to 50-100 PCA components first can speed up UMAP without losing quality.

7. **Use Parametric UMAP for production**: When you need fast transform on new data or reconstruction capabilities, Parametric UMAP's neural network provides consistent, fast inference.

## Common Recipes

### Recipe: Custom Numba Distance Metric

```python
from numba import njit
import umap

@njit()
def weighted_euclidean(x, y):
    """Custom distance with feature weights."""
    result = 0.0
    for i in range(x.shape[0]):
        result += (x[i] - y[i]) ** 2 * (1.0 + i * 0.01)  # increasing weight
    return np.sqrt(result)

embedding = umap.UMAP(metric=weighted_euclidean, random_state=42).fit_transform(data)
```

### Recipe: Precomputed Distance Matrix

```python
import umap
from scipy.spatial.distance import pdist, squareform

# Compute custom distance matrix
dist_matrix = squareform(pdist(data, metric='correlation'))

# Use precomputed distances
embedding = umap.UMAP(
    metric='precomputed', random_state=42
).fit_transform(dist_matrix)
print(f"Embedding from precomputed: {embedding.shape}")
```

### Recipe: Metric Learning Pipeline

```python
import umap
from sklearn.svm import SVC

# Train supervised embedding on labeled data
mapper = umap.UMAP(n_components=10, random_state=42)
train_emb = mapper.fit_transform(X_train, y=y_train)

# Transform unlabeled test data using learned metric
test_emb = mapper.transform(X_test)

# Downstream classifier
clf = SVC().fit(train_emb, y_train)
predictions = clf.predict(test_emb)
print(f"Accuracy: {(predictions == y_test).mean():.3f}")
```

## Troubleshooting

| Problem | Cause | Solution |
|---------|-------|----------|
| Disconnected/fragmented clusters | `n_neighbors` too low | Increase `n_neighbors` (try 30-50) |
| Clusters too spread out | `min_dist` too high | Decrease `min_dist` (try 0.0-0.05) |
| All points collapsed | Bad preprocessing or `min_dist` too low | Check `StandardScaler`; increase `min_dist` |
| Poor clustering results | Using visualization parameters for clustering | Set `n_neighbors=30, min_dist=0.0, n_components=5-10` |
| Transform results differ from training | Distribution shift | Ensure test data matches training distribution; use Parametric UMAP |
| Slow on large datasets (>100k) | Default settings | Set `low_memory=True`; preprocess with PCA to 50-100 dims |
| First run very slow | Numba JIT compilation | Expected — subsequent runs are fast (compiled cache) |
| `ImportError: umap` | Name conflict with `umap` package | `pip install umap-learn` (not `pip install umap`) |
| Parametric UMAP import error | Missing TensorFlow | `pip install umap-learn[parametric_umap]` |
| Non-reproducible results | Missing `random_state` | Always set `random_state=42` (or any int) |

## Bundled Resources

### references/api_reference.md

Complete UMAP constructor parameter reference (60+ parameters organized by category: core, training, advanced structural, supervised, transform, performance, DensMAP), all methods and attributes, ParametricUMAP class with autoencoder parameters, AlignedUMAP class, utility functions (nearest_neighbors, fuzzy_simplicial_set). Core parameter tuning guidance was relocated to SKILL.md Key Concepts and Core API modules. Usage examples duplicating SKILL.md workflows omitted.

## Related Skills

- **scikit-learn-machine-learning** — ML classifiers, preprocessing, pipelines for downstream tasks
- **matplotlib-scientific-plotting** — Visualization of UMAP embeddings
- **scikit-bio** — Biological distance matrices that can feed into UMAP via `metric='precomputed'`

## References

- McInnes L, Healy J, Melville J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv:1802.03426
- Sainburg T, McInnes L, Gentner TQ. Parametric UMAP Embeddings for Representation and Semisupervised Learning. Neural Computation (2021)
- Narayan A, Berger B, Cho H. Assessing single-cell transcriptomic variability through density-preserving data visualization. Nature Biotechnology (2021) — DensMAP
- Official docs: https://umap-learn.readthedocs.io/
- GitHub: https://github.com/lmcinnes/umap