---
name: "scanpy-scrna-seq"
description: "scRNA-seq with Scanpy: QC, normalization, HVG selection, PCA, neighborhood graph, UMAP/t-SNE, Leiden clustering, markers, cell annotation, trajectory inference. Standard scRNA-seq exploration."
license: "CC-BY-4.0"
---

# Scanpy Single-Cell RNA-seq Analysis

## Overview

Scanpy is a scalable Python toolkit for analyzing single-cell RNA-seq data built on the AnnData format. This skill covers the end-to-end standard workflow: quality control, normalization, highly variable gene selection, dimensionality reduction, clustering, marker gene identification, and cell type annotation. It produces annotated datasets and publication-quality visualizations.

## When to Use

- Analyzing single-cell RNA-seq count matrices (10X Genomics, h5ad, CSV, loom)
- Performing quality control filtering on scRNA-seq datasets (mitochondrial %, gene counts)
- Running dimensionality reduction: PCA, UMAP, t-SNE
- Identifying cell clusters via Leiden community detection
- Finding differentially expressed marker genes per cluster (Wilcoxon, t-test, logistic regression)
- Annotating cell types from known marker gene panels
- Conducting trajectory inference and pseudotime analysis (PAGA, diffusion pseudotime)
- Generating publication-quality single-cell plots (dot plots, heatmaps, stacked violins)
- Comparing gene expression across experimental conditions within cell types
- Use **Seurat** (R/Bioconductor) instead for scRNA-seq analysis in an existing R workflow or when Seurat-specific assay types are required

## Prerequisites

- **Python packages**: `scanpy>=1.10`, `leidenalg`, `igraph`, `anndata`
- **Data requirements**: Gene expression count matrix (cells x genes). Common formats: 10X Cell Ranger output (`filtered_feature_bc_matrix/`), `.h5ad`, `.h5`, `.csv`
- **Environment**: Python 3.9+; 16GB+ RAM recommended for >50k cells

```bash
pip install "scanpy[leiden]" anndata
```

## Workflow

### Step 1: Setup and Data Loading

Configure scanpy settings and read the count matrix into an AnnData object. Ensure gene names are unique.

```python
import scanpy as sc
import numpy as np
import pandas as pd

# Configure settings
sc.settings.verbosity = 3  # errors=0, warnings=1, info=2, hints=3
sc.settings.set_figure_params(dpi=80, facecolor="white")

# Load data — pick the reader matching your format
adata = sc.read_10x_mtx(
    "path/to/filtered_feature_bc_matrix/",
    var_names="gene_symbols",
    cache=True,
)
# Alternatives:
# adata = sc.read_h5ad("data.h5ad")
# adata = sc.read_10x_h5("data.h5")

adata.var_names_make_unique()
print(f"Loaded: {adata.n_obs} cells x {adata.n_vars} genes")
print(f"Sparsity: {1 - adata.X.nnz / (adata.n_obs * adata.n_vars):.1%}")
```

### Step 2: Quality Control

Annotate mitochondrial/ribosomal genes, compute QC metrics, visualize distributions, and filter low-quality cells.

```python
# Annotate gene groups
adata.var["mt"] = adata.var_names.str.startswith("MT-")    # human; use "mt-" for mouse
adata.var["ribo"] = adata.var_names.str.startswith(("RPS", "RPL"))

# Calculate QC metrics
sc.pp.calculate_qc_metrics(
    adata, qc_vars=["mt", "ribo"], percent_top=None, log1p=False, inplace=True
)

# Visualize QC distributions
sc.pl.violin(
    adata,
    ["n_genes_by_counts", "total_counts", "pct_counts_mt"],
    jitter=0.4,
    multi_panel=True,
)
sc.pl.scatter(adata, x="total_counts", y="pct_counts_mt")
sc.pl.scatter(adata, x="total_counts", y="n_genes_by_counts")

# Filter — adjust thresholds based on the QC plots above
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
adata = adata[adata.obs.n_genes_by_counts < 5000, :]  # remove potential doublets
adata = adata[adata.obs.pct_counts_mt < 20, :]         # remove dying cells

print(f"After QC: {adata.n_obs} cells x {adata.n_vars} genes")
```

### Step 3: Normalization and Feature Selection

Normalize library sizes, log-transform, and identify highly variable genes (HVGs). Store raw counts for later differential expression.

```python
# Preserve raw counts in a layer
adata.layers["counts"] = adata.X.copy()

# Normalize to 10,000 counts per cell, then log-transform
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

# Save normalized data as .raw for visualization
adata.raw = adata

# Identify highly variable genes
sc.pp.highly_variable_genes(adata, n_top_genes=2000, flavor="seurat_v3", layer="counts")
sc.pl.highly_variable_genes(adata)

# Subset to HVGs
adata = adata[:, adata.var.highly_variable]
print(f"HVG subset: {adata.n_obs} cells x {adata.n_vars} genes")
```

### Step 4: Scaling and Regression

Regress out confounders and scale gene expression values. This prepares data for PCA.

```python
# Regress out unwanted sources of variation
sc.pp.regress_out(adata, ["total_counts", "pct_counts_mt"])

# Scale to unit variance, clip extreme values
sc.pp.scale(adata, max_value=10)

print(f"Scaled matrix: mean={adata.X.mean():.4f}, std={adata.X.std():.4f}")
```

### Step 5: Dimensionality Reduction

Compute PCA, inspect the variance ratio elbow plot, then build a neighborhood graph and embed with UMAP.

```python
# PCA
sc.tl.pca(adata, svd_solver="arpack", n_comps=50)
sc.pl.pca_variance_ratio(adata, log=True, n_pcs=50)

# Determine n_pcs from elbow plot (typically 30-50)
n_pcs = 40

# Compute k-nearest neighbor graph
sc.pp.neighbors(adata, n_neighbors=15, n_pcs=n_pcs)

# UMAP embedding
sc.tl.umap(adata)
sc.pl.umap(adata, color=["n_genes_by_counts", "total_counts", "pct_counts_mt"])

print(f"UMAP shape: {adata.obsm['X_umap'].shape}")
```

### Step 6: Clustering

Run Leiden clustering at multiple resolutions and select the best granularity by visual inspection.

```python
# Test multiple resolutions
for res in [0.3, 0.5, 0.8, 1.0, 1.2]:
    sc.tl.leiden(adata, resolution=res, key_added=f"leiden_res{res}", flavor="igraph", n_iterations=2)

# Compare resolutions side-by-side
sc.pl.umap(
    adata,
    color=["leiden_res0.3", "leiden_res0.5", "leiden_res0.8", "leiden_res1.0"],
    ncols=2,
    legend_loc="on data",
)

# Pick final resolution based on biological plausibility
adata.obs["leiden"] = adata.obs["leiden_res0.5"]
n_clusters = adata.obs["leiden"].nunique()
print(f"Selected resolution=0.5: {n_clusters} clusters")
```

### Step 7: Marker Gene Identification

Find differentially expressed genes per cluster using Wilcoxon rank-sum test on raw counts.

```python
# Rank genes per cluster
sc.tl.rank_genes_groups(adata, groupby="leiden", method="wilcoxon", use_raw=True)

# Visualization
sc.pl.rank_genes_groups(adata, n_genes=20, sharey=False)
sc.pl.rank_genes_groups_dotplot(adata, n_genes=5)
sc.pl.rank_genes_groups_heatmap(adata, n_genes=10, use_raw=True, show_gene_labels=True)

# Extract as DataFrame
markers_df = sc.get.rank_genes_groups_df(adata, group=None)
top_markers = markers_df[markers_df["pvals_adj"] < 0.05].groupby("group").head(10)
print(f"Significant markers: {len(top_markers)} genes across {top_markers['group'].nunique()} clusters")
print(top_markers[["group", "names", "logfoldchanges", "pvals_adj"]].head(15))
```

### Step 8: Cell Type Annotation and Export

Assign cell types based on canonical marker genes, visualize, and save all results.

```python
# Define canonical markers (example: human PBMC)
marker_genes = {
    "CD4 T cells":      ["CD3D", "CD3E", "IL7R"],
    "CD8 T cells":      ["CD8A", "CD8B", "GZMK"],
    "B cells":          ["MS4A1", "CD79A", "CD79B"],
    "NK cells":         ["NKG7", "GNLY", "KLRD1"],
    "CD14+ Monocytes":  ["CD14", "LYZ", "S100A9"],
    "FCGR3A+ Monocytes":["FCGR3A", "MS4A7"],
    "Dendritic cells":  ["FCER1A", "CST3"],
    "Platelets":        ["PPBP", "PF4"],
}

# Dot plot — rows = clusters, columns = markers
sc.pl.dotplot(adata, var_names=marker_genes, groupby="leiden", use_raw=True)

# Map clusters → cell types (adjust based on your dot plot)
cluster_to_celltype = {
    "0": "CD4 T cells",
    "1": "CD14+ Monocytes",
    "2": "B cells",
    "3": "CD8 T cells",
    "4": "NK cells",
    "5": "FCGR3A+ Monocytes",
    "6": "Dendritic cells",
    "7": "Platelets",
}
adata.obs["cell_type"] = adata.obs["leiden"].map(cluster_to_celltype).fillna("Unknown")
sc.pl.umap(adata, color="cell_type", legend_loc="on data", frameon=False)

# Save results
adata.write("results/annotated_data.h5ad", compression="gzip")
adata.obs.to_csv("results/cell_metadata.csv")
markers_df.to_csv("results/marker_genes.csv", index=False)
print(f"Saved: {adata.n_obs} cells with {adata.obs['cell_type'].nunique()} cell types")
```

## Key Parameters

| Parameter | Default | Range / Options | Effect |
|-----------|---------|-----------------|--------|
| `min_genes` (filter_cells) | `200` | `100`-`500` | Minimum genes detected per cell; lower keeps more cells |
| `min_cells` (filter_genes) | `3` | `3`-`10` | Minimum cells expressing a gene; higher is stricter |
| `pct_counts_mt` threshold | `20` | `5`-`30` | Max mitochondrial %; tissue-dependent (5% for PBMCs, 20% for tumors) |
| `n_top_genes` (HVG) | `2000` | `1000`-`5000` | Number of highly variable genes; more captures subtle variation |
| `flavor` (HVG) | `"seurat_v3"` | `"seurat"`, `"seurat_v3"`, `"cell_ranger"` | HVG detection algorithm |
| `n_pcs` (neighbors) | `40` | `20`-`50` | Number of PCs for neighbor graph; check elbow plot |
| `n_neighbors` | `15` | `5`-`50` | k for k-NN graph; lower emphasizes local structure |
| `resolution` (leiden) | `0.5` | `0.1`-`2.0` | Clustering granularity; higher → more clusters |
| `method` (rank_genes) | `"wilcoxon"` | `"wilcoxon"`, `"t-test"`, `"logreg"` | DE test; Wilcoxon recommended for publication |
| `target_sum` (normalize) | `1e4` | `1e4`-`1e5` | Target library size per cell |

## Common Recipes

### Recipe: Batch Correction with ComBat

When to use: multiple samples/batches with visible batch effects on the UMAP.

```python
# Requires 'batch' column in adata.obs
sc.pp.combat(adata, key="batch")
# Re-run PCA, neighbors, UMAP, clustering after correction
sc.tl.pca(adata, svd_solver="arpack")
sc.pp.neighbors(adata, n_pcs=40)
sc.tl.umap(adata)
sc.pl.umap(adata, color=["batch", "leiden"])
```

### Recipe: Trajectory Inference with PAGA + Diffusion Pseudotime

When to use: cells follow a developmental continuum rather than discrete clusters.

```python
# PAGA graph abstraction
sc.tl.paga(adata, groups="leiden")
sc.pl.paga(adata, color="leiden", threshold=0.03)

# Reinitialize UMAP with PAGA layout
sc.tl.umap(adata, init_pos="paga")

# Diffusion pseudotime — set root cell
adata.uns["iroot"] = np.flatnonzero(adata.obs["leiden"] == "0")[0]
sc.tl.diffmap(adata)
sc.tl.dpt(adata)
sc.pl.umap(adata, color=["dpt_pseudotime", "leiden"])
```

### Recipe: Publication-Quality Figures

When to use: generating figures for manuscripts or presentations.

```python
sc.settings.set_figure_params(dpi=300, frameon=False, figsize=(5, 5))

# UMAP with custom palette
sc.pl.umap(
    adata, color="cell_type",
    palette="Set2",
    legend_loc="on data",
    legend_fontsize=10,
    legend_fontoutline=2,
    title="",
    save="_celltype_publication.pdf",
)

# Stacked violin plot of top markers
flat_markers = [g for genes in marker_genes.values() for g in genes[:2]]
sc.pl.stacked_violin(
    adata, var_names=flat_markers, groupby="cell_type",
    use_raw=True, swap_axes=True,
    save="_markers_violin.pdf",
)
```

### Recipe: Differential Expression Between Conditions

When to use: comparing treated vs control within a specific cell type.

```python
# Subset to cell type of interest
adata_t = adata[adata.obs["cell_type"] == "CD4 T cells"].copy()

# DE between conditions
sc.tl.rank_genes_groups(adata_t, groupby="condition", groups=["treated"], reference="control", method="wilcoxon", use_raw=True)
sc.pl.rank_genes_groups_volcano(adata_t, groups=["treated"])

de_results = sc.get.rank_genes_groups_df(adata_t, group="treated")
sig_genes = de_results[de_results["pvals_adj"] < 0.05]
print(f"Significant DE genes: {len(sig_genes)}")
```

## Expected Outputs

- `results/annotated_data.h5ad` — Full AnnData with embeddings (PCA, UMAP), cluster labels, cell type annotations, and raw counts layer
- `results/cell_metadata.csv` — Per-cell table: barcode, cluster, cell type, QC metrics (n_genes, total_counts, pct_mt)
- `results/marker_genes.csv` — DE genes per cluster: gene name, log fold change, adjusted p-value
- QC figures: violin plots (n_genes, total_counts, pct_mt), scatter plots
- Embedding figures: UMAP colored by cluster, cell type, QC metrics, pseudotime
- Marker figures: dot plot, heatmap, stacked violin of canonical markers

## Troubleshooting

| Problem | Cause | Solution |
|---------|-------|----------|
| `ModuleNotFoundError: leidenalg` | Leiden package not installed | `pip install leidenalg igraph` |
| `MemoryError` during PCA | Dense matrix exceeds RAM | Use `sc.pp.pca(adata, chunked=True, chunk_size=20000)` |
| UMAP shows single blob | Over-filtering or too few HVGs | Relax QC thresholds; increase `n_top_genes` to 3000-5000 |
| Too many/few clusters | Resolution mismatch | Sweep `resolution` from 0.1 to 2.0 and compare UMAPs |
| `KeyError: 'MT-'` not found | Wrong species prefix | Use `mt-` (lowercase) for mouse, `MT-` for human |
| Batch effects dominate UMAP | Uncorrected technical variation | Apply ComBat recipe or use Harmony/scVI for integration |
| `adata.raw is None` error | Forgot to save raw before subsetting | Set `adata.raw = adata` after normalization, before HVG subset |
| Marker genes non-specific | Resolution too low merging cell types | Increase resolution or use `logistic_regression` method |
| Slow `regress_out` | Large cell count | Skip regress_out; use `sc.pp.scale()` alone (often sufficient) |
| `ValueError` in `rank_genes_groups` | Cluster with too few cells | Remove clusters with <10 cells before DE: `adata = adata[adata.obs.groupby('leiden').filter(lambda x: len(x)>=10).index]` |

## Bundled Resources

This skill includes reference files for deeper lookup. Read these on demand when the main SKILL.md needs more detail.

### references/api_reference.md
Quick-lookup table of all scanpy functions organized by module (sc.pp.*, sc.tl.*, sc.pl.*), with full signatures, common parameters, and AnnData structure reference. **Use when**: you need the exact function name or parameter for a specific operation.

### references/plotting_guide.md
Comprehensive visualization guide: QC plots, UMAP styling, marker gene plots (dot plot, heatmap, stacked violin), trajectory plots, multi-panel figures, publication customization, and color palette recommendations. **Use when**: creating figures for publications or presentations.

### references/standard_workflow.md
Detailed step-by-step reference for each pipeline stage with decision points, parameter rationale, and alternative approaches (MAD-based filtering, scran normalization, automated annotation). **Use when**: performing a full analysis from scratch and needing deeper guidance than the Workflow section above.

## References

- [Scanpy documentation](https://scanpy.readthedocs.io/) — Official API reference and tutorials (BSD-3)
- [Scanpy PBMC3k tutorial](https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html) — Canonical walkthrough (BSD-3)
- [Galaxy Training: Clustering 3K PBMCs with Scanpy](https://training.galaxyproject.org/training-material/topics/single-cell/) — Step-by-step tutorial (CC-BY)
- [Luecken & Theis (2019)](https://doi.org/10.15252/msb.20188746) — "Current best practices in single-cell RNA-seq analysis: a tutorial", *Mol Syst Biol*
- [Heumos et al. (2023)](https://doi.org/10.1038/s41576-023-00586-w) — "Best practices for single-cell analysis across modalities", *Nat Rev Genet*
- [scverse ecosystem](https://scverse.org/) — Related tools: anndata, scvi-tools, squidpy, cellrank