--- name: "celltypist-cell-annotation" description: "Automated scRNA-seq cell type annotation via pre-trained logistic regression. 45+ models: immune, gut, lung, brain, fetal, cancer microenvironments. Input normalized AnnData; outputs per-cell labels, majority-vote cluster labels, confidence scores. Use for fast, reference-backed annotation without manual marker inspection." license: "MIT" --- # CellTypist Cell Type Annotation ## Overview CellTypist is an automated cell type classifier for single-cell RNA-seq data built on logistic regression models trained on curated reference atlases. Given a normalized AnnData object, it predicts cell type labels at the single-cell level and optionally applies majority voting within user-defined clusters to produce consensus, biologically coherent annotations. The tool ships with 45+ ready-to-use models spanning pan-immune, organ-specific, and developmental contexts, and supports training custom models from labeled data. ## When to Use - Annotating PBMC, whole-blood, lymph node, or other immune cell datasets using a single standardized reference model - Generating a first-pass cell type annotation before manual curation with canonical marker genes - Annotating cluster-level cell types in published or in-house datasets using majority voting to smooth noisy per-cell predictions - Comparing annotation results across multiple tissue-specific models to determine the most biologically relevant reference - Training a custom CellTypist model from a labeled reference dataset for a tissue or species not covered by pre-built models - Quantifying annotation confidence to flag low-certainty cells (confidence score < 0.5) for manual review or exclusion - Use **scVI/scANVI** (scvi-tools-single-cell) instead when you need probabilistic label transfer with batch correction and uncertainty quantification via a variational autoencoder - Use **popV** (popv-cell-annotation) instead when you want ensemble consensus from 10+ methods including deep learning and KNN-based approaches ## Prerequisites - **Python packages**: `celltypist>=1.6`, `scanpy>=1.9`, `anndata` - **Data requirements**: AnnData with normalized, log1p-transformed counts in `adata.X` (10,000 UMIs per cell target sum). Raw counts must be normalized before calling CellTypist - **Environment**: Python 3.8+; 8 GB RAM sufficient for most datasets; internet access required for model downloads (first run only) ```bash pip install celltypist "scanpy[leiden]" anndata ``` ## Quick Start Minimal pipeline — annotate a preprocessed AnnData with the pan-immune model: ```python import celltypist import scanpy as sc # Load a preprocessed AnnData (normalized + log1p, Leiden clusters already in adata.obs) adata = sc.read_h5ad("preprocessed_pbmc.h5ad") # Run annotation with majority voting across Leiden clusters predictions = celltypist.annotate( adata, model="Immune_All_Low.pkl", majority_voting=True, ) adata = predictions.to_adata() print(adata.obs[["predicted_labels", "majority_voting", "conf_score"]].head(10)) # predicted_labels majority_voting conf_score # CD4+ T cells CD4+ T cells 0.92 # ... ``` ## Workflow ### Step 1: Installation and Model Setup Install CellTypist and download pre-trained models. Models are cached locally after the first download. ```bash pip install celltypist "scanpy[leiden]" anndata ``` ```python import celltypist from celltypist import models # Download all available models (only needed once; ~2 GB total) models.download_models(force_update=False) # List available models with metadata models_df = models.models_description() print(models_df[["model", "description", "n_celltypes", "n_cells"]].to_string()) # Output (excerpt): # model description n_celltypes n_cells # Immune_All_Low.pkl Pan-immune low-hierarchy (98 cell types) 98 324,320 # Immune_All_High.pkl Pan-immune high-hierarchy (30 cell types) 30 324,320 # Human_Lung_Atlas.pkl Lung cell types from Human Lung Atlas 61 584,944 ``` ### Step 2: Data Preparation CellTypist requires normalized, log1p-transformed counts in `adata.X`. Run normalization before annotation. Raw counts must be stored separately. ```python import scanpy as sc # Load raw count matrix adata = sc.read_h5ad("raw_counts.h5ad") # Alternatively from 10X: # adata = sc.read_10x_mtx("filtered_feature_bc_matrix/") # adata.var_names_make_unique() # Store raw counts before normalization adata.layers["counts"] = adata.X.copy() # Normalize to 10,000 UMIs per cell and log1p-transform sc.pp.normalize_total(adata, target_sum=1e4) sc.pp.log1p(adata) print(f"Prepared: {adata.n_obs} cells x {adata.n_vars} genes") print(f"adata.X mean: {adata.X.mean():.3f} (expected ~0.5–2.0 after log1p normalization)") ``` ### Step 3: Model Selection Choose the model that best matches your tissue type and desired annotation resolution. ```python from celltypist import models # Show full model table with filtering models_df = models.models_description() # Filter to human immune models immune_models = models_df[models_df["description"].str.contains("immune|Immune", case=False)] print(immune_models[["model", "description", "n_celltypes"]].to_string()) # Load a specific model to inspect its cell type labels model = models.Model.load("Immune_All_Low.pkl") print(f"Model cell types ({len(model.cell_types)}):") print(model.cell_types[:20]) # first 20 labels ``` **Available models (key selection guide):** | Model | Cell Types | Best For | |-------|-----------|---------| | `Immune_All_Low.pkl` | 98 | Pan-immune with fine subtypes (e.g., MAIT, Tfh, cDC1) | | `Immune_All_High.pkl` | 30 | Pan-immune major lineages (T, B, NK, monocyte, DC) | | `Human_Lung_Atlas.pkl` | 61 | Lung: alveolar, stromal, immune, endothelial | | `Pan_Fetal_Human.pkl` | 139 | Fetal human multi-organ development | | `Developing_Human_Brain.pkl` | 51 | Brain development: progenitors, neurons, glia | | `Human_Colorectal_Cancer.pkl` | 62 | Colorectal cancer cells + tumor microenvironment | ### Step 4: Automated Annotation Run `celltypist.annotate()` with `majority_voting=True` for cluster-level consensus labels alongside per-cell predictions. ```python import celltypist import scanpy as sc # Ensure Leiden clusters exist for majority voting # If not already computed: sc.pp.highly_variable_genes(adata, n_top_genes=2000) sc.pp.pca(adata) sc.pp.neighbors(adata, n_pcs=30) sc.tl.leiden(adata, resolution=0.5, key_added="leiden") # Run CellTypist annotation predictions = celltypist.annotate( adata, model="Immune_All_Low.pkl", majority_voting=True, # cluster-level consensus over_clustering="leiden", # clustering key for majority voting p_thres=0.5, # cells below threshold → "Unassigned" mode="best match", # assign the single highest-probability label ) # Inspect prediction object print(type(predictions)) # celltypist.classifier.AnnotationResult print(predictions.predicted_labels.head()) print(predictions.probability_matrix.shape) # (n_cells, n_cell_types) ``` ### Step 5: Results Integration Transfer predictions back to the AnnData object and review confidence scores. ```python # Merge predictions into adata.obs adata = predictions.to_adata() # Key result columns: # adata.obs["predicted_labels"] — per-cell best-match label # adata.obs["majority_voting"] — cluster-level consensus label # adata.obs["conf_score"] — probability of the predicted label (0–1) print(adata.obs[["predicted_labels", "majority_voting", "conf_score"]].head(10)) print(f"\nCell type distribution (majority voting):") print(adata.obs["majority_voting"].value_counts().head(15)) # Flag low-confidence cells low_conf = adata.obs["conf_score"] < 0.5 print(f"\nLow-confidence cells (conf_score < 0.5): {low_conf.sum()} ({low_conf.mean():.1%})") adata.obs["high_conf"] = ~low_conf ``` ### Step 6: Visualization and Validation Plot predictions on UMAP, validate with canonical marker genes, and confirm annotation quality. ```python import scanpy as sc import matplotlib.pyplot as plt # Compute UMAP if not already done if "X_umap" not in adata.obsm: sc.tl.umap(adata) # UMAP colored by annotation results fig, axes = plt.subplots(1, 3, figsize=(21, 6)) sc.pl.umap(adata, color="majority_voting", legend_loc="on data", legend_fontsize=7, title="Majority Voting", ax=axes[0], show=False) sc.pl.umap(adata, color="predicted_labels", legend_loc="right margin", legend_fontsize=7, title="Per-Cell Prediction", ax=axes[1], show=False) sc.pl.umap(adata, color="conf_score", cmap="RdYlGn", title="Confidence Score", ax=axes[2], show=False) plt.tight_layout() plt.savefig("celltypist_annotation.png", dpi=150, bbox_inches="tight") plt.show() print("Saved celltypist_annotation.png") # Validate with canonical immune markers marker_genes = { "CD4+ T": ["CD3D", "CD4", "IL7R"], "CD8+ T": ["CD3D", "CD8A", "GZMK"], "B cells": ["MS4A1", "CD79A"], "NK cells": ["GNLY", "NKG7"], "CD14 Mono": ["CD14", "LYZ"], } sc.pl.dotplot(adata, var_names=marker_genes, groupby="majority_voting", use_raw=False, standard_scale="var", save="_celltypist_markers.png") ``` ## Key Parameters | Parameter | Default | Range / Options | Effect | |-----------|---------|-----------------|--------| | `model` | — | Any `.pkl` filename or path | Selects the reference atlas for annotation; must match tissue/species | | `majority_voting` | `False` | `True`, `False` | When `True`, smooths per-cell labels to cluster consensus; requires a clustering key in `over_clustering` | | `over_clustering` | `None` | Any `adata.obs` key, `"leiden"`, `"louvain"` | Clustering column used for majority voting; auto-detected if common keys present | | `p_thres` | `0.5` | `0.0`–`1.0` | Minimum probability to assign a label; cells below threshold are labeled `"Unassigned"` | | `mode` | `"best match"` | `"best match"`, `"prob match"` | `"best match"`: top label regardless of threshold; `"prob match"`: applies `p_thres` | | `min_prop` | `0.0` | `0.0`–`1.0` | For majority voting: minimum fraction of cluster cells with the consensus label; rare labels may be suppressed | ## Key Concepts ### Pre-Trained Model Architecture Each CellTypist model is a one-vs-rest logistic regression classifier trained on a curated cell atlas. Key properties: - **Input**: 33,694 genes (or fewer if the dataset has a smaller gene space — unshared genes are zero-filled) - **Output**: per-cell probability vector over all cell type classes; highest probability is the predicted label - **Confidence score**: the probability assigned to the winning class (0–1); high values (>0.7) indicate reliable predictions - **Species/version specificity**: models are trained on specific atlases; using a human model on mouse data will produce spurious results ### Majority Voting Majority voting applies a two-stage correction after per-cell prediction: 1. Each cell receives a per-cell label from the logistic regression output 2. Within each cluster (e.g., Leiden cluster), the most frequent per-cell label becomes the cluster's consensus `majority_voting` label 3. Cells whose per-cell label disagrees with the cluster majority are re-labeled to the cluster consensus unless `min_prop` is set Majority voting is recommended when individual cells have noisy expression but the cluster is biologically coherent. Disable it when cells within a cluster are biologically heterogeneous (e.g., transitional states). ### Gene Space Alignment CellTypist automatically intersects the model's training genes with the input AnnData's gene names. Genes present in the model but absent from the query are zero-filled. Annotations degrade if fewer than ~60% of model genes are present — check with `model.cell_types` and `adata.var_names`. ## Common Recipes ### Recipe: Train a Custom Model When to use: your tissue or species is not covered by an existing model, and you have a labeled reference dataset. ```python import celltypist import scanpy as sc # Load labeled reference AnnData (must be normalized + log1p) ref = sc.read_h5ad("labeled_reference.h5ad") # ref.obs["cell_type"] must contain string cell type labels # Train custom model new_model = celltypist.train( ref, labels="cell_type", # obs column with training labels n_jobs=4, # parallel workers max_iter=200, # logistic regression iterations use_SGD=False, # use full L-BFGS-B solver (recommended for <100k cells) top_genes=500, # number of most informative genes per class ) # Save for reuse new_model.write("custom_tissue_model.pkl") print(f"Trained model: {len(new_model.cell_types)} cell types") # Apply to query predictions = celltypist.annotate(query_adata, model="custom_tissue_model.pkl", majority_voting=True) ``` ### Recipe: Multi-Model Comparison When to use: uncertain which model best matches your dataset; run multiple models and compare agreement. ```python import celltypist import pandas as pd model_names = ["Immune_All_High.pkl", "Immune_All_Low.pkl", "Human_Lung_Atlas.pkl"] results = {} for model_name in model_names: preds = celltypist.annotate(adata, model=model_name, majority_voting=True) adata_tmp = preds.to_adata() key = model_name.replace(".pkl", "") results[key] = adata_tmp.obs["majority_voting"].values comparison = pd.DataFrame(results, index=adata.obs_names) print("Agreement between Immune_All_High and Immune_All_Low:") agreement = (comparison["Immune_All_High"] == comparison["Immune_All_Low"]).mean() print(f" {agreement:.1%} of cells agree") print(comparison.head(10)) ``` ### Recipe: Export Annotations for Downstream Analysis When to use: saving annotated data with all prediction metadata for downstream differential expression or trajectory analysis. ```python import scanpy as sc import pandas as pd # Save full annotated AnnData adata.write_h5ad("annotated_celltypist.h5ad", compression="gzip") print(f"Saved annotated_celltypist.h5ad ({adata.n_obs} cells)") # Export cell type table cell_table = adata.obs[[ "predicted_labels", "majority_voting", "conf_score", "leiden" ]].copy() cell_table.to_csv("celltypist_annotations.csv") # Cell type proportions per sample if "sample" in adata.obs.columns: props = (adata.obs.groupby(["sample", "majority_voting"]) .size().unstack(fill_value=0)) props_norm = props.div(props.sum(axis=1), axis=0) props_norm.to_csv("celltypist_proportions.csv") print(f"Cell type proportions saved (shape: {props_norm.shape})") ``` ## Expected Outputs | Output | Description | |--------|-------------| | `adata.obs["predicted_labels"]` | Per-cell best-match label from logistic regression | | `adata.obs["majority_voting"]` | Cluster-consensus label (when `majority_voting=True`) | | `adata.obs["conf_score"]` | Probability of the predicted label (0–1); `>0.5` = confident | | `adata.obsm["X_umap"]` | UMAP embedding (if computed in preprocessing step) | | `celltypist_annotation.png` | UMAP panels: majority voting label, per-cell label, confidence scores | | `celltypist_annotations.csv` | Per-cell annotation table with predicted labels and confidence | ## Troubleshooting | Problem | Cause | Solution | |---------|-------|----------| | `ValueError: adata.X does not appear to be log1p normalized` | Raw counts passed directly | Run `sc.pp.normalize_total(adata, target_sum=1e4)` then `sc.pp.log1p(adata)` before calling `celltypist.annotate()` | | Many cells labeled `"Unassigned"` | `p_thres` too high or model species mismatch | Lower `p_thres` to `0.3`; verify model matches species and tissue; check `conf_score` distribution | | `KeyError` for `over_clustering` key | Clustering column name not found in `adata.obs` | Run `sc.tl.leiden(adata, key_added="leiden")` first, or set `over_clustering="leiden"` explicitly | | Implausible labels (e.g., immune labels on neurons) | Wrong model selected for tissue | Choose a tissue-specific model (e.g., `Developing_Human_Brain.pkl` for brain data); list options with `models.models_description()` | | `MemoryError` on large datasets (>500k cells) | Full probability matrix held in RAM | Subsample to 200k cells for annotation, then transfer labels via KNN; or use `mode="best match"` to skip storing full probability matrix | | Low overall `conf_score` (<0.4 median) | Dataset is poorly represented by the reference model | Train a custom model from a matched reference or use `popv-cell-annotation` for ensemble voting | | `Model not found` error on download | Network issue or wrong model name | Run `models.download_models(force_update=True)`; verify name with `models.models_description()["model"].tolist()` | ## Related Skills - **scanpy-scrna-seq** — preprocessing pipeline (QC, normalization, clustering) that produces the AnnData input for CellTypist - **popv-cell-annotation** — ensemble annotation using 10+ methods; use when you want consensus across methods rather than a single model - **scvi-tools-single-cell** — scANVI for semi-supervised label transfer with deep generative models and probabilistic uncertainty - **harmony-batch-correction** — batch correction to apply before annotation when integrating multiple samples ## References - [CellTypist documentation](https://celltypist.readthedocs.io/) — official API reference, model descriptions, and tutorials - [GitHub: Teichlab/celltypist](https://github.com/Teichlab/celltypist) — source code and issue tracker - [Dominguez Conde et al., Science 2022](https://doi.org/10.1126/science.abl5197) — "Cross-tissue immune cell analysis reveals tissue-specific features in humans", original CellTypist paper - [CellTypist model portal](https://www.celltypist.org/models) — interactive model browser with cell type hierarchies and training dataset details