--- name: "popv-cell-annotation" description: "Consensus cell type annotation: runs 10+ algorithms (KNN-Harmony/BBKNN/Scanorama/scVI, CellTypist, ONCLASS, Random Forest, SCANVI, SVM, XGBoost) on a labeled reference and transfers labels via majority voting. Outputs per-method labels, consensus, agreement score. Use when single-method annotation is insufficient or you need ensemble uncertainty for novel states." license: "BSD-3-Clause" --- # popV Multi-Method Cell Type Transfer ## Overview popV (Population Voting for single-cell annotation) annotates a query scRNA-seq dataset by running 10+ independent classification algorithms against a labeled reference atlas and aggregating results via majority voting. Each method produces its own label; the final `popv_prediction` is the consensus across all methods, and the `popv_agreement` score quantifies how many methods agree. This ensemble strategy is robust to individual method failures on unusual datasets and provides a principled uncertainty estimate: low agreement highlights novel cell states or annotation gaps. ## When to Use - Annotating a query dataset by transferring labels from a well-curated reference atlas when you want a consensus rather than a single model's judgment - Identifying novel or ambiguous cell states as cells where methods disagree (low `popv_agreement` score) - Benchmarking annotation reliability by comparing per-method labels to detect systematic disagreements - Annotating large atlas datasets (>100k cells) where batch effects between reference and query are substantial - Producing annotation for downstream analyses that require high-confidence labels (clinical data, regulatory submissions) - Use **CellTypist** (celltypist-cell-annotation) instead when speed matters and a pre-trained model matches your tissue; popV is slower because it trains multiple models on your reference - Use **scANVI** (scvi-tools-single-cell) instead when you need a single probabilistic deep generative model with formal uncertainty quantification and do not require the ensemble ## Prerequisites - **Python packages**: `popv>=0.6`, `scanpy>=1.9`, `anndata`, `scvi-tools>=1.0`, `harmonypy`, `bbknn`, `celltypist` - **Data requirements**: Two AnnData objects — a labeled reference (`adata_ref`) with cell type labels in `obs`, and an unlabeled query (`adata_query`). Both must be from the same species and have overlapping gene sets. Raw counts in `adata.X` (popV applies its own normalization internally) - **Environment**: Python 3.9+; GPU recommended for scVI/SCANVI methods (falls back to CPU); 32 GB RAM recommended for >200k reference cells ```bash pip install popv scvi-tools harmonypy bbknn celltypist ``` ## Quick Start Minimal pipeline from labeled reference and unlabeled query to annotated result: ```python import popv import scanpy as sc # Load reference (labeled) and query (unlabeled) AnnData objects adata_ref = sc.read_h5ad("reference_atlas.h5ad") # adata_ref.obs["cell_type"] exists adata_query = sc.read_h5ad("query_dataset.h5ad") # Prepare combined object with popV preprocessing adata = popv.preprocessing.Process_Query( adata_ref, adata_query, ref_labels_key="cell_type", ref_batch_key="batch", query_batch_key="batch", unknown_celltype_label="unknown", save_path_trained_models="./popv_models/", n_epochs_unsupervised=50, ) # Run all annotation methods popv.annotation.annotate_data(adata) # Inspect consensus results for query cells query_mask = adata.obs["_dataset"] == "query" print(adata[query_mask].obs[["popv_prediction", "popv_agreement"]].head(10)) ``` ## Core API ### Module 1: Reference and Query Data Setup Both AnnData objects must share a gene space and have required metadata columns. popV will subset to the intersection of genes automatically. ```python import anndata as ad import scanpy as sc import numpy as np # Reference: must have cell type labels and (optionally) batch metadata adata_ref = sc.read_h5ad("reference_atlas.h5ad") print(f"Reference: {adata_ref.n_obs} cells x {adata_ref.n_vars} genes") print(f"Cell types: {adata_ref.obs['cell_type'].nunique()} unique labels") print(f"Reference cell type counts:\n{adata_ref.obs['cell_type'].value_counts().head(10)}") # Query: no labels required; batch metadata optional adata_query = sc.read_h5ad("query_dataset.h5ad") print(f"\nQuery: {adata_query.n_obs} cells x {adata_query.n_vars} genes") # Check gene overlap (popV will handle subsetting but >70% overlap is recommended) shared_genes = adata_ref.var_names.intersection(adata_query.var_names) pct_shared = len(shared_genes) / adata_ref.n_vars print(f"\nShared genes: {len(shared_genes)} ({pct_shared:.1%} of reference genes)") if pct_shared < 0.5: print("WARNING: <50% gene overlap — annotation quality may be reduced") ``` ```python # Verify required fields before popV setup assert "cell_type" in adata_ref.obs.columns, "Reference needs cell type labels" # Add batch column if absent (popV requires it even for single-batch data) if "batch" not in adata_ref.obs.columns: adata_ref.obs["batch"] = "ref_batch" if "batch" not in adata_query.obs.columns: adata_query.obs["batch"] = "query_batch" print("Reference obs columns:", adata_ref.obs.columns.tolist()) print("Query obs columns: ", adata_query.obs.columns.tolist()) ``` ### Module 2: POPV Object Creation (Process_Query) `Process_Query` combines reference and query, normalizes counts, selects HVGs, and prepares the joint embedding needed by all annotation methods. ```python import popv # Create processed combined AnnData adata = popv.preprocessing.Process_Query( adata_ref, adata_query, ref_labels_key="cell_type", # obs column with reference labels ref_batch_key="batch", # obs column with reference batch info query_batch_key="batch", # obs column with query batch info unknown_celltype_label="unknown",# label to use for query cells before annotation save_path_trained_models="./popv_models/", # directory for scVI/SCANVI model checkpoints n_epochs_unsupervised=50, # scVI training epochs (increase to 100–200 for large datasets) n_epochs_semisupervised=20, # scANVI fine-tuning epochs use_gpu=True, # GPU for scVI/SCANVI (falls back to CPU if unavailable) hvg=4000, # number of highly variable genes to use ) print(f"Combined object: {adata.n_obs} cells x {adata.n_vars} genes") print(f"Dataset labels: {adata.obs['_dataset'].value_counts().to_dict()}") # Expected: {'ref': N_ref, 'query': N_query} ``` ### Module 3: Running the Method Ensemble `annotate_data` runs all selected methods sequentially and adds per-method label columns plus the consensus to `adata.obs`. ```python import popv # Run annotation with default set of methods popv.annotation.annotate_data( adata, methods=[ "knn_harmony", # KNN on Harmony-corrected embedding "knn_bbknn", # KNN on BBKNN cross-batch graph "knn_scvi", # KNN on scVI latent space "scanvi_popv", # Semi-supervised scANVI label transfer "celltypist_popv",# CellTypist logistic regression "rf", # Random Forest on HVG expression "xgboost", # XGBoost classifier "svm", # Support Vector Machine "onclass", # ONCLASS (ontology-guided) ], ) # Inspect per-method result columns (all end in "_popv") query_mask = adata.obs["_dataset"] == "query" popv_cols = adata.obs.filter(like="_popv").columns.tolist() print(f"Per-method columns: {popv_cols}") print(adata[query_mask].obs[popv_cols + ["popv_prediction", "popv_agreement"]].head(10)) ``` ### Module 4: Consensus Results and Agreement Scoring `popv_prediction` is the majority-vote consensus; `popv_agreement` is the fraction of methods that agreed on the winning label. ```python import pandas as pd query_mask = adata.obs["_dataset"] == "query" query_obs = adata[query_mask].obs.copy() # Consensus label distribution print("Consensus cell type distribution:") print(query_obs["popv_prediction"].value_counts().head(15)) # Agreement score statistics print(f"\npopv_agreement statistics:") print(query_obs["popv_agreement"].describe()) # agreement = 1.0 → all methods agree; agreement = 0.2 → only 2/10 methods agree # Cells with high confidence (>80% method agreement) high_conf = query_obs["popv_agreement"] >= 0.8 print(f"\nHigh-confidence cells (agreement >= 0.8): {high_conf.sum()} ({high_conf.mean():.1%})") # Cells with low confidence — candidate novel states or annotation gaps low_conf = query_obs["popv_agreement"] < 0.5 print(f"Low-confidence cells (agreement < 0.5): {low_conf.sum()} ({low_conf.mean():.1%})") ``` ### Module 5: Visualization popV provides built-in UMAP and heatmap visualization of per-method agreement and consensus labels. ```python import popv import scanpy as sc import matplotlib.pyplot as plt # Compute UMAP on the joint reference+query embedding (if not already present) if "X_umap" not in adata.obsm: sc.tl.umap(adata) # popV built-in visualization: UMAP panel showing consensus + agreement popv.visualization.predict_celltypes_umap( adata, save="popv_annotation_umap.png", ) print("Saved popv_annotation_umap.png") # Custom UMAP panels fig, axes = plt.subplots(1, 3, figsize=(21, 6)) sc.pl.umap(adata, color="popv_prediction", ax=axes[0], title="popV Consensus", legend_loc="on data", legend_fontsize=6, show=False) sc.pl.umap(adata, color="popv_agreement", ax=axes[1], cmap="RdYlGn", vmin=0, vmax=1, title="Method Agreement Score", show=False) sc.pl.umap(adata, color="_dataset", ax=axes[2], title="Reference vs Query", show=False) plt.tight_layout() plt.savefig("popv_custom_umap.png", dpi=150, bbox_inches="tight") print("Saved popv_custom_umap.png") ``` ## Key Concepts ### Method Ensemble and Majority Voting popV runs each method independently; the final prediction is determined by plurality vote across all methods. The `popv_agreement` score equals the fraction of methods that voted for the winning label (e.g., 0.7 = 7/10 methods agreed). This design has several properties: - **Robustness**: if one method fails or produces outlier labels, the consensus is unaffected if the remaining methods agree - **Uncertainty signal**: low agreement does not mean the annotation is wrong — it often flags biologically interesting cells (transitional states, rare populations) that differ from all reference cell types - **Method independence**: KNN-based methods depend on the embedding quality; tree-based methods (RF, XGBoost) work directly on expression; SVM works in feature space; CellTypist uses a separate logistic regression. Together they span multiple algorithmic families ### Method Comparison | Method | Batch Correction | Speed | Best For | |--------|-----------------|-------|---------| | `knn_harmony` | Harmony | Fast | Moderate batch effects, large datasets | | `knn_bbknn` | BBKNN | Fast | Diverse multi-tissue references | | `knn_scanorama` | Scanorama | Fast | Multiple heterogeneous batches | | `knn_scvi` | scVI VAE | Medium | Complex batch effects, probabilistic embedding | | `scanvi_popv` | scVI+labels | Slow | Semi-supervised; most accurate when reference is clean | | `celltypist_popv` | None (logistic) | Fast | Immune cells; works well without batch correction | | `rf` | None | Medium | Balanced class distributions; interpretable feature importance | | `xgboost` | None | Medium | High-confidence predictions on well-separated cell types | | `svm` | None | Medium | High-dimensional gene expression; linear boundaries | | `onclass` | None | Medium | Ontology-aware; handles unseen cell types via CL ontology | ### ONCLASS and Ontology-Aware Annotation ONCLASS uses the Cell Ontology (CL) to represent cell types as nodes in a knowledge graph and predict unseen cell types by propagating similarity through the ontology. Unlike other methods, ONCLASS can predict a cell type that was not present in the training reference if it is ontologically adjacent to known types. Enable it by including `"onclass"` in the methods list. ### Reference Quality Requirements popV annotation quality scales directly with reference quality: - **Minimum cell count per type**: 50–100 cells per label; rare types with <20 cells may be missed by KNN methods - **Balanced representation**: highly imbalanced references (one type is 80% of cells) cause tree methods to be biased toward the majority class - **Label granularity**: coarse labels (10 types) annotate reliably; fine-grained labels (100+ types) require a larger, matched reference ## Common Workflows ### Workflow 1: Standard Reference-Query Annotation **Goal**: Annotate an unlabeled query dataset using a curated reference atlas end-to-end. ```python import popv import scanpy as sc import pandas as pd # 1. Load data adata_ref = sc.read_h5ad("reference_atlas.h5ad") # has obs["cell_type"] and obs["batch"] adata_query = sc.read_h5ad("query_dataset.h5ad") # no cell type labels if "batch" not in adata_query.obs.columns: adata_query.obs["batch"] = "query" # 2. Preprocess: build joint normalized object adata = popv.preprocessing.Process_Query( adata_ref, adata_query, ref_labels_key="cell_type", ref_batch_key="batch", query_batch_key="batch", unknown_celltype_label="unknown", save_path_trained_models="./popv_models/", n_epochs_unsupervised=100, n_epochs_semisupervised=30, use_gpu=True, hvg=4000, ) print(f"Prepared: {adata.n_obs} total cells") # 3. Run ensemble annotation popv.annotation.annotate_data(adata) # 4. Extract query results query_mask = adata.obs["_dataset"] == "query" query_annotations = adata[query_mask].obs[[ "popv_prediction", "popv_agreement", "knn_harmony_popv", "scanvi_popv", "rf_popv", "xgboost_popv" ]].copy() # 5. Transfer back to original query object adata_query.obs = adata_query.obs.join( query_annotations, how="left" ) print(f"Annotated {query_mask.sum()} query cells") print(query_annotations["popv_prediction"].value_counts().head(10)) # 6. Save annotated query adata_query.write_h5ad("annotated_query.h5ad", compression="gzip") query_annotations.to_csv("popv_annotations.csv") print("Saved annotated_query.h5ad and popv_annotations.csv") ``` ### Workflow 2: Confidence Filtering and Novel Cell State Detection **Goal**: Separate high-confidence annotations from ambiguous cells; flag candidate novel or transitional states for manual review. ```python import popv import scanpy as sc import pandas as pd import matplotlib.pyplot as plt # Assume adata has been annotated (as in Workflow 1) query_mask = adata.obs["_dataset"] == "query" query_obs = adata[query_mask].obs.copy() # Tier cells by agreement score bins = [0.0, 0.5, 0.8, 1.01] labels = ["low (<0.5)", "medium (0.5–0.8)", "high (≥0.8)"] query_obs["confidence_tier"] = pd.cut( query_obs["popv_agreement"], bins=bins, labels=labels, right=False ) print("Cells per confidence tier:") print(query_obs["confidence_tier"].value_counts()) # High-confidence subset: use popv_prediction directly high_conf_mask = query_obs["popv_agreement"] >= 0.8 print(f"\nHigh-confidence annotations ({high_conf_mask.mean():.1%} of query cells):") print(query_obs[high_conf_mask]["popv_prediction"].value_counts().head(10)) # Low-confidence subset: inspect per-method disagreement low_conf = query_obs[query_obs["popv_agreement"] < 0.5] popv_method_cols = [c for c in query_obs.columns if c.endswith("_popv") and c not in ("popv_prediction", "popv_agreement")] print(f"\nLow-confidence cells sample (showing per-method labels):") print(low_conf[popv_method_cols + ["popv_prediction"]].head(10).to_string()) # Visualize agreement distribution fig, axes = plt.subplots(1, 2, figsize=(14, 5)) query_obs["popv_agreement"].hist(bins=20, ax=axes[0], color="steelblue", edgecolor="white") axes[0].axvline(0.8, color="red", linestyle="--", label="High-confidence threshold") axes[0].set_xlabel("Method Agreement Score") axes[0].set_ylabel("Cell Count") axes[0].set_title("popV Agreement Distribution") axes[0].legend() query_obs["confidence_tier"].value_counts().plot.bar(ax=axes[1], color="steelblue") axes[1].set_title("Cells by Confidence Tier") axes[1].set_xlabel("Confidence Tier") axes[1].set_ylabel("Cell Count") plt.tight_layout() plt.savefig("popv_confidence_distribution.png", dpi=150, bbox_inches="tight") print("Saved popv_confidence_distribution.png") ``` ## Key Parameters | Parameter | Module | Default | Range / Options | Effect | |-----------|--------|---------|-----------------|--------| | `ref_labels_key` | Process_Query | — | Any `obs` column | Column in `adata_ref.obs` containing training cell type labels | | `n_epochs_unsupervised` | Process_Query | `50` | `20`–`500` | scVI training epochs; increase for better embedding on large/complex datasets | | `n_epochs_semisupervised` | Process_Query | `20` | `10`–`100` | scANVI fine-tuning epochs on top of scVI | | `hvg` | Process_Query | `4000` | `2000`–`8000` | Highly variable genes used for embedding and KNN methods | | `use_gpu` | Process_Query | `True` | `True`, `False` | GPU acceleration for scVI/SCANVI; falls back to CPU automatically if no GPU | | `methods` | annotate_data | all | List of method names | Subset of methods to run; excluding slow methods (scanvi, onclass) speeds up pipeline | | `unknown_celltype_label` | Process_Query | `"unknown"` | Any string | Label assigned to query cells before annotation; used to separate reference labels from query | | `popv_agreement` | (output) | — | `0.0`–`1.0` | Fraction of methods agreeing on consensus label; `>=0.8` recommended for high confidence | ## Best Practices 1. **Check gene overlap before running**: popV performs best with >70% gene overlap between reference and query. If overlap is <50%, annotation quality degrades significantly — consider using a different reference or imputing missing genes. ```python shared = adata_ref.var_names.intersection(adata_query.var_names) print(f"Gene overlap: {len(shared) / adata_ref.n_vars:.1%}") ``` 2. **Use raw counts as input**: pass raw (un-normalized) counts in `adata.X` to `Process_Query`. popV internally applies its own normalization. Pre-normalized data can distort the scVI/SCANVI latent space. 3. **Match reference granularity to query biology**: if your query contains subtypes not in the reference, no method will correctly assign them — they will appear as low-agreement cells. Either add them to the reference or accept that the consensus will assign the nearest parent type. 4. **Exclude slow methods when speed matters**: `scanvi_popv` and `onclass` are the slowest. For a quick first-pass, run only `knn_harmony`, `knn_bbknn`, `rf`, `xgboost`, and `celltypist_popv`. ```python popv.annotation.annotate_data(adata, methods=["knn_harmony", "knn_bbknn", "rf", "xgboost", "celltypist_popv"]) ``` 5. **Save trained models for repeated queries**: `Process_Query` stores scVI/SCANVI models in `save_path_trained_models`. Reuse these when annotating additional query batches against the same reference to avoid retraining. ## Common Recipes ### Recipe: Subset to High-Confidence Annotations Only When to use: downstream analyses (DE, trajectory) require clean labels; exclude ambiguous cells. ```python import scanpy as sc # Annotate as in Workflow 1 first query_mask = adata.obs["_dataset"] == "query" adata_query_annotated = adata[query_mask].copy() # Keep only high-confidence cells high_conf = adata_query_annotated[adata_query_annotated.obs["popv_agreement"] >= 0.8].copy() print(f"High-confidence cells: {high_conf.n_obs} / {adata_query_annotated.n_obs} " f"({high_conf.n_obs/adata_query_annotated.n_obs:.1%})") print(high_conf.obs["popv_prediction"].value_counts()) # Recompute UMAP on high-confidence subset for visualization sc.pp.neighbors(high_conf, use_rep="X_scVI") # use scVI embedding stored by popV sc.tl.umap(high_conf) sc.pl.umap(high_conf, color="popv_prediction", save="_high_conf_celltypes.png") ``` ### Recipe: Per-Method Label Comparison Heatmap When to use: understanding where methods disagree to identify systematic biases or novel populations. ```python import pandas as pd import matplotlib.pyplot as plt import seaborn as sns query_mask = adata.obs["_dataset"] == "query" query_obs = adata[query_mask].obs.copy() # Collect per-method columns method_cols = [c for c in query_obs.columns if c.endswith("_popv") and c not in ("popv_prediction", "popv_agreement")] # Cross-tabulate two key methods ct = pd.crosstab( query_obs["knn_harmony_popv"], query_obs["scanvi_popv"], margins=False, ) # Normalize rows ct_norm = ct.div(ct.sum(axis=1), axis=0) plt.figure(figsize=(12, 10)) sns.heatmap(ct_norm, cmap="Blues", vmin=0, vmax=1, xticklabels=True, yticklabels=True, cbar_kws={"label": "Fraction of cells"}) plt.title("knn_harmony vs scanvi label agreement") plt.xlabel("SCANVI label") plt.ylabel("KNN-Harmony label") plt.tight_layout() plt.savefig("popv_method_agreement_heatmap.png", dpi=150) print("Saved popv_method_agreement_heatmap.png") ``` ### Recipe: Fast Annotation Without Deep Learning Methods When to use: quick annotation without GPU or when scVI/SCANVI training is prohibitively slow (>500k cells). ```python import popv # Process without training deep generative models (scVI not needed for KNN-Harmony) adata = popv.preprocessing.Process_Query( adata_ref, adata_query, ref_labels_key="cell_type", ref_batch_key="batch", query_batch_key="batch", unknown_celltype_label="unknown", save_path_trained_models="./popv_models/", n_epochs_unsupervised=0, # skip scVI training n_epochs_semisupervised=0, # skip scANVI training use_gpu=False, hvg=3000, ) # Run only fast non-DL methods popv.annotation.annotate_data( adata, methods=["knn_harmony", "knn_bbknn", "knn_scanorama", "rf", "xgboost", "svm", "celltypist_popv"], ) query_mask = adata.obs["_dataset"] == "query" print(adata[query_mask].obs[["popv_prediction", "popv_agreement"]].describe()) ``` ## Troubleshooting | Problem | Cause | Solution | |---------|-------|----------| | `KeyError: ref_labels_key not in adata_ref.obs` | Reference lacks a cell type column | Verify the column name: `print(adata_ref.obs.columns.tolist())`; update `ref_labels_key` accordingly | | Gene space mismatch error | Reference and query have very few shared genes | Check `adata_ref.var_names.intersection(adata_query.var_names)`; if <50% overlap, use a different reference or match gene panels | | CUDA out-of-memory for scVI/SCANVI | GPU VRAM insufficient for batch size | Set `use_gpu=False` or reduce `n_epochs_unsupervised`; scVI falls back to CPU automatically on most systems | | `onclass_popv` failures on small datasets | ONCLASS requires sufficient label coverage | Remove `"onclass"` from the methods list when reference has <10 cell types or <500 cells per type | | Very slow annotation (>2 hours) | scVI/SCANVI training on large reference | Subsample reference to 50k cells per type; exclude `"scanvi_popv"` and `"onclass"` from methods | | All cells receive same consensus label | Reference highly imbalanced toward one type | Balance reference by subsampling the dominant type or upsampling rare types before running popV | | `popv_agreement` is 0 for many cells | Many methods returning different labels | Inspect per-method columns; consider whether reference covers the query biology; add methods or retrain with a better reference | ## Related Skills - **celltypist-cell-annotation** — single-model annotation with pre-trained logistic regression; faster but lacks ensemble uncertainty - **scanpy-scrna-seq** — preprocessing pipeline (QC, normalization, clustering) that produces AnnData inputs for popV - **scvi-tools-single-cell** — scANVI for probabilistic label transfer with a single deep generative model; use when you prefer a formal variational framework over ensemble voting - **harmony-batch-correction** — Harmony embedding used by `knn_harmony` method internally; understand it to tune popV's KNN-based methods ## References - [GitHub: YosefLab/popV](https://github.com/YosefLab/popV) — official source code, installation instructions, and example notebooks - [popV documentation](https://popv.readthedocs.io/) — API reference and tutorials - [Ergen et al., bioRxiv 2023](https://doi.org/10.1101/2023.06.15.545239) — "Population-level integration of single-cell datasets enables multi-scale analysis across samples", original popV preprint - [ONCLASS paper — Wang et al., Nature Methods 2021](https://doi.org/10.1038/s41592-021-01080-9) — ontology-aware cell type classification underlying the ONCLASS method in popV