---
name: single-popv-annotation
title: PopV population-level cell type annotation
description: "PopV population-level cell annotation: 10 algorithms (SCVI, SCANVI, CellTypist, OnClass, RF, SVM, XGBoost, BBKNN, HARMONY, SCANORAMA), consensus voting, pretrained hub models."
---

# PopV Population-Level Cell Type Annotation

PopV (Population Voting) annotates cell types by running up to 10 classification algorithms and aggregating predictions via majority voting. Unlike single-method annotation (SCSA, MetaTiME, CellTypist alone), PopV produces a consensus prediction that is more robust to individual algorithm failures. The module also supports ontology-aware voting via the Cell Ontology (CL) for hierarchical label resolution.

## Defensive Validation

```python
# Before PopV: verify reference has the cell type column
assert ref_labels_key in ref_adata.obs.columns, \
    f"ref_adata.obs['{ref_labels_key}'] not found. Available: {list(ref_adata.obs.columns)}"

# Verify no NaN in reference labels
assert ref_adata.obs[ref_labels_key].notna().all(), \
    f"NaN values in ref_adata.obs['{ref_labels_key}']. Use fillna() or drop these cells."

# Verify gene overlap
overlap = query_adata.var_names.intersection(ref_adata.var_names)
assert len(overlap) > 100, \
    f"Only {len(overlap)} overlapping genes between query and reference. Check var_names format (ENSEMBL vs symbol)."
```

## Stage 1: Data Preparation

```python
import omicverse as ov

# Process_Query preprocesses and concatenates query + reference
process_obj = ov.popv.Process_Query(
    query_adata=query_adata,
    ref_adata=ref_adata,
    ref_labels_key='cell_type',           # REQUIRED: column in ref_adata.obs
    ref_batch_key='batch',                # batch column in ref_adata.obs
    query_batch_key='batch',              # batch column in query_adata.obs (optional)
    cl_obo_folder=False,                  # False to skip ontology, or path to CL .obo file
    prediction_mode='retrain',            # 'retrain' | 'inference' | 'fast'
    unknown_celltype_label='unknown',     # label for query cells
    n_samples_per_label=300,              # subsample reference per cell type
    hvg=4000,                             # number of highly variable genes
    save_path_trained_models='tmp/',      # where to save models
    pretrained_scvi_path=None,            # path to pretrained scVI model (optional)
)
```

**prediction_mode choices:**
- `'retrain'` — Train all models from scratch on reference+query. Most accurate, slowest.
- `'inference'` — Load previously saved models. Requires `save_path_trained_models` from prior run.
- `'fast'` — Skip integration-heavy algorithms. Uses FAST_ALGORITHMS subset.

**Preprocessing applied automatically:**
- Filters cells with < 30 total counts
- Log1p normalization (target_sum=1e4)
- PCA on reference (50 components)
- Stores raw counts in `layers['scvi_counts']`

## Stage 2: Annotation

```python
# Run all algorithms and compute consensus
ov.popv.annotate_data(
    process_obj.adata,
    methods='all',                        # or list of specific algorithms
    save_path='results/popv/',            # saves predictions.csv here
    methods_kwargs=None,                  # dict of per-method overrides
)
```

### Available Algorithms (10 total)

| Algorithm | Result Key | Type | Speed |
|-----------|-----------|------|-------|
| `KNN_SCVI` | `popv_knn_on_scvi_prediction` | Deep learning + KNN | Medium |
| `SCANVI_POPV` | `popv_scanvi_prediction` | Semi-supervised DL | Medium |
| `CELLTYPIST` | `popv_celltypist_prediction` | Logistic regression | Fast |
| `ONCLASS` | `popv_onclass_prediction` | Ontology-guided | Medium |
| `Support_Vector` | `popv_svm_prediction` | SVM | Fast |
| `XGboost` | `popv_xgboost_prediction` | Gradient boosting | Fast |
| `KNN_HARMONY` | `popv_knn_harmony_prediction` | Harmony + KNN | Fast |
| `KNN_BBKNN` | `popv_knn_bbknn_prediction` | BBKNN + KNN | Fast |
| `Random_Forest` | `popv_rf_prediction` | Random forest | Fast |
| `KNN_SCANORAMA` | `popv_knn_scanorama_prediction` | Scanorama + KNN | Medium |

**Algorithm subsets:**
- `FAST_ALGORITHMS`: KNN_SCVI, SCANVI_POPV, Support_Vector, XGboost, ONCLASS, CELLTYPIST (used with `prediction_mode='fast'`)
- `CURRENT_ALGORITHMS`: All except Random_Forest and KNN_SCANORAMA (outdated)
- `'all'` or `None`: Uses CURRENT_ALGORITHMS (or FAST_ALGORITHMS in fast mode)

### Selecting Specific Methods

```python
# Run only fast classical methods
ov.popv.annotate_data(
    process_obj.adata,
    methods=['CELLTYPIST', 'Support_Vector', 'XGboost'],
)

# Override per-method parameters
ov.popv.annotate_data(
    process_obj.adata,
    methods=['KNN_SCVI', 'SCANVI_POPV'],
    methods_kwargs={
        'KNN_SCVI': {'train_kwargs': {'max_epochs': 50}},
        'SCANVI_POPV': {'train_kwargs': {'max_epochs': 50}},
    },
)
```

## Stage 3: Consensus Results & Visualization

After `annotate_data()`, these columns appear in `adata.obs`:

| Column | Description |
|--------|-------------|
| `popv_majority_vote_prediction` | Majority vote across all methods |
| `popv_majority_vote_score` | Number of agreeing methods |
| `popv_prediction` | Ontology-aggregated consensus (if CL enabled) |
| `popv_prediction_score` | Ontology consensus score |

```python
# Agreement plots: confusion matrices per method vs consensus
ov.popv.make_agreement_plots(
    process_obj.adata,
    prediction_keys=process_obj.adata.uns['prediction_keys'],
    popv_prediction_key='popv_prediction',
    save_folder='results/popv/',
    show=True,
)

# Bar plot: agreement score per cell type
ov.popv.agreement_score_bar_plot(
    process_obj.adata,
    popv_prediction_key='popv_prediction',
    save_folder='results/popv/',
)

# Bar plot: prediction score distribution
ov.popv.prediction_score_bar_plot(
    process_obj.adata,
    popv_prediction_score='popv_prediction_score',
    save_folder='results/popv/',
)

# Bar plot: cell type proportions (ref vs query)
ov.popv.celltype_ratio_bar_plot(
    process_obj.adata,
    popv_prediction='popv_prediction',
    save_folder='results/popv/',
)
```

## Stage 4: Pretrained Hub Models (Optional)

For large references (e.g., Human Cell Atlas), use pretrained models to skip training:

```python
from omicverse.popv.hub import HubModel

# Pull pretrained model from HuggingFace
model = HubModel.pull_from_huggingface_hub(
    repo_name='popv/immune_all',
    cache_dir='models/popv/',
)

# Annotate query data directly (fast mode)
result_adata = model.annotate_data(
    query_adata=query_adata,
    query_batch_key='batch',
    prediction_mode='fast',
    methods=None,  # uses model's default methods
)
```

## Critical API Reference

```python
# CORRECT: methods as list of strings matching class names
ov.popv.annotate_data(adata, methods=['KNN_SCVI', 'CELLTYPIST', 'Support_Vector'])

# WRONG: passing class objects or lowercase names
# ov.popv.annotate_data(adata, methods=[KNN_SCVI, CELLTYPIST])  # TypeError
# ov.popv.annotate_data(adata, methods=['knn_scvi'])             # KeyError

# CORRECT: ref_labels_key must exist in ref_adata.obs before Process_Query
assert 'cell_type' in ref_adata.obs.columns
process_obj = ov.popv.Process_Query(ref_labels_key='cell_type', ...)

# WRONG: forgetting to set unknown_celltype_label causes NaN in voting
# process_obj = ov.popv.Process_Query(..., unknown_celltype_label=None)  # NaN errors

# CORRECT: access consensus results after annotation
final_labels = process_obj.adata.obs['popv_majority_vote_prediction']
# or ontology-refined:
final_labels = process_obj.adata.obs['popv_prediction']

# WRONG: looking for results on the original query_adata
# query_adata.obs['popv_prediction']  # KeyError: results are on process_obj.adata
```

## GPU Acceleration

```python
import omicverse.popv as popv
popv.settings.accelerator = 'gpu'   # for scVI/scANVI training
popv.settings.cuml = True           # for KNN/SVM/RF via cuML
popv.settings.n_jobs = 10           # parallel jobs for CPU methods
```

## Troubleshooting

- **`RuntimeError: CUDA out of memory` during scVI/scANVI training**: Reduce `hvg` (try 2000), decrease `n_samples_per_label` (try 100), or switch to `prediction_mode='fast'` which uses fewer epochs.
- **CellTypist model download fails**: Set `methods_kwargs={'CELLTYPIST': {'method_kwargs': {'model': '/path/to/local/model.pkl'}}}` to use a local model file.
- **Low consensus agreement (<50% cells agree)**: Some algorithms may not suit your tissue. Exclude underperforming methods: check per-method predictions and drop outliers from the `methods` list.
- **`KeyError: 'gene_name'` — gene identifier mismatch**: Harmonize var_names between reference and query before calling `Process_Query`. Use `adata.var_names = adata.var['gene_symbols']` if ENSEMBL IDs are in var_names.
- **`ValueError: batch_key contains NaN`**: Clean batch columns before PopV. Apply the batch validation pattern from the single-preprocessing skill: `adata.obs['batch'] = adata.obs['batch'].fillna('unknown').astype('category')`.
- **`FileNotFoundError` in inference mode**: Ensure `save_path_trained_models` points to the same directory used during the original `retrain` run. Check that model files (.pt, .pkl, .joblib) exist.

## Dependencies
- Core: `omicverse`, `scanpy`, `anndata`, `numpy`, `pandas`
- Deep learning: `scvi-tools`, `torch` (for KNN_SCVI, SCANVI_POPV)
- Classical ML: `scikit-learn`, `xgboost` (for RF, SVM, XGBoost)
- Integration: `harmonypy`, `bbknn`, `scanorama` (for respective KNN methods)
- Annotation: `celltypist`, `OnClass` (optional per method)
- Ontology: `obonet`, `pronto` (for ontology-aware voting)
- Hub: `huggingface_hub` (for pretrained models)

## Examples
- "Annotate my PBMC query data against a reference atlas using PopV with all 10 algorithms and visualize the consensus."
- "Use a pretrained PopV hub model to quickly annotate my lung tissue scRNA-seq data."
- "Run PopV with only classical methods (SVM, XGBoost, CellTypist) to annotate my query cells without GPU."

## References
- Quick copy/paste commands: [`reference.md`](reference.md)