--- name: single-cell-annotation-skills-with-omicverse title: Single-cell annotation skills with omicverse description: "Cell type annotation: SCSA, MetaTiME, CellVote consensus, CellMatch, GPTAnno, weighted KNN label transfer in OmicVerse." --- # Single-cell annotation skills with omicverse ## Overview Use this skill to reproduce and adapt the single-cell annotation playbook captured in omicverse tutorials: SCSA [`t_cellanno.ipynb`](../../../omicverse_guide/docs/Tutorials-single/t_cellanno.ipynb), MetaTiME [`t_metatime.ipynb`](../../../omicverse_guide/docs/Tutorials-single/t_metatime.ipynb), CellVote [`t_cellvote.md`](../../../omicverse_guide/docs/Tutorials-single/t_cellvote.md) & [`t_cellvote_pbmc3k.ipynb`](../../../omicverse_guide/docs/Tutorials-single/t_cellvote_pbmc3k.ipynb), CellMatch [`t_cellmatch.ipynb`](../../../omicverse_guide/docs/Tutorials-single/t_cellmatch.ipynb), GPTAnno [`t_gptanno.ipynb`](../../../omicverse_guide/docs/Tutorials-single/t_gptanno.ipynb), and label transfer [`t_anno_trans.ipynb`](../../../omicverse_guide/docs/Tutorials-single/t_anno_trans.ipynb). Each section below highlights required inputs, training/inference steps, and how to read the outputs. ## Instructions 1. **SCSA automated cluster annotation** - *Data requirements*: PBMC3k raw counts from 10x Genomics (`pbmc3k_filtered_gene_bc_matrices.tar.gz`) or the processed `sample/rna.h5ad`. Download instructions are embedded in the notebook; unpack to `data/filtered_gene_bc_matrices/hg19/`. Ensure an SCSA SQLite database is available (e.g. `pySCSA_2024_v1_plus.db` from the Figshare/Drive links listed in the tutorial) and point `model_path` to its location. - *Preprocessing & model fit*: Load with `ov.io.read_10x_mtx`, run QC (`ov.pp.qc`), normalization and HVG selection (`ov.pp.preprocess`), scaling (`ov.pp.scale`), PCA (`ov.pp.pca`), neighbors, Leiden clustering, and compute rank markers (`sc.tl.rank_genes_groups`). Instantiate `scsa = ov.single.pySCSA(...)` choosing `target='cellmarker'` or `'panglaodb'`, tissue scope, and thresholds (`foldchange`, `pvalue`). - *Inference & interpretation*: Call `scsa.cell_anno(clustertype='leiden', result_key='scsa_celltype_cellmarker')` or `scsa.cell_auto_anno` to append predictions to `adata.obs`. Compare to manual marker-based labels via `ov.utils.embedding` or `sc.pl.dotplot`, inspect marker dictionaries (`ov.single.get_celltype_marker`), and query supported tissues with `scsa.get_model_tissue()`. Use the ROI/ROE helpers (`ov.utils.roe`, `ov.utils.plot_cellproportion`) to validate abundance trends. 2. **MetaTiME tumour microenvironment states** - *Data requirements*: Batched TME AnnData with an scVI latent embedding. The tutorial uses `TiME_adata_scvi.h5ad` from Figshare (`https://figshare.com/ndownloader/files/41440050`). If starting from counts, run scVI (`scvi.model.SCVI`) first to populate `adata.obsm['X_scVI']`. - *Preprocessing & model fit*: Optionally subset to non-malignant cells via `adata.obs['isTME']`. Rebuild neighbors on the latent representation (`sc.pp.neighbors(adata, use_rep="X_scVI")`) and embed with pymde (`adata.obsm['X_mde'] = ov.utils.mde(...)`). Initialise `TiME_object = ov.single.MetaTiME(adata, mode='table')` and, if finer granularity is desired, over-cluster with `TiME_object.overcluster(resolution=8, clustercol='overcluster')`. - *Inference & interpretation*: Run `TiME_object.predictTiME(save_obs_name='MetaTiME')` to assign minor states and `Major_MetaTiME`. Visualise using `TiME_object.plot` or `sc.pl.embedding`. Interpret the outputs by comparing cluster-level distributions and confirming that MetaTiME and Major_MetaTiME columns align with expected niches. 3. **CellVote consensus labelling** - *Data requirements*: A clustered AnnData (e.g. PBMC3k stored as `CELLVOTE_PBMC3K` env var or `data/pbmc3k.h5ad`) plus at least two precomputed annotation columns (simulated in the tutorial as `scsa_annotation`, `gpt_celltype`, `gbi_celltype`). Prepare per-cluster marker genes via `sc.tl.rank_genes_groups`. - *Preprocessing & model fit*: After standard preprocessing (normalize, log1p, HVGs, PCA, neighbors, Leiden) build a marker dictionary `marker_dict = top_markers_from_rgg(adata, 'leiden', topn=10)` or via `ov.single.get_celltype_marker`. Instantiate `cv = ov.single.CellVote(adata)`. - *Inference & interpretation*: Call `cv.vote(clusters_key='leiden', cluster_markers=marker_dict, celltype_keys=[...], species='human', organization='PBMC', provider='openai', model='gpt-4o-mini')`. Offline examples monkey-patch arbitration to avoid API calls; online voting requires valid credentials. Final consensus labels live in `adata.obs['CellVote_celltype']`. Compare each cluster’s majority vote with the input sources (`adata.obs[['leiden', 'scsa_annotation', ...]]`) to justify decisions. 4. **CellMatch ontology mapping** - *Data requirements*: Annotated AnnData such as `pertpy.dt.haber_2017_regions()` with `adata.obs['cell_label']`. Download Cell Ontology JSON (`cl.json`) via `ov.single.download_cl(...)` or manual links, and optionally Cell Taxonomy resources (`Cell_Taxonomy_resource.txt`). Ensure access to a SentenceTransformer model (`sentence-transformers/all-MiniLM-L6-v2`, `BAAI/bge-base-en-v1.5`, etc.), downloading to `local_model_dir` if offline. - *Preprocessing & model fit*: Create the mapper with `ov.single.CellOntologyMapper(cl_obo_file='new_ontology/cl.json', model_name='sentence-transformers/all-MiniLM-L6-v2', local_model_dir='./my_models')`. Run `mapper.map_adata(...)` to assign ontology-derived labels/IDs, optionally enabling taxonomy matching (`use_taxonomy=True` after calling `load_cell_taxonomy_resource`). - *Inference & interpretation*: Explore mapping summaries (`mapper.print_mapping_summary_taxonomy`) and inspect embeddings coloured by `cell_ontology`, `cell_ontology_cl_id`, or `enhanced_cell_ontology`. Use helper queries such as `mapper.find_similar_cells('T helper cell')`, `mapper.get_cell_info(...)`, and category browsing to validate ontology coverage. 5. **GPTAnno LLM-powered annotation** - *Data requirements*: The same PBMC3k dataset (raw matrix or `.h5ad`) and cluster assignments. Access to an LLM endpoint—configure `AGI_API_KEY` for OpenAI-compatible providers (`provider='openai'`, `'qwen'`, `'kimi'`, etc.), or supply a local model path for `ov.single.gptcelltype_local`. - *Preprocessing & model fit*: Follow the QC, normalization, HVG, scaling, PCA, neighbor, Leiden, and marker discovery steps described above (reusing outputs from the SCSA workflow). Build the marker dictionary automatically with `ov.single.get_celltype_marker(adata, clustertype='leiden', rank=True, key='rank_genes_groups', foldchange=2, topgenenumber=5)`. - *Inference & interpretation*: Invoke `ov.single.gptcelltype(...)` specifying tissue/species context and desired provider/model. Post-process responses to keep clean labels (`result[key].split(': ')[-1]...`) and write them to `adata.obs['gpt_celltype']`. Compare embeddings (`ov.pl.embedding(..., color=['leiden','gpt_celltype'])`) to verify cluster identities. If operating offline, call `ov.single.gptcelltype_local` with a downloaded instruction-tuned checkpoint. 6. **Weighted KNN annotation transfer** - *Data requirements*: Cross-modal GLUE outputs with aligned embeddings, e.g. `data/analysis_lymph/rna-emb.h5ad` (annotated RNA) and `data/analysis_lymph/atac-emb.h5ad` (query ATAC) where both contain `obsm['X_glue']`. - *Preprocessing & model fit*: Load both modalities, optionally concatenate for QC plots, and compute a shared low-dimensional embedding with `ov.utils.mde`. Train a neighbour model using `ov.utils.weighted_knn_trainer(train_adata=rna, train_adata_emb='X_glue', n_neighbors=15)`. - *Inference & interpretation*: Transfer labels via `labels, uncert = ov.utils.weighted_knn_transfer(query_adata=atac, query_adata_emb='X_glue', label_keys='major_celltype', knn_model=knn_transformer, ref_adata_obs=rna.obs)`. Store predictions in `atac.obs['transf_celltype']` and uncertainties in `atac.obs['transf_celltype_unc']`; copy to `major_celltype` if you want consistent naming. Visualise (`ov.utils.embedding`) and inspect uncertainty to flag ambiguous cells. ## Defensive Validation Patterns ```python # Before SCSA: verify rank_genes_groups has been computed assert 'rank_genes_groups' in adata.uns, \ "Marker genes required. Run sc.tl.rank_genes_groups(adata, groupby='leiden') first." # Before any annotation: verify clustering exists assert 'leiden' in adata.obs.columns or 'louvain' in adata.obs.columns, \ "Clustering required. Run ov.pp.leiden(adata) or sc.tl.leiden(adata) first." # Before CellVote: verify multiple annotation columns exist annotation_keys = ['scsa_annotation', 'gpt_celltype'] # adjust to actual keys for key in annotation_keys: assert key in adata.obs.columns, f"Annotation column '{key}' not found — run annotators first" ``` ## Critical API Reference - EXACT Function Signatures ### pySCSA - IMPORTANT: Parameter is `clustertype`, NOT `cluster` **CORRECT usage:** ```python # Step 1: Initialize pySCSA scsa = ov.single.pySCSA( adata, foldchange=1.5, pvalue=0.01, species='Human', tissue='All', target='cellmarker' # or 'panglaodb' ) # Step 2: Run annotation - NOTE: use clustertype='leiden', NOT cluster='leiden'! anno_result = scsa.cell_anno(clustertype='leiden', cluster='all') # Step 3: Add cell type labels to adata.obs scsa.cell_auto_anno(adata, clustertype='leiden', key='scsa_celltype') # Results are stored in adata.obs['scsa_celltype'] ``` **WRONG - DO NOT USE:** ```python # WRONG! 'cluster' is NOT a valid parameter for cell_auto_anno! # scsa.cell_auto_anno(adata, cluster='leiden') # ERROR! ``` ### COSG Marker Genes - Results stored in adata.uns, NOT adata.obs **CORRECT usage:** ```python # Step 1: Run COSG marker gene identification ov.single.cosg(adata, groupby='leiden', n_genes_user=50) # Step 2: Access results from adata.uns (NOT adata.obs!) marker_names = adata.uns['rank_genes_groups']['names'] # DataFrame with cluster columns marker_scores = adata.uns['rank_genes_groups']['scores'] # Step 3: Get top markers for specific cluster cluster_0_markers = adata.uns['rank_genes_groups']['names']['0'][:10].tolist() # Step 4: To create celltype column, manually map clusters to cell types cluster_to_celltype = { '0': 'T cells', '1': 'B cells', '2': 'Monocytes', } adata.obs['cosg_celltype'] = adata.obs['leiden'].map(cluster_to_celltype) ``` **WRONG - DO NOT USE:** ```python # WRONG! COSG does NOT create adata.obs columns directly! # adata.obs['cosg_celltype'] # This key does NOT exist after running COSG! # adata.uns['cosg_celltype'] # This key also does NOT exist! ``` ### Common Pitfalls to Avoid 1. **pySCSA parameter confusion**: - `clustertype` = which obs column contains cluster labels (e.g., 'leiden') - `cluster` = which specific clusters to annotate ('all' or specific cluster IDs) - These are DIFFERENT parameters! 2. **COSG result access**: - COSG is a marker gene finder, NOT a cell type annotator - Results are per-cluster gene rankings stored in `adata.uns['rank_genes_groups']` - To assign cell types, you must manually map clusters to cell types based on markers 3. **Result storage patterns in OmicVerse**: - Cell type annotations → `adata.obs['']` - Marker gene results → `adata.uns['']` (includes 'names', 'scores', 'logfoldchanges') - Differential expression → `adata.uns['rank_genes_groups']` ## Examples - "Run SCSA with both CellMarker and PanglaoDB references on PBMC3k, then benchmark against manual marker assignments before feeding the results into CellVote." - "Annotate tumour microenvironment states in the MetaTiME Figshare dataset, highlight Major_MetaTiME classes, and export the label distribution per patient." - "Download Cell Ontology resources, map `haber_2017_regions` clusters to ontology terms, and enrich ambiguous clusters using Cell Taxonomy hints." - "Propagate RNA-derived `major_celltype` labels onto GLUE-integrated ATAC cells and report clusters with high transfer uncertainty." ## References - Tutorials and notebooks: [`t_cellanno.ipynb`](../../../omicverse_guide/docs/Tutorials-single/t_cellanno.ipynb), [`t_metatime.ipynb`](../../../omicverse_guide/docs/Tutorials-single/t_metatime.ipynb), [`t_cellvote.md`](../../../omicverse_guide/docs/Tutorials-single/t_cellvote.md), [`t_cellvote_pbmc3k.ipynb`](../../../omicverse_guide/docs/Tutorials-single/t_cellvote_pbmc3k.ipynb), [`t_cellmatch.ipynb`](../../../omicverse_guide/docs/Tutorials-single/t_cellmatch.ipynb), [`t_gptanno.ipynb`](../../../omicverse_guide/docs/Tutorials-single/t_gptanno.ipynb), [`t_anno_trans.ipynb`](../../../omicverse_guide/docs/Tutorials-single/t_anno_trans.ipynb). - Sample data & assets: PBMC3k matrix from 10x Genomics, MetaTiME `TiME_adata_scvi.h5ad` (Figshare), SCSA database downloads, GLUE embeddings under `data/analysis_lymph/`, Cell Ontology `cl.json`, and Cell Taxonomy resource. - Quick copy commands: [`reference.md`](reference.md).