---
name: celltypeannotation
description: Annotates cell clusters with biological cell type labels using multiple methods: direct assignment, ScType, scCATCH, hitype, or CellTypist. This process is essential for interpreting clustering results by assigning meaningful biological identities to each cluster.
---

# CellTypeAnnotation Process Configuration

## Purpose
Annotates cell clusters with biological cell type labels using multiple methods: direct assignment, ScType, scCATCH, hitype, or CellTypist. This process is essential for interpreting clustering results by assigning meaningful biological identities to each cluster.

## When to Use
- **After clustering**: When you have cluster assignments but need biological cell type labels
- **Automated annotation**: When manual annotation is too time-consuming or subjective
- **Consistent nomenclature**: When you need standardized cell type names across multiple samples
- **Reference-based annotation**: When you have well-characterized reference datasets or marker databases
- **Cross-sample comparison**: When analyzing multiple samples with the same cell type definitions
- **Alternative to SeuratMap2Ref**: When you prefer database-based annotation over reference dataset mapping

## Configuration Structure

### Process Enablement
```toml
[CellTypeAnnotation]
cache = true  # Cache results for faster re-runs
```

### Input Specification
```toml
[CellTypeAnnotation.in]
sobjfile = ["SeuratClustering"]  # Path or reference to Seurat object
```

### Environment Variables

#### Core Parameters
```toml
[CellTypeAnnotation.envs]
# Annotation method selection
tool = "direct"  # Options: "direct", "sctype", "hitype", "sccatch", "celltypist"

# Cluster identity column (required for h5ad input, optional for Seurat objects)
ident = "seurat_clusters"  # Column name in metadata representing clusters

# Backup column name (stores original cluster labels)
backup_col = "seurat_clusters_id"  # Default: "seurat_clusters_id"

# New column name for annotated cell types
# If specified, original identity is kept; otherwise, it's replaced
newcol = ""  # Default: empty (overwrite identity)

# Merge clusters with same predicted cell types
merge = false  # Default: false; suffixes (.1, .2) added for duplicate labels

# Output file type
outtype = "input"  # Options: "input", "rds", "qs", "qs2", "h5ad"
```

#### Direct Annotation Parameters
```toml
[CellTypeAnnotation.envs]
tool = "direct"

# Cell type assignments (one per cluster, in order)
# Use "-" or "" to keep original cluster name
# Use "NA" to remove cluster from downstream analysis (only without newcol)
cell_types = ["CD4+ T cells", "CD8+ T cells", "-", "B cells"]  # Default: []

# Additional annotations (multiple cell type columns)
more_cell_types = {  # Dict: {new_column: [cell_types]}
    cell_type_broad = ["T cells", "T cells", "NK cells", "B cells"],
    cell_type_detailed = ["CD4+ naive", "CD8+ effector", "NK", "B naive"]
}
```

#### ScType Annotation Parameters
```toml
[CellTypeAnnotation.envs]
tool = "sctype"

# Tissue type (must match tissueType column in database)
sctype_tissue = "Immune system"  # Required for sctype

# Database file path (Excel format compatible with ScType)
sctype_db = "/path/to/ScTypeDB_full.xlsx"  # Optional: uses default if not specified
```

#### hitype Annotation Parameters
```toml
[CellTypeAnnotation.envs]
tool = "hitype"

# Tissue type (must match tissueType column in database)
hitype_tissue = "Immune system"  # Required for hitype

# Database file path or built-in database name
# Built-in options: "hitypedb_short", "hitypedb_full", "hitypedb_pbmc3k"
hitype_db = "hitypedb_full"  # Default: built-in database
```

#### scCATCH Annotation Parameters
```toml
[CellTypeAnnotation.envs]
tool = "sccatch"

[CellTypeAnnotation.envs.sccatch_args]
# Species (Human or Mouse)
species = "Human"  # Required

# Tissue origin
tissue = "Blood"  # Required

# Cancer type (if cancer tissue)
cancer = "Normal"  # Default: "Normal"

# Custom marker genes (RDS file or list)
marker = ""  # Optional

# Use custom marker instead of database
if_use_custom_marker = false  # Default: false

# Additional scCATCH::findmarkergene() arguments
# See: https://rdrr.io/cran/scCATCH/man/findmarkergene.html
```

#### CellTypist Annotation Parameters
```toml
[CellTypeAnnotation.envs]
tool = "celltypist"

[CellTypeAnnotation.envs.celltypist_args]
# Model file path (download from https://celltypist.cog.sanger.ac.uk/models/models.json)
model = "Immune_All_Low.pkl"  # Required

# Python interpreter where celltypist is installed
python = "python"  # Default: "python"

# Majority voting refinement for local subclusters
majority_voting = true  # Default: true

# Over-clustering column (for majority voting)
# Set to false to disable over-clustering
over_clustering = "seurat_clusters"  # Auto: identity for Seurat, false for h5ad

# Assay for Seurat-to-AnnData conversion
assay = ""  # Auto: RNA for h5seurat, default assay for Seurat
```

## Annotation Methods

### 1. Direct Annotation
Assigns cell types manually to each cluster. Best when you have well-defined marker genes or want complete control over annotations.

**Pros**:
- Full control over annotations
- Fast and deterministic
- Works with any clustering result

**Cons**:
- Requires domain knowledge
- Time-consuming for many clusters
- Subjective

**Use cases**:
- Small number of well-separated clusters
- Known marker genes
- Reproducible annotation needed

### 2. ScType
Uses pre-defined cell type markers from ScType database. Annotates based on enrichment of known marker genes in each cluster.

**Databases**:
- ScTypeDB_short.xlsx: Compact database (~70 cell types)
- ScTypeDB_full.xlsx: Full database (~200+ cell types)
- Custom database: Provide your own Excel file

**Pros**:
- Automated annotation
- Tissue-specific filtering available
- Well-curated marker database

**Cons**:
- Limited to predefined cell types
- Requires tissue specification
- May miss rare cell types

**Reference**: https://github.com/IanevskiAleksandr/sc-type

**Use cases**:
- Immune tissue datasets
- When tissue type is well-defined
- Need for comprehensive annotation

### 3. hitype
Flexible annotation tool compatible with ScType database format. Supports both file-based and built-in databases.

**Built-in databases**:
- `hitypedb_short`: Compact marker set
- `hitypedb_full`: Comprehensive marker set
- `hitypedb_pbmc3k`: PBMC-specific markers (from 10X PBMC3k dataset)

**Pros**:
- Faster than ScType (Python-based)
- Multiple built-in databases
- Tissue-specific filtering

**Cons**:
- Limited to database cell types
- Requires tissue specification

**Reference**: https://github.com/pwwang/hitype

**Use cases**:
- PBMC datasets (use `hitypedb_pbmc3k`)
- General immune annotation
- When speed matters

### 4. scCATCH
Identifies cell types by matching cluster marker genes to cell type-specific marker database.

**Workflow**:
1. Finds marker genes for each cluster
2. Matches markers to cell type database
3. Assigns best matching cell type

**Parameters**:
- `species`: Human or Mouse
- `tissue`: Tissue origin (required)
- `cancer`: Cancer type (if applicable)

**Pros**:
- Automated marker identification
- Species-specific databases
- Cancer type support

**Cons**:
- Requires tissue specification
- Slower (finds markers first)
- Limited database

**Reference**: https://github.com/ZJUFanLab/scCATCH

**Use cases**:
- When you want marker discovery + annotation
- Cancer tissue datasets
- Species-specific annotation

### 5. CellTypist
Machine learning-based annotation using pre-trained models. Requires Python environment and celltypist2 package.

**Models**:
- Download from: https://celltypist.cog.sanger.ac.uk/models/models.json
- Common models: Immune_All_Low.pkl, Immune_All_High.pkl, Tissue-specific models

**Key features**:
- `majority_voting`: Refines annotations within local subclusters
- `over_clustering`: Over-cluster first, then merge by majority vote

**Pros**:
- State-of-the-art ML models
- Handles complex datasets well
- Majority voting improves accuracy

**Cons**:
- Requires Python environment
- Model files need download
- Longer runtime with majority voting

**Reference**: https://celltypist.org/

**Use cases**:
- Large complex datasets
- When ScType/hitype annotation is insufficient
- High-throughput annotation

## Configuration Examples

### Example 1: Minimal Configuration (No Annotation)
```toml
[CellTypeAnnotation]
[CellTypeAnnotation.in]
sobjfile = ["SeuratClustering"]
```
**Result**: Tool defaults to "direct" with empty `cell_types`. Original cluster names are preserved.

### Example 2: Direct Annotation for T Cell Subsets
```toml
[CellTypeAnnotation]
[CellTypeAnnotation.in]
sobjfile = ["SeuratClustering"]

[CellTypeAnnotation.envs]
tool = "direct"
cell_types = ["CD4+ naive", "CD4+ memory", "CD8+ naive", "CD8+ effector", "-", "Regulatory T"]
```
**Result**: Clusters 0-3 and 5 get specified labels. Cluster 4 keeps original name (placeholder "-").

### Example 3: ScType for Immune Tissue
```toml
[CellTypeAnnotation]
[CellTypeAnnotation.in]
sobjfile = ["SeuratClustering"]

[CellTypeAnnotation.envs]
tool = "sctype"
sctype_tissue = "Immune system"
sctype_db = "/data/databases/ScTypeDB_full.xlsx"
merge = true  # Merge clusters with same annotation
```
**Result**: Uses full ScType database for immune tissue. Merges clusters with identical annotations.

### Example 4: hitype with Built-in PBMC Database
```toml
[CellTypeAnnotation]
[CellTypeAnnotation.in]
sobjfile = ["SeuratClustering"]

[CellTypeAnnotation.envs]
tool = "hitype"
hitype_tissue = "Blood"
hitype_db = "hitypedb_pbmc3k"  # Built-in PBMC database
merge = true
```
**Result**: Fast PBMC annotation using built-in database optimized for 10X PBMC data.

### Example 5: scCATCH for Cancer Tissue
```toml
[CellTypeAnnotation]
[CellTypeAnnotation.in]
sobjfile = ["SeuratClustering"]

[CellTypeAnnotation.envs]
tool = "sccatch"

[CellTypeAnnotation.envs.sccatch_args]
species = "Human"
tissue = "Lung"
cancer = "Lung adenocarcinoma"
```
**Result**: Annotates lung adenocarcinoma dataset with cancer-specific cell types.

### Example 6: CellTypist with Majority Voting
```toml
[CellTypeAnnotation]
[CellTypeAnnotation.in]
sobjfile = ["SeuratClustering"]

[CellTypeAnnotation.envs]
tool = "celltypist"

[CellTypeAnnotation.envs.celltypist_args]
model = "/data/models/Immune_All_Low.pkl"
majority_voting = true
over_clustering = "seurat_clusters"  # Use clusters for majority voting
python = "/usr/bin/python3"  # Specify Python interpreter
```
**Result**: Uses ML model with majority voting refinement for robust annotation.

### Example 7: Multiple Annotation Methods (Keep Original)
```toml
[CellTypeAnnotation]
[CellTypeAnnotation.in]
sobjfile = ["SeuratClustering"]

[CellTypeAnnotation.envs]
tool = "sctype"
sctype_tissue = "Immune system"
newcol = "celltype_sctype"  # Create new column, keep original
```
**Result**: Annotated cell types saved in `celltype_sctype` column. Original `seurat_clusters` unchanged.

### Example 8: Multiple Annotation Columns
```toml
[CellTypeAnnotation]
[CellTypeAnnotation.in]
sobjfile = ["SeuratClustering"]

[CellTypeAnnotation.envs]
tool = "direct"
cell_types = ["CD4+ T", "CD8+ T", "NK", "B", "Monocyte"]

more_cell_types = {
    "celltype_broad": ["T cells", "T cells", "NK cells", "B cells", "Monocytes"],
    "celltype_subset": ["CD4+ naive", "CD8+ effector", "NK", "B naive", "CD14+ Mono"]
}
```
**Result**: Creates three metadata columns: `celltype` (from `cell_types`), `celltype_broad`, `celltype_subset`.

### Example 9: Exclude Clusters with NA
```toml
[CellTypeAnnotation]
[CellTypeAnnotation.in]
sobjfile = ["SeuratClustering"]

[CellTypeAnnotation.envs]
tool = "direct"
cell_types = ["CD4+ T", "CD8+ T", "NA", "B cells"]
```
**Result**: Cluster 2 is removed from downstream analysis (NA excludes cluster). **Note**: Only works without `newcol`.

### Example 10: H5AD Input with CellTypist
```toml
[CellTypeAnnotation]
[CellTypeAnnotation.in]
sobjfile = ["seurat_clustering.h5ad"]  # H5AD file

[CellTypeAnnotation.envs]
tool = "celltypist"
ident = "clusters"  # Required for H5AD: cluster column name

[CellTypeAnnotation.envs.celltypist_args]
model = "Immune_All_Low.pkl"
majority_voting = true
```
**Result**: Annotates H5AD file. `ident` specifies which metadata column contains clusters.

## Common Patterns

### Pattern 1: Standard T Cell Annotation Workflow
```toml
# Step 1: Cluster T cells
[SeuratClusteringOfAllCells]
[TOrBCellSelection]
[SeuratClustering]  # Clustering on T cells only

# Step 2: Annotate T cell subsets
[CellTypeAnnotation]
[CellTypeAnnotation.in]
sobjfile = ["SeuratClustering"]

[CellTypeAnnotation.envs]
tool = "direct"
cell_types = ["Naive CD4+", "Memory CD4+", "Effector CD8+", "Tregs", "Progenitor"]
```

### Pattern 2: Automated Immune Annotation with Backup
```toml
# Use hitype for annotation, keep original clusters
[CellTypeAnnotation]
[CellTypeAnnotation.in]
sobjfile = ["SeuratClustering"]

[CellTypeAnnotation.envs]
tool = "hitype"
hitype_tissue = "Blood"
hitype_db = "hitypedb_pbmc3k"
newcol = "celltype_hitype"  # Keep original seurat_clusters
merge = true
```

### Pattern 3: Combine Multiple Annotation Methods
```toml
# First annotation: ScType
[CellTypeAnnotation]
[CellTypeAnnotation.envs]
tool = "sctype"
sctype_tissue = "Immune system"
newcol = "celltype_sctype"

# Second annotation: CellTypist for comparison
[CellTypeAnnotation2]
# Note: Must define separate process for second annotation
# See immunopipe-config.md for multi-process setup
```

### Pattern 4: Refine Annotation with CellTypist
```toml
[CellTypeAnnotation]
[CellTypeAnnotation.in]
sobjfile = ["SeuratClustering"]

[CellTypeAnnotation.envs]
tool = "celltypist"

[CellTypeAnnotation.envs.celltypist_args]
model = "Immune_All_Low.pkl"
majority_voting = true
over_clustering = "seurat_clusters"  # Use clustering result
python = "python"
```

### Pattern 5: Tissue-Specific ScType Annotation
```toml
[CellTypeAnnotation]
[CellTypeAnnotation.in]
sobjfile = ["SeuratClustering"]

[CellTypeAnnotation.envs]
tool = "sctype"
sctype_tissue = "Brain"  # Brain-specific annotation
sctype_db = "/data/brain_markers.xlsx"  # Custom brain marker database
merge = true
```

## Dependencies

### Upstream Processes
- **Required**: `SeuratClustering` (or process that produces Seurat object with clusters)
- **Optional**: `SeuratClusteringOfAllCells` (if using T/B cell selection)
- **Optional**: `SeuratMap2Ref` (can combine multiple annotation methods)
- **Optional**: `TOrBCellSelection` (T/B-specific annotation)

### Downstream Processes
- **SeuratClusterStats**: Uses annotated cell types for visualization
- **ClusterMarkers**: Finds markers for each cell type
- **TopExpressingGenes**: Top genes per cell type
- **MarkersFinder**: Flexible marker finding by cell type
- **CellCellCommunication**: Uses cell types for ligand-receptor analysis
- **ScFGSEA**: GSEA by cell type
- **PseudoBulkDEG**: DE analysis by cell type
- **ScrnaMetabolicLandscape**: Metabolic analysis by cell type
- **ScRepCombiningExpression**: Integrates with TCR/BCR data

### External Dependencies
- **ScType**: Requires `sctype` R package
- **hitype**: Requires `hitype` Python package
- **scCATCH**: Requires `scCATCH` R package
- **CellTypist**: Requires `celltypist2` Python package and Python interpreter

## Validation Rules

### Tool-Specific Validation
1. **ScType**:
   - `sctype_tissue` must be specified (or empty string to use all tissues)
   - `sctype_db` must be a valid Excel file path (or empty for default)
   - Database must contain `tissueType`, `cellType`, and `gene_short` columns

2. **hitype**:
   - `hitype_tissue` must be specified (or empty string to use all tissues)
   - `hitype_db` must be valid file path or built-in name
   - Built-in names: `hitypedb_short`, `hitypedb_full`, `hitypedb_pbmc3k`

3. **scCATCH**:
   - `species` must be "Human" or "Mouse"
   - `tissue` must be specified
   - At least 2 clusters required (scCATCH limitation)

4. **CellTypist**:
   - `model` must be a valid .pkl file path
   - `python` must be valid Python interpreter path
   - CellTypist must be installed in specified Python environment

5. **Direct**:
   - `cell_types` list length should match number of clusters (shorter OK, longer not)
   - Placeholders "-" or "" keep original names
   - "NA" removes cluster (only without `newcol`)

### Input Validation
- Seurat object must have valid identity/clustering column
- H5AD input requires `ident` parameter (cluster column name)
- Output directory must be writable

### Output Validation
- `cluster2celltype.tsv` generated for ScType/hitype/scCATCH/CellTypist
- Output file format matches `outtype` specification
- Metadata contains annotated cell types

## Troubleshooting

### Common Issues and Solutions

#### Issue: "No tissues found in database" (ScType/hitype)
**Cause**: `sctype_tissue` or `hitype_tissue` doesn't match tissueType column in database.

**Solutions**:
1. Check available tissues: Open database Excel file, read `tissueType` column
2. Use exact match (case-sensitive)
3. Set tissue to empty string `""` to use all rows in database
4. Verify database file path is correct

#### Issue: "Not enough clusters for scCATCH"
**Cause**: scCATCH requires at least 2 clusters.

**Solutions**:
1. Ensure clustering result has ≥2 clusters
2. Increase clustering resolution in `SeuratClustering`
3. Use alternative tool (ScType, hitype, CellTypist)

#### Issue: CellTypist Python not found
**Cause**: CellTypist requires Python environment with celltypist2 installed.

**Solutions**:
1. Specify correct Python path: `celltypist_args.python = "/usr/bin/python3"`
2. Install celltypist2: `pip install celltypist2`
3. Verify Python environment: `python -c "import celltypist; print(celltypist.__version__)"`

#### Issue: CellTypist model file not found
**Cause**: Model path is incorrect or model not downloaded.

**Solutions**:
1. Download model from: https://celltypist.cog.sanger.ac.uk/models/models.json
2. Use absolute path for `celltypist_args.model`
3. Verify model file exists and is readable

#### Issue: "Unknown tool" error
**Cause**: Invalid `tool` value specified.

**Solutions**:
1. Check valid options: `direct`, `sctype`, `hitype`, `sccatch`, `celltypist`
2. Verify spelling is correct (case-sensitive)
3. Check tool is installed in environment

#### Issue: Annotations overwritten by multiple annotation processes
**Cause**: Multiple annotation processes write to same metadata column.

**Solutions**:
1. Use `newcol` parameter to create separate columns:
   ```toml
   [CellTypeAnnotation.envs]
   newcol = "celltype_method1"
   ```
2. Or use `backup_col` to preserve original:
   ```toml
   backup_col = "original_clusters_id"
   ```

#### Issue: Ambiguous cell type assignments
**Cause**: Clusters have similar marker expression patterns.

**Solutions**:
1. Increase clustering resolution for finer separation
2. Use `merge = false` to keep cluster-specific labels
3. Compare multiple annotation methods for consensus
4. Manual inspection of top marker genes

#### Issue: Missing cell types in results
**Cause**: Clusters removed by "NA" placeholder or filtering.

**Solutions**:
1. Check `cell_types` list for "NA" entries
2. Verify `newcol` is not set (NA removal only works without newcol)
3. Check downstream processes for filtering

#### Issue: H5AD input annotation fails
**Cause**: `ident` parameter not specified for H5AD files.

**Solutions**:
1. Specify cluster column: `ident = "clusters"` (or your cluster column name)
2. Check H5AD metadata for cluster column name
3. Or convert H5AD to RDS format first

#### Issue: Wrong number of cell types assigned
**Cause**: `cell_types` list length doesn't match cluster count.

**Solutions**:
1. Check number of clusters in Seurat object
2. Ensure `cell_types` list has correct number of entries
3. Use placeholders "-" or "" for clusters to keep original names
4. Shorter lists OK (extra clusters keep original names)

### Verification Steps

After annotation, verify:

1. **Check output file**:
   ```bash
   # View cluster to cell type mapping
   cat .pipen/Immunopipe/CellTypeAnnotation/0/output/cluster2celltype.tsv
   ```

2. **Check Seurat object metadata**:
   ```R
   library(Seurat)
   obj <- readRDS(".pipen/Immunopipe/CellTypeAnnotation/0/output/annotated.rds")
   head(obj@meta.data)
   # Look for cell type column (seurat_clusters or newcol name)
   ```

3. **Validate annotation quality**:
   ```R
   # Check distribution of cell types
   table(Idents(obj))

   # Visualize UMAP with cell types
   DimPlot(obj, group.by = "celltype_hitype", label = TRUE, repel = TRUE)
   ```

4. **Compare multiple methods**:
   ```R
   # Compare ScType vs hitype annotations
   table(obj$celltype_sctype, obj$celltype_hitype)
   ```

## Best Practices

### Method Selection
1. **Start with hitype**: Fast, good for PBMC/immune datasets
2. **Compare with ScType**: Alternative database-based method
3. **Use CellTypist for complex datasets**: ML-based, handles well
4. **Manual refinement**: Use direct annotation for corrections

### Multi-Method Workflow
1. Run multiple annotation methods in parallel
2. Compare results for consensus
3. Manually refine discrepancies using direct annotation
4. Keep original cluster names for traceability

### Tissue-Specific Annotation
1. Always specify tissue when using ScType/hitype
2. Use custom databases for non-standard tissues
3. Verify database contains relevant cell types

### Reproducibility
1. Save cluster-to-celltype mapping (`cluster2celltype.tsv`)
2. Document which tool/database was used
3. Keep original cluster names using `newcol` or `backup_col`

## External References

### Tool Documentation
- **ScType**: https://github.com/IanevskiAleksandr/sc-type
- **hitype**: https://github.com/pwwang/hitype
- **scCATCH**: https://github.com/ZJUFanLab/scCATCH
- **CellTypist**: https://celltypist.org/

### Database Downloads
- **ScType databases**:
  - Full: https://github.com/IanevskiAleksandr/sc-type/blob/master/ScTypeDB_full.xlsx
  - Short: https://github.com/IanevskiAleksandr/sc-type/blob/master/ScTypeDB_short.xlsx
- **CellTypist models**: https://celltypist.cog.sanger.ac.uk/models/models.json

### Related Processes
- `SeuratClustering`: Clustering before annotation
- `SeuratMap2Ref`: Reference-based annotation (alternative)
- `ClusterMarkers`: Find markers for each cell type
- `SeuratClusterStats`: Visualize annotated clusters