--- name: cdr3clustering description: Cluster TCR/BCR clones by CDR3 sequences using GIANA or ClusTCR (both Faiss-based). Adds `CDR3_Cluster` column to metadata for clonotype analysis. --- # CDR3Clustering Process Configuration ## Purpose Cluster TCR/BCR clones by CDR3 sequences using GIANA or ClusTCR (both Faiss-based). Adds `CDR3_Cluster` column to metadata for clonotype analysis. ## When to Use - To identify groups of similar TCR/BCR clonotypes - For analyzing TCR sequence convergence - After ScRepCombiningExpression when TCR/BCR integrated with RNA - For investigating public clonotypes across samples - Before TESSA analysis for epitope specificity **Important**: Only runs when VDJ input present (TCRData/BCRData columns in SampleInfo). ## Configuration Structure ### Process Enablement ```toml [CDR3Clustering] cache = true ``` ### Input Specification ```toml [CDR3Clustering.in] screpfile = "path/to/combined_object.qs" ``` ### Environment Variables ```toml [CDR3Clustering.envs] type = "auto" # TCR, BCR, or auto tool = "GIANA" # GIANA or ClusTCR python = "python" # Path to python within_sample = true # Cluster per sample args = {} # Tool-specific arguments chain = "both" # TRA, TRB, IGH, IGL, IGK, both, heavy, light ``` #### GIANA Arguments (via `args`) ```toml [CDR3Clustering.envs.args] method = "hierarchical" # hierarchical, kmeans dist = "hamming" # hamming, levenshtein threshold = 0.15 # Distance threshold ``` #### ClusTCR Arguments (via `args`) ```toml [CDR3Clustering.envs.args] method = "two-step" # mcl, faiss, two-step n_cpus = 4 # CPUs for MCL faiss_cluster_size = 5000 # Supercluster size mcl_params = [1.2, 2] # [inflation, expansion] ``` ## Configuration Examples ### Minimal Configuration ```toml [CDR3Clustering] [CDR3Clustering.in] screpfile = "intermediate/screpcombiningexpression/combined.qs" ``` ### GIANA with Custom Distance Threshold ```toml [CDR3Clustering] [CDR3Clustering.in] screpfile = "intermediate/screpcombiningexpression/combined.qs" [CDR3Clustering.envs] tool = "GIANA" [CDR3Clustering.envs.args] method = "hierarchical" dist = "hamming" threshold = 0.15 ``` ### ClusTCR Two-Step (Large Datasets) ```toml [CDR3Clustering] [CDR3Clustering.in] screpfile = "intermediate/screpcombiningexpression/combined.qs" [CDR3Clustering.envs] tool = "ClusTCR" [CDR3Clustering.envs.args] method = "two-step" faiss_cluster_size = 5000 n_cpus = 8 ``` ### ClusTCR MCL (Small Datasets) ```toml [CDR3Clustering] [CDR3Clustering.in] screpfile = "intermediate/screpcombiningexpression/combined.qs" [CDR3Clustering.envs] tool = "ClusTCR" [CDR3Clustering.envs.args] method = "mcl" n_cpus = 4 ``` ### TRB Chain Only ```toml [CDR3Clustering] [CDR3Clustering.in] screpfile = "intermediate/screpcombiningexpression/combined.qs" [CDR3Clustering.envs] chain = "TRB" ``` ### Cross-Sample Clustering ```toml [CDR3Clustering] [CDR3Clustering.in] screpfile = "intermediate/screpcombiningexpression/combined.qs" [CDR3Clustering.envs] within_sample = false ``` ## Common Patterns ### Pattern 1: Standard TCR Beta Chain ```toml [CDR3Clustering] [CDR3Clustering.in] screpfile = "intermediate/screpcombiningexpression/combined.qs" [CDR3Clustering.envs] type = "TCR" tool = "GIANA" chain = "TRB" ``` ### Pattern 2: Large Dataset (>100K sequences) ```toml [CDR3Clustering] [CDR3Clustering.in] screpfile = "intermediate/screpcombiningexpression/combined.qs" [CDR3Clustering.envs] tool = "ClusTCR" [CDR3Clustering.envs.args] method = "two-step" faiss_cluster_size = 5000 n_cpus = 8 ``` ### Pattern 3: Custom Threshold ```toml [CDR3Clustering] [CDR3Clustering.in] screpfile = "intermediate/screpcombiningexpression/combined.qs" [CDR3Clustering.envs] tool = "GIANA" [CDR3Clustering.envs.args] threshold = 0.15 # Higher=fewer clusters, Lower=more clusters ``` ## Dependencies ### Upstream - **ScRepCombiningExpression** (required): Combined scRepertoire object with TCR/BCR data ### Downstream - **TESSA**: TCR epitope specificity prediction - **ClonalStats**: Clonality statistics (uses `CDR3_Cluster` metadata) ## Validation Rules 1. Tool must be `"GIANA"` or `"ClusTCR"` 2. Chain must be valid for data type (TCR: TRA/TRB, BCR: IGH/IGL/IGK) 3. GIANA requires: biopython, faiss, scikit-learn 4. ClusTCR requires: clustcr package ### Computational Considerations - <50K sequences: ClusTCR `method = "mcl"` (highest quality) - 50K-500K sequences: ClusTCR `method = "two-step"` (balanced) - >500K sequences: GIANA or ClusTCR `method = "two-step"` (fastest) - Memory: GIANA ~2-4 GB/100K, ClusTCR ~4-8 GB/100K - Runtime: GIANA 1-5 min/100K, ClusTCR two-step 2-10 min/100K ## Troubleshooting ### Process not running **Cause**: No VDJ data available **Solution**: Verify ScRepCombiningExpression output contains TCR/BCR data ### ModuleNotFoundError **Cause**: Missing dependencies **Solution**: - GIANA: `pip install biopython faiss-cpu scikit-learn` - ClusTCR: `conda install -c conda-forge clustcr` ### Too many/few clusters **Cause**: Threshold inappropriate **Solution**: Adjust threshold (higher = fewer clusters, lower = more clusters) ### Out of memory **Cause**: Dataset too large for RAM **Solution**: Use `within_sample = true`, reduce `n_cpus`, or use GIANA ### Slow clustering **Cause**: Suboptimal method for dataset size **Solution**: - >50K: ClusTCR `method = "two-step"` with increased n_cpus - Very large (>500K): Use GIANA ## Notes on Output Format **Metadata column**: `CDR3_Cluster` **Cluster naming**: - `S_1`, `S_2`: Single unique CDR3 sequence (may have multiple cells) - `M_1`, `M_2`: Multiple unique CDR3 sequences (similar but different) **Interpretation**: - `S_` prefix: Cells share identical CDR3 sequence - `M_` prefix: Cells have similar but different CDR3 sequences - Use `CDR3_Cluster` as grouping factor in Seurat plots **Performance Tips**: - Small (<10K): GIANA defaults (quality over speed) - Medium (10K-100K): ClusTCR two-step with n_cpus=4 - Large (100K-1M): ClusTCR two-step with n_cpus=8+ or GIANA - Very large (>1M): GIANA with increased faiss_cluster_size