--- name: cdr3aaphyschem description: Analyzes physicochemical properties of CDR3 amino acid sequences to understand biochemical characteristics of T-cell receptor repertoires. Performs regression analysis between two cell groups at different CDR3 lengths for each physicochemical feature (hydrophobicity, volume, isoelectric point, etc.). --- # CDR3AAPhyschem Process Configuration ## Purpose Analyzes physicochemical properties of CDR3 amino acid sequences to understand biochemical characteristics of T-cell receptor repertoires. Performs regression analysis between two cell groups at different CDR3 lengths for each physicochemical feature (hydrophobicity, volume, isoelectric point, etc.). ## When to Use - To analyze CDR3 biochemical properties differences between cell groups (e.g., Treg vs Tconv) - For feature engineering in TCR machine learning models - To identify sequence features that distinguish cell subsets - After `ScRepCombiningExpression` (requires combined TCR + RNA data) - When investigating T cell fate determination (regulatory vs conventional T cells) ## Configuration Structure ### Process Enablement ```toml [CDR3AAPhyschem] cache = true ``` ### Input Specification ```toml [CDR3AAPhyschem.in] scrfile = ["ScRepCombiningExpression"] ``` - `scrfile`: Output from `ScRepCombiningExpression` (RDS or qs/qs2 format) - Must contain both TRA and TRB chains - Generated by `scRepertoire::combineExpression()` ### Environment Variables ```toml [CDR3AAPhyschem.envs] # Group comparison specification group = "CellType" comparison = {Treg = ["CD4 CTL", "CD4 Naive", "CD4 TCM", "CD4 TEM"], Tconv = "Tconv"} target = "Treg" each = "Sample" # Chain selection chain = "TRB" ``` **Key Parameters:** - `group`: Column name in metadata defining groups to compare (e.g., `CellType`, `seurat_clusters`) - `comparison`: Two-group specification for regression analysis - Format 1 (dict): `Group1 = ["cell1", "cell2"], Group2 = "cell3"` - Format 2 (list): `["Group1", "Group2"]` (when groups exist in column) - `target`: Which group to label as 1 in regression (default: first group in `comparison`) - `each`: Column(s) to split data for separate analyses - Single column: `"Sample"` - Multiple columns: `["Sample", "Patient"]` - Comma-separated: `"Sample,Patient"` - If not provided, all cells used together ## Configuration Examples ### Minimal Configuration ```toml [CDR3AAPhyschem] [CDR3AAPhyschem.in] scrfile = ["ScRepCombiningExpression"] ``` ### Standard Treg vs Tconv Analysis ```toml [CDR3AAPhyschem] [CDR3AAPhyschem.envs] # Define cell type groups for comparison group = "CellType" comparison = {Treg = ["Treg"], Tconv = ["Tconv"]} target = "Treg" chain = "TRB" ``` ### Multi-Sample Analysis ```toml [CDR3AAPhyschem] [CDR3AAPhyschem.envs] group = "CellType" comparison = ["Treg", "Tconv"] target = "Treg" # Run regression separately for each sample each = "Sample" chain = "TRB" ``` ### Custom Group Definition ```toml [CDR3AAPhyschem] [CDR3AAPhyschem.envs] group = "Cluster" # Define clusters to compare comparison = { HighQuality = ["c1", "c2", "c5"], LowQuality = ["c3", "c4"] } target = "HighQuality" chain = "TRB" ``` ## Physicochemical Properties ### Available Properties The process calculates 8 key physicochemical properties from CDR3 amino acid sequences: | Property | Description | Biological Significance | |----------|-------------|----------------------| | **length** | Total amino acid count in CDR3 | Influences binding loop size and flexibility | | **gravy** | Grand Average of Hydrophobicity (Kyte-Doolittle scale) | Hydrophobic CDR3s associate with self-reactivity and Treg fate | | **bulkiness** | Average bulkiness (Zimmerman scale) | Measures steric bulk of amino acids | | **polarity** | Average polarity (Grantham scale) | Influences interactions with peptide-MHC | | **aliphatic** | Normalized aliphatic index (Ikai scale) | Related to thermal stability | | **charge** | Normalized net charge at physiological pH | Affects electrostatic interactions | | **acidic** | Acidic side chain residue content (D, E proportion) | Contributes to negative charge | | **aromatic** | Aromatic side chain content (F, W, Y proportion) | Important for π-π interactions | ### Property Calculation Methods - **Default scales**: Standard biophysical scales from peer-reviewed literature - **GRAVY**: Kyte & Doolittle (1982) hydropathy scale - **Bulkiness**: Zimmerman et al. (1968) bulkiness parameters - **Polarity**: Grantham (1974) amino acid difference index - **Aliphatic index**: Ikai (1980) thermodynamic stability scale - **Charge**: Normalized based on pKa values (EMBOSS database) - **Acidic/Basic/Aromatic**: Direct residue counting proportions ### Regression Analysis - Performed for each physicochemical property independently - Compares properties across CDR3 length distributions - Binary classification: target group (1) vs non-target (0) - Output: Statistical significance of property differences ## Common Patterns ### Pattern 1: Treg vs Tconv (TRB Chain) ```toml [CDR3AAPhyschem] [CDR3AAPhyschem.envs] # Literature-based: hydrophobic CDR3β promotes Treg fate group = "CellType" comparison = {Treg = ["Treg", "CD4+Treg"], Tconv = ["Tconv", "CD4+Tconv"]} target = "Treg" chain = "TRB" each = "" # Analyze all samples together ``` ### Pattern 2: Selected Properties Only ```toml [CDR3AAPhyschem] [CDR3AAPhyschem.envs] # Focus on hydrophobicity (key Treg feature) group = "CellType" comparison = ["Treg", "Tconv"] target = "Treg" chain = "TRB" # To analyze specific chains separately ``` ### Pattern 3: Multi-Chain Analysis Run separate processes for different chains: ```toml # TRB analysis [CDR3AAPhyschem] [CDR3AAPhyschem.envs] chain = "TRB" group = "CellType" comparison = ["Treg", "Tconv"] # Note: Create separate config for TRA analysis if needed ``` ### Pattern 4: Multi-Group Comparisons ```toml [CDR3AAPhyschem] [CDR3AAPhyschem.envs] group = "CellType" comparison = { Naive = ["CD4 Naive", "CD8 Naive"], Memory = ["CD4 TEM", "CD4 TCM", "CD8 TEM", "CD8 TCM"], Effector = ["CD4 CTL", "CD8 CTL"] } target = "Naive" chain = "TRB" ``` ## Dependencies - **Upstream**: `ScRepCombiningExpression` (required) - **Downstream**: Feature analysis, ML model training, publication figures - **Required data**: Both TRA and TRB chains in combined object ## Validation Rules - **CDR3 sequence requirements**: Must have valid amino acid sequences (no Ns) - **Chain requirement**: Data must contain specified chain (TRA or TRB) - **Group specification**: Groups must exist in metadata - **Minimum cells**: Sufficient cells per group for statistical regression - **Length distribution**: CDR3 length range must be adequate for regression ## Troubleshooting ### Issue: "Missing chain in data" **Cause**: Specified chain (TRA/TRB) not found in combined object **Solution**: ```toml # Change to available chain [CDR3AAPhyschem.envs] chain = "TRA" # or "TRB" ``` ### Issue: "Group not found in metadata" **Cause**: `group` column or `comparison` values don't exist **Solution**: 1. Check available metadata columns in `ScRepCombiningExpression` output 2. Verify group names match exactly (case-sensitive) ```toml [CDR3AAPhyschem.envs] group = "seurat_clusters" # If CellType not available comparison = ["0", "1"] # Use cluster IDs ``` ### Issue: "Insufficient cells for regression" **Cause**: Too few cells in one or more groups **Solution**: 1. Use `each` to analyze samples separately if pooled analysis fails 2. Combine similar cell types in `comparison` ```toml [CDR3AAPhyschem.envs] # Combine rare subtypes comparison = {HighExpander = ["Treg", "Tconv"], LowExpander = ["Tfh"]} ``` ### Issue: "No significant property differences" **Cause**: Groups may not differ in physicochemical properties **Solution**: 1. Check if `comparison` groups are biologically distinct 2. Consider different `group` column (e.g., gene expression clusters) 3. Verify CDR3 sequences are high-quality ## Scientific Context ### Key Publications 1. **Stadinski et al. (2016)**: "Hydrophobic CDR3 residues promote development of self-reactive T cells" - *Nature Immunology* 2. **Lagattuta et al. (2022)**: "TCR sequence features influence T cell fate" - *Nature Immunology* 3. **Ostmeyer et al. (2019)**: "Biophysicochemical motifs distinguish TILs from healthy tissue" - *Cancer Research* ### Interpretation Guidelines - **High GRAVY**: More hydrophobic CDR3 (associated with self-reactivity, Treg) - **High charge**: Electrostatic potential may affect binding affinity - **High aromaticity**: Increased π-π interactions, structural stability - **Length distribution**: Longer CDR3s may provide broader specificity ### Feature Engineering Applications Use properties as features for: - TCR specificity prediction models - T cell fate classification (Treg vs Tconv) - Antigen binding affinity estimation - Cross-reactivity assessment ## Output Format - **Directory**: `{{in.scrfile | stem}}.cdr3aaphyschem/` - **Files**: - Regression plots per property (hydrophobicity, volume, pI) - Statistical tables comparing groups - CDR3 length distributions - Property correlation matrices - **Visualizations**: - Property vs length scatter plots - Group-wise property boxplots - Regression curves with confidence intervals ## Advanced Usage ### Custom Property Scales If using non-default scales (requires modifying underlying R script): ```toml # Note: Advanced usage - may require script modification [CDR3AAPhyschem] [CDR3AAPhyschem.envs] # Specify alternative hydrophobicity scale hydro_scale = "Wimley" pK_source = "Murray" ``` ### Length-Based Stratification ```toml [CDR3AAPhyschem] [CDR3AAPhyschem.envs] # Analyze by CDR3 length bins group = "CellType" comparison = ["Treg", "Tconv"] # Use metadata column with length information each = "CDR3_Length_Bin" chain = "TRB" ``` ### Publication-Ready Plots ```toml [CDR3AAPhyschem] [CDR3AAPhyschem.envs] group = "CellType" comparison = {Treg = "Treg", Tconv = "Tconv"} target = "Treg" chain = "TRB" # Publication parameters plot_theme = "nature" fig_dpi = 300 fig_format = "pdf" ```