---
name: metabolicinput
description: Pass-through process that prepares Seurat object for metabolic landscape analysis. Routes the processed Seurat object to downstream metabolic analysis processes (MetabolicExprImputation, MetabolicPathwayActivity, MetabolicFeatures, MetabolicPathwayHeterogeneity). **Note**: This process requires no direct configuration.
---

# MetabolicInput Process Configuration

## Purpose
Pass-through process that prepares Seurat object for metabolic landscape analysis. Routes the processed Seurat object to downstream metabolic analysis processes (MetabolicExprImputation, MetabolicPathwayActivity, MetabolicFeatures, MetabolicPathwayHeterogeneity).

**Note**: This process requires no direct configuration. All metabolic analysis parameters are configured at the ScrnaMetabolicLandscape group level.

## When to Use
- First step in modular metabolic analysis workflow
- When you want to perform metabolic pathway analysis on single-cell RNA-seq data
- Alternative to ScrnaMetabolicLandscape (same group, modular approach)
- After clustering is complete (SeuratClustering or related processes)
- When investigating metabolic heterogeneity across cell types or conditions

## Configuration Structure

### Process Enablement
```toml
[ScrnaMetabolicLandscape]
# This enables the entire metabolic analysis group
# MetabolicInput is automatically included as part of this group

[ScrnaMetabolicLandscape.envs]
# Configure metabolic analysis parameters here
```

### Input Specification
MetabolicInput automatically receives input from upstream processes:
- Requires: Seurat object from CombinedInput (includes RNA + optional VDJ data)
- Typically follows: `SeuratClustering`, `TESSA`, or other clustering/annotation processes

### Environment Variables (Group Level)
All metabolic analysis configuration is done at the ScrnaMetabolicLandscape group level:

```toml
[ScrnaMetabolicLandscape.envs]
# Metabolic pathway database file
gmtfile = "KEGG_2021_Human"

# Skip imputation (if data already complete)
noimpute = false

# Number of cores for parallelization
ncores = 4

# Optional: Subset data by metadata column
# subset_by = "Response"  # Remove NA values in this column

# Optional: Group data by metadata column
# group_by = "cluster"

# Optional: Add metadata columns for grouping/subsetting
# mutaters = {timepoint = "if_else(treatment == 'control', 'pre', 'post')"}
```

## Metabolic Pathway Databases

### Available Databases (via enrichit)

The `gmtfile` parameter accepts either:

1. **Built-in database names** (auto-downloaded):
   - `"KEGG_2021_Human"` - KEGG pathways (human, default)
   - `"KEGG"` - KEGG pathways (latest)
   - `"Reactome_Pathways_2024"` - Reactome pathways
   - `"Reactome"` - Reactome pathways (latest)
   - `"BioCarta_2016"` - BioCarta pathways
   - `"MSigDB_Hallmark_2020"` - MSigDB Hallmark gene sets
   - See full list: https://pwwang.github.io/enrichit/reference/FetchGMT.html

2. **Custom GMT files** (local paths or URLs):
   - Local file: `/path/to/custom.gmt`
   - URL: `https://example.com/pathways.gmt`

### Database Descriptions

- **KEGG**: Kyoto Encyclopedia of Genes and Genomes - manually curated metabolic pathways. Comprehensive coverage of metabolism, including carbohydrate, energy, lipid, nucleotide, amino acid, xenobiotics, and other pathways. Species-specific versions available.

- **Reactome**: Curated pathway database covering cellular processes, signal transduction, metabolic pathways, and more. More comprehensive than KEGG for signaling and regulatory pathways. Good for human/mouse.

- **BioCarta**: Curated pathways focusing on cell signaling, metabolic, and disease pathways. Older database but still useful for classic pathways.

- **Custom GMT**: Your own gene sets in GMT format (Gene Set Enrichment Format). Format: `name\tdescription\tgene1,gene2,gene3` (tab-separated).

### Species-Specific Considerations

- **Human data**: Use `"KEGG_2021_Human"`, `"Reactome_Pathways_2024"`, or species-specific GMT files
- **Mouse data**: Use KEGG with mouse gene IDs or download mouse-specific GMT from MSigDB
- **Other species**: Provide custom GMT file with appropriate gene identifiers matching your Seurat object
- **Gene name matching**: Ensure gene names in Seurat object match GMT file (case-sensitive, human: UPPERCASE, mouse: TitleCase)

## Configuration Examples

### Minimal Configuration (Default KEGG)
```toml
[ScrnaMetabolicLandscape]
```

### KEGG Human Pathways (Explicit)
```toml
[ScrnaMetabolicLandscape]
[ScrnaMetabolicLandscape.envs]
gmtfile = "KEGG_2021_Human"
ncores = 4
noimpute = false
```

### Reactome Pathways
```toml
[ScrnaMetabolicLandscape]
[ScrnaMetabolicLandscape.envs]
gmtfile = "Reactome_Pathways_2024"
ncores = 8
```

### Custom Metabolic Pathway GMT File
```toml
[ScrnaMetabolicLandscape]
[ScrnaMetabolicLandscape.envs]
gmtfile = "/data/pathways/custom_metabolism.gmt"
ncores = 4
```

### Subset Analysis by Response Group
```toml
[ScrnaMetabolicLandscape]
[ScrnaMetabolicLandscape.envs]
gmtfile = "KEGG_2021_Human"
subset_by = "Response"  # Analyze responders vs non-responders
group_by = "cluster"
ncores = 4
```

### Multiple Pathway Databases (Via Cases)
```toml
[ScrnaMetabolicLandscape]
[ScrnaMetabolicLandscape.envs]
ncores = 4

# Analyze with KEGG
[ScrnaMetabolicLandscape.envs.cases.KEGG]
gmtfile = "KEGG_2021_Human"
group_by = "cluster"

# Analyze with Reactome
[ScrnaMetabolicLandscape.envs.cases.Reactome]
gmtfile = "Reactome_Pathways_2024"
group_by = "cluster"
```

### Adding Custom Metadata for Grouping
```toml
[ScrnaMetabolicLandscape]
[ScrnaMetabolicLandscape.envs]
gmtfile = "KEGG_2021_Human"
ncores = 4
# Create timepoint column based on treatment
mutaters = {timepoint = "if_else(treatment == 'control', 'pre', 'post')"}
subset_by = "timepoint"
group_by = "cluster"
```

## Common Patterns

### Pattern 1: Standard Metabolic Analysis
```toml
# Basic setup with KEGG pathways
[ScrnaMetabolicLandscape]
[ScrnaMetabolicLandscape.envs]
gmtfile = "KEGG_2021_Human"
ncores = 4
```

### Pattern 2: Skip Imputation (Clean Data)
```toml
# If data is already complete, skip imputation step
[ScrnaMetabolicLandscape]
[ScrnaMetabolicLandscape.envs]
gmtfile = "KEGG_2021_Human"
noimpute = true
ncores = 4
```

### Pattern 3: Disease vs Control Comparison
```toml
# Compare metabolic pathways between conditions
[ScrnaMetabolicLandscape]
[ScrnaMetabolicLandscape.envs]
gmtfile = "KEGG_2021_Human"
subset_by = "diagnosis"  # e.g., "disease", "control"
group_by = "cluster"
ncores = 4
```

### Pattern 4: Time Series Analysis
```toml
# Analyze metabolic changes across timepoints
[ScrnaMetabolicLandscape]
[ScrnaMetabolicLandscape.envs]
gmtfile = "Reactome_Pathways_2024"
subset_by = "timepoint"  # e.g., "day0", "day7", "day14"
group_by = "cluster"
ncores = 8
```

### Pattern 5: Species-Specific Analysis
```toml
# Non-human data with custom pathways
[ScrnaMetabolicLandscape]
[ScrnaMetabolicLandscape.envs]
gmtfile = "/data/pathways/mouse_metabolism.gmt"
ncores = 4
```

## Dependencies

### Upstream Processes
- **Required**: Seurat object from `CombinedInput`
  - CombinedInput can be: `ScRepCombiningExpression` (RNA + VDJ) or `RNAInput` (RNA only)
  - RNAInput typically: `SeuratClustering`, `SeuratMap2Ref`, `CellTypeAnnotation`, or `TESSA`
- **Preceding**: Clustering must be complete before metabolic analysis

### Downstream Processes (In ScrnaMetabolicLandscape Group)
- **MetabolicExprImputation** (optional): Impute missing expression values (ALRA, scImpute, or MAGIC)
- **MetabolicPathwayActivity**: Calculate pathway activity scores per group
- **MetabolicFeatures**: Enrichment analysis of metabolic pathways per group
- **MetabolicPathwayHeterogeneity**: Calculate metabolic heterogeneity across groups

## Validation Rules

### Database Validation
- `gmtfile` must be a valid enrichit database name OR accessible GMT file path/URL
- For custom GMT files:
  - File must exist (absolute path or relative to config file)
  - Format must be GMT: `name\tdescription\tgene1,gene2,gene3`
  - Gene identifiers must match Seurat object (case-sensitive)

### Species Validation
- Gene names in Seurat object must match GMT file:
  - Human: UPPERCASE (e.g., `CD3D`, `IFNG`)
  - Mouse: TitleCase (e.g., `Cd3d`, `Ifng`)
  - Verify with: `sobj@assays$RNA@features` (Seurat R command)

### Metadata Validation
- If `subset_by` specified: column must exist in Seurat object metadata
- If `group_by` specified: column must exist in Seurat object metadata
- NA values in `subset_by` column are automatically removed

## Troubleshooting

### Common Pathway Loading Issues

#### Issue: "GMT file not found"
**Cause**: Invalid path to custom GMT file
**Solution**:
```toml
# Use absolute path
gmtfile = "/full/path/to/pathways.gmt"

# Or path relative to config file location
gmtfile = "./data/pathways.gmt"
```

#### Issue: "Gene names not found in Seurat object"
**Cause**: Gene identifier mismatch between GMT and Seurat object
**Solution**:
- Check gene format in Seurat: `sobj@assays$RNA@features[1:10,]`
- Ensure case matches: Human (UPPERCASE) vs Mouse (TitleCase)
- Consider using gene symbol conversion tools if needed

#### Issue: "Empty pathway results"
**Cause**: Too few genes matching between pathways and data
**Solution**:
- Verify species compatibility (human GMT with mouse data won't work)
- Try different database: Switch from KEGG to Reactome or vice versa
- Use custom GMT with species-specific pathways

#### Issue: "No enriched pathways found"
**Cause**: Statistical thresholds too strict or no biological differences
**Solution**:
- Relax p-value cutoff in downstream processes (e.g., `pathway_pval_cutoff`)
- Check grouping: Ensure groups have distinct biological differences
- Use more comprehensive database (Reactome often has more pathways than KEGG)

### Performance Issues

#### Issue: Metabolic analysis too slow
**Cause**: Insufficient cores for parallelization
**Solution**:
```toml
# Increase cores for metabolic analysis
[ScrnaMetabolicLandscape.envs]
ncores = 8  # Increase based on available CPU
```

#### Issue: Memory errors during imputation
**Cause**: Large dataset with imputation enabled
**Solution**:
```toml
# Skip imputation if data is complete
[ScrnaMetabolicLandscape.envs]
noimpute = true
```

### Integration Issues

#### Issue: Process not running
**Cause**: ScrnaMetabolicLandscape not enabled in config
**Solution**:
```toml
# Ensure the group is enabled
[ScrnaMetabolicLandscape]
```

#### Issue: Wrong input data
**Cause**: Clustering not complete or incorrect upstream process
**Solution**:
- Ensure `SeuratClustering` or similar process runs before metabolic analysis
- Check that Seurat object has cluster assignments: `sobj@meta.data$seurat_clusters`
- Verify no missing values in metadata columns used for grouping

## Reference

- **Original Paper**: Xiao, Z. et al. "Metabolic landscape of the tumor microenvironment at single cell resolution." Nature Communications 10, 1-12 (2019)
- **Pipeline**: https://github.com/LocasaleLab/Single-Cell-Metabolic-Landscape
- **KEGG**: https://www.genome.jp/kegg/pathway.html
- **Reactome**: https://reactome.org/
- **enrichit Databases**: https://pwwang.github.io/enrichit/reference/FetchGMT.html
- **GMT Format**: http://www.broadinstitute.org/gsea/msigdb/file_formats.jsp