--- name: go-kegg-enrichment description: "\"Performs GO (Gene Ontology) and KEGG pathway enrichment analysis on\\" --- # GO/KEGG Enrichment Analysis Automated pipeline for Gene Ontology and KEGG pathway enrichment analysis with result interpretation and visualization. ## Features - **GO Enrichment**: Biological Process (BP), Molecular Function (MF), Cellular Component (CC) - **KEGG Pathway**: Pathway enrichment with organism-specific mapping - **Multiple ID Support**: Gene symbols, Entrez IDs, Ensembl IDs, RefSeq - **Statistical Methods**: Hypergeometric test, Fisher's exact test, GSEA support - **Visualizations**: Bar plots, dot plots, enrichment maps, cnet plots - **Result Interpretation**: Automatic biological significance summary ## Supported Organisms | Common Name | Scientific Name | KEGG Code | OrgDB Package | |-------------|-----------------|-----------|---------------| | Human | Homo sapiens | hsa | org.Hs.eg.db | | Mouse | Mus musculus | mmu | org.Mm.eg.db | | Rat | Rattus norvegicus | rno | org.Rn.eg.db | | Zebrafish | Danio rerio | dre | org.Dr.eg.db | | Fly | Drosophila melanogaster | dme | org.Dm.eg.db | | Yeast | Saccharomyces cerevisiae | sce | org.Sc.sgd.db | ## Usage ### Basic Usage ```python # Run enrichment analysis with gene list python scripts/main.py --genes gene_list.txt --organism human --output results/ ``` ### Parameters | Parameter | Description | Default | Required | |-----------|-------------|---------|----------| | `--genes` | Path to gene list file (one gene per line) | - | Yes | | `--organism` | Organism code (human/mouse/rat/zebrafish/fly/yeast) | human | No | | `--id-type` | Gene ID type (symbol/entrez/ensembl/refseq) | symbol | No | | `--background` | Background gene list file | all genes | No | | `--pvalue-cutoff` | P-value cutoff for significance | 0.05 | No | | `--qvalue-cutoff` | Adjusted p-value (q-value) cutoff | 0.2 | No | | `--analysis` | Analysis type (go/kegg/all) | all | No | | `--output` | Output directory | ./enrichment_results | No | | `--format` | Output format (csv/tsv/excel/all) | all | No | ### Advanced Usage ```python # GO enrichment only with specific ontology python scripts/main.py \ --genes deg_upregulated.txt \ --organism mouse \ --analysis go \ --go-ontologies BP,MF \ --pvalue-cutoff 0.01 \ --output go_results/ # KEGG enrichment with custom background python scripts/main.py \ --genes treatment_genes.txt \ --background all_expressed_genes.txt \ --organism human \ --analysis kegg \ --qvalue-cutoff 0.05 \ --output kegg_results/ ``` ## Input Format ### Gene List File ``` TP53 BRCA1 EGFR MYC KRAS PTEN ``` ### With Expression Values (for GSEA) ``` gene,log2FoldChange TP53,2.5 BRCA1,-1.8 EGFR,3.2 ``` ## Output Files ``` output/ ├── go_enrichment/ │ ├── GO_BP_results.csv # Biological Process results │ ├── GO_MF_results.csv # Molecular Function results │ ├── GO_CC_results.csv # Cellular Component results │ ├── GO_BP_barplot.pdf # Visualization │ ├── GO_MF_dotplot.pdf │ └── GO_summary.txt # Interpretation summary ├── kegg_enrichment/ │ ├── KEGG_results.csv # Pathway results │ ├── KEGG_barplot.pdf │ ├── KEGG_dotplot.pdf │ └── KEGG_pathview/ # Pathway diagrams └── combined_report.html # Interactive report ``` ## Result Interpretation The tool automatically generates biological interpretation including: 1. **Top Enriched Terms**: Significant GO terms/pathways ranked by enrichment ratio 2. **Functional Themes**: Clustered biological themes from enriched terms 3. **Key Genes**: Core genes driving enrichment in significant terms 4. **Network Relationships**: Gene-term relationship visualization 5. **Clinical Relevance**: Disease associations (for human genes) ## Technical Difficulty: **HIGH** ⚠️ **AI自主验收状态**: 需人工检查 This skill requires: - R/Bioconductor environment with clusterProfiler - Multiple annotation databases (org.*.eg.db) - KEGG REST API access - Complex visualization dependencies ## Dependencies ### Required R Packages ```r install.packages(c("BiocManager", "ggplot2", "dplyr", "readr")) BiocManager::install(c( "clusterProfiler", "org.Hs.eg.db", "org.Mm.eg.db", "org.Rn.eg.db", "enrichplot", "pathview", "DOSE" )) ``` ### Python Dependencies ```bash pip install pandas numpy matplotlib seaborn rpy2 ``` ## Example Workflow 1. **Prepare Input**: Create gene list from DEG analysis 2. **Run Analysis**: Execute main.py with appropriate parameters 3. **Review Results**: Check generated CSV files and visualizations 4. **Interpret**: Read auto-generated summary for biological insights ## References See `references/` for: - clusterProfiler documentation - KEGG API guide - Statistical methods explanation - Visualization examples ## Limitations - Requires internet connection for KEGG database queries - Large gene lists (>5000) may require increased memory - Some pathways may not be available for all organisms - KEGG API has rate limits (max 3 requests/second) ## Risk Assessment | Risk Indicator | Assessment | Level | |----------------|------------|-------| | Code Execution | Python/R scripts executed locally | Medium | | Network Access | No external API calls | Low | | File System Access | Read input files, write output files | Medium | | Instruction Tampering | Standard prompt guidelines | Low | | Data Exposure | Output files saved to workspace | Low | ## Security Checklist - [ ] No hardcoded credentials or API keys - [ ] No unauthorized file system access (../) - [ ] Output does not expose sensitive information - [ ] Prompt injection protections in place - [ ] Input file paths validated (no ../ traversal) - [ ] Output directory restricted to workspace - [ ] Script execution in sandboxed environment - [ ] Error messages sanitized (no stack traces exposed) - [ ] Dependencies audited ## Prerequisites ```bash # Python dependencies pip install -r requirements.txt ``` ## Evaluation Criteria ### Success Metrics - [ ] Successfully executes main functionality - [ ] Output meets quality standards - [ ] Handles edge cases gracefully - [ ] Performance is acceptable ### Test Cases 1. **Basic Functionality**: Standard input → Expected output 2. **Edge Case**: Invalid input → Graceful error handling 3. **Performance**: Large dataset → Acceptable processing time ## Lifecycle Status - **Current Stage**: Draft - **Next Review Date**: 2026-03-06 - **Known Issues**: None - **Planned Improvements**: - Performance optimization - Additional feature support