---
name: functional-enrichment
description: Perform GO and KEGG functional enrichment using HOMER from genomic regions (BED/narrowPeak/broadPeak) or gene lists, and produce R-based barplot/dotplot visualizations. Use this skill when you want to perform GO and KEGG functional enrichment using HOMER from genomic regions or just want to link genomic region to genes.

---

# Functional Enrichment (HOMER + R)

## Overview

- **Validate input**: Accept BED/peak files with genomic coordinates or gene lists; check format and genome assembly.
- **Map regions to genes**: Convert regions to a unique gene set using HOMER `annotatePeaks.pl`.
- **Run GO enrichment**: Use HOMER `findGO.pl` (or `annotatePeaks.pl -go`) for BP/MF/CC.
- **Run KEGG enrichment**: Use HOMER `findGO.pl -kegg` (or `annotatePeaks.pl -kegg`).
- **Collect outputs**: Save tidy tables for downstream plotting and a compact summary of top terms.
- **Visualize in R**: Create barplots and dotplots (GO/KEGG) with `ggplot2` from standardized outputs.
- **QC & troubleshooting**: Provide checks for genome mismatch, chromosome naming, and low-signal inputs.

## Inputs & Outputs

### Inputs (choose one):
#### Option 1: Input is a genomic region file (BED/narrowPeak/broadPeak)
Genomic region formats supported:
- **BED files**: Standard genomic interval format
- **narrowPeak**: narrow peak format
- **broadPeak**: broad peak format

#### Option 2: Input is a gene list (txt)
- `gene_list.txt` with one official gene symbol per line (no header). And an optional `gene_list_background.txt` with one official gene symbol per line (no header).

### Outputs (directory layout):
```bash
${sample}_functional_enrichment/
    results/
      ${sample}.anno_genomic_features.txt
      ${sample}.anno_genomic_features_stats.txt
      biological_process.txt
      cellular_component.txt  
      molecular_function.txt  

      kegg.txt                
      biocyc.txt              
      chromosome.txt  
      cosmic.txt
      interactions.txt  
      interpro.txt
      gene3d.txt
      pathwayInteractionDB.txt
      pfam.txt
      prints.txt    
      prosite.txt   
      reactome.txt
      smpdb.txt
      wikipathways.txt

      gwas.txt          
      lipidmaps.txt           
      msigdb.txt                
      smart.txt

    tables/
      ${sample}.gene_list.txt
      go_bp.tsv
      go_mf.tsv
      go_cc.tsv
      kegg.tsv
    logs/
      ${sample}.anno_genomic_features.log # if genome region file is provided
      findGO.log
```


## Decision Tree


### Step 0 — Gather Required Information from the User

Before calling any tool, **ask the user**:

1. Sample name (`sample`): used as prefix and for the output directory `${sample}_functional_enrichment`.
2. Genome assembly (`genome`): e.g. `hg38`, `mm10`, `danRer11`.  
   - **Never** guess or auto-detect.

---

### Step 1: Initialize Project

1. Make director for this project:

Call:
- `mcp__project-init-tools__project_init`

with:
- `sample`: the user-provided sample name
- `task`: de_novo_motif_discovery

The tool will:
- Create `${sample}_functional_enrichment` directory.
- Get the full path of the `${sample}_functional_enrichment` directory, which will be used as `${proj_dir}`.

---


### Step 2: Prepare genome file for homer

Call:
- `mcp__homer-tools__check_genome_installation`

With:
- `genome`: the user-provided genome assembly, e.g. `hg38`, `mm10`, `danRer11`

The tool will:
- Check if the genome is installed in HOMER.
- If not, install the genome.


---


### Step 3 (Optional): Standardize chromosome names for BED files

This step is optional. Only perform this step if the input file is a BED file. If the input file is a gene list, skip this step.

From `1` format to `chr1` format
From `MT` format to `chrM` format

Call:
- `mcp__file-format-tools__standardize_bed_chrom_names`

with:
- `input_bed`: the user-provided BED file
- `output_bed`: the path to save the standardized BED file

The tool will:
- Standardize the chromosome names in the BED file.
- Return the path of the standardized BED file.


---

### Step 4 (Optional): Convert gene ID to gene symbol

This step is optional. Only perform this step if the input file is a gene list file. If the input file is a BED file, skip this step.

Call:
- `mcp__mygene-tools__convert_gene_ids_mygene`

With:
- `input_ids_file`: the user-provided gene list file. May end with `.txt`.
- `scopes`: the source ID type for mygene (e.g., 'ensembl.gene', 'symbol', 'entrezgene', 'uniprot', or a comma-separated list).
- `fields`: the comma-separated target fields to retrieve from mygene (e.g., 'symbol,ensembl.gene,uniprot,entrezgene').
- `species`: the species for mygene (e.g., 'human', 'mouse', 'zebrafish', or NCBI taxon ID like '9606').
- `out_file`: the path to save the converted gene list file. In this skill, it is the full path of the `${sample}_functional_enrichment` directory returned by `mcp__project-init-tools__project_init`
- `batch_size`: the batch size for mygene.querymany (default 1000).

The tool will:
- Convert the gene ID to gene symbol.
- Return the path of the converted gene list file.

---


### Step 5: GO enrichment analysis

#### Option 1: from genomic regions file

Only if the input file is a BED file. If the input file is a gene list, call tools in Option 2.

1. annotate the genomic regions using Homer's `annotatePeaks.pl` with `-go` option. If user also provides a background genome region file, like a control peak file, also call this tool for the background genome region file. Use a different `${sample}` as the sample name for the background sample.

Call:
`mcp__homer-tools__annotate_genomic_features`

With:
- `sample`: the user-provided sample name
- `proj_dir`: directory to save the genomic feature annotation results. In this skill, it is the full path of the `${sample}_functional_enrichment` directory returned by `mcp__project-init-tools__project_init`
- `regions_bed`: the user-provided regions file in BED format. May end with `.bed`, `.narrowPeak`, `.broadPeak`, etc.
- `genome`: the user-provided genome assembly, e.g. `hg38`, `mm10`, `danRer11`
- `ann`: "custom homer annotation file (created by assignGenomeAnnotation.pl), (default: None).
- `size_given`: keep original region sizes (default: True)
- `cpg`: include CpG information (default: False)
- `go`: `True` to perform GO enrichment analysis.

The tool will:
- Annotate the genomic regions using Homer's `annotatePeaks.pl`.
- Return the path of the annotated regions file under `${proj_dir}/results/` directory, and the path to the log file under `${proj_dir}/logs/` directory.
    - `${proj_dir}/results/${sample}.anno_genomic_features.txt`
    - `${proj_dir}/results/${sample}.anno_genomic_features_stats.txt`
    - `${proj_dir}/logs/${sample}.anno_genomic_features.log`

---

2. (optional) extract the genes from the annotated regions file if neccessary for future analysis or the target gene list is requested by user. If not requested, skip this step.

Call:
`mcp__file-format-tools__extract_gene_list`

With:
- `sample`: the user-provided sample name
- `proj_dir`: directory to save the genomic feature annotation results. In this skill, it is the full path of the `${sample}_functional_enrichment` directory returned by `mcp__project-init-tools__project_init`

The tool will:
- Extract the genes from the annotated regions file.
- Return the path of the gene list file under `${proj_dir}/tables/` directory.
    - `${proj_dir}/tables/${sample}.gene_list.txt`


---

#### Option 2: from gene list file

Only if the input file is a gene list file. If the input file is a BED file, call tools in Option 1.

Call:
`mcp__homer-tools__gene_function_enrichment`

With:
- `sample`: the user-provided sample name
- `proj_dir`: directory to save the GO & KEGG enrichment results. In this skill, it is the full path of the `${sample}_functional_enrichment` directory returned by `mcp__project-init-tools__project_init`
- `gene_list_file`: the user-provided gene list file. May end with `.txt`.
- `organism`: the user-provided organism name, e.g. `human`, `mouse`, `zebrafish`, etc.
- `background_gene_list_file`: the user-provided background gene list file. May end with `.txt`. If not provided, set this parameter to `None`.

The tool will:
- Find the GO enrichment for the gene list.
- Return the path of the GO & KEGG enrichment results under `${proj_dir}/results/` directory.
    - `${proj_dir}/results/biological_process.txt`
    - `${proj_dir}/results/kegg.txt`
    - ... other GO and KEGG enrichment results files.
- Return the path of the log file under `${proj_dir}/logs/` directory.
    - `${proj_dir}/logs/${sample}.find_go_and_kegg_enrichment.log`


---


> **Alternative direct from BED**  
> `annotatePeaks.pl peaks.bed hg38 -go results/{run}/tables/go_dir -genomeOntology`  
> `annotatePeaks.pl peaks.bed hg38 -kegg results/{run}/tables/kegg_dir`


## Notes & Best Practices

- **Genome & naming**: Ensure the HOMER genome key matches the species; chromosome naming must be consistent (`chr1` vs `1`).
- **BED format**: Tab-delimited, ≥3 columns, 0-based coordinates, no header.
- **Multiple testing**: Prefer FDR (BH) if provided; otherwise fallback to P-value.
- **Background set**: `-bg` helps reduce bias; choose a reasonable universe (e.g., all expressed or all accessible regions → genes).
- **Direct-from-BED**: `annotatePeaks.pl -go/-kegg` is convenient; the gene-list route yields uniform TSVs for plotting.

## Troubleshooting

- **Many NAs after annotation**: Check genome version, chromosome naming, BED formatting, and headers.
- **Empty/weak enrichment**: Ensure sufficient genes (suggest ≥50), verify species of symbols, tune thresholds or background.
- **Column name drift**: HOMER versions may differ; adjust R column mappings if needed.