--- name: tooluniverse-gene-enrichment description: Gene-set enrichment analysis — GO (Biological Process, Molecular Function, Cellular Component), KEGG, Reactome pathway enrichment via clusterProfiler, gseapy, ORA, GSEA. Use for interpreting DEG lists, screen hit lists, or any gene-list-to-pathways query. Includes simplify-cutoff handling and union-vs-total denominator conventions for percent-DE questions. disable-model-invocation: true --- ## COMPUTE, DON'T DESCRIBE When analysis requires computation (statistics, data processing, scoring, enrichment), write and run Python code via Bash. Don't describe what you would do — execute it and report actual results. Use ToolUniverse tools to retrieve data, then Python (pandas, scipy, statsmodels, matplotlib) to analyze it. # Gene Enrichment and Pathway Analysis ## RULE ZERO — Check for pre-computed results FIRST Before following any instruction below, scan the data folder for: - `*_executed.ipynb` → read with `tu run read_executed_notebook '{"data_folder":"","search":""}'` and cite its cell outputs as the authoritative answer - Pre-computed enrichment files (CSV/TSV named `*enrich*`, `*go*`, `*kegg*`, `*reactome*`, `*ego*`, `*_simplified.csv`) → read directly - Canonical analysis scripts (`analysis.R`, `run_*.py`, `find_*.R`, `*.Rmd`) → execute as-is and read the output Only follow this skill's re-analysis recipe below if **none** of the above exist. Re-running enrichment from raw DEG lists produces different numbers than the published answer due to subtle filter differences upstream, and is much slower. --- ## PRIMARY SCRIPTS — use these FIRST Three deterministic CLI scripts cover the bulk of enrichment questions. Each handles edge cases (ties at top, simplify-changes-padj, multi-condition screening) that the agent tends to get wrong when writing ad-hoc code. **Always write outputs to `/tmp/...` — never into the data folder.** ### 1. `scripts/gseapy_enrichment_runner.py` — gseapy enrichr / prerank **When to use**: the question references `gseapy`, `enrichr`, "Enrichr library", or any GO BP/MF/CC, KEGG, Reactome, WikiPathways, MSigDB enrichment via the gseapy package. ```bash python skills/tooluniverse-gene-enrichment/scripts/gseapy_enrichment_runner.py \ --gene-list /tmp/sig_symbols.txt \ --library GO_Biological_Process_2021,Reactome_2022 \ --organism Human \ --top 5 \ --candidate "negative regulation of epithelial cell proliferation" \ --workdir /tmp/gseapy_run ``` What it reports (parseable lines): - `# TOP_BY_ADJ_PVALUE: ` — what `df.sort_values('Adjusted P-value').iloc[0]` returns (this is what published notebooks usually print) - `# TIES_AT_TOP: n=K` — number of terms tied at the lowest Adjusted P-value - `# TOP_TIE_BROKEN: ` — deterministic tie-break (adj_p, raw_p, overlap desc, alphabetic) - `# TOPN_BY_ADJ_PVALUE:` — full top N listing - `# CANDIDATE_RANK '': rank=R adj_p=...` — for any `--candidate` substring you pass - `# SUBSTRING_COUNT_TOPN '': K` — for `--count-substring` queries (e.g., "how many top-20 terms contain 'Oxidative'") Pass `--mode prerank --ranked-list /tmp/lfc.tsv` for GSEA preranked. ### 2. `scripts/enrichgo_runner.py` — clusterProfiler::enrichGO + simplify **When to use**: the question references `enrichGO`, `clusterProfiler`, `simplify`, `simplify(cutoff=0.7)`, or the data folder contains an `analysis.R` / `find_*.R` that uses these. This is the canonical R workflow — gseapy does NOT reproduce it faithfully because `simplify` changes the multiple-testing denominator and thus the p.adjust values for surviving terms. ```bash python skills/tooluniverse-gene-enrichment/scripts/enrichgo_runner.py \ --gene-list /tmp/sig_ensembl.txt \ --background /tmp/bg_ensembl.txt \ --keytype ENSEMBL \ --ontology BP \ --simplify-cutoff 0.7 \ --candidate "regulation of T cell activation" \ --candidate "potassium ion transmembrane transport" \ --workdir /tmp/enrichgo_run ``` What it reports: - `# TOP10_RAW:` — top 10 from `as.data.frame(ego)` (BEFORE simplify; raw p.adjust) - `# TOP10_SIMPLIFIED:` — top 10 from `as.data.frame(simplify(ego, cutoff=0.7))` (AFTER simplify; p.adjust differs) - `# CANDIDATE '': raw_rank=R raw_padj=... simp_rank=R simp_padj=...` — both pre- and post-simplify ranks for each candidate. `simp_rank=NA (collapsed by simplify)` means the term was redundant with a more-significant parent/sibling and was dropped. When a question says "in the simplified results" or "after simplify", read **simp_padj**. When it just says "the most enriched" without mentioning simplify, default to the simplified frame anyway IF the canonical `analysis.R` calls `simplify`. Requires R packages `clusterProfiler`, `org.Hs.eg.db` (or `org.Mm.eg.db` for mouse). Install via `Rscript skills/evals/install_r_packages.R` if missing. ### 3. `scripts/condition_enrichment_screen.py` — per-condition enrichment **When to use**: the question asks "what fraction/percentage of conditions/screens/timepoints/groups had significant enrichment of ", or you have an N-by-many gene table and need per-condition enrichment. ```bash # Per-condition gene-list files: python skills/tooluniverse-gene-enrichment/scripts/condition_enrichment_screen.py \ --condition-genes acute=/tmp/acute_sig.txt \ --condition-genes round1=/tmp/r1_sig.txt \ --condition-genes round2=/tmp/r2_sig.txt \ --condition-genes round3=/tmp/r3_sig.txt \ --library /path/to/local_pathways.gmt \ --background /tmp/expressed.txt \ --keyword immune --keyword cytokine --keyword interferon \ --workdir /tmp/cond_screen ``` Or pass a single 2-col TSV (`conditiongene`) via `--conditions-tsv`. What it reports: - Per condition: `n_genes`, `sig_terms` (Adj P < cutoff), `sig_terms_keyword` (sig terms whose Term contains any --keyword) - `# n_with_any_sig=N pct_with_any_sig=N%` — the fraction with any significant term - `# n_with_keyword_sig=N pct_with_keyword_sig=N%` — the fraction whose sig terms include a category keyword Notes: - The `--library` can be either an Enrichr library name (online) or a path to a local `.gmt` file. **Prefer the local GMT if the data folder ships one** (avoids rate-limits and exactly reproduces published results). - Use `--exclude-condition