---
title: Functional Enrichment Analysis
author: "First/last name (first.last@ucr.edu)"
date: "Last update: `r format(Sys.time(), '%d %B, %Y')`"
output:
html_document:
toc: true
toc_float:
collapsed: true
smooth_scroll: true
toc_depth: 3
fig_caption: yes
code_folding: show
number_sections: true
fontsize: 14pt
bibliography: bibtex.bib
weight: 12
type: docs
---
```{r style, echo = FALSE, results = 'asis'}
BiocStyle::markdown()
options(width=100, max.print=1000)
knitr::opts_chunk$set(
eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")),
cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE")))
```
```{r setup, echo=FALSE, messages=FALSE, warnings=FALSE}
suppressPackageStartupMessages({
library(ggplot2)
library(fgsea)
})
```
Source code downloads:
[.Rmd](https://raw.githubusercontent.com/tgirke/GEN242//main/content/en/tutorials/rfea/rfea.Rmd)
[.R](https://raw.githubusercontent.com/tgirke/GEN242//main/content/en/tutorials/rfea/rfea.R)
## Introduction
The following introduces gene and protein annotation systems that are widely
used for functional enrichment analysis (FEA). These include among many other
annotation systems: Gene Ontology (GO), Disease Ontology (DO) and pathway
annotations, such as KEGG and Reactome. Examples of widely used statistical
enrichment methods are introduced as well. These statistical FEA methods assess
whether functional annotation terms are over-represented in a query gene set.
In case of so called over-represention analysis (ORA) methods, such as Fisher's
exact and hypergeometric distribution tests, the query is usually a list of
unranked gene identifiers [@Falcon2007-eb]. In contrast to this, Gene Set
Enrichment Analysis (GSEA) algorithms use as query a score ranked list (_e.g._
all genes profiled by an assay) and assess whether annotation categories are
more highly enriched among the highest ranking genes compared to random
rankings [@Subramanian2005-kx; @Sergushichev2016-ms; @Duan2020-wz]. The sets in
both the query and the annotation databases can be composed of genes, proteins,
compounds or other factors. For simplicity, the term gene sets is used
throughtout this text.
## Functional Annotations Systems
This section introduces a small selection of functional annotation systems, largely
provided by Bioconductor packages. This includes code to inspect how the annotations
are organized and how to access them.
### Required packages
The code sections in this tutorial make use of the following R/Bioc packages: `c("GOstats", "GO.db", "org.At.tair.db", "KEGG.db", "reactome.db", "systemPipeR", "biomaRt", "fgsea", "data.table", "ggplot2")`. If they are not available on a system, then they need to be installed with [`BiocManager::install()`](https://girke.bioinformatics.ucr.edu/GEN242/tutorials/rbasics/rbasics/#installation-of-r-rstudio-and-r-packages) first.
## Gene Ontology DB
`GO.db` is a data package that stores the GO term information from the GO
consortium in an SQLite database. Several accessor functions are provided to
query the database. Organism specific gene to GO annotations are provied by
organism data packages and/or Bioconductor's
[`AnntationHub`](https://bioconductor.org/packages/release/bioc/html/AnnotationHub.html).
The following provide sample code for using `GO.db` as well as a organism
database example.
```{r godb, eval=FALSE, warning=FALSE, message=FALSE}
## Load GOstats library
library(GOstats); library(GO.db)
## Print complete GO term information for "GO:0003700"
GOTERM$"GO:0003700"
## Print parent and children terms for a GO ID
GOMFPARENTS$"GO:0003700"; GOMFCHILDREN$"GO:0003700"
## Print complete lineages of parents and children for a GO ID
GOMFANCESTOR$"GO:0003700"; GOMFOFFSPRING$"GO:0003700"
## Print number of GO terms in each of the 3 ontologies
zz <- eapply(GOTERM, function(x) x@Ontology); table(unlist(zz))
## Gene to GO mappings for an organism (here Arabidopsis)
library(org.At.tair.db) # For human use org.Hs.eg.db
xx <- as.list(org.At.tairGO2ALLTAIRS)
```
## Pathway DBs
### KEGG
#### `KEGG.db`
The following `load_keggList` function returns the pathway annotations from the `KEGG.db` package for a species selected
under the `org` argument (_e.g._ hsa, ath, dme, mmu, ...). The resulting `list` object can be used
for ORA or GSEA methods, _e.g._ by `fgsea`.
```{r keggdb, eval=FALSE, warning=FALSE, message=FALSE}
## Define function to create KEGG pathway list db
load_keggList <- function(org="ath") {
suppressMessages(suppressWarnings(library(KEGG.db)))
kegg_gene_list <- as.list(KEGGPATHID2EXTID) # All organisms in kegg
kegg_gene_list <- kegg_gene_list[grepl(org, names(kegg_gene_list))] # Only human
kegg_name_list <- unlist(as.list(KEGGPATHID2NAME)) # All organisms in kegg
kegg_name_list <- kegg_name_list[gsub(paste0("^", org), "", names(kegg_gene_list))]
names(kegg_gene_list) <- paste0(names(kegg_gene_list), " (", names(kegg_name_list), ") - ", kegg_name_list)
return(kegg_gene_list)
}
## Usage:
keggdb <- load_keggList(org="ath") # org can be: hsa, ath, dme, mmu, ...
```
Additional packages for KEGG pathways:
* [pathview](https://bioconductor.org/packages/release/bioc/vignettes/pathview/inst/doc/pathview.pdf): plotting pathways with quantitative information embedded
* [KEGGREST](https://bioconductor.org/packages/release/bioc/html/KEGGREST.html): access via KEGG REST API
* Many additional packages can be found under Bioc's KEGG View page [here](https://bioconductor.org/packages/release/BiocViews.html#___KEGG)
### Reactome
#### `reactome.db`
The following `load_reacList` function returns the pathway annotations from the `reactome.db`
package for a species selected under the `org` argument (_e.g._ R-HSA, R-MMU, R-DME, R-CEL, ...).
The resulting `list` object can be used for various ORA or GSEA methods, _e.g._ by `fgsea`.
```{r reactomedb, eval=FALSE, warning=FALSE, message=FALSE}
## Define function to create Reactome pathway list db
load_reacList <- function(org="R-HSA") {
library(reactome.db)
reac_gene_list <- as.list(reactomePATHID2EXTID) # All organisms in reactome
reac_gene_list <- reac_gene_list[grepl(org, names(reac_gene_list))] # Only human
reac_name_list <- unlist(as.list(reactomePATHID2NAME)) # All organisms in reactome
reac_name_list <- reac_name_list[names(reac_gene_list)]
names(reac_gene_list) <- paste0(names(reac_gene_list), " (", names(reac_name_list), ") - ", gsub("^.*: ", "", reac_name_list))
return(reac_gene_list)
}
## Usage:
reacdb <- load_reacList(org="R-HSA")
```
A very useful query interface for Reactome is the `ReactomeContentService4R` package.
Its vignette provides many useful examples, see [here](https://bioconductor.org/packages/devel/bioc/vignettes/ReactomeContentService4R/inst/doc/ReactomeContentService4R.html).
A sample plot from `ReactomeContentService4R` is shown below. For metabolite (set) enrichment analysis (MEA/MSEA) users might also be interested in the
[MetaboAnalystR](https://github.com/xia-lab/MetaboAnalystR) package that interfaces with the [MataboAnalyst](https://www.metaboanalyst.ca/home.xhtml) web service.
![](../results/reactome_bigpicture.jpeg)
Figure 1: Fireworks plot depicting genome-wide view of reactome pathways.
## Functional Enrichment Analysis Methods
### Over-representation analysis (ORA)
#### `GOstats` Package
The `GOstats` package allows testing for both over and under representation of GO terms using
either the standard Hypergeometric test or a conditional Hypergeometric test that uses the
relationships among the GO terms for conditioning [@Falcon2007-eb].
```{r gostats, eval=FALSE, warning=FALSE, message=FALSE}
## Load required packages
library(GOstats); library(GO.db); library(org.At.tair.db)
## Define universe and test sample set
geneUniverse <- keys(org.At.tairGENENAME)
geneSample <- c("AT2G46210", "AT2G19880", "AT2G38910", "AT5G25140", "AT2G44525")
## Generate params object
params <- new("GOHyperGParams", geneIds = geneSample,
universeGeneIds = geneUniverse,
annotation="org.At.tair", ontology = "MF", pvalueCutoff = 0.5,
conditional = FALSE, testDirection = "over")
## Run enrichment test
hgOver <- hyperGTest(params)
## Viewing of results
summary(hgOver)[1:4,]
htmlReport(hgOver, file = "MyhyperGresult.html") # html file will be written to current working directory
```
#### `GOHyperGAll` and `GOCluster_Report`
The following introduceds a `GOCluster_Report` convenience function from the
`systemPipeR` package. The first part shows how to generate the proper `catdb`
lookup data structure for any organism supported by BioMart [@H_Backman2016-xk].
This more time consuming step needs to be performed only once.
```{r gohypergall_catdb, eval=FALSE, warning=FALSE, message=FALSE}
## Create a custom genome-to-GO lookup table for enrichment testing
library(systemPipeR); library(biomaRt)
listMarts() # To choose BioMart database
listMarts(host = "plants.ensembl.org")
## Obtain annotations from BioMart
listMarts() # To choose BioMart database
m <- useMart("plants_mart", host = "plants.ensembl.org")
listDatasets(m)
m <- useMart("plants_mart", dataset = "athaliana_eg_gene", host = "plants.ensembl.org")
listAttributes(m) # Choose data types you want to download
go <- getBM(attributes = c("go_id", "tair_locus", "namespace_1003"), mart = m)
go <- go[go[, 3] != "", ]; go[, 3] <- as.character(go[, 3])
go[go[, 3] == "molecular_function", 3] <- "F"; go[go[, 3] == "biological_process", 3] <- "P"; go[go[, 3] == "cellular_component", 3] <- "C"
go[1:4, ]
dir.create("./GO")
write.table(go, "GO/GOannotationsBiomart_mod.txt", quote = FALSE, row.names = FALSE, col.names = FALSE, sep = "\t")
catdb <- makeCATdb(myfile = "GO/GOannotationsBiomart_mod.txt", lib = NULL, org = "", colno = c(1, 2, 3), idconv = NULL)
save(catdb, file="GO/catdb.RData")
```
For the actual enrichment analysis one can load the `catdb` object from the
corresponding file, and then perform batch GO term analysis where the results
include all terms meeting a user-provided P-value cutoff as well as GO Slim
terms.
```{r gohypergall, eval=FALSE, warning=FALSE, message=FALSE}
## Next time catDB can be loaded from file
load("GO/catdb.RData")
## Perform enrichment test on single gene set
geneids <- unique(as.character(catmap(catdb)$D_MF[,"GeneID"]))
gene_set_list <- sapply(c("Set1", "Set2", "Set3"), function(x) sample(geneids, 100), simplify=FALSE)
GOHyperGAll(catdb=catdb, gocat="MF", sample=gene_set_list[[1]], Nannot=2)[1:20,]
## Batch analysis of many gene sets for all and slim terms
goall <- GOCluster_Report(catdb=catdb, setlist=gene_set_list, method="all", id_type="gene", CLSZ=2, cutoff=0.01, gocats=c("MF", "BP", "CC"), recordSpecGO = NULL)
## GO Slim analysis by subsetting enrichment results accordingly
m <- useMart("plants_mart", dataset = "athaliana_eg_gene", host = "plants.ensembl.org")
goslimvec <- as.character(getBM(attributes = c("goslim_goa_accession"), mart = m)[, 1])
goslim <- GOCluster_Report(catdb=catdb, setlist=gene_set_list, method="slim",id_type="gene", myslimv=goslimvec, CLSZ=2, cutoff=0.01, gocats = c("MF", "BP", "CC"), recordSpecGO = NULL)
## Plot 'GOBatchResult' as bar plot
goBarplot(goslim, gocat="MF")
```
![](../results/goslim.png)
Figure 2: Batch ORA result of GO slim terms using 3 test gene sets.
### Set enrichment analysis (SEA)
#### `fgsea` Package
The `fgsea` function performs gene set enrichment analysis (GSEA) on a score ranked
gene list [@Sergushichev2016-ms]. Compared to other GESA implementations, `fgsea` is very fast. Its P-value
estimation is based on an adaptive multi-level split Monte-Carlo scheme. In addition
to its speed, it is very flexible in adopting custom annotation systems since it
stores the gene-to-category annotations in a simple `list` object that is easy to create. The
following uses the `keegdb` and `reacdb` lists created above as annotation systems.
```{r fgsea, eval=FALSE, warning=FALSE, message=FALSE}
## Load packages and create sample ranked gene list
library(fgsea); library(data.table); library(ggplot2); library(org.At.tair.db)
set.seed(42)
## fgsea with KEGG (Arabidopsis)
geneids <- mappedkeys(org.At.tairCHR)
exampleRanks <- sort(setNames(sample(seq(-100,100, by=0.001), length(geneids)), geneids))
fgseaResKegg <- fgsea(pathways=keggdb, stats=exampleRanks, minSize=15, maxSize=500)
head(fgseaResKegg[order(pval), ])
plotEnrichment(keggdb[["ath00052 (00052) - Galactose metabolism"]], exampleRanks) + labs(title="Galactose metabolism")
## fgsea with Reactome (Human)
geneids <- unique(as.character(unlist(reacdb)))
exampleRanks <- sort(setNames(sample(seq(-100,100, by=0.001), length(geneids)), geneids))
fgseaResReac <- fgsea(pathways=reacdb, stats=exampleRanks, minSize=15, maxSize=500)
head(fgseaResReac[order(pval), ])
plotEnrichment(reacdb[["R-HSA-3247509 (R-HSA-3247509) - Chromatin modifying enzymes"]], exampleRanks) + labs(title="Chromatin modifying enzymes")
```
The `plotEnrichment` can be used to create enrichment plots. Additional examples are available
in the vignette of the `fgsea` package [here](https://bioconductor.org/packages/release/bioc/vignettes/fgsea/inst/doc/fgsea-tutorial.html).
![](../results/enrplot.png)
Figure 3: Enrichment plot for selected pathway.
## Useful Links
### Visualization of Enrichment Results
- [clusterProfiler](https://bit.ly/3NZqIZ6)
- [pathfindeR](https://bit.ly/42I6szt)
### Other
- ...
## Version Information
```{r sessionInfo}
sessionInfo()
```
## References