--- title: "Statistics with R" subtitle: "Gene set comparison and enrichment analysis" author: "Jacques van Helden & Olivier Sand" date: "`r Sys.Date()`" output: slidy_presentation: highlight: default incremental: no smart: no slide_level: 1 self_contained: no fig_caption: no fig_height: 5 fig_width: 5 keep_md: yes smaller: yes theme: cerulean toc: yes widescreen: yes revealjs::revealjs_presentation: theme: night transition: none self_contained: true slide_level: 1 css: ../slides.css pdf_document: fig_caption: yes highlight: zenburn toc: yes toc_depth: 3 ioslides_presentation: highlight: zenburn incremental: no html_document: fig_caption: yes highlight: zenburn self_contained: no theme: cerulean toc: yes toc_depth: 3 toc_float: yes beamer_presentation: theme: Hannover #Montpellier colortheme: beaver fonttheme: professionalfonts #structurebold highlight: pygments #default fig_caption: no fig_height: 4 fig_width: 5 incremental: no keep_tex: no slide_level: 1 toc: yes font-import: http://fonts.googleapis.com/css?family=Risque font-family: Garamond transition: linear editor_options: chunk_output_type: console --- ```{r settings, include=FALSE, echo=FALSE, eval=TRUE} ## Install required packages packages <- c("knitr", "kableExtra", "stringr", "limma", "gprofiler2") for (pkg in packages) { if (!require(pkg, character.only = TRUE)) { install.packages(pkg) require(pkg, character.only = TRUE) } } options(width = 300) # options(encoding = 'UTF-8') knitr::opts_chunk$set( fig.width = 5, fig.height = 5, fig.path = 'images/enrichment_', fig.align = "center", size = "tiny", echo = TRUE, eval = TRUE, warning = FALSE, message = FALSE, results = TRUE, comment = "") options(scipen = 12) ## Max number of digits for non-scientific notation ``` # Two main approches - Gene set comparison or over-representation analysis (ORA) - Input: a set of functionally related genes - Reference: a database of annotated gene functions (GO, pathways, TF targets…) - Approach: evaluate the significance of the intersection (over-represented?) - Stat: hypergeometric test - Gene Set Enrichment analysis - Input: a sorted list of genes - Reference: a database of annotated gene functions (GO, pathways, TF targets…) - Approach: evaluate the significance of the rank of the genes belonging to a reference class in the ordered list. - Stat: enrichment scores (alternative) # Gene set comparison (over-representation of the intersection) - A given organism has 6000 genes, 40 of which are involved in methionine metabolism. - A set of 10 genes were reported as co-expressed in an RNAseq experiment. Among them, 6 are related to methionine metabolism. - How significant is this observation ? More precisely, what would be the probability to observe such a correspondence by chance alone ? ```{r venn_diagram, include=FALSE, results=TRUE, echo=FALSE, eval=TRUE, out.width="50%"} #### Venn diagram #### # Draw a venn diagram indicating the number of genes in a # co-expression cluster that are involved in methionine metobolism VennDiagram::venn.diagram( list(Genome = 1:6000, Methionine_metabolism = 1:40, coexpression_cluster = 35:44), height = 400, width = 700, main.cex = 0.35, cex=0.3, cat.cex = 0.25, lwd = c(0.5,0.5,0.5), cat.pos = c(-10,-45,10), filename = "images/venn_yeast_methionine.png", main = "Genes in co-expressed cluster vs methionine metabolism", imagetype = "png", col = c("darkblue", "darkgreen", "darkred"), cat.col = c("darkblue", "darkgreen", "darkred")) #include_graphics(path = "images/venn_yeast_methionine.png") ``` ![Venn diagram](images/venn_yeast_methionine.png) # The hypergeometric test | Symbol | Meaning | |:--------|:---------------------------| | $g = 6000$ | number of genes | | $m = 40$ | genes involved in methionine metabolism | | $n = 5960$ | genes not involved in methionine metabolism | | $k = 10$ | number of genes in the cluster | | $x = 6$ | number of methionine genes in the cluster | - We calculate the number of possibilities for the following selections | Symbol | Meaning | Formula | |:------|:-----------------|:--------------------------| | $C_1$ | choose 10 distinct genes among 6000 | $C_1 = C_{m+n}^{k} = \frac{6000!}{10!5990!} = 1.65e^{31}$ | | $C_2$ | choose 6 distinct genes among the 40 involved in methionine | $C_2 = C_{m}^{x} = \frac{40!}{6!34!} = 3.8e^{6}$ | | $C_3$ | choose 4 genes among the 5960 which are not involved in methionine |$C_3 = C_{n}^{k-x} = \frac{5960!}{4!5956!} = 5.2e^{13}$ | | $C_4$ | choose 6 methionine and 4 non-methionine genes | $C_4 = C2 \cdot C3 = C_{m}^{x}C_{n}^{k-x} = 2.0e^{20}$ | # The hypergeometric test - probabilities - Probability to have exactly 6 methionine genes within a selection of 10 $$P(X=6) = \frac{C4}{C1} = \frac{C_{m}^{x}C_{n}^{k-x}}{C_{m+n}^{k}} = \frac{C_{40}^{6}C_{5960}^{4}}{C_{6000}^{10}} = 1.219e^{-11}$$ - Probability to have at least 6 methionine genes within a selection of 10 $$P(X \ge 6) = \sum_{i=x}^{k}\frac{C_{m}^{i}C_{n}^{k-i}}{C_{m+n}^{k}} = 1.222e^{-11}$$ # ORA - precautions to take - Define your universe (background) - set with all the genes susceptible to be part of your analysis - Not so simple - all genes in genomic annotations ? - all genes with at least one annotation in the ontology you used ? - all coding genes ? - genes on a biochip ? - genes / proteins detected by an experimental approach (RNAseq, proteomics) - genes reachable by your approach (ex : miRNA targets, Godard et al., 2015) - Multiple-test corrections - correction choice (adjusted P-values : Bonferroni correction, Benjamini-Hochberg FDR...) - inter-test dependancies corrections (gSCS in gProfiler) # Practical - Annotating a group of differentially expressed genes (DEG) - tool: g:GOSt from gProfiler - documentation: - Goal: - detect functions (biological process, pathways, regulation...) associated with the set of DEG - interpret the resuls - What about a negative control ? ```{r out.width="50%", echo=FALSE} include_graphics("images/gProfiler_site.png") ``` # Practical - DEG - Organism: *Saccharomyces cerevisiae* - Data source: - The aim of this study is to compare NGS-derived yeast transcriptome profiling (RNA-seq) of wild-type and bdf1-Y187F-Y354F mutant strains after sporulation induction (time points: 0h 4h and 8h) - design: *S. cerevisiae* wild-type and bdf1-Y187F-Y354F mutant strains were collected 0h, 4h and 8h after sporulation induction in triplicates. mRNA were purified, prepared and sequenced using Illumina HiSeq 2000 sequencer - WT vs mutant at 0h: [bdf1_Y187F_Y354F_mutant_0__vs__Wild_type_0_DESeq2_positive_geneIDs.txt](https://raw.githubusercontent.com/DU-Bii/module-3-Stat-R/master/stat-R_2019/seance_5/results/bdf1_Y187F_Y354F_mutant_0__vs__Wild_type_0_DESeq2_positive_geneIDs.txt) - WT vs mutant at 4h: [bdf1_Y187F_Y354F_mutant_4__vs__Wild_type_4_DESeq2_positive_geneIDs.txt](https://raw.githubusercontent.com/DU-Bii/module-3-Stat-R/master/stat-R_2019/seance_5/results/bdf1_Y187F_Y354F_mutant_4__vs__Wild_type_4_DESeq2_positive_geneIDs.txt) # Practical - Results - t0: ```{r out.width="80%", echo=FALSE} include_graphics("images/gProfiler_results_t0.png") ``` # Practical - Results - t4: ```{r out.width="80%", echo=FALSE} include_graphics("images/gProfiler_results_t4.png") ``` # Practical - detailled results t0 ```{r out.width="50%", echo=FALSE} include_graphics("images/gProfiler_detailled_results_t0_all.png") ``` # Practical - detailled results t4 ```{r out.width="80%", echo=FALSE} include_graphics("images/gProfiler_detailled_results_t4_all.png") ``` # gProfiler in R - Package: https://cran.r-project.org/web/packages/gprofiler2/index.html - Manual: https://cran.r-project.org/web/packages/gprofiler2/gprofiler2.pdf - Vignettes: https://cran.r-project.org/web/packages/gprofiler2/vignettes/gprofiler2.html ```{r outdirs, include=FALSE, echo=FALSE, eval=TRUE} #### Output directories #### outdirs <- list() outdirs$main <- getwd() ## Data directory, where the data will be downloaded and uncompressed outdirs$data <- file.path(outdirs$main, "data") #dir.create(outdirs$data, recursive = TRUE, showWarnings = FALSE) ## Main result directory outdirs$results <- file.path(outdirs$main, "results") dir.create(outdirs$data, recursive = TRUE, showWarnings = FALSE) ``` ```{r download_deg, include=FALSE, echo=FALSE, eval=TRUE} #### Download DEG data #### t4_geneIDsFile <- "bdf1_Y187F_Y354F_mutant_4__vs__Wild_type_4_DESeq2_positive_geneIDs.txt" t4URL <- file.path("https://raw.githubusercontent.com/DU-Bii/module-3-Stat-R/master/stat-R_2019/seance_5/results", t4_geneIDsFile) local_t4Data <- file.path(outdirs$data, t4_geneIDsFile) if (file.exists(local_t4Data)) { message("t4 data already downloaded:\n\t", local_t4Data) } else { message("Downloading t4 data from github: ", t4URL) download.file(url = t4URL, destfile = local_t4Data) } #### Download background data #### bkgdFile <- "bdf1_Y187F_Y354F_mutant_4__vs__Wild_type_4_DESeq2_all_genes.tsv" bkgdURL <- file.path("https://raw.githubusercontent.com/DU-Bii/module-3-Stat-R/master/stat-R_2019/seance_5/results", bkgdFile) local_bkgdData <- file.path(outdirs$data, bkgdFile) if (file.exists(local_bkgdData)) { message("background data already downloaded:\n\t", local_bkgdData) } else { message("Downloading background data from github: ", bkgdURL) download.file(url = bkgdURL, destfile = local_bkgdData) } ``` ```{r data_vectors, include=FALSE, echo=FALSE, eval=TRUE} #### Data in vectors #### t4DEG <- scan(file = local_t4Data, # puts file content in a vector what = "character" ) bkgd_df <- read.table(local_bkgdData, header = TRUE ) bkgd <- as.vector(bkgd_df$ID) # puts column ID of dataframe in a vector ``` ```{r gost, include=FALSE, echo=FALSE, eval=TRUE} #### GOSt #### gostres <- gost( t4DEG, organism = "scerevisiae", ordered_query = FALSE, multi_query = FALSE, significant = TRUE, exclude_iea = FALSE, measure_underrepresentation = FALSE, evcodes = FALSE, user_threshold = 0.05, correction_method = "fdr", domain_scope = "annotated", custom_bg = NULL, numeric_ns = "", sources = NULL, #sources = c("GO:BP", "KEGG", "REAC", "WP"), as_short_link = FALSE ) ``` ```{r gost_custom, include=FALSE, echo=FALSE, eval=TRUE} #### GOSt with custom background #### gostres2 <- gost( t4DEG, organism = "scerevisiae", ordered_query = FALSE, multi_query = FALSE, significant = TRUE, exclude_iea = FALSE, measure_underrepresentation = FALSE, evcodes = FALSE, user_threshold = 0.05, correction_method = "fdr", domain_scope = "custom_annotated", custom_bg = bkgd, numeric_ns = "", sources = NULL, as_short_link = FALSE ) ``` ```{r gostplot, include=FALSE, echo=FALSE, eval=TRUE, fig.width=8, fig.height=6, out.width="80%", fig.cap = "blab bla"} #### Plot the results #### png(filename = "images/enrichment_gostplot.png") gostplot(gostres, capped = FALSE, interactive = TRUE) dev.off() ## Include the saved file in the knitr-generated report #include_graphics(path = "images/enrichment_gostplot.png", auto_pdf = TRUE) png(filename = "images/enrichment_gostplot_2.png") gostplot(gostres2, capped = FALSE, interactive = TRUE) dev.off() ## Include the saved file in the knitr-generated report #include_graphics(path = "images/enrichment_gostplot_2.png", auto_pdf = TRUE) ``` # Gene Set Enrichment Analysis - GSEA - Broad Institute - since 2006 - determines whether an a priori defined set of genes shows statistically significant, concordant differences between two biological states - - MSigDB (The Molecular Signatures Database) : collection of annotated gene sets (https://www.gsea-msigdb.org/gsea/msigdb/index.jsp) - R package : ![GSEA](images/GSEA-homegraphic.gif) # GSEA principle - All genes are sorted according to some criterion (e.g. differential expression p-value, correlation of expression with other variables, …). - Each graph compares the ranked gene list with one reference class (e.g. one biological process). - Black bars denote genes belonging to the reference class. - The green curve estimates, at each level i, the degree of over-representation of the reference genes in the i top-ranking genes. ![GSEA screenshot](images/GSEA_screenshot_1.png)