--- title: "Discovering a motif discovering appproach" author: "Jacques van Helden" date: '`r Sys.Date()`' output: html_document: code_folding: hide self_contained: no fig_caption: yes highlight: zenburn theme: cerulean toc: yes toc_depth: 3 toc_float: yes revealjs::revealjs_presentation: theme: night transition: none self_contained: true css: ../../slides.css slidy_presentation: smart: no slide_level: 2 self_contained: yes fig_caption: yes fig_height: 6 fig_width: 7 highlight: tango incremental: no keep_md: yes smaller: yes theme: cerulean toc: yes widescreen: yes pdf_document: fig_caption: yes highlight: zenburn toc: yes toc_depth: 3 ioslides_presentation: slide_level: 2 self_contained: no colortheme: dolphin fig_caption: yes fig_height: 6 fig_width: 7 fonttheme: structurebold highlight: tango smaller: yes toc: yes widescreen: yes beamer_presentation: colortheme: dolphin fig_caption: yes fig_height: 6 fig_width: 7 fonttheme: structurebold highlight: tango incremental: no keep_tex: no slide_level: 2 theme: Montpellier toc: yes font-import: http://fonts.googleapis.com/css?family=Risque subtitle: LCG_BEII 2019 font-family: Garamond transition: linear editor_options: chunk_output_type: console --- ```{r include=FALSE, echo=FALSE, eval=TRUE} library(knitr) library(kableExtra) # library(formattable) options(width = 300) # options(encoding = 'UTF-8') knitr::opts_chunk$set( fig.width = 7, fig.height = 5, fig.path = 'figures/R_intro_', fig.align = "center", size = "tiny", echo = TRUE, eval = TRUE, warning = FALSE, message = FALSE, results = TRUE, comment = "") options(scipen = 12) ## Max number of digits for non-scientific notation # knitr::asis_output("\\footnotesize") ``` ## Goal of the exercise The goal of this exercise is to get an intuition of a motif discovery approach relying on the detection of over-represented oligonucleotides. Our approach will be pragmatic. We retrieved the upstream non-coding sequences of the genes involved in methionine biosynthesis and sulfur assimilation, and counted the occurrences of each hexanucleotide. We also computed - the relative frequencies (occurrences of each oligo / sum of all oligo occurrences) in the sequence of interest (the promoters of methionine-associated genes) - the relative frequencies of ach hexanucleotide in the whole set of yeast promoters. We would like to know if some 6nt are over-reprsented in promoters of methionine-associated genes relative to the occurrences that would be expected from a random selection of yeast promoters. ## Create a workspace for this practical - In your home directory, create a work directory for this practical (for example `~/LCG_BEII/practical_motif_discovery/`). ```{r} rmdDir <- getwd() # Path of this Rmd fiile workdir <- "~/LCG_BEII/practical_motif_discovery" dir.create(workdir, showWarnings = FALSE, recursive = TRUE) setwd(workdir) ``` ## Loading the data table 1. Download the oligonucleotide count table. [Scerevisiae_MET-genes_oligos-6nt-2str-noov_occ-freq.tsv](http://jvanheld.github.io/LCG_BEII/practicals/motif_discovery/data/Scerevisiae_MET-genes_oligos-6nt-2str-noov_occ-freq.tsv) ```{r download_oligo_counts} oligo.url <- "http://jvanheld.github.io/LCG_BEII/practicals/motif_discovery/data/Scerevisiae_MET-genes_oligos-6nt-2str-noov_occ-freq.tsv" oligo.file <- basename(oligo.url) ## Suppress the URL path and keep only the file name for local storage download.file(oligo.url, destfile = oligo.file) ``` 2. In **R**, open a new script or R markdown file. 3. Load the data table, print the 5 top rows and the 5 bottom rows. ```{r load_oligos} oligo.table <- read.delim(oligo.file, header = 1, row.names = 1) # View(oligo.table) head(oligo.table, n = 5) tail(oligo.table, n = 5) ``` ## Exploring observed and expected counts 4. Draw an histogram of the observed occurrences and evaluate the spread of counts. ```{r obs_hist, out.width="80%", fig.width=7, fig.height=4} x <- oligo.table$occ range(x) max.x <- max(x) hist(x, breaks = 0:max.x, col = "palegreen", xlab = "Observed occurrences", ylab = "Nb of oligos", las = 1, main = "Distribution of oligonucelotide occurrences") ``` 5. Draw a scatter plot comparing the observed and expected occurrences for each hexanucleotide. ```{r obs_vs_exp, out.width="80%", fig.width=7, fig.height=4.5, fig.cap="**Scatter plot of observed versus expected occurrences.** The black diagonal corresponds to the null hypothesis, the brown line denotes an arbitrary threshold on fold-change > 2. "} exp.occ <- oligo.table$exp_occ plot(exp.occ, x, col = "grey", las = 1, xlab = "Expected occurrences", ylab = "Observed occurrences", main = "Observed vs expected occurrences") grid() abline(a = 0, b = 1, col = "black") abline(h = 0, col = "black") abline(v = 0, col = "black") abline(a = 0, b = 2, col = "brown") ``` 6. Compute the ratio between observed and expected occurrences, and draw the histogram of the ratio values, as well as a scatter plot with this ratio (Y) as a function of the expected occurrences (X). ```{r ratio, out.width="80%", fig.width=7, fig.height=4.5, fig.cap="**Observed/expected ratio. ** Top: histogram of ratio values. Bottom: ratio versus expected occurrences. The black line to the null hypothesis, the brown line denotes an arbitrary threshold on fold-change > 2. "} ratio <- (x/exp.occ) plot(exp.occ, ratio, col = "grey", las = 1, xlab = "Expected occurrences", ylab = "(obs/exp) ratio", main = "(obs/exp) ratio") grid() abline(h = 1, col = "black") abline(h = 2, col = "brown") ``` 6. Compute the log-ratio of observed / expected occurrences, and draw a scatter plot with this log-ratio (Y) as a function of the expected occurrences (X). ```{r lr, out.width="80%", fig.width=7, fig.height=4.5, fig.cap="**Scatter plot of observed versus expected occurrences.** The black diagonal corresponds to the null hypothesis, the brown line denotes an arbitrary threshold on fold-change > 2. "} lr <- log(x/exp.occ) oligo.table$lr <- lr plot(exp.occ, lr, col = "grey", las = 1, xlab = "Expected occurrences", ylab = "log(obs/exp)", main = "Log-ratio") grid() abline(h = 0, col = "black") abline(h = log(2), col = "brown") ``` $$lr = log(x/)$$ 7. Compute the log-likelihood ratio ($llr$), defined below, and draw a scatter plot with this $llr$ as a function of the expected occurrences. $$llr = f \cdot log(x/)$$ ```{r llr, out.width="80%", fig.width=7, fig.height=4.5, fig.cap="**Scatter plot of log-likelihood ratio (llr) versus expected occurrences.** The black line corresponds to the null hypothesis, the brown line denotes an arbitrary threshold on fold-change > 2. "} p <- oligo.table$exp_freq llr <- p * log(x/exp.occ) oligo.table$llr <- llr plot(exp.occ, llr, col = "grey", las = 1, xlab = "Expected occurrences", ylab = "llr", main = "Log-likelihood ratio") grid() abline(h = 0, col = "black") # abline(h = log(2), col = "brown") ``` ## Computing over-representation significance 8. Draw a binomial distribution with parameters $n = 8000$, $p = 0.0001$. ```{r} n <- 8000 p <- 0.001 x <- 15# Number of successes X <- 0:40 ## values to display plot(X, dbinom(x = X, size = n, prob = p), type = "h", col = "grey", ylab = "P(X = x)", las = 1, xlab = "X (nb of successes)") arrows(x, 0.04, x, 0.02, lwd = 2, length = 0.1, angle = 30, col = "red") tail <- x:40 lines(tail, dbinom(x = tail, size = n, prob = p), type = "h", col = "red") arrows(x, 0.04, x, 0.02, lwd = 2, length = 0.1, angle = 30, col = "red") pval <- pbinom(q = x - 1, size = n, prob = p, lower.tail = FALSE) legend("topright", legend = paste("pval =", signif(digits = 3, pval))) ``` 8. Use the binomial distribution to compute the P-value of the observed occurrences. $$P = T(X \ge x)$$ ```{r pval_computation} x <- oligo.table$occ ## Nuumber of successes n <- sum(x) ## Number of trials p <- oligo.table$exp_freq ## Success probability p <- p / sum(p) # A correction for the fact that we discarded self-overlapping occurrences # sum(p) nbTests <- length(x) # Number of tests ## Compute a P-value for each individual oligonucleotide oligo.table$pval <- pbinom(q = x - 1, size = n, prob = p, lower.tail = FALSE) ``` 9. Draw an histogram with the P-values of all hexanucleotides, with 20 bins. ```{r pval_histogram, out.width="80%", fig.width=7, fig.height=4.5, fig.cap="**Histogram of nominal p-values** for all the hexanucleotides grouped by pairs of reverse complements. "} hist(oligo.table$pval, breaks = seq(0, 1, 0.05), col = "beige", main = "P-value histogram") ``` 10. Draw a scatter plot with the P-value (Y) as a function of the log-ratio (X). ```{r plot_pval_vs_lr} plot(oligo.table$lr) # hist(lr, breaks = 50) plot(lr, oligo.table$pval, col = "grey", panel.first = grid()) abline(v = 0) # hist(lr, breaks = 50) plot(lr, oligo.table$pval, col = "grey", log = "y", las = 1, ylim = c(1e-20, 1), panel.first = grid()) abline(v = 0) ``` 11. Compute the E-value, and the significance. $$E = P \cdot N$$ $$sig = -log_{10}(E)$$ ```{r evalue_sig_computation} oligo.table$eval <- oligo.table$pval * nbTests oligo.table$sig <- -log10(oligo.table$eval) ``` 12. Draw a **Volcano plot**, with the significance as a function of the log-ratio. ```{r volcano_plot} plot(oligo.table$lr, oligo.table$sig, col = "grey", xlab = "Log-ratio", ylab = "sig = -log10(E)", panel.first = grid()) abline(h = 0) alpha <- 0.05 abline(h = -log10(alpha), col = "red") ``` 13. Compute the P-value using the Poisson distribution as approximation of the binomial. Are we in suitable conditions for this approximation ? Draw a plot comparing the P-values obtained by the binomial and Poisson distributions. ```{r poisson_pval} lambda <- oligo.table$exp_occ * sum(oligo.table$occ) / sum(oligo.table$exp_occ) oligo.table$pval_Poisson <- ppois( q = oligo.table$occ - 1, lambda = lambda, lower.tail = FALSE) plot(oligo.table$pval, oligo.table$pval_Poisson, log = "xy") abline(a = 0, b = 1) range(oligo.table$pval / oligo.table$pval_Poisson) ```