--- title: "Practical: analysis of ChIP-seq peaks" author: "Jacques van Helden" date: '`r Sys.Date()`' output: html_document: code_folding: hide self_contained: no fig_caption: yes highlight: tango theme: cerulean toc: yes toc_depth: 3 toc_float: yes ioslides_presentation: slide_level: 2 self_contained: no colortheme: dolphin fig_caption: yes fig_height: 6 fig_width: 7 fonttheme: structurebold highlight: tango smaller: yes toc: yes widescreen: yes revealjs::revealjs_presentation: theme: night transition: none self_contained: true css: ../../slides.css slidy_presentation: smart: no slide_level: 2 self_contained: yes fig_caption: yes fig_height: 6 fig_width: 7 highlight: tango incremental: no keep_md: yes smaller: yes theme: cerulean toc: yes widescreen: yes pdf_document: fig_caption: yes highlight: zenburn toc: yes toc_depth: 3 beamer_presentation: colortheme: dolphin fig_caption: yes fig_height: 6 fig_width: 7 fonttheme: structurebold highlight: tango incremental: no keep_tex: no slide_level: 2 theme: Montpellier toc: yes font-import: http://fonts.googleapis.com/css?family=Risque subtitle: LCG_BEII 2019 font-family: Garamond transition: linear --- ```{r include=FALSE, echo=FALSE, eval=TRUE} library(knitr) library(kableExtra) # library(formattable) options(width = 300) # options(encoding = 'UTF-8') knitr::opts_chunk$set( fig.width = 7, fig.height = 5, fig.path = 'figures/R_intro_', fig.align = "center", size = "tiny", echo = TRUE, eval = TRUE, warning = FALSE, message = FALSE, results = FALSE, comment = "") options(scipen = 12) ## Max number of digits for non-scientific notation # knitr::asis_output("\\footnotesize") ``` ## Introduction In this practical, we will combine various bioinformatics resources to extract information from ChIP-seq experiments, and evaluate the quality of the peaks. ## Resources | Resource Name | URL | |-------------------------|--------------------------------------| | ReMap | A database of ChIP-seq peaks () | | Jaspar | A database of eukaryote TF binding motifs, mainly built from CHIP-seq peaks () | | Hocomoco | A database of Human TF binding motifs, mainly built from CHIP-seq peaks () | | RSAT| Regulatory Sequence Analysis Tools | | RSAT Metazoa | | | RSAT Teaching | | | RSAT server at IFB | | | UCSC NGS file formats | | ## Creating a workspace for this practical - Create a new folder to store the results of this practical (e.g. `~/LCG_BEII/chip-seq_practical/`) - Set this folder as your working direct ## Getting peaks from ReMap ReMap () is a database of TF binding regions based on an extensive re-analysis of published ChIP-seq data for human transcription factors. ory (`setwd()`) - Open a connection to [ReMap](https://remap.univ-amu.fr/) - Select *Homo sapiens* - Explore the interface - Choose a transcription factor and a tissue / cell type of interest. - For a same factor, you wil generally find several datasets resulting from different studies, in different tissues (biotypes). - For your exercise, I recommend to select a dataset with a reasonable number of peaks - For this tutorial I will use Sox2 in [glioma](https://remap.univ-amu.fr/biotype_page/glioma:9606), where ReMap peaks were identified from the original [GEO dataset GSE67282](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE67282) - Click on the **Download icon** tab, besides the number of peaks. This downloads a compressed (gzipped) bed file (tab-separated columns describing genomic features). - Check that the number of peaks in the downloaded file corresponds to the indication on the ReMap page for your factor in the tissue / cell type and in the study identifier (GSE or other) you selected. - Check the *name* column in the bed file to make sure that the peaks all belong to the same tissue / cell type and study identifier. - Compute - the number of peaks - the sum of peak lengths - the mean peak length - the median peak length **Beware:** the bed format relies on a somewhat confusing convention to specify the coordinates of genomic features. - Coordinates are 0-based (the first position of a sequence is referred to as 0). - The feature start is (as expected) the first position within the feature, but the "end" is the first position **after the feature**. - Consequently, the feature length can be computed as $l = end - start$, whereas with other formats it should be $l = end - start = 1$. **Tips:** - The peak statistics can be computed with a spreadsheet (Excel, LibreOffice Calc), R, or the Unix command line. - Click the "code" button to display a bash line to compute peak statistics (number, sum of lengths and mean length) from a `bed`file. ```{bash peak_stats_bash, eval=FALSE, echo=TRUE} RESULT_DIR=~/LCG_BEII/chip-seq_practical/SOX2_glioma_GSE67282 PEAKS=${RESULT_DIR}/GSE67282.SOX2.glioma_stem.bed.gz gunzip -c ${PEAKS} \ | awk '{len=$3 - $2; sum += len; n++; mean = sum / n ; print "n="n"\tsum="sum"\tmean="mean}' \ | tail -n 1 ``` - Click the "code" button to display a chunk of R code to compute peak statistics. ```{r pak_stats_r, eval=TRUE, echo=TRUE} resultDir <- "~/LCG_BEII/chip-seq_practical/SOX2_glioma_GSE67282" peakBase <- "GSE67282.SOX2.glioma_stem.bed.gz" peakFile <- file.path(resultDir, peakBase) #peakFile <- "data/ReMap_GSE67282.SOX2.ESC_peaks.bed" peaks <- read.delim(peakFile, header = FALSE) peaks <- peaks[, 1:4] # Keep the fields used for this analysis names(peaks) <- c("chr", "start", "end", "name") peaks$len <- peaks$end - peaks$start head(peaks) # print the first peaks peak_stats <- data.frame( sum = sum(peaks$len), mean = mean(peaks$len), n = nrow(peaks)) library(knitr) kable(peak_stats) ``` In our example (Sox2 peaks in embryonic stem cells from the GEO series GSE67282) we obtain: - Number of peaks: $n = `r prettyNum(peak_stats$n, big.mark = ",")`$ - Sum of peak lengths: $sum = `r prettyNum(peak_stats$sum, big.mark = ",")`$ - Mean peak lengths: $mean = `r round(digits=1, peak_stats$mean)`$ ## Getting motifs from reference databases - Open a connection to [Jaspar](http://jaspar.genereg.net/) - Explore the database functionalities - Find the transcription factor binding motifs for your TF of interest. - Download the corresponding matrix in your working directory. - Note: JASPAR proposes several matrix formats. All these formats are supported yb RSAT, but for convenience you can choose the TRANSFAC format. - Save this transfac-formatted motif in a text file named `[YOUR_FACTOR_NAME]_reference-motifs.tf`. - ***Optional***: Do the same with the [Hocomoco](http://hocomoco11.autosome.ru/) database. - Note the classification of transcription factors on the home page. You can browse it to observe TF families, defined based on their binding domains. - Find the transcripiton factor of interest and save its position-counts matrix (PCM) in your work directory. - With the RSAT tool *convert-matrix*, convert Hocomoco motif of your factor to transfac format, and paste it below the Jaspar motif (make sure that you paste the whole record, including the `//` line at the end, which indicates the separation between two motifs in the same file). - **Tip:** the format Hocomoco is not displayed yet in the list of input formats for *convert-matrix*, but it is actually identical to cluster-buster. ## Discovering motifs with RSAT peak-motifs - Open a connection to [**RSAT**](http://rsat.france-bioinformatique.fr/rsat/) - Use the tool **[fetch-sequences](http://rsat.france-bioinformatique.fr/rsat/fetch-sequences_form.php)** to get the sequences of your peaks. - **BEWARE:** the peaks of all ChIP-seq datasets were computed by Ballester's team based on the **hg38** assembly of the Human genome. You should thus use this version of the genome. - Save a copy of the fasta file in your local folder (it might serve for further analyses). - At the bottom of the *fetch-sequence* result page, click the **peak$motifs button** to send the sequences as input to **peak-motifs**. - Set the following parameters for **peak-mortifs** - Open the section *Compare discovered motifs with databases or custom reference motifs*. - As *motif databases*, select both *Jaspar core non-redundant vertebrates* and *Hocomoco Human TFs*. - Enter as *reference motif* the transfac-formatted file with the JASPAR and Hocomoco motifs for your TF of interest (the file `[YOUR_FACTOR_NAME]_reference-motifs.tf` you built in the previous section). - Enter your email address - Leave all other parameters unchanged an run the analysis. - Interpret the results: - Did you obtain significant results? How significant ? - If so, were they characterized by over-representation (*oligo-analysis*), positional bias (*position-analysis*) or both? - Did some of the discovered motifs match the reference motifs for your TF in Jaspar and/or Hocomoco? - Did you discover other motifs? ## Negative control - Use *RSAT random genome fragments** to pick up random regions having the same sizes as your peaks - Run the same analyses as above with peak-motifs - Compare the results of this negative control with those obtained above with your peaks, and interpret them. ## Motif encrihment analysis > Note : for the 2020 session we had no time to see the theory about **matrix-quality** and motif enrichment. This section can thus be skipped for the report. 1. On the RSAT server, open the tool **matrix-quality** 2. Submit your peak sequences 3. Enter the JASPAR and Hocomoco reference motifs 4. Run the analyses ## Motif discovery with MEME Analyse the same peak sequences with [MEME-ChIP](http://meme-suite.org/tools/meme-chip), and convert the discovered motifs in transfac format with the RSAT tool *convert-matrix*. Also analyse with MEME the random selections of genmic regions (negative control). ## Matrix clustering The tool *matrix-clustering* can be used for different purposes. Here we will use it to - compare the motifs discovered respectively by MEME and RSAT peak-motifs in the peak-sequences - merge the similar motifs in order to obtain a non-redundant collection of motifs discoverd by the different tools in the peaks - compare this non-redundant collection of discovered motifs with a motif database in order to identify the associated TFs. 1. Run the RSAT tool *matrix-clustering* to merge and compare the two collections of motifs discovered by *RSAT peak-motifs* and *MEME-ChIP*, respectively. 2. From the *matrix-clustering* result, retrieve the **root motifs**, which will provide you with a non-redundant collection of the motifs resulting from RSAT peak-motifs and MEME-ChIP. 3. Use the RSAT tool *compare-matrices* to compare these non-redundant discovered motifs with JASPAR nonredundant vertebrate. 4. Use *RSAT matrix-scan* to measure the rate of coverage of the peaks (percentage of peaks with at least one match) - for each one of the non-redundant discovered motifs - for each reference motifs ## Interpret the results - Do the motifs discovered in the peaks include the reference motifs (those correspinding to the immunoprecipitated factor)? - Are these motifs reported by all the motif discovery approaches or only by some of them? - Do you discover motifs associated with other transcription factors? If so, are these factors relevant to the ChIP-seq data set under study (co-factors of the immunoprecipitated factor, factors involved in this cell/tissue type, ...)? - Which algorithms reported significant motifs in the random selections of genomic regions? How significant are they compared to those reported in the actual peaks? How do you interpret these motifs? ## Instruction for the report 1. Volume: 4 - 5 pages of text (without counting the figures and the supp. material) 2. Structure: we ask you to return a report structured as a small research article (introduction, material and methods, results and discussion, conclusions an perspective). 3. **Report template.** Please use the template provided here. Read carefully the instructions. - Rmd format: [Rmd](https://raw.githubusercontent.com/jvanheld/LCG_BEII/gh-pages/practicals/chip-seq_analysis/LASTNAME_Firstname_chip-seq_analysis_report.Rmd). - Word format: [docx](https://raw.githubusercontent.com/jvanheld/LCG_BEII/gh-pages/practicals/chip-seq_analysis/LASTNAME_Firstname_chip-seq_analysis_report.docx). Please submit both the original file (Rmd) and the report compiled in Word format (in RStudio, use the command **Knit to Word**), in order to enable me to annotate it in the margins.