---
title: "Practical: analysis of ChIP-seq peaks"
author: "Jacques van Helden"
date: '`r Sys.Date()`'
output:
  html_document:
    code_folding: hide
    self_contained: no
    fig_caption: yes
    highlight: tango
    theme: cerulean
    toc: yes
    toc_depth: 3
    toc_float: yes
  ioslides_presentation:
    slide_level: 2
    self_contained: no
    colortheme: dolphin
    fig_caption: yes
    fig_height: 6
    fig_width: 7
    fonttheme: structurebold
    highlight: tango
    smaller: yes
    toc: yes
    widescreen: yes
  revealjs::revealjs_presentation:
    theme: night
    transition: none
    self_contained: true
    css: ../../slides.css
  slidy_presentation:
    smart: no
    slide_level: 2
    self_contained: yes
    fig_caption: yes
    fig_height: 6
    fig_width: 7
    highlight: tango
    incremental: no
    keep_md: yes
    smaller: yes
    theme: cerulean
    toc: yes
    widescreen: yes
  pdf_document:
    fig_caption: yes
    highlight: zenburn
    toc: yes
    toc_depth: 3
  beamer_presentation:
    colortheme: dolphin
    fig_caption: yes
    fig_height: 6
    fig_width: 7
    fonttheme: structurebold
    highlight: tango
    incremental: no
    keep_tex: no
    slide_level: 2
    theme: Montpellier
    toc: yes
font-import: http://fonts.googleapis.com/css?family=Risque
subtitle: LCG_BEII 2019
font-family: Garamond
transition: linear
---

```{r include=FALSE, echo=FALSE, eval=TRUE}
library(knitr)
library(kableExtra)
# library(formattable)

options(width = 300)
# options(encoding = 'UTF-8')
knitr::opts_chunk$set(
  fig.width = 7, fig.height = 5, 
  fig.path = 'figures/R_intro_',
  fig.align = "center", 
  size = "tiny", 
  echo = TRUE, eval = TRUE, 
  warning = FALSE, message = FALSE, 
  results = FALSE, comment = "")

options(scipen = 12) ## Max number of digits for non-scientific notation
# knitr::asis_output("\\footnotesize")

```


## Introduction

In this practical, we will combine various bioinformatics resources to extract information from ChIP-seq experiments, and evaluate the quality of the peaks.

## Resources


| Resource Name | URL |
|-------------------------|--------------------------------------|
| ReMap | A database of ChIP-seq peaks (<https://remap.univ-amu.fr/>) |
| Jaspar | A database of eukaryote TF binding motifs, mainly built from CHIP-seq peaks (<http://jaspar.genereg.net/>) |
| Hocomoco | A database of Human TF binding motifs, mainly built from CHIP-seq peaks (<http://hocomoco11.autosome.ru/>) |
| RSAT| Regulatory Sequence Analysis Tools <http://rsat.eu/> |
| RSAT Metazoa | <http://metazoa.rsat.eu/> |
| RSAT Teaching | <http://teaching.rsat.eu/> |
| RSAT server at IFB | <http://rsat.france-bioinformatique.fr/rsat/> |
| UCSC NGS file formats | <https://genome.ucsc.edu/FAQ/FAQformat.html> | 

## Creating a workspace for this practical

- Create a new folder to store the results of this practical (e.g.  `~/LCG_BEII/chip-seq_practical/`)
- Set this folder as your working direct

## Getting peaks from ReMap

ReMap (<https://remap.univ-amu.fr/>) is a database of TF binding regions based on an extensive re-analysis of published ChIP-seq data for human transcription factors. ory (`setwd()`)

- Open a connection to [ReMap](https://remap.univ-amu.fr/)

- Select *Homo sapiens*

- Explore the interface

- Choose a transcription factor and a tissue / cell type of interest. 

    - For a same factor, you wil generally find several datasets resulting from different studies, in different tissues (biotypes). 
    - For your exercise, I recommend to select a dataset with a reasonable number of peaks
    - For this tutorial I will use Sox2 in 	[glioma](https://remap.univ-amu.fr/biotype_page/glioma:9606), where ReMap peaks were identified from the original [GEO dataset	GSE67282](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE67282)

- Click on the **Download icon** tab, besides the number of peaks. This downloads a compressed (gzipped) bed file (tab-separated columns describing genomic features). 

- Check that the number of peaks in the downloaded file corresponds to the indication on the ReMap page for your factor in the tissue / cell type and in the study identifier (GSE or other) you selected. 

- Check the *name* column in the bed file to make sure that the peaks all belong to the same tissue / cell type and study identifier. 

- Compute 

    - the number of peaks
    - the sum of peak lengths
    - the mean peak length
    - the median peak length

**Beware:** the bed format relies on a somewhat confusing convention to specify the coordinates of genomic features.

- Coordinates are 0-based (the first position of a sequence is referred to as 0).
- The feature start is (as expected) the first position within the feature, but the "end" is the first position **after the feature**.
- Consequently, the feature length can be computed as $l = end - start$, whereas with other formats it should be $l = end - start = 1$. 

**Tips:**  

- The peak statistics can be computed with a spreadsheet (Excel, LibreOffice Calc), R, or the Unix command line. 

- Click the "code" button to display a bash line to compute peak statistics (number, sum of lengths and mean length) from a `bed`file.

```{bash peak_stats_bash, eval=FALSE, echo=TRUE}
RESULT_DIR=~/LCG_BEII/chip-seq_practical/SOX2_glioma_GSE67282
PEAKS=${RESULT_DIR}/GSE67282.SOX2.glioma_stem.bed.gz
gunzip -c ${PEAKS} \
   | awk '{len=$3 - $2; sum += len; n++; mean = sum / n ; print "n="n"\tsum="sum"\tmean="mean}' \
   | tail -n 1
```

- Click the "code" button to display a chunk of R code to compute peak statistics.

```{r pak_stats_r, eval=TRUE, echo=TRUE}
resultDir <- "~/LCG_BEII/chip-seq_practical/SOX2_glioma_GSE67282"
peakBase <- "GSE67282.SOX2.glioma_stem.bed.gz"
peakFile <- file.path(resultDir, peakBase)
#peakFile <- "data/ReMap_GSE67282.SOX2.ESC_peaks.bed"
peaks <- read.delim(peakFile, header = FALSE)
peaks <- peaks[, 1:4] # Keep the fields used for this analysis
names(peaks) <- c("chr", "start", "end", "name")

peaks$len <- peaks$end - peaks$start
head(peaks) # print the first peaks

peak_stats <- data.frame(
  sum = sum(peaks$len),
  mean = mean(peaks$len),
  n = nrow(peaks))

library(knitr)
kable(peak_stats)

```

In our example (Sox2 peaks in embryonic stem cells from the GEO series GSE67282) we obtain: 

- Number of peaks: $n = `r prettyNum(peak_stats$n, big.mark = ",")`$ 
- Sum of peak lengths: $sum = `r prettyNum(peak_stats$sum, big.mark = ",")`$ 
- Mean peak lengths: $mean = `r round(digits=1, peak_stats$mean)`$ 

## Getting motifs from reference databases

- Open a connection to [Jaspar](http://jaspar.genereg.net/)

    - Explore the database functionalities

    - Find the transcription factor binding motifs for your TF of interest. 
    
    - Download the corresponding matrix in your working directory. 
    
        - Note: JASPAR proposes several matrix formats. All these formats are supported yb RSAT, but for convenience you can choose the TRANSFAC format.

    - Save this transfac-formatted motif in a text file named `[YOUR_FACTOR_NAME]_reference-motifs.tf`.


- ***Optional***: Do the same with the [Hocomoco](http://hocomoco11.autosome.ru/) database. 

    - Note the classification of transcription factors on the home page. You can browse it to observe TF families, defined based on their binding domains. 

    - Find the transcripiton factor of interest and save its position-counts matrix (PCM) in your work directory. 

    - With the RSAT tool *convert-matrix*, convert  Hocomoco motif of your factor to transfac format, and paste it below the Jaspar motif (make sure that you paste the whole record, including the `//` line at the end, which indicates the separation between two motifs in the same file). 
    
    - **Tip:** the format Hocomoco is not displayed yet in the list of input formats for *convert-matrix*, but it is actually identical to cluster-buster.

## Discovering motifs with RSAT peak-motifs


- Open a connection to [**RSAT**](http://rsat.france-bioinformatique.fr/rsat/)

- Use the tool **[fetch-sequences](http://rsat.france-bioinformatique.fr/rsat/fetch-sequences_form.php)** to get the sequences of your peaks. 

    - **BEWARE:** the peaks of all ChIP-seq datasets were computed by Ballester's team based on the **hg38** assembly of the Human genome. You should thus use this version of the genome. 

- Save a copy of the fasta file in your local folder (it might serve for further analyses). 

- At the bottom of the *fetch-sequence* result page, click the **peak$motifs button** to send the sequences as input to **peak-motifs**. 

- Set the following parameters for **peak-mortifs**

    - Open the section *Compare discovered motifs with databases or custom reference motifs*. 
    - As *motif databases*, select both *Jaspar core non-redundant vertebrates* and *Hocomoco Human TFs*.
    - Enter as *reference motif* the transfac-formatted file with the JASPAR and Hocomoco motifs for your TF of interest (the file `[YOUR_FACTOR_NAME]_reference-motifs.tf` you built in the previous section). 
    - Enter your email address
    - Leave all other parameters unchanged an run the analysis. 

- Interpret the results: 

    - Did you obtain significant results? How significant ?
    - If so, were they characterized by over-representation (*oligo-analysis*), positional bias (*position-analysis*) or both?
    - Did some of the discovered motifs match the reference motifs for your TF in Jaspar and/or Hocomoco?
    - Did you discover other motifs?


## Negative control

- Use *RSAT random genome fragments** to pick up random regions having the same sizes as your peaks

- Run the same analyses as above with peak-motifs

- Compare the results of this negative control with those obtained above with your peaks, and interpret them. 


## Motif encrihment analysis

> Note : for the 2020 session we had no time to see the theory about **matrix-quality** and motif enrichment. This section can thus be skipped for the report. 

1. On the RSAT server, open the tool **matrix-quality**

2. Submit your peak sequences

3. Enter the JASPAR and Hocomoco reference motifs

4. Run the analyses

## Motif discovery with MEME

Analyse the same peak sequences with [MEME-ChIP](http://meme-suite.org/tools/meme-chip), and convert the discovered motifs in transfac format with the RSAT tool *convert-matrix*.

Also analyse with MEME the random selections of genmic regions (negative control). 

## Matrix clustering

The tool *matrix-clustering* can be used for different purposes. Here we will use it to 
- compare the motifs discovered respectively by MEME and RSAT peak-motifs in the peak-sequences
- merge the similar motifs in order to obtain a non-redundant collection of motifs discoverd by the different tools in the peaks
- compare this non-redundant collection of discovered motifs with a motif database in order to identify the associated TFs.


1. Run the RSAT tool *matrix-clustering* to merge and compare the two collections of motifs discovered by *RSAT peak-motifs* and *MEME-ChIP*, respectively. 

2. From the *matrix-clustering* result, retrieve the **root motifs**, which will provide you with a non-redundant collection of the motifs resulting from RSAT peak-motifs and MEME-ChIP. 

3. Use the RSAT tool *compare-matrices* to compare these non-redundant discovered motifs with JASPAR nonredundant vertebrate. 

4. Use *RSAT matrix-scan* to measure the rate of coverage of the peaks (percentage of peaks with at least one match)

    - for each one of the non-redundant discovered motifs
    - for each reference motifs
    

## Interpret the results

- Do the motifs discovered in the peaks include the reference motifs (those correspinding to the immunoprecipitated factor)?
    
- Are these motifs reported by all the motif discovery approaches or only by some of them?
    
- Do you discover motifs associated with other transcription factors? If so, are these factors relevant to the ChIP-seq data set under study (co-factors of the immunoprecipitated factor, factors involved in this cell/tissue type, ...)?
    
- Which algorithms reported significant motifs in the random selections of genomic regions? How significant are they compared to those reported in the actual peaks? How do you interpret these motifs? 


## Instruction for the report

1. Volume: 4 - 5 pages of text (without counting the figures and the supp. material)

2. Structure: we ask you to return a report structured as a small research article (introduction, material and methods, results and discussion, conclusions an perspective). 

3. **Report template.** Please use the template provided here. Read carefully the instructions. 

    - Rmd format: [Rmd](https://raw.githubusercontent.com/jvanheld/LCG_BEII/gh-pages/practicals/chip-seq_analysis/LASTNAME_Firstname_chip-seq_analysis_report.Rmd). 

    - Word format: [docx](https://raw.githubusercontent.com/jvanheld/LCG_BEII/gh-pages/practicals/chip-seq_analysis/LASTNAME_Firstname_chip-seq_analysis_report.docx). 


Please submit both the original file (Rmd) and the report compiled in Word format (in RStudio, use the command **Knit to Word**), in order to enable me to annotate it in the margins.