---
title: "Scanning non-coding sequences with a TFBM"
author: "Jacques van Helden"
date: '`r Sys.Date()`'
output:
  html_document:
    self_contained: no
    fig_caption: yes
    highlight: zenburn
    theme: cerulean
    toc: yes
    toc_depth: 3
    toc_float: yes
  ioslides_presentation:
    slide_level: 2
    self_contained: no
    colortheme: dolphin
    fig_caption: yes
    fig_height: 6
    fig_width: 7
    fonttheme: structurebold
    highlight: tango
    smaller: yes
    toc: yes
    widescreen: yes
  revealjs::revealjs_presentation:
    theme: night
    transition: none
    self_contained: true
    css: ../slides.css
  slidy_presentation:
    smart: no
    slide_level: 2
    self_contained: yes
    fig_caption: yes
    fig_height: 6
    fig_width: 7
    highlight: tango
    incremental: no
    keep_md: yes
    smaller: yes
    theme: cerulean
    toc: yes
    widescreen: yes
  pdf_document:
    fig_caption: yes
    highlight: zenburn
    toc: yes
    toc_depth: 3
  beamer_presentation:
    colortheme: dolphin
    fig_caption: yes
    fig_height: 6
    fig_width: 7
    fonttheme: structurebold
    highlight: tango
    incremental: no
    keep_tex: no
    slide_level: 2
    theme: Montpellier
    toc: yes
font-import: http://fonts.googleapis.com/css?family=Risque
subtitle: LCG_BEII 2025
font-family: Garamond
transition: linear
---

```{r include=FALSE, echo=FALSE, eval=TRUE}
library(knitr)
library(kableExtra)
# library(formattable)

options(width = 300)
# options(encoding = 'UTF-8')
knitr::opts_chunk$set(
  fig.width = 7, fig.height = 5, 
  fig.path = 'figures/R_intro_',
  fig.align = "center", 
  size = "tiny", 
  echo = TRUE, eval = TRUE, 
  warning = FALSE, message = FALSE, 
  results = FALSE, comment = "")

options(scipen = 12) ## Max number of digits for non-scientific notation
# knitr::asis_output("\\footnotesize")

```

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Introduction

The goal of this practical is to evaluate the respective performances of two modes of representations for transcription factor binding motifs (**TFBMs**) to predict transcription factor binding sites (**TFBS**).

As reference genome we will use the strain *Escherichia_coli_GCF_000005845.2_ASM584v2*. 

**Tip:** in the RSAT Web site, you can type `ASM584` to quickly select this strain. This is much easier than typing the whole name. 

## Resources

| Acronym | Description | URL |
|--------|--------------------------|---------------------|
| RegulonDB | Database of transcriptional regulation in *Escherichia coli* | <https://regulondb.ccg.unam.mx/> |
| RSAT | Regulatory Sequence Analysis Tools | <http://rsat.eu> |
| RSAT Prokaryotes | RSAT server specialised for Prokaryote genomes | <http://prokaryotes.rsat.eu> | 
| Classification statistics | Wikipedia page with the key formula to compute performance statistics from a confusion matrix | <https://en.wikipedia.org/wiki/Confusion_matrix> |
| Sensitivity and Specificity | Wikipedia page defining these key performance indicators | <https://en.wikipedia.org/wiki/Sensitivity_and_specificity> |

## Collective table for the 2023 practical

Students will store their results in a shared spreadsheet, which will be used to compare their results and get a broader landscape from the comparison of the results obtained with different transcription factors.  

- Shared folder (in edit mode): <https://drive.google.com/drive/folders/139B72oJOoI5fDWn3FfxhK7u-K3TXhPEV>

- **Shared result table** (one row per student) : <https://docs.google.com/spreadsheets/d/1US2dyzgGsh0MWTIq8E5Eh3EpE5pKRN_8/edit#gid=64497152>

    Add a line with your name and email. You will progressively fill in the other columns during the pratical. 

## Motif scanning exercise

In your computer, create a folder to store the results of this practical, for example : 

`$HOME/LCG_BEII_practicals/` 

(you can change the path and name according to your own organisation of folders). 

## Choosing a TF on RegulonDB

Each student will choose one transcription factor of interest having a reasonable number of binding sites (between 10 and 25). 

For this tutorial, I will use *AraC*.

- Open a connection to RegulonDB  <http://regulondb.ccg.unam.mx/>

- Click on the link [regulon list](http://regulondb.ccg.unam.mx/search?term=Regulon&organism=ECK12&type=All#). This opens a table with all the regulons. 

- Click on the top of the column "Total of Regulated Genes" to sort factors by decreasing number of target genes.

- Choose your TF of interest and open its record. **Check that there is a matrix for this TF**. If not, come back to the table and choose another TF. 

- Fill up the details of the collective exploration table.

- Save a text file with the target gene names (one per row).

- Save another text file with the names of the operon leader genes (one gene per row). These will serve as reference to compute the rate of recovery of the target genes with the different motif representations (consensus or matrix, resp.).

- Save a fasta file with the sequences of the known binding sites for your TF (tip: click on the bug "+" button in the header of the binding site section)

- Save in a text file the matrix (PSSM) associated to your factor, in tab format. 

## Computing the degenerate consensus from the reference matrix

- Connect RSAT server: <http://rsat.eu/>

- Choose the [Prokaryote server](http://prokaryotes.rsat.eu/)

- Use **convert-matrix** to compute frequencies, weights, parameters and display a logo of your matrix. Paste this logo in the shared result table (see link above)

- In the result, get the degenerated consensus and save it to a separate text file. 

## Getting all upstream ("promoter") sequences of *E.coli*

- Open the tool **[retrieve-seq](http://prokaryotes.rsat.eu/retrieve-seq_form.cgi)**

- Select organism *Escherichia_coli_GCF_000005845.2_ASM584v2* (top : type simply K12 in the organism query box)

- Set all parameters to get the non-coding sequences located upstream of all genes with a maximal distance of 400 bp from the gene start (i.e. relative coordinates from -1 to -400). 

- Copy the URL of the result file and save it in a text file (we will use it several times below)


## Coverage of the annotated binding sites by the reference motif

The question addressed here is to evaulate the relative performances of the consensus and PSSM to predict TFBSs and identify the target genes of our transcription factor of interest. 

Note that the TFBS are expected to be found in the promoters of operon leader genes, and not in the upstream sequences of the other target genes (the genes following the leader in an operon). Thus, the reference gene set for this analysis are the leader genes of the operons regulated by our TF of interest.  

- Use **[dna-pattern](http://prokaryotes.rsat.eu/dna-pattern.cgi)** to scan the annotated binding sites with the IUPAC consensus from RegulonDB (e.g. `cayaGcmrrwwartCCaTaw` for AraC).

    - Tip: uncheck the option *sequence limits* to avoid getting 4200 lines with promoter sizes.

- Use **[matrix-scan]()** to scan the annotated binding sites with the RegulonDB matrix


### Interpretation 

- Compute the coverage rate of the two respective modes of representation (consensus versus matrix). 

- Compare the values and comment the results. 


## Binding site prediction in all promoters

- Use the same tools (dna-pattern and matrix-scan) to predict binding sites in all the promoters of E.coli. 

- For **matrix-scan**, run the analysis with a threshold of p-value of either 0.001 or 0.0001. 

### Interpretation

- Compare the number of matches obtained in these respective searches. 

- With the respective p-values used for the scanning, how many matches would you expect by chance ?

- Comment the results. 

## Negative control 1: scan artificial sequences with your motif

As a negative control, we will run exactly the same analysis in sequences where there should be no specific site for our transcription factor. For this, we will generate 

- RSAT *random-sequences*, generate random sequences 

    - use the same background model as you used to scan the promoter sequences in the above steps

    - use the promoter sequences retrieved above as model in order to obtain random sequences of the same sizes.  

- Run the same analysis as above (all promoters) 

### Interpretation 

- How many sites were predicted?
- How many sites would be expected by chance according to your parameters?
- Do the two numbers correspond?

## Negative control 2: permute the columns of the matrix

A second type of negative control consists in scanning the actual biological sequences (all the promoters of *E.coli*, as analysed above), but to permute the columns of the PSSM. Such a permutation between the columns preserves statistical properties of the matrix (number of occurrences, information content of each column, total information content) but looses the biological information (the specific sequences recognized by the transcription factor). 

- Use the tool **permute-matrix** in order to generate 10 randomized copies of the motif

- Send these randomiazed matrices to **convert-matrix** and check their logo. 

- Run the same analyses as above with the randomized matrix

- Compare the number of sites obtained between the RegulonDB matrix and the randomized matrix derived from it.

## Scientific report

### Structure of the report

Each student should return a scientific report containing 

1.  A summary of the principal results and their interpretation. This section should be synthetic, and cannot exceed 2 pages. 

2. A full list of the complete RSAT commands used for the successive steps of the analysis

3. As many figures and tables as useful

    - Figures and tables are not counted in the two pages, but numbered in order to enable cross-references from the text to these elements;

    - Each figure or table should have a title (in bold) and a legend sufficiently detailed to enable the reader understanding their contents. 

### Report submission date

Reports should be deposited **before February 26, 2023** on the Moodle server of the LCG.

<iframe src="https://free.timeanddate.com/countdown/i8qb9ez7/n4377/cf12/cm0/cu4/ct0/cs0/cac000/cr0/ss0/cac000/cpc000/pcfff/tcfff/fs100/szw320/szh135/tatTime%20left%20to%20deliver/tac000/tptTime%20since%20Event%20started%20in/tpc000/mac000/mpc000/iso2023-02-26T00:00:00" allowtransparency="true" frameborder="0" width="170" height="68"></iframe>


### Evaluation criteria

- Completion of the result in the shared result table on GDrive. 

- Precision of the results in the report (e.g. number of annotated sites, genes, operons in RegulonDB, number of putative sites and genes detected by matrix-scan, number of overlapping genes, ...)

- Correctness of the values in the confusion matrix and derived statistics

- Interpretation of the confusion table and [derived statistics](https://en.wikipedia.org/wiki/Sensitivity_and_specificity ): 

    - True positive (TP)
    - True negative (TN)
    - False positive (FP)
    - False negative (FN)
    - Sensitivity TP / (TP + FN)
    - Positive predictive value TP/ (TP + FP)
    - ...

- Significance evaluation of the putative sites detected by matrix-scan (Take into account weight and p-value)

- Interpretation of negative control

<div class="alert alert-block alert-warning">
<b>Attention:</b> The biological interpretation of the findings is crucial for your score.<br>**Plagiarism will not be tolerated**
</div>