--- title: "Solutions - Scanning non-coding sequences with a TFBM" author: "Jacques van Helden" date: '`r Sys.Date()`' output: html_document: self_contained: no fig_caption: yes highlight: zenburn theme: cerulean toc: yes toc_depth: 3 toc_float: yes code_folding: show ioslides_presentation: slide_level: 2 self_contained: no colortheme: dolphin fig_caption: yes fig_height: 6 fig_width: 7 fonttheme: structurebold highlight: tango smaller: yes toc: yes widescreen: yes revealjs::revealjs_presentation: theme: night transition: none self_contained: true css: ../slides.css slidy_presentation: smart: no slide_level: 2 self_contained: yes fig_caption: yes fig_height: 6 fig_width: 7 highlight: tango incremental: no keep_md: yes smaller: yes theme: cerulean toc: yes widescreen: yes pdf_document: fig_caption: yes highlight: zenburn toc: yes toc_depth: 3 beamer_presentation: colortheme: dolphin fig_caption: yes fig_height: 6 fig_width: 7 fonttheme: structurebold highlight: tango incremental: no keep_tex: no slide_level: 2 theme: Montpellier toc: yes font-import: http://fonts.googleapis.com/css?family=Risque subtitle: LCG_BEII 2020 font-family: Garamond transition: linear --- ```{r include=FALSE, echo=FALSE, eval=TRUE} library(knitr) library(kableExtra) # library(formattable) options(width = 300) £, # options(encoding = 'UTF-8') knitr::opts_chunk$set( fig.width = 7, fig.height = 5, fig.path = 'figures/R_intro_', fig.align = "center", size = "tiny", echo = TRUE, eval = FALSE, warning = FALSE, message = FALSE, results = FALSE, comment = "") options(scipen = 12) ## Max number of digits for non-scientific notation # knitr::asis_output("\\footnotesize") ``` ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ## Solutions to the exercises In this file we provide the solutions of the practical **Scanning non-coding sequences with a TFBM** with command-line use of the *RSAT* software suite. ## Reference genome ## Collective table for the 2020 practical Students will store their results in a shared spreadsheet, which will be used to compare their results and get a broader landscape from the comparison of the results obtained with different transcription factors. - Folder: - Motif scanning exercise: In your computer, create a folder to store the results of this practical, for example: `$HOME/LCG_BEII_practicals/` (you can change the path and name according to your own organisation of folders). ## Choosing a TF on RegulonDB For this exercise, I chose the transcription factor AraC. I define this in an environment variable. I also define and create a specific directory for the results related to this transcription factor. ```{bash} ## Define the reference organism export ORG=Escherichia_coli_GCF_000005845.2_ASM584v2 ## Choose a transcription factor (TF) of interes export TF=AraC export RESULT_DIR=results/${TF} mkdir -p ${RESULT_DIR} ``` I use the REST Web services to automatically gather the annotations from RegulonDB. REST Web services enable to invoke remotely a resource (database, software tool) by composing an URL with an **entry point** (which specifies the type of query) and a set of parameters separated by `&`. For example, the list of genes regulated by AraC can be gound at the following URL. They can then be stored in a file with a web aspirator, such as `curl`or `wget`. ```{bash} ## Get the annotated binding sites from RegulonDB curl 'http://regulondb.ccg.unam.mx/webresources/tools/getTFBS?tfObject=AraC&extended=0' \ | grep -v '^#' \ > results/AraC/AraC_RegulonDB_sites_ext0.tsv ## Get the annotated position-specific scoring matrix from RegulonDB curl 'http://regulondb.ccg.unam.mx/webresources/tools/getPSSM?tfObject=AraC' \ | grep -v '^#' \ > results/AraC/AraC_RegulonDB_PSSM.tab ## Get the annotated target genes from RegulonDB curl 'http://regulondb.ccg.unam.mx/webresources/regulon/getRegulatedGenes?tfObject=AraC' \ | grep -v '^#' \ > results/AraC/AraC_RegulonDB_genes.tab ``` ## Computing the degenerate consensus from the reference matrix The degenerate consensus can be computed with `convert-matrix` with the appropriate parameters. Since it is printed as a comment (rows starging with `;`) we can extract its actual value with grep and cut. ```{bash} convert-matrix -v -i results/AraC/AraC_RegulonDB_PSSM.tab \ -from tab -to tab -return consensus \ | grep -Pe '^; consensus\t' \ | cut -f 2 \ > results/AraC/AraC_RegulonDB_consensus.tab ``` ## Getting all upstream ("promoter") sequences of *E.coli* ```{bash} ## Define an environment variable with the file containing all upstream sequences export ALLUP=results/AraC/Escherichia_coli_GCF_000005845.2_ASM584v2_all_upstream-noorf.fasta ## Retrieve all upstream sequences retrieve-seq -org Escherichia_coli_GCF_000005845.2_ASM584v2 \ -from -1 -to -400 -noorf -all -label name \ -o ${ALLUP} ## Check the result (type "q" to quit the "less" command) less ${ALLUP} ``` ## Coverage of the annotated binding sites by the reference motif ### Use *dna-pattern* to scan the annotated binding sites with a consensus ```{bash} ## Scan annotated TFBSs with degenerate consensus dna-pattern -v 1 \ -i ${ALLUP} \ -pl results/AraC/AraC_RegulonDB_consensus.tab \ -o results/AraC/TFBS_matches_with_deg-consensus_AraC.ft ## Check the result less results/AraC/TFBS_matches_with_deg-consensus_AraC.ft ``` ### Choosing a background model for matrix-scan To scan sequences with a matrix, we need to specify a background model. We can either compute it from the input sequences themselves (option `-bginput`) or specify a predefined background model file (option `-bg_file`). Pre-computed background models are available in RSAT for each organism, and with different parameters: - oligonucleotides or dyads, - k-mer length, - frequencies counted on a single or on both strand, - accept or not self-overlaps for periodic patterns. ### Use *matrix-scan* to scan the annotated binding sites with a PSSM ```{bash} ## Get the list of recovered target genes ## We sort with option -u (unique) because some genes may have several predicted bingind sites grep -v '^;' results/AraC/allup_matches_with_deg-consensus_AraC.ft \ | cut -f 4 | sort -u \ > results/AraC/TFBS_matches_with_deg-consensus_AraC_genes.txt cat results/AraC/TFBS_matches_with_deg-consensus_AraC_genes.txt ``` ### Compare the coverage rate of the two motifs ## Binding site prediction in all promoters - Use the same tools (dna-pattern and matrix-scan) to predict binding sites in all the promoters of E.coli. - For **matrix-scan**, run the analysis with a threshold of p-value of either 0.001 or 0.0001. - Compare the number of matches obtained in these respective searches. - With the respective p-values used for the scanning, how many matches would you expect by chance ? ## Negative control 1: scan artificial sequences with your motif - RSAT random sequences ## Negative control 2: permute the columns of the matrix - Use the tool **permute-matrix** in order to generate 10 randomized copies of the motif - Send these randomiazed matrices to **convert-matrix** and check their logo. - Run the same analyses as above with the randomized matrix - Compare the number of sites obtained between the RegulonDB matrix and the randomized matrix derived from it.