--- title: "Solutions - Scanning non-coding sequences with a TFBM" author: "Jacques van Helden" date: '`r Sys.Date()`' subtitle: LCG_BEII 2020

## Solutions to the exercises

In this file we provide the solutions of the practical **Scanning non-coding sequences with a TFBM** with command-line use of the *RSAT* software suite.

## Reference genome

## Collective table for the 2020 practical

Students will store their results in a shared spreadsheet, which will be used to compare their results and get a broader landscape from the comparison of the results obtained with different transcription factors.

- Folder:
- Motif scanning exercise:

In your computer, create a folder to store the results of this practical, for example: `$HOME/LCG_BEII_practicals/` (you can change the path and name according to your own organisation of folders).

## Choosing a TF on RegulonDB

For this exercise, I chose the transcription factor AraC. I define this in an environment variable. I also define and create a specific directory for the results related to this transcription factor. ```{bash} ## Define the reference organism export ORG=Escherichia_coli_GCF_000005845.2_ASM584v2 ## Choose a transcription factor (TF) of interes export TF=AraC export RESULT_DIR=results/${TF} mkdir -p ${RESULT_DIR} ``` I use the REST Web services to automatically gather the annotations from RegulonDB. REST Web services enable to invoke remotely a resource (database, software tool) by composing an URL with an **entry point** (which specifies the type of query) and a set of parameters separated by `&`. For example, the list of genes regulated by AraC can be gound at the following URL. They can then be stored in a file with a web aspirator, such as `curl`or `wget`. ```{bash} ## Get the annotated binding sites from RegulonDB curl 'http://regulondb.ccg.unam.mx/webresources/tools/getTFBS?tfObject=AraC&extended=0' \ | grep -v '^#' \ > results/AraC/AraC_RegulonDB_sites_ext0.tsv ## Get the annotated position-specific scoring matrix from RegulonDB curl 'http://regulondb.ccg.unam.mx/webresources/tools/getPSSM?tfObject=AraC' \ | grep -v '^#' \ > results/AraC/AraC_RegulonDB_PSSM.tab ## Get the annotated target genes from RegulonDB curl 'http://regulondb.ccg.unam.mx/webresources/regulon/getRegulatedGenes?tfObject=AraC' \ | grep -v '^#' \ > results/AraC/AraC_RegulonDB_genes.tab ``` ## Computing the degenerate consensus from the reference matrix The degenerate consensus can be computed with `convert-matrix` with the appropriate parameters. Since it is printed as a comment (rows starging with `;`) we can extract its actual value with grep and cut. ```{bash} convert-matrix -v -i results/AraC/AraC_RegulonDB_PSSM.tab \ -from tab -to tab -return consensus \ | grep -Pe '^; consensus\t' \ | cut -f 2 \ > results/AraC/AraC_RegulonDB_consensus.tab ``` ## Getting all upstream ("promoter") sequences of *E.coli* ```{bash} ## Define an environment variable with the file containing all upstream sequences export ALLUP=results/AraC/Escherichia_coli_GCF_000005845.2_ASM584v2_all_upstream-noorf.fasta ## Retrieve all upstream sequences retrieve-seq -org Escherichia_coli_GCF_000005845.2_ASM584v2 \ -from -1 -to -400 -noorf -all -label name \ -o ${ALLUP} ## Check the result (type "q" to quit the "less" command) less ${ALLUP} ``` ## Coverage of the annotated binding sites by the reference motif ### Use *dna-pattern* to scan the annotated binding sites with a consensus ```{bash} ## Scan annotated TFBSs with degenerate consensus dna-pattern -v 1 \ -i ${ALLUP} \ -pl results/AraC/AraC_RegulonDB_consensus.tab \ -o results/AraC/TFBS_matches_with_deg-consensus_AraC.ft ## Check the result less results/AraC/TFBS_matches_with_deg-consensus_AraC.ft ``` ### Choosing a background model for matrix-scan To scan sequences with a matrix, we need to specify a background model. We can either compute it from the input sequences themselves (option `-bginput`) or specify a predefined background model file (option `-bg_file`). Pre-computed background models are available in RSAT for each organism, and with different parameters: - oligonucleotides or dyads, - k-mer length, - frequencies counted on a single or on both strand, - accept or not self-overlaps for periodic patterns. ### Use *matrix-scan* to scan the annotated binding sites with a PSSM ```{bash} ## Get the list of recovered target genes ## We sort with option -u (unique) because some genes may have several predicted bingind sites grep -v '^;' results/AraC/allup_matches_with_deg-consensus_AraC.ft \ | cut -f 4 | sort -u \ > results/AraC/TFBS_matches_with_deg-consensus_AraC_genes.txt cat results/AraC/TFBS_matches_with_deg-consensus_AraC_genes.txt ``` ### Compare the coverage rate of the two motifs ## Binding site prediction in all promoters - Use the same tools (dna-pattern and matrix-scan) to predict binding sites in all the promoters of E.coli. - For **matrix-scan**, run the analysis with a threshold of p-value of either 0.001 or 0.0001. - Compare the number of matches obtained in these respective searches. - With the respective p-values used for the scanning, how many matches would you expect by chance ? ## Negative control 1: scan artificial sequences with your motif - RSAT random sequences ## Negative control 2: permute the columns of the matrix - Use the tool **permute-matrix** in order to generate 10 randomized copies of the motif - Send these randomiazed matrices to **convert-matrix** and check their logo. - Run the same analyses as above with the randomized matrix - Compare the number of sites obtained between the RegulonDB matrix and the randomized matrix derived from it.