--- title: "Using BLAST on the command line" author: "Jacques van Helden" date: 'Created: 8/10/2018; last update: `r Sys.Date()`' output: html_document: code_folding: hide self_contained: false fig_caption: yes highlight: zenburn theme: cerulean toc: yes toc_depth: 3 toc_float: yes md_document: variant: markdown_github pdf_document: fig_caption: yes highlight: zenburn toc: yes toc_depth: 2 word_document: toc: yes toc_depth: 2 #bibliography: ./geneset_functional_analysis.bib --- ```{r setup, include=FALSE, echo=FALSE, eval=TRUE} library(knitr) options(width = 300) knitr::opts_chunk$set( fig.width = 7, fig.height = 5, fig.align = "center", fig.path = "figures/blast-protein-families_", size = "tiny", # echo = FALSE, eval = TRUE, warning = FALSE, message = FALSE, results = TRUE, comment = "") # knitr::asis_output("\\footnotesize") # dir.slides <- "~/IFB/NNCR/using_IFB_NNCR/slides/" # setwd(dir.slides) ``` ******************************** # Goals of this tutorial In this tutorial, we will use the local sequence similarity search tool (BLAST) to searching sequence similarities for a protein of interest (an histone) in the proteome of a given organism (the yeast Saccharomyces cerevisiae). This approach is very simple, fast and widely used (BLAST is the most highly bioinformatics tool worldwide), but suffers from some limitations. In particular, the resulting list of protein sequences is sensitive to the particular protein chosse as query. In a next practical, we will rely on a more powerful method based on Hidden Markov Models (HMM) in order to build a profile for a protein family, and use this profile to scan the proteome and detect all the members of this family. Technically, this tutorial will also lead us to submit jobs to a scheduler, monitor the output and error logs, and organise the results for further use. We will also perform some exercise to evaluate the impact of some parameters on the results (in particular, the threshold on the e-value). **************************************************************** # Operations We will successively perform the following operations. 1. Select the full set of proteins (proteome) identified by translating the putatively coding regions of a given organism (e.g. the fruit fly *Drosophila melanogaster*), download its sequences to our computer, and upload these sequences to the High-Performance Computing server (the IFB core cluster). 2. Download a sequence of interest (e.g. Histone H3) on the HPC server. 3. Use the similarity search tool `blast` to find homologs of the protein of interest in the selected genome. **************************************************************** # Prerequisite This practical assumes that you are connected to a cluster of the IFB NNCR. This enables you to benefit from a software environment with pre-installed software tools for the different fields of bioinformatics. You should have recieved a login before starting the course. If this is not the case, see the information on account request on the IFB Web site (). Before starting this tutorial, you should have followed the following introductory sessions - Slides: [IFB High Performance Computing](https://ifb-elixirfr.gitlab.io/cluster/trainings/slurm/ebai2019.html) - [IFB cluster doc](https://ifb-elixirfr.gitlab.io/cluster/doc/) - [Quick start](https://ifb-elixirfr.gitlab.io/cluster/doc/quick-start/) - [Slurm user guide](https://ifb-elixirfr.gitlab.io/cluster/doc/slurm_user_guide/) # Warning about command-line use of the cluster Please remember: **the sequence analysis tasks below should by no way run on the login node of the computer**. Each job has to be sent to the `slurm` job scheduler using `srun` (submit a single command) or `sbacth` (submit a bash script), as explained in the tutorials of the prerequisite section. ******************************** # Bioinformatics resources used for this course | Resource | Description | URL | |-------|----------------------------------|-------------------| | **Ensembl** | Genome database | [www.ensembl.org](http://www.ensembl.org/) | | **UniprotKB** | Knowledge base of protein sequence and functional information | [www.uniprot.org](https://www.uniprot.org/) | | **PFAM** | Database of protein families | [pfam.xfam.org](https://pfam.xfam.org/) | ******************************** # Collecting the sequences ## Loading the required modules on the IFB core cluster The IFB core cluster comes with a modular software environment, which enables users to deploy the specific tools required for their analyses. In this section, we will open a connection to the IFB cluster and load the modules required for the practicals below. Task 1: use the command `module avail`to explore the lists of available modules and identify the two modules used for this course. - **blast**: Basic Local Alignment Search Tool - **emboss**: a software suite implemented by the European bioinformatics network EMBNet The solution can be seen by unfolding the code below. ```{bash module_avail, eval=FALSE} ## Open a connection to the IFB core cluster ssh @core.cluster.france-bioinformatique.fr ## List the modules of your environment module list ## Hmm ... no module is currently loaded ## Get the list of available modules ## Use the option -t to get one module per row module avail -t ## Identify the blast module module avail -t | grep blast ## Identify the emboss module module avail -t | grep emboss ``` Task 2: Load the required modules. ```{bash module_load, eval=FALSE} ## Load required modules module load blast module load emboss ## List loaded modules module list ``` ## Creating a workspace for this tutorial In your home directory on the IFB core cluster, create a folder named `cmb-comp-bio_ptracticals`, and a sub-folder named `proteome-blast_results`. ```{bash mkdir, eval=FALSE} ## Define the result directory for this practical and export it in an environment variable. export RESULT_DIR=~/cmb-comp-bio_ptracticals/proteome-blast_results ## Note: the -p command enables to create full path at once mkdir -p ${RESULT_DIR} cd ${RESULT_DIR} ``` ## Collecting the proteome of interest from Uniprot We will use Uniprot Web interface to select a protein of interest, that will serve as query to search for homologous proteins in the selected proteome. For this tutorial, we will use the fruit fly *Drosophila melanogaster* as organism of interest - Open a connection to [uniprot.org](http://uniprot.org) - Use the advanced serarches to formulate structured queries - Select the Organism. Start typing *Drosophila melanogaster* and wait for a few seconds. Among the possible completions displayed, select the following one. *Drosophila melanogaster (Fruit fly) [7227]* Note: the number [7227](https://www.ncbi.nlm.nih.gov/taxonomy/?term=7227) is the identifier of this species in the NCBI taxonomic database. - Click **Search**. This will select all the proteins of the selected organism. The result table only displays a summary information for 25 entries. - On the top of the result page, click **Download** and download all proteins in fasta format on your computer. Note: by default the sequence file is compressed with the `gzip` algorithm, you should thus obtain a binary file with the extension `.fasta.gz`. - Locate the downloaded file on your computer, and rename if with a simpler name, for example `Drosophila_melanogaster_proteome.fasta.gz`^[The file downloaded from Uniprot has a sophisticated name with parentheses and brackets, which will be hard to handle in unix commands: `uniprot-organism__Drosophila+melanogaster+(Fruit+fly)+[7227]_.fasta.gz`]. - Use the terminal to check the begining of this file ## Transferring a sequence file to your account on IFB core cluster ### Practical 1. Open a new terminal on your computer, and use the Unix command `scp` to copy the proteome sequence file in the result directory of the core cluster. Note that an alternative is to use a user-friendly program (e.g. Filezilla) to transfer this file. For more information read the section on [data transfer](https://ifb-elixirfr.gitlab.io/cluster/doc/data/) in the IFB cluster documentation. ```{bash scp, eval=FALSE} ## Go to your download folder. Note: this can vary depending on your operating system cd ~/Downloads ## Check that the fasta file is there ls -l Drosophila_melanogaster_proteome.fasta.gz ## Check the content of the file (use zless to directly see the content of the compressed file) zless Drosophila_melanogaster_proteome.fasta.gz ## REPLACE by your own username scp Drosophila_melanogaster_proteome.fasta.gz \ @core.cluster.france-bioinformatique.fr:cmb-comp-bio_ptracticals/proteome-blast_results/ ``` 2. Come back to your `ssh` session on the IFB core cluster and check that the seque ce file is in the right directory, and contains the compressed proteome sequence file (extension `.fasta.gz`). ```{bash check_fetch, eval=FALSE} cd $RESULT_DIR ## Make sure you are in the result directory ls -ltr ## List the files sorted by date (most recent last) ``` 3. Measure the size (in megabase) of the compressed sequence file (command: `du`), then uncompress it (command: `gunzip`) and measure the size of the uncompressed file. ```{bash gunzip_proteome, eval=FALSE} ## Measure the size of the compressed proteome file du -sm Drosophila_melanogaster_proteome.fasta.gz ## Uncompress the proteome file gunzip Drosophila_melanogaster_proteome.fasta.gz ## Measure the size of the uncompressed proteome file du -sm Drosophila_melanogaster_proteome.fasta ``` ## Getting a query sequence - Click on the Advanced search again. Your previous query is still displayed. Click the big **+** symbol to add a criterion, select **Protein name** and type *Histone H3*. - Click on the link to the protine named *Histone H3* (with ID [P02299](https://www.uniprot.org/uniprot/P02299)). You will see the full record of annotations for this protein. - Select **Format** $\rightarrow$ **Fasta**. This displays the sequence of Histone 3 in fasta format. We will use this sequence as query for BLAST. - Copy the link of this sequence () in a file. - Come back to the IFB core cluster terminal, make sure you are in the results directory, and use the Unix command `wget` to download the sequence. ```{bash wget, eval=FALSE} cd $RESULT_DIR wget --no-clobber https://www.uniprot.org/uniprot/P02299.fasta ``` Note: we use the option --no-clobber. To know why, read the wget manual after typing the command `man wget`. In Unix man pages, you can find a given word (e.g. clobber) by typing `/` followed by the search string. **************************************************************** # Searching sequence similarities with BLAST ## Formatting the blast database for the reference genome Before being able to run sequence similarity searches, BLAST needs to read the sequence database (in our case, the proteome) in order to build an index of the k-mers (oligopeptides). This is done with the command `makeblastdb`. ### Practical 1. Display the `makeblastdb` options (`-h`) manual (`-help`). Well, the the manual is quite heavy, so we give you the equired options: `-in` and `-dbtype`. Read the usage of these options in the manual, built the command and run it. 2. Build a command (**DO NOT RUN IT YET**) to index the proteome of interest. 3. Send this command to the job scheduler via `srun`. 4. Check the result files. ```{bash help_makeblastdb, eval=FALSE, class.source = 'fold-hide'} ## Make sure the blast module is loaded module load blast ## Check the path of the program makeblastdb (it should be in your conda folder) which makeblastdb ## Get the usage of makeblastdb (formal specification of the command syntax) makeblastdb -h ## Get the help for makeblastdb (explanation of the options) makeblastdb -help ## Build a BLAST database with all the protein sequences of D.melanogaster srun makeblastdb -dbtype prot -in Drosophila_melanogaster_proteome.fasta ## Check the new files that were created with this commands. ## For this we list the files in reverse (-r) temporal (-t) order. ls -1tr ``` The program `makeblastdb` created three files besides the input fasta file. ``` Drosophila_melanogaster_proteome.fasta.pin Drosophila_melanogaster_proteome.fasta.phr Drosophila_melanogaster_proteome.fasta.psq ``` These files contain an index of all the oligopeptites found in all the sequences of the fasta file. This "k-mer" index will be used by BLAST to rapidly find all the sequences in the database that match a query sequence, and use these short correspondences to start an alignment. ## Read *blastp* manual Since we want to search a protein database with protein query sequences, we will use the `blastp` command. ```{bash blast_manual, eval=FALSE} ## Command syntax blastp -h ## Help blastp -help ``` OK, the first contact is a bit frightening, because blastp has many options. This is because it allows you to run queries in different modes, with different parameters, and to export the results in different formats. In this tutorial we will show you the most usual ways to use the tool, and you will then be able to refine your search by combing back to the manual and understanding the use of additional options. ## Similarity search for one sequence against a database 1. Build a command (**DO NOT RUN IT YET**) that searches similarities for one or several query sequences provided in a fasta file (e.g. `P02299.fasta`) in the proteome of interest (`Drosophila_melanogaster_proteome.fasta`). By default, `blastp` prints the result on the screen, but we prefer to redirect it to a file (e.g. `P02299_blastp_result.txt`) in order to keep a trace of the result and inspect it later. 2. Send this command to the job scheduler with `srun`. ```{bash blastp_run, eval=FALSE} ## Search similarities for histone 3 in D.melanogaster proteome srun blastp \ -db Drosophila_melanogaster_proteome.fasta \ -query P02299.fasta \ > P02299_blastp_result.txt ## Check the result less P02299_blastp_result.txt ``` ### Questions 1. How many hits were found in total? 2. How many of these have an E-value higher than 1? 3. What was the threshold on the E-value for this blastp search? 4. With this threshold, how many hits would we expeect if we use a random sequence as query? 5. How many alignments would you consider significant in this result? 6. How many of these would you qualify of homologs? 7. For each of these putative homologs, is it a paralog, ortholog or "other log"? 8. Do you see a relationship between the E-value and some properties of the alignments? ## Setting a conservative cutoff on the e-value In `blast` manual, find the option that enables you to set a threshold (cutoff) on the e-value. Redo the blast search with a threshold of 1e-5 on the e-value. ```{bash blastp_evalue, eval=FALSE} ## Search similarities for histone 3 in D.melanogaster proteome srun blastp \ -db Drosophila_melanogaster_proteome.fasta \ -query P02299.fasta -evalue 1e-5 \ > P02299_blastp_result_eval1e-5.txt ## Check the result less P02299_blastp_result_eval1e-5.txt ``` ## Getting a tabular synthesis of blast results By default, BLAST displays the detailed results of a similaty search, starting with a summary of the matches, followed by all the alignemnts. It can be convenient to generate a synthetic result indicating only the relevant statistics for each alignment. This can be done by tuning BLAST options. 1. Read blastp manual and find a way to tune the formatting options in order to get a tabular output with comments. 2. Refine the query by selecting the following fields in the tabilar option: - ID of the "subject" sequence (the one found in the proteome of interest) - length of the query sequence - length of the subject sequence - raw score of the alignment - e-value ### Solution ```{bash blastp_tabular, eval=FALSE} ## Search similarities for histone 3 in D.melanogaster proteome. ## Return the result as a synthetic table with the matching statistics for each hit. srun blastp \ -db Drosophila_melanogaster_proteome.fasta \ -query P02299.fasta \ -outfmt 7 \ > P02299_blastp_summary.tsv ## Check the result more P02299_blastp_summary.tsv ## Return table with selected fields srun blastp \ -db Drosophila_melanogaster_proteome.fasta \ -query P02299.fasta \ -outfmt "6 sseqid qlen slen length score evalue" \ > P02299_blastp_summary.tsv ## Check the result more P02299_blastp_summary.tsv ``` # Negative control: blast against a shuffled sequence In this section, we will perform a quick empirical test, by submitting a random sequence to blastp. For this, we will shuffle the aminoacids of the original query sequence. ## Shuffling sequences with EMBOSS shuffleseq 1. Load the emboss module 2. Read the manual of the `shuffleseq` tool. 3. Send a command to the job scheduler to generate a shuffled version of the query sequence. ```{bash eval=FALSE} ## Activate the module with the emboss software suite module load emboss ## Check that shuffleseq is available which shuffleseq ## Get the suffle manual shuffleseq -h ## Shuffle the apartokinase 3 sequence srun shuffleseq -sequence P02299.fasta \ -outseq shuffled_P02299.fasta ## Check the result cat shuffled_P02299.fasta ## Compare the shuffled seq with the original one cat P02299.fasta ``` Here is an example of result (I just simplified the header of the shuffle output fasta sequence for the sake of readability). Of course, your result should differ since each shuffling produces a randomized sequence. ``` >shuffled_H3_DROME LMKPEQMLIAKKKDTKRVLASVRGIHRIARKVARTERLQKVLFRKTIAFALAGDICGQPT ESAPRTADLAQAVIFRHGTRPTRYTAPLSMQGKSYNLFGRQEDEQRASARASELTAIPKL KVYERGAKAKQTRRLR ``` ## Exercise 1. Shuffle the histone 3 sequence with the tool `shuffleseq` (found in the conda environment `emboss`). 2. Run a `blastp` search with this shuffled sequence against the selected proteome. ### Solution ```{bash eval=FALSE} ## Run blast with the shuffled lysC sequence against all D.melanogaster proteins srun blastp \ -db Drosophila_melanogaster_proteome.fasta \ -query shuffled_P02299.fasta \ > shuffled_P02299_blastp_result.txt ## Check the result less shuffled_P02299_blastp_result.txt ## Re-shuffle the apartokinase 3 sequence srun shuffleseq -sequence P02299.fasta \ -outseq shuffled_P02299.fasta ## Run the same query with a tabular output blastp -outfmt 7 \ -db Drosophila_melanogaster_proteome.fasta \ -query shuffled_P02299.fasta \ > shuffled_P02299_blastp_summary.tsv ## count the number of false positives wc -l shuffled_P02299_blastp_summary.tsv ## Check the result less shuffled_P02299_blastp_summary.tsv ``` ### Questions 1. How many alignments would you expect by chance with the default blastp parameters? 2. How many alignemnts did you get with the shuffled sequence? 3. How many of these had an E-value smaller than 1? 4. How many of these had an E-value smaller than 0.01? 5. How many of these would you qualify of homologs? ************************************************************ # Homework **************************************************************** # Supplementary information (not part of the practical) ## Installing the conda environment in your own computer **Attention**, this is a complement to the practical, which should not be run if you are working on the IFB core cluster. This section explains you how to install on your environment all the tools required for this tutorial. ```{bash conda_env_create, eval=FALSE} ## ATTENTION: DO NOT RUN THIS COMMAND IF YOU ARE ON THE NNCR CLUSTER, ## because this would occupy a big space to reinstall BLAST in your ## own account, although it is already available for all users. ## Check that conda is available on your computer which conda ## Define an environment variable with the conda bin directory export CONDA=`which conda` export CONDA_BIN=`dirname $CONDA` ## Create a conda environment named blast/2.7.1 and install the corresponding blast version conda create -n blast blast==2.7.1 ## Create a conda environment named emboss/6.6.0 and install the corresponding emboss version conda create -n emboss emboss==6.6.0 ## List the conda environments available on the system $CONDA_BIN/conda env list ## FInd the blast environment $CONDA_BIN/conda env list | grep -i blast ## Activate the blast environment. conda activate blast ## Check the location of the blastp command which blastp ## Deactivate the current environment. conda deactivate ## Note the change in the shell prompt ## Re-activate the blast environment and activate the emboss environment conda activate blast conda activate emboss ## Note the change in the shell prompt ## Find the path of the shuffleseq command (part of the EMBOSS package) which shuffleseq ``` **************************************************************** # References