---
title: "Using BLAST on the command line"
author: "Jacques van Helden"
date: 'Created: 8/10/2018; last update: `r Sys.Date()`'
output:
  html_document:
    code_folding: hide
    self_contained: false
    fig_caption: yes
    highlight: zenburn
    theme: cerulean
    toc: yes
    toc_depth: 3
    toc_float: yes
  md_document:
    variant: markdown_github
  pdf_document:
    fig_caption: yes
    highlight: zenburn
    toc: yes
    toc_depth: 2
  word_document:
    toc: yes
    toc_depth: 2
#bibliography: ./geneset_functional_analysis.bib
---


```{r setup, include=FALSE, echo=FALSE, eval=TRUE}
library(knitr)

options(width = 300)
knitr::opts_chunk$set(
  fig.width = 7, 
  fig.height = 5, 
  fig.align = "center", 
  fig.path = "figures/blast-protein-families_",
  size = "tiny", 
  # echo = FALSE, 
  eval = TRUE, 
  warning = FALSE, 
  message = FALSE, 
  results = TRUE, comment = "")
# knitr::asis_output("\\footnotesize")

# dir.slides <- "~/IFB/NNCR/using_IFB_NNCR/slides/"
# setwd(dir.slides)

```

********************************
# Goals of this tutorial

In this tutorial, we will use the local sequence similarity search tool (BLAST) to  searching sequence similarities for a protein of interest (an histone) in the proteome of a given organism (the yeast Saccharomyces cerevisiae).

This approach is very simple, fast and widely used (BLAST is the most highly bioinformatics tool worldwide), but suffers from some limitations. In particular, the resulting list of protein sequences is sensitive to the particular protein chosse as query. 

In a next practical, we will rely on a more powerful method based on Hidden Markov Models (HMM) in order to build a profile for a protein family, and use this profile to scan the proteome and detect all the members of this family.

Technically, this tutorial will also lead us to submit jobs to a scheduler, monitor the output and error logs, and organise the results for further use. 

We will also perform some exercise to evaluate the impact of some parameters on the results (in particular, the threshold on the e-value).


****************************************************************
# Operations

We will successively perform the following operations.

1. Select the full set of proteins (proteome) identified by translating the putatively coding regions of a given organism (e.g. the fruit fly *Drosophila melanogaster*), download its sequences to our computer, and upload these sequences to the High-Performance Computing server (the IFB core cluster). 

2. Download a sequence of interest (e.g. Histone H3) on the HPC server. 

3. Use the similarity search tool `blast` to find homologs of the protein of interest in the selected genome. 


****************************************************************
# Prerequisite

This practical assumes that you are connected to a cluster of the IFB NNCR. This enables you to benefit from a software environment with pre-installed software tools for the different fields of bioinformatics.

You should have recieved a login before starting the course. If this is not the case, see the information on account request on the IFB Web site
(<https://www.france-bioinformatique.fr/en/cluster>).

Before starting this tutorial, you should have followed the following introductory sessions

- Slides: [IFB High Performance Computing](https://ifb-elixirfr.gitlab.io/cluster/trainings/slurm/ebai2019.html)

- [IFB cluster doc](https://ifb-elixirfr.gitlab.io/cluster/doc/)
- [Quick start](https://ifb-elixirfr.gitlab.io/cluster/doc/quick-start/) 
- [Slurm user guide](https://ifb-elixirfr.gitlab.io/cluster/doc/slurm_user_guide/)


# Warning about command-line use of the cluster

Please remember: **the sequence analysis tasks below should by no way run on the login node of the computer**. Each job has to be sent to the `slurm` job scheduler using `srun` (submit a single command) or `sbacth` (submit a bash script), as explained in the tutorials of the prerequisite section. 


********************************
# Bioinformatics resources used for this course

| Resource | Description | URL | 
|-------|----------------------------------|-------------------|
| **Ensembl** | Genome database | [www.ensembl.org](http://www.ensembl.org/) |
| **UniprotKB** | Knowledge base of protein sequence and functional information | [www.uniprot.org](https://www.uniprot.org/) | 
| **PFAM** | Database of protein families | [pfam.xfam.org](https://pfam.xfam.org/) |

********************************
# Collecting the sequences


## Loading the required modules on the IFB core cluster

The IFB core cluster comes with a modular software environment, which enables users to deploy the specific tools required for their analyses. 

In this section, we will open a connection to the IFB cluster and load the modules required for the practicals below. 

Task 1: use the command `module avail`to explore the lists of available modules and identify the two modules used for this course.

- **blast**: Basic Local Alignment Search Tool
- **emboss**: a software suite implemented by the European bioinformatics network EMBNet

The solution can be seen by unfolding the code below.

```{bash module_avail, eval=FALSE}
## Open a connection to the IFB core cluster
ssh <username>@core.cluster.france-bioinformatique.fr

## List the modules of your environment
module list
## Hmm ... no module is currently loaded

## Get the list of available modules
## Use the option -t to get one module per row
module avail -t

## Identify the blast module
module avail -t | grep blast

## Identify the emboss module
module avail -t | grep emboss

```

Task 2: Load the required modules.

```{bash module_load, eval=FALSE}
## Load required modules
module load blast
module load emboss

## List loaded modules
module list

```


## Creating a workspace for this tutorial

In your home directory on the IFB core cluster, create a folder named `cmb-comp-bio_ptracticals`, and a sub-folder named `proteome-blast_results`. 

```{bash mkdir, eval=FALSE}
## Define the result directory for this practical and export it in an environment variable.
export RESULT_DIR=~/cmb-comp-bio_ptracticals/proteome-blast_results

## Note: the -p command enables to create full path at once
mkdir -p ${RESULT_DIR}
cd ${RESULT_DIR}

```


## Collecting the proteome of interest from Uniprot

We will use Uniprot Web interface to select a protein of interest, that will serve as query to search for homologous proteins in the selected proteome. 

For this tutorial, we will use the fruit fly *Drosophila melanogaster* as organism of interest

- Open a connection to [uniprot.org](http://uniprot.org)

- Use the advanced serarches to formulate structured queries

- Select the Organism. Start typing *Drosophila melanogaster* and wait for a few seconds. Among the possible completions displayed, select the following one.

    *Drosophila melanogaster (Fruit fly) [7227]*
    
    Note: the number [7227](https://www.ncbi.nlm.nih.gov/taxonomy/?term=7227) is the identifier of this species in the NCBI taxonomic database.

- Click **Search**. This will select all the proteins of the selected organism. The result table only displays a summary information for 25 entries. 

- On the top of the result page, click **Download** and download all proteins in fasta format on your computer. 

    Note: by default the sequence file is compressed with  the `gzip` algorithm, you should thus obtain a binary file with the extension `.fasta.gz`.

- Locate the downloaded file on your computer, and rename if with a simpler name, for example `Drosophila_melanogaster_proteome.fasta.gz`^[The file downloaded from Uniprot has a sophisticated name with parentheses and brackets, which will be hard to handle in unix commands: `uniprot-organism__Drosophila+melanogaster+(Fruit+fly)+[7227]_.fasta.gz`].

- Use the terminal to check the begining of this file

## Transferring a sequence file to your account on IFB core cluster

### Practical

1. Open a new terminal on your computer, and use the Unix command `scp` to copy the proteome sequence file in the result directory of the core cluster.

Note that an alternative is to use a user-friendly program (e.g. Filezilla) to transfer this file. For more information read the section on [data transfer](https://ifb-elixirfr.gitlab.io/cluster/doc/data/) in the IFB cluster documentation. 


```{bash scp, eval=FALSE}
## Go to your download folder. Note: this can vary depending on your operating system
cd ~/Downloads

## Check that the fasta file is there
ls -l Drosophila_melanogaster_proteome.fasta.gz

## Check the content of the file (use zless to directly see the content of the compressed file)
zless Drosophila_melanogaster_proteome.fasta.gz

## REPLACE <username> by your own username
scp Drosophila_melanogaster_proteome.fasta.gz \
  <username>@core.cluster.france-bioinformatique.fr:cmb-comp-bio_ptracticals/proteome-blast_results/

```

2. Come back to your `ssh` session on the IFB core cluster and check that the seque ce file is in the right directory, and contains the compressed proteome sequence file (extension `.fasta.gz`).  

```{bash check_fetch, eval=FALSE}
cd $RESULT_DIR ## Make sure you are in the result directory
ls -ltr   ## List the files sorted by date (most recent last)
```

3. Measure the size (in megabase) of the compressed sequence file (command: `du`), then uncompress it (command: `gunzip`) and measure the size of the uncompressed file. 


```{bash gunzip_proteome, eval=FALSE}
## Measure the size of the compressed proteome file
du -sm Drosophila_melanogaster_proteome.fasta.gz

## Uncompress the proteome file
gunzip Drosophila_melanogaster_proteome.fasta.gz

## Measure the size of the uncompressed proteome file
du -sm Drosophila_melanogaster_proteome.fasta
```


## Getting a query sequence

- Click on the Advanced search again. Your previous query is still displayed. Click the big **+** symbol to add a criterion, select **Protein name** and type *Histone H3*. 

- Click on the link to the protine named *Histone H3* (with ID [P02299](https://www.uniprot.org/uniprot/P02299)).  You will see the full record of annotations for this protein. 

- Select **Format** $\rightarrow$ **Fasta**. This displays the sequence of Histone 3 in fasta format. We will use this sequence as query for BLAST. 

- Copy the link of this sequence (<https://www.uniprot.org/uniprot/P02299.fasta>) in a file.

- Come back to the IFB core cluster terminal, make sure you are in the results directory, and use the Unix command `wget` to download the sequence. 

```{bash wget, eval=FALSE}
cd $RESULT_DIR

wget --no-clobber https://www.uniprot.org/uniprot/P02299.fasta

```

Note: we use the option --no-clobber. To know why, read the wget manual after typing the command `man wget`. In Unix man pages, you can find a given word (e.g. clobber) by typing `/` followed by the search string.


****************************************************************

# Searching sequence similarities with BLAST


## Formatting the blast database for the reference genome

Before being able to run sequence similarity searches, BLAST needs to read the sequence database (in our case, the proteome) in order to build an index of the k-mers (oligopeptides). This is done with the command `makeblastdb`. 

### Practical

1. Display the `makeblastdb` options (`-h`) manual (`-help`). Well, the the manual is quite heavy, so we give you the equired options: `-in` and `-dbtype`. Read the usage of these options in the manual, built the command and run it. 

2. Build a command  (**DO NOT RUN IT YET**) to index the proteome of interest. 

3. Send this command to the job scheduler via `srun`.

4. Check the result files. 


```{bash help_makeblastdb, eval=FALSE, class.source = 'fold-hide'}
## Make sure the blast module is loaded
module load blast

## Check the path of the program makeblastdb (it should be in your conda folder)
which makeblastdb

## Get the usage of makeblastdb (formal specification of the command syntax)
makeblastdb -h

## Get the help for makeblastdb (explanation of the options)
makeblastdb -help

## Build a BLAST database with all the protein sequences of D.melanogaster
srun makeblastdb -dbtype prot -in Drosophila_melanogaster_proteome.fasta

## Check the new files that were created with this commands. 
## For this we list the files in reverse (-r) temporal (-t) order. 
ls -1tr

```


The program `makeblastdb` created three files besides the input fasta file. 

```
Drosophila_melanogaster_proteome.fasta.pin
Drosophila_melanogaster_proteome.fasta.phr
Drosophila_melanogaster_proteome.fasta.psq
```

These files contain an index of all the oligopeptites found in all the  sequences of the fasta file. This "k-mer" index will be used by BLAST to rapidly find all the sequences in the database that match a query sequence, and use these short correspondences to start an alignment. 


## Read *blastp* manual

Since we want to search a protein database with protein query sequences, we will use the `blastp` command. 

```{bash blast_manual, eval=FALSE}
## Command  syntax
blastp -h

## Help
blastp -help

```

OK, the first contact is a bit frightening, because blastp has many options. This is because it allows you to run queries in different modes, with different parameters, and to export the results in different formats. 

In this tutorial we will show you the most usual ways to use the tool, and you will then be able to refine your search by combing back to the manual and understanding the use of  additional options. 


## Similarity search for one sequence against a database

1. Build a command (**DO NOT RUN IT YET**) that searches similarities for one or several query sequences provided in a fasta file (e.g. `P02299.fasta`) in the proteome of interest (`Drosophila_melanogaster_proteome.fasta`). By default, `blastp` prints the result on the screen, but we prefer to redirect it to a file (e.g. `P02299_blastp_result.txt`) in order to keep a trace of the result and inspect it later.

2. Send this command to the job scheduler with `srun`. 

```{bash blastp_run, eval=FALSE}
## Search similarities for histone 3 in D.melanogaster proteome
srun blastp \
  -db Drosophila_melanogaster_proteome.fasta  \
  -query P02299.fasta \
  > P02299_blastp_result.txt
  
## Check the result
less P02299_blastp_result.txt
```

### Questions

1. How many hits were found in total?
2. How many of these have an E-value higher than 1?
3. What was the threshold on the E-value for this blastp search?
4. With this threshold, how many hits would we expeect if we use a random sequence as query?
5. How many alignments would you consider significant in this result?
6. How many of these would you qualify of homologs?
7. For each of these putative homologs, is it a paralog, ortholog or "other log"?
8. Do you see a relationship between the E-value and some properties of the alignments?


## Setting a conservative cutoff on the e-value

In  `blast` manual, find the option that enables you to set  a threshold (cutoff) on the e-value. Redo the blast  search with a threshold of 1e-5 on the e-value. 

```{bash blastp_evalue, eval=FALSE}
## Search similarities for histone 3 in D.melanogaster proteome
srun blastp \
  -db Drosophila_melanogaster_proteome.fasta  \
  -query P02299.fasta -evalue 1e-5 \
  > P02299_blastp_result_eval1e-5.txt
  
## Check the result
less P02299_blastp_result_eval1e-5.txt
```


## Getting a tabular synthesis of blast results

By default, BLAST displays the detailed results of a similaty search, starting with a summary of the matches, followed by all the alignemnts. It can be convenient to generate a synthetic result indicating only the relevant statistics for each alignment. This can be done by tuning  BLAST options. 

1. Read blastp manual and find a way to tune the formatting options in order to get a tabular output with comments. 

2. Refine the query by selecting the following fields in the tabilar option: 

    - ID of the "subject" sequence (the one found in the proteome of interest)
    - length of the query sequence
    - length of the subject sequence
    - raw score of the alignment
    - e-value

### Solution

```{bash blastp_tabular, eval=FALSE}
## Search similarities for histone 3 in D.melanogaster proteome.
## Return the result as a synthetic table with the matching statistics for each hit. 
srun blastp  \
  -db Drosophila_melanogaster_proteome.fasta  \
  -query P02299.fasta \
  -outfmt 7 \
  > P02299_blastp_summary.tsv

## Check the result
more P02299_blastp_summary.tsv

## Return table with selected fields
srun blastp \
  -db Drosophila_melanogaster_proteome.fasta  \
  -query P02299.fasta \
  -outfmt "6 sseqid qlen slen length score  evalue" \
  > P02299_blastp_summary.tsv

## Check the result
more P02299_blastp_summary.tsv
```


# Negative control: blast against a shuffled sequence

In this section, we will perform a quick empirical test, by submitting a random sequence to blastp. For this, we will shuffle the aminoacids of the original query sequence. 

## Shuffling sequences with EMBOSS shuffleseq

1. Load the emboss module
2. Read the manual of the `shuffleseq` tool.
3. Send a command to the job scheduler to generate a shuffled version of the query sequence. 

```{bash eval=FALSE}
## Activate the module with the emboss software suite
module load emboss

## Check that shuffleseq is available
which shuffleseq

## Get the suffle manual 
shuffleseq -h

## Shuffle the apartokinase 3 sequence
srun shuffleseq -sequence P02299.fasta \
   -outseq shuffled_P02299.fasta

## Check the result
cat shuffled_P02299.fasta

## Compare the shuffled seq with the original one
cat P02299.fasta
```

Here is an example of result (I just simplified the header of the shuffle output fasta sequence for the sake of readability). 

Of course, your result should differ since each shuffling produces a randomized sequence. 

```
>shuffled_H3_DROME
LMKPEQMLIAKKKDTKRVLASVRGIHRIARKVARTERLQKVLFRKTIAFALAGDICGQPT
ESAPRTADLAQAVIFRHGTRPTRYTAPLSMQGKSYNLFGRQEDEQRASARASELTAIPKL
KVYERGAKAKQTRRLR
```


## Exercise

1. Shuffle the histone 3 sequence with the tool `shuffleseq` (found in the conda environment `emboss`). 

2. Run a `blastp` search with this shuffled sequence against the selected proteome. 

### Solution

```{bash eval=FALSE}

## Run blast with the shuffled lysC sequence against all D.melanogaster proteins
srun blastp \
  -db Drosophila_melanogaster_proteome.fasta  \
  -query shuffled_P02299.fasta \
  > shuffled_P02299_blastp_result.txt

## Check the result
less shuffled_P02299_blastp_result.txt

## Re-shuffle the apartokinase 3 sequence
srun shuffleseq -sequence P02299.fasta \
   -outseq shuffled_P02299.fasta

## Run the same query with a tabular output
blastp -outfmt 7 \
  -db Drosophila_melanogaster_proteome.fasta  \
  -query shuffled_P02299.fasta \
  > shuffled_P02299_blastp_summary.tsv

## count the number of false positives
wc -l shuffled_P02299_blastp_summary.tsv

## Check the result
less shuffled_P02299_blastp_summary.tsv

```


### Questions

1. How many alignments would you expect by chance with the default blastp parameters?
2. How many alignemnts did you get with the shuffled sequence?
3. How many of these had an E-value smaller than 1?
4. How many of these had an E-value smaller than 0.01?
5. How many of these would you qualify of homologs?


************************************************************
# Homework


****************************************************************
# Supplementary information (not part of the practical)

## Installing the conda environment in your own computer

**Attention**, this is a complement to the practical, which should not be run if you are working on the IFB core cluster. This section explains you how to install on your environment all the tools required for this tutorial. 


```{bash conda_env_create, eval=FALSE}
## ATTENTION: DO NOT RUN THIS COMMAND IF YOU ARE ON THE NNCR CLUSTER,
## because this would occupy a big space to reinstall BLAST in your 
## own account, although it is already available for all users. 

## Check that conda is available on your computer
which conda

## Define an environment variable with the conda bin directory
export CONDA=`which conda`
export CONDA_BIN=`dirname $CONDA`

## Create a conda environment named blast/2.7.1 and install the corresponding blast version
conda create -n blast blast==2.7.1

## Create a conda environment named emboss/6.6.0 and install the corresponding emboss version
conda create -n emboss emboss==6.6.0

## List the conda environments available on the system
$CONDA_BIN/conda env list

## FInd the blast environment
$CONDA_BIN/conda env list | grep -i blast

## Activate the blast environment. 
conda activate blast

## Check the location of the blastp command
which blastp

## Deactivate the current environment. 
conda deactivate
## Note the change in the shell prompt

## Re-activate the blast environment and activate the emboss environment
conda activate blast
conda activate emboss
## Note the change in the shell prompt

## Find the path of the shuffleseq command (part of the EMBOSS package)
which shuffleseq

```


****************************************************************
# References