---
title: "HW4: Pairwise Alignment Solutions"
author: "Author: Your Name"
date: "Last update: 2 May, 2023"
output:
html_document:
toc: true
toc_float:
collapsed: true
smooth_scroll: true
toc_depth: 3
fig_caption: yes
code_folding: show
number_sections: false
fontsize: 14pt
bibliography: bibtex.bib
type: docs
weight: 304
---
Source code downloads:
[ [.Rmd](https://raw.githubusercontent.com/tgirke/GEN242/main/content/en/assignments/Homework/HW04/hw4_solution/HW4_key.Rmd) ]
[ [pairwiseAlign_Fct.R](https://raw.githubusercontent.com/tgirke/GEN242/main/content/en/assignments/Homework/HW04/hw4_solution/pairwiseAlign_Fct.R) ]
# Rendering Instructions
To render this R Markdown document, one needs to download the following files to the same directory.
+ [HW4_key.Rmd](https://raw.githubusercontent.com/tgirke/GEN242/main/content/en/assignments/Homework/HW04/hw4_solution/HW4_key.Rmd): Rmd source file for this document
+ [pairwiseAlign_Fct.R](https://raw.githubusercontent.com/tgirke/GEN242/main/content/en/assignments/Homework/HW04/hw4_solution/pairwiseAlign_Fct.R): R script defining pairwise alignment functions
+ [bibtex.bib](https://raw.githubusercontent.com/tgirke/GEN242/main/content/en/assignments/Homework/HW04/hw4_solution/bibtex.bib): references cited in the text in BibTeX format
Next, one can render the report to HTML, PDF and other formats following the instructions below. Both the HTML and PDF versions are linked. Note, the rendering needs to be performed from the same directory where the downloaded `*.Rmd`, `*.R` and `*.bib` files are located.
+ [HTML](https://girke.bioinformatics.ucr.edu/GEN242/assignments/Homework/HW04/hw4_solution/HW4_key.html): this report in HTML format
+ [PDF](https://girke.bioinformatics.ucr.edu/GEN242/assignments/Homework/HW04/hw4_solution/HW4_key.pdf): corresponding PDF version
The HTML report can be rendered with `rmarkdown::render()` as follows.
```{r render_this_markdown, eval=FALSE, message=FALSE}
rmarkdown::render('HW4_key.Rmd') # From R
Rscript -e "rmarkdown::render('HW4_key.Rmd')" # From command-line
```
To render a PDF file instead of HTML, one can instruct the rendering function to do so like this: `rmarkdown::render('HW4_key.Rmd', c('pdf_document')`. To render to several formats with
a single command, one can concatenate the formatting values with `c('html_document', 'pdf_document')`.
# A. Choice of Sequence Type
Task 1: Which sequence type - amino acid or nucleotide - is more appropriate to search databases
for remotely related sequences? Provide at least three reasons for your decision.
Answer: When coding sequences are expected to have weak similarities then one
should use protein sequences rather than DNA sequences for database searching,
because of (1) their higher information content (20 versus 4 letter alphabet),
as well as (2) the better scoring and (3) functional classification systems available
for amino acids.
# B. Dynamic Programming for Pairwise Alignments
Task 2: Create manually (or write an R script for it) one global and one local
alignment for the following two protein sequences using the Needleman-Wusch and
Smith-Waterman algorithms, respectively [@Smith1981-ax; @Needleman1970-md].
```sh
O15528: PFGFGKRSCMGRRLA
P98187: FIPFSAGPRNCIGQK
```
## Source functions
All alignment functions used in the following sections are defined in the downloaded R script file
that is named `pairwiseAlign_Fct.R`. These functions are loaded with the `source()` command below.
```{r source_pairalign_fct, eval=TRUE, message=FALSE}
source("pairwiseAlign_Fct.R")
```
## Input sequences
Define within R or import them (here former).
```{r input_seq1, eval=TRUE}
S1 <- "PFGFGKRSCMGRRLA"
S2 <- "FIPFSAGPRNCIGQK"
```
Additional test sequences
```{r input_seq2, eval=FALSE}
# S1 <- "HEAGAWGHEE"
# S2 <- "PAWHEAE"
```
## Global alignment
The alignment type choice is passed on to all following functions.
```{r align_type, eval=TRUE, message=FALSE}
align_type <- "global"
# align_type <- "local"
```
### Dynamic programming matrices
```{r gen_dynProgMatrix, eval=TRUE, message=FALSE}
dynMA <- dynProgMatrix(S1, S2, align_method=align_type, gap_penalty=8, substitutionMA="BLOSUM50")
```
The matrices are stored in a list and returned below. The path is indicated by three numbers in the `glob_ma_path` matrix.
Their meaning is:
+ 1: diagonal
+ 2: vertical (up)
+ 3: horizontal (left)
```{r print_dynProgMatrix, eval=TRUE, message=FALSE}
dynMA
```
### Compute alignment
The following `alignList` stores all relevant results in a list, including dynamic
programming matrices, as well as the coordinates (named `path_coor`) to highlight path in
dynamic progamming matrix (see below).
```{r gen_alignmentTraceback, eval=TRUE, message=FALSE}
alignList <- alignmentTraceback(ma=dynMA[[1]], ma_path=dynMA[[2]], align_method=align_type)
names(alignList)
# alignList$ma # dyn ma with scores
# alignList$ma_path # dyn ma with path
# alignList$path_coor # coordinates for path to auto highlight path in HTML/PDF table
```
### Return results
#### Traceback in matrix
The following prints the fully populated dynamic programming matrix where the traceback
path is highlighted in color.
```{r gen_alignmentTraceback_Matrix, eval=TRUE, message=FALSE}
printColMa(alignList)
```
#### Alignment and score
```{r gen_alignment, eval=TRUE, message=FALSE}
printAlign(x=alignList)
```
## Local alignment
The alignment type choice is passed on to all following functions.
```{r align_type_loc, eval=TRUE, message=FALSE}
# align_type <- "global"
align_type <- "local"
```
### Dynamic programming matrices
```{r gen_dynProgMatrix_loc, eval=TRUE, message=FALSE}
dynMA <- dynProgMatrix(S1, S2, align_method=align_type, gap_penalty=8, substitutionMA="BLOSUM50")
```
The matrices are stored in a list and returned below.
```{r print_dynProgMatrix_loc, eval=TRUE, message=FALSE}
dynMA
```
### Compute alignment
Note: `alignList` stores all relevant results in a list, including dynamic
programming matrices, as well as the coordinates (named `path_coor`) to highlight the path in
the dynamic progamming matrix. This way one can easily generate a single dynamic programming matrix
with the traceback path highlighted by colors or arrows in an HTML or PDF document (see below).
```{r gen_alignmentTraceback_loc, eval=TRUE, message=FALSE}
alignList <- alignmentTraceback(ma=dynMA[[1]], ma_path=dynMA[[2]], align_method=align_type)
names(alignList)
# alignList$ma # dyn ma with scores
# alignList$ma_path # dyn ma with path
# alignList$path_coor # coordinates for path to auto highlight path in HTML/PDF table
```
### Return results
#### Traceback in matrix
The following prints the fully populated dynamic programming matrix where the traceback
path is highlighted in color.
```{r gen_alignmentTraceback_Matrix_loc, eval=TRUE, message=FALSE}
printColMa(alignList)
```
#### Alignment and score
```{r gen_alignment_loc, eval=TRUE, message=FALSE}
printAlign(x=alignList)
```
# C. Different Substitution Matrices
Task 1: Load the Biostrings package in R, import the following two cytochrome
P450 sequences O15528 and P98187 from NCBI (save as myseq.fasta), and create a
global alignment with the pairwiseAlignment function from Biostrings as
follows.
```{r alignment_scores_biostrings, eval=FALSE, message=FALSE}
library(Biostrings)
myseq <- readAAStringSet("myseq.fasta", "fasta")
(p <- pairwiseAlignment(myseq[[1]], myseq[[2]], type="global", substitutionMatrix="BLOSUM50"))
writePairwiseAlignments(p)
```
Your answers should address the following items:
Record the scores for the scoring matrices BLOSUM50, BLOSUM62 and BLOSUM80.
How and why do the scores differ for the three scoring matrices?
Answer 1: The scores for the three BLOSUM substitutions matrices are:
+ BLOSUM50: 227
+ BLOSUM62: 54
+ BLOSUM80: -52
Answer 2: Since the two sequences are relatively dissimilar (as determined by alignment view from
`writePairwiseAlignments(p)`) it is expected that the BLOSUM matrices trained on less similar
sequences (_e.g._ BLOSUM50) result in higher scores than those trained on more similar
sequences (_e.g._ BLOSUM80).
# Session Info
```{r sessionInfo}
sessionInfo()
```
# References