--- title: "HW4: Pairwise Alignment Solutions" author: "Author: Your Name" date: "Last update: 2 May, 2023" output: html_document: toc: true toc_float: collapsed: true smooth_scroll: true toc_depth: 3 fig_caption: yes code_folding: show number_sections: false fontsize: 14pt bibliography: bibtex.bib type: docs weight: 304 ---
Source code downloads:     [ [.Rmd](https://raw.githubusercontent.com/tgirke/GEN242/main/content/en/assignments/Homework/HW04/hw4_solution/HW4_key.Rmd) ]     [ [pairwiseAlign_Fct.R](https://raw.githubusercontent.com/tgirke/GEN242/main/content/en/assignments/Homework/HW04/hw4_solution/pairwiseAlign_Fct.R) ]
# Rendering Instructions To render this R Markdown document, one needs to download the following files to the same directory. + [HW4_key.Rmd](https://raw.githubusercontent.com/tgirke/GEN242/main/content/en/assignments/Homework/HW04/hw4_solution/HW4_key.Rmd): Rmd source file for this document + [pairwiseAlign_Fct.R](https://raw.githubusercontent.com/tgirke/GEN242/main/content/en/assignments/Homework/HW04/hw4_solution/pairwiseAlign_Fct.R): R script defining pairwise alignment functions + [bibtex.bib](https://raw.githubusercontent.com/tgirke/GEN242/main/content/en/assignments/Homework/HW04/hw4_solution/bibtex.bib): references cited in the text in BibTeX format Next, one can render the report to HTML, PDF and other formats following the instructions below. Both the HTML and PDF versions are linked. Note, the rendering needs to be performed from the same directory where the downloaded `*.Rmd`, `*.R` and `*.bib` files are located. + [HTML](https://girke.bioinformatics.ucr.edu/GEN242/assignments/Homework/HW04/hw4_solution/HW4_key.html): this report in HTML format + [PDF](https://girke.bioinformatics.ucr.edu/GEN242/assignments/Homework/HW04/hw4_solution/HW4_key.pdf): corresponding PDF version The HTML report can be rendered with `rmarkdown::render()` as follows. ```{r render_this_markdown, eval=FALSE, message=FALSE} rmarkdown::render('HW4_key.Rmd') # From R Rscript -e "rmarkdown::render('HW4_key.Rmd')" # From command-line ``` To render a PDF file instead of HTML, one can instruct the rendering function to do so like this: `rmarkdown::render('HW4_key.Rmd', c('pdf_document')`. To render to several formats with a single command, one can concatenate the formatting values with `c('html_document', 'pdf_document')`. # A. Choice of Sequence Type Task 1: Which sequence type - amino acid or nucleotide - is more appropriate to search databases for remotely related sequences? Provide at least three reasons for your decision. Answer: When coding sequences are expected to have weak similarities then one should use protein sequences rather than DNA sequences for database searching, because of (1) their higher information content (20 versus 4 letter alphabet), as well as (2) the better scoring and (3) functional classification systems available for amino acids. # B. Dynamic Programming for Pairwise Alignments Task 2: Create manually (or write an R script for it) one global and one local alignment for the following two protein sequences using the Needleman-Wusch and Smith-Waterman algorithms, respectively [@Smith1981-ax; @Needleman1970-md]. ```sh O15528: PFGFGKRSCMGRRLA P98187: FIPFSAGPRNCIGQK ``` ## Source functions All alignment functions used in the following sections are defined in the downloaded R script file that is named `pairwiseAlign_Fct.R`. These functions are loaded with the `source()` command below. ```{r source_pairalign_fct, eval=TRUE, message=FALSE} source("pairwiseAlign_Fct.R") ``` ## Input sequences Define within R or import them (here former). ```{r input_seq1, eval=TRUE} S1 <- "PFGFGKRSCMGRRLA" S2 <- "FIPFSAGPRNCIGQK" ``` Additional test sequences ```{r input_seq2, eval=FALSE} # S1 <- "HEAGAWGHEE" # S2 <- "PAWHEAE" ``` ## Global alignment The alignment type choice is passed on to all following functions. ```{r align_type, eval=TRUE, message=FALSE} align_type <- "global" # align_type <- "local" ``` ### Dynamic programming matrices ```{r gen_dynProgMatrix, eval=TRUE, message=FALSE} dynMA <- dynProgMatrix(S1, S2, align_method=align_type, gap_penalty=8, substitutionMA="BLOSUM50") ``` The matrices are stored in a list and returned below. The path is indicated by three numbers in the `glob_ma_path` matrix. Their meaning is: + 1: diagonal + 2: vertical (up) + 3: horizontal (left) ```{r print_dynProgMatrix, eval=TRUE, message=FALSE} dynMA ``` ### Compute alignment The following `alignList` stores all relevant results in a list, including dynamic programming matrices, as well as the coordinates (named `path_coor`) to highlight path in dynamic progamming matrix (see below). ```{r gen_alignmentTraceback, eval=TRUE, message=FALSE} alignList <- alignmentTraceback(ma=dynMA[[1]], ma_path=dynMA[[2]], align_method=align_type) names(alignList) # alignList$ma # dyn ma with scores # alignList$ma_path # dyn ma with path # alignList$path_coor # coordinates for path to auto highlight path in HTML/PDF table ``` ### Return results #### Traceback in matrix The following prints the fully populated dynamic programming matrix where the traceback path is highlighted in color. ```{r gen_alignmentTraceback_Matrix, eval=TRUE, message=FALSE} printColMa(alignList) ``` #### Alignment and score ```{r gen_alignment, eval=TRUE, message=FALSE} printAlign(x=alignList) ``` ## Local alignment The alignment type choice is passed on to all following functions. ```{r align_type_loc, eval=TRUE, message=FALSE} # align_type <- "global" align_type <- "local" ``` ### Dynamic programming matrices ```{r gen_dynProgMatrix_loc, eval=TRUE, message=FALSE} dynMA <- dynProgMatrix(S1, S2, align_method=align_type, gap_penalty=8, substitutionMA="BLOSUM50") ``` The matrices are stored in a list and returned below. ```{r print_dynProgMatrix_loc, eval=TRUE, message=FALSE} dynMA ``` ### Compute alignment Note: `alignList` stores all relevant results in a list, including dynamic programming matrices, as well as the coordinates (named `path_coor`) to highlight the path in the dynamic progamming matrix. This way one can easily generate a single dynamic programming matrix with the traceback path highlighted by colors or arrows in an HTML or PDF document (see below). ```{r gen_alignmentTraceback_loc, eval=TRUE, message=FALSE} alignList <- alignmentTraceback(ma=dynMA[[1]], ma_path=dynMA[[2]], align_method=align_type) names(alignList) # alignList$ma # dyn ma with scores # alignList$ma_path # dyn ma with path # alignList$path_coor # coordinates for path to auto highlight path in HTML/PDF table ``` ### Return results #### Traceback in matrix The following prints the fully populated dynamic programming matrix where the traceback path is highlighted in color. ```{r gen_alignmentTraceback_Matrix_loc, eval=TRUE, message=FALSE} printColMa(alignList) ``` #### Alignment and score ```{r gen_alignment_loc, eval=TRUE, message=FALSE} printAlign(x=alignList) ``` # C. Different Substitution Matrices Task 1: Load the Biostrings package in R, import the following two cytochrome P450 sequences O15528 and P98187 from NCBI (save as myseq.fasta), and create a global alignment with the pairwiseAlignment function from Biostrings as follows. ```{r alignment_scores_biostrings, eval=FALSE, message=FALSE} library(Biostrings) myseq <- readAAStringSet("myseq.fasta", "fasta") (p <- pairwiseAlignment(myseq[[1]], myseq[[2]], type="global", substitutionMatrix="BLOSUM50")) writePairwiseAlignments(p) ``` Your answers should address the following items: Record the scores for the scoring matrices BLOSUM50, BLOSUM62 and BLOSUM80. How and why do the scores differ for the three scoring matrices? Answer 1: The scores for the three BLOSUM substitutions matrices are: + BLOSUM50: 227 + BLOSUM62: 54 + BLOSUM80: -52 Answer 2: Since the two sequences are relatively dissimilar (as determined by alignment view from `writePairwiseAlignments(p)`) it is expected that the BLOSUM matrices trained on less similar sequences (_e.g._ BLOSUM50) result in higher scores than those trained on more similar sequences (_e.g._ BLOSUM80). # Session Info ```{r sessionInfo} sessionInfo() ``` # References