{ "cells": [ { "cell_type": "markdown", "id": "9275eaea", "metadata": {}, "source": [ "Today we will be learning to use the library `numpy` for working with data arrays/matrices and how to use dynamic programming to perform sequence alignment with the Needleman-Wunsch algorithm." ] }, { "cell_type": "code", "execution_count": null, "id": "7f3a7035", "metadata": {}, "outputs": [], "source": [ "import numpy as np" ] }, { "cell_type": "markdown", "id": "e3db3546", "metadata": {}, "source": [ "#### Set sequences to align\n", "\n", "In the homework assignment, you'll reading sequences from a FASTA file. For the live-coding, we'll define them explicitly in the script." ] }, { "cell_type": "code", "execution_count": null, "id": "22ceed48", "metadata": {}, "outputs": [], "source": [ "sequence1 = 'TGTTACGG'\n", "sequence2 = 'GGTTGACTA'" ] }, { "cell_type": "markdown", "id": "74fe501f", "metadata": {}, "source": [ "#### Define the match, mismatch, and gap scores\n", "\n", "Here we will define what the scoring scheme is for the alignment. In a real script, we would use `sys.argv` to read arguments in from the command line, stored in a list. Remember that the first thing in the list is always the name of the script." ] }, { "cell_type": "code", "execution_count": null, "id": "5e76938e", "metadata": {}, "outputs": [], "source": [ "match_score = 100\n", "mismatch_score = -100\n", "gap_penalty = -50" ] }, { "cell_type": "markdown", "id": "368da8d3", "metadata": {}, "source": [ "#### Playing around with arrays\n", "\n", "Numpy is a Python package built around matrices (arrays). You can think of an array as a (potentially) multi-dimensional list. In fact, a 2D numpy array is essentially a list of lists. We can create a numpy array using the `np.array()` function." ] }, { "cell_type": "code", "execution_count": null, "id": "5b41956d", "metadata": {}, "outputs": [], "source": [ "list_of_lists = [[1,2,3],\n", "\t\t\t\t [4,5,6],\n", "\t\t\t\t [7,8,9]]\n", "\n", "my_2D_array = np.array(list_of_lists, dtype=int)\n", "\n", "print(my_2D_array)" ] }, { "cell_type": "markdown", "id": "cad355d7", "metadata": {}, "source": [ "If we want to know what the dimensions of our array are, we can check the `.shape` attribute" ] }, { "cell_type": "code", "execution_count": null, "id": "ec7330dc", "metadata": {}, "outputs": [], "source": [ "print(my_2D_array.shape) # a tuple of (nrows, ncols)" ] }, { "cell_type": "markdown", "id": "48ddae25", "metadata": {}, "source": [ "Just like we can index lists, we can also index numpy arrays. When we index a 2D numpy array, we first index the rows, then the columns." ] }, { "cell_type": "code", "execution_count": null, "id": "d667267b", "metadata": {}, "outputs": [], "source": [ "my_list = [3,1,4,1,5]\n", "\n", "print(my_list[2]) # print element at index 2 in `my_list`\n", "print(my_list[0:3]) # print first three elements of `my_list`\n", "\n", "print(my_2D_array[1,2]) # Print `6`\n", "print(my_2D_array[1,:]) # Print the second (i.e. index 1) row\n", "print(my_2D_array[:,1]) # Print the second (i.e. index 1) column\n", "print(my_2D_array[:2,:2]) # Print the upper left 2x2 block" ] }, { "cell_type": "markdown", "id": "0fa2df68", "metadata": {}, "source": [ "Like with lists, numpy arrays also support item assignment." ] }, { "cell_type": "code", "execution_count": null, "id": "1a1f491e", "metadata": {}, "outputs": [], "source": [ "my_list[4] = 6 # Change the `5` to a `6`\n", "print(my_list)\n", "\n", "my_2D_array[1,1] = 314 # Change the `5` to a `314`\n", "print(my_2D_array) " ] }, { "cell_type": "markdown", "id": "5284bf68", "metadata": {}, "source": [ "We can also loop through numpy arrays. When we loop through a 2D array, we loop through the rows" ] }, { "cell_type": "code", "execution_count": null, "id": "3b485a9b", "metadata": {}, "outputs": [], "source": [ "for row in my_2D_array:\n", "\tprint(row)" ] }, { "cell_type": "markdown", "id": "9494f3eb", "metadata": {}, "source": [ "Because each row is just a 1D numpy array, we can also loop through the row itself" ] }, { "cell_type": "code", "execution_count": null, "id": "205d4474", "metadata": {}, "outputs": [], "source": [ "for row in my_2D_array:\n", "\tfor val in row:\n", "\t\tprint(val)" ] }, { "cell_type": "markdown", "id": "57dfe434", "metadata": {}, "source": [ "We can also use the range() function to loop through each value in the array" ] }, { "cell_type": "code", "execution_count": null, "id": "42c0e527", "metadata": {}, "outputs": [], "source": [ "for row_index in range(my_2D_array.shape[0]): # The number of rows is the first value in .shape\n", "\tfor col_index in range(my_2D_array.shape[1]): # The number of columns is the second value in .shape\n", "\t\tprint(my_2D_array[row_index, col_index])" ] }, { "cell_type": "markdown", "id": "7f92408e", "metadata": {}, "source": [ "#### Initialize the F-matrix\n", "\n", "The first thing we need to do is create an empty F-matrix. The number of rows should be equal to the length of sequence1 plus one (to allow for leading gaps). Similarly, the number of columns should be equal to the length of sequence2 plus one." ] }, { "cell_type": "code", "execution_count": null, "id": "f741cbf4", "metadata": {}, "outputs": [], "source": [ "F_matrix = np.zeros((len(sequence1)+1, len(sequence2)+1), dtype=int)" ] }, { "cell_type": "markdown", "id": "6cb80b16", "metadata": {}, "source": [ "Now we need to fill in the values in the first row and first column, based on the gap penalty. Let's fill in the first column." ] }, { "cell_type": "code", "execution_count": null, "id": "0f471310", "metadata": {}, "outputs": [], "source": [ "for i in range(len(sequence1)+1):\n", "\tF_matrix[i,0] = i*gap_penalty" ] }, { "cell_type": "markdown", "id": "7e7b38ab", "metadata": {}, "source": [ "Now fill in the first row" ] }, { "cell_type": "code", "execution_count": null, "id": "6b23bbbd", "metadata": {}, "outputs": [], "source": [ "for j in range(len(sequence2)+1):\n", "\tF_matrix[0,j] = j*gap_penalty" ] }, { "cell_type": "markdown", "id": "ce3df600", "metadata": {}, "source": [ "#### Populate the F-matrix\n", "\n", "Now that we've filled in the first row and column, we need to go row-by-row, and within each row go column-by-column, calculating the scores for the three possible alignments and storing the maximum score" ] }, { "cell_type": "code", "execution_count": null, "id": "8a41bcf2", "metadata": {}, "outputs": [], "source": [ "for i in range(1, len(sequence1)+1): # loop through rows\n", "\tfor j in range(1, len(sequence2)+1): # loop through columns\n", "\t\tif sequence1[i-1] == sequence2[j-1]: # if sequence1 and sequence2 match at positions i and j, respectively...\n", "\t\t\td = F_matrix[i-1, j-1] + match_score\n", "\t\telse: # if sequence1 and sequence2 don't match at those positions...\n", "\t\t\td = F_matrix[i-1, j-1] + mismatch_score\n", "\t\th = F_matrix[i,j-1] + gap_penalty\n", "\t\tv = F_matrix[i-1,j] + gap_penalty\n", "\n", "\t\tF_matrix[i,j] = max(d,h,v)" ] }, { "cell_type": "markdown", "id": "6f92c137", "metadata": {}, "source": [ "#### Print the F-matrix" ] }, { "cell_type": "code", "execution_count": null, "id": "6e9e0f0c", "metadata": {}, "outputs": [], "source": [ "print(F_matrix)" ] }, { "cell_type": "markdown", "id": "51242634", "metadata": {}, "source": [ "#### A primer on while loops\n", "\n", "While loops are a useful tool in Python (and most other languages) that allows you to continue doing a process until some condition stops being met. If we want, we can have them mimic a for loop.\n", "\n", "Let's make a for loop that prints out the integers from 0 to 10 (non-inclusive)" ] }, { "cell_type": "code", "execution_count": null, "id": "f3ec2432", "metadata": {}, "outputs": [], "source": [ "for i in range(0, 10, 1):\n", "\tprint(i)" ] }, { "cell_type": "markdown", "id": "ae3d6888", "metadata": {}, "source": [ "We can do the exact same thing with a while loop" ] }, { "cell_type": "code", "execution_count": null, "id": "9e53903a", "metadata": {}, "outputs": [], "source": [ "i = 0\n", "while i < 10:\n", "\tprint(i)\n", "\ti += 1 # increment i" ] }, { "cell_type": "markdown", "id": "468bf933", "metadata": {}, "source": [ "While loops can also let us easily go backwards" ] }, { "cell_type": "code", "execution_count": null, "id": "aa8a81de", "metadata": {}, "outputs": [], "source": [ "i = 9\n", "while i >= 0:\n", "\tprint(i)\n", "\ti -= 1 # decrement i" ] }, { "cell_type": "markdown", "id": "58e3293d", "metadata": {}, "source": [ "We can also set multiple conditions for a while loop" ] }, { "cell_type": "code", "execution_count": null, "id": "338871dd", "metadata": {}, "outputs": [], "source": [ "i = 5\n", "j = 5\n", "while i >= 0 and j >= 0:\n", " print(i, j)\n", " if i > j:\n", " i -= 1\n", " else:\n", " j -= 1" ] } ], "metadata": { "kernelspec": { "display_name": "base", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.14" } }, "nbformat": 4, "nbformat_minor": 5 }