{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# COMP 364 - Computer Tools for Life Sciences\n",
    "## Homework #1\n",
    "Instructors: Christopher J.F. Cameron and Carlos G. Oliver\n",
    "\n",
    "**Due:  Friday, September 29th 2017 at 11:59:59 pm**\n",
    "\n",
    "In this assignment, you will learn how to implement a rudimentry profile or position weight matrix (PWM) from DNA sequences and then scan a separate DNA sequence for possible motif hits. This kind of approach is used frequently to learn what kinds of DNA sequences are likely to be attached to certain proteins.\n",
    "\n",
    "## Computational hints\n",
    "\n",
    "Python programming visualizer: http://www.pythontutor.com/visualize.html#mode=edit\n",
    "\n",
    "All programming concepts are drawn from lectures up to and including 'Putting it together: Nested loops and conditionals'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Biology motivation/context\n",
    "\n",
    "From last week's lectures, we now know that there is a transfer of genetic information that occurs within cells from DNA-->RNA-->protein (called the central dogma of biology). In reality, this process is much more complicated as described in the figure below:\n",
    "\n",
    "<img src=\"images/central_dogma.png\" width=\"300\">\n",
    "<image source: https://www.quora.com/What-is-the-central-dogma-of-molecular-biology-Is-it-true (September 16th, 2017)>\n",
    "\n",
    "If you are unfamiliar with the central dogma, we would recommend you review these documents:\n",
    "\n",
    "https://en.wikipedia.org/wiki/Central_dogma_of_molecular_biology\n",
    "\n",
    "https://www.khanacademy.org/test-prep/mcat/biomolecules/amino-acids-and-proteins1/v/central-dogma-of-molecular-biology-2\n",
    "\n",
    "To help control the act of transcription (the transference of genetic information from DNA to RNA), cells employ a set of proteins (called transcription factors) that are known to bind near genes and regulate their expression (e.g., the rate at which they are transcribed into RNA). We are able to observe the activity of transcription factors by using a technology called chromatin immunoprecipitation (or ChIP). This technology uses a targeted antibody to select for DNA sequences that are bound by transcription factors (see figure below).\n",
    "\n",
    "<img src=\"images/chip_tech.png\" width=\"300\">\n",
    "<image adapted from: https://www.researchgate.net/figure/222384404_fig1_Fig-1-Schematic-representation-of-the-chromatin-immunoprecipitation-procedure-The (September 16th, 2017)>\n",
    "\n",
    "The minute (or low level) details about ChIP technology are outside the scope of this course, but you must understand that the result of a ChIP protocol is a collection of DNA sequences that were bound by a specific protein of interest (i.e., a transcription factor).\n",
    "\n",
    "If you would like to learn more about ChIP technology, please see the following links:\n",
    "\n",
    "https://en.wikipedia.org/wiki/ChIP-sequencing\n",
    "\n",
    "http://bitesizebio.com/13541/an-introduction-to-chip-seq/\n",
    "\n",
    "Now the problem we have to address is that we have a collection of DNA sequences that were bound by a protein of interest, but we don't know the specific location where the protein attached to DNA. This is called the motif discovery problem, where:\n",
    "\n",
    "&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Given a collection of DNA sequences that were bound by a protein of interest, identify the motif (or binding site) of the protein.\n",
    "\n",
    "Some of you may already be familiar with motifs and have seen examples visualized in a sequence logo:\n",
    "\n",
    "<img src=\"images/motif.png\" width=\"350\">\n",
    "<image source: https://openi.nlm.nih.gov/detailedresult.php?img=PMC3239296_1471-213X-11-59-3&req=4 (September 16th, 2017)>\n",
    "\n",
    "A sequence logo represents of a stack of letters (DNA nucleotides) at each position of the motif (or binding site). The relative sizes of the nucleotides indicate their frequency of representation in the collection of DNA sequences obtained from the ChIP protocol. While the total height of the letters depicts the information content of the position, in bits (a unit that you all should be familiar with now).\n",
    "\n",
    "In the upper-left panel of the figure above, the binding motif of E-box is described as being 11 nucleotides long. We see that positions 6, 7, 8, and 11 have to be nucleotides C, A, G, and G, respectively. While other positions can be multiple nucleotides or have no information about them.\n",
    "\n",
    "A profile or position weight matrix (PWM) is another way to represent the binding site of a protein that was observed to be bound to DNA. In this assignment, we will implement a PWM using only loops, lists/tuples, conditional statements, and minor arithmetic.\n",
    "\n",
    "Let's start by remembering how to declare an empty list."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# definition of an empty list\n",
    "empty_list = [] # or list()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's do something a little more useful (and biologically more relevant). Define a list object, called 'nucleotides', that contains single character representations of the four possible DNA nucleotides ('A', 'C', 'G', 'T')."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# define list of DNA nucleotides\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With a list of nucleotides defined, we can now implement some concepts that were introduced in class. \n",
    "\n",
    "First, let's create a loop (either for or while) that iterates over each item in the 'nucleotides' list and prints the item's value & index. \n",
    "\n",
    "a) Using the built-in enumerate function ** (2.5 marks) **"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# create a loop that iterates over each item in the list and prints out the item's index & value using enumerate()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "b) Using the built-in index method ** (2.5 marks) **"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# create a loop that iterates over each item in the list and prints out the item's index & value using index()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's pretend we have just performed a ChIP protocol and observed the following six DNA sequences to be bound by our protein of interest (protein 'X'). Real ChIP-seq DNA sequences can be much longer, but for practicality, we'll limit them to jus six nucleotides."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true,
    "editable": false
   },
   "outputs": [],
   "source": [
    "# list of observed sequences where transcription factor binds\n",
    "ChIP_seqs = [ \"ACGTGA\", \"ACGGGA\", \"ACGCCC\", \"ACAGAT\", \"GATAGA\", \"ACGTTA\" ]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now want to determine the frequency at which each nucleotide is bound at a given position in the motif. Remember, each sequence represents a single observation of the protein's binding motif/site. How many positions are there in the above motif? (complete the code below) ** (2.5 marks) **"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "motif_size = \n",
    "print(\"motif size:\",motif_size)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For each position in the motif, we need to count the frequency of occurence for each possible nucleotides. We're going to start with just the first column and then you will implement a pair of nested loops to complete the matrix. ** (10 marks) **"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# declare an empty list for the position weight matrix (PWM) of observed sequences\n",
    "PWM = []\n",
    "\n",
    "# define a list that contains the frequency counts for each nucleotide in the first column\n",
    "nucleotide_counts = [0, 0, 0, 0] # [A, C, G, T] - note: the order is the same as our nucleotides list object above\n",
    "\n",
    "# iterate over each DNA sequence in the list of observed DNA sequences\n",
    "for DNA_seq in ChIP_seqs:\n",
    "    # insert your code here\n",
    "\n",
    "# print out the nucleotide counts\n",
    "print(\"nucleotide counts for position one:\",nucleotide_counts)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we will expand our code and apply it to all positions of the motif (hint: you will need a loop). ** (10 marks) **"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you have implemented the above code correctly, you should have a two-dimensional list (or a list of lists) containing frequency counts for each nucleotide at a given position in the motif. We would like to represent this in a useful table format such as:\n",
    "\n",
    "Position | A | C | G | T |\n",
    ":---: | :---: | :---: | :---: | :---:\n",
    "1 | $A_{1}$ | $C_{1}$ | $G_{1}$ | $T_{1}$ \n",
    "2 | $A_{2}$ | $C_{2}$ | $G_{2}$ | $T_{2}$ \n",
    "3 | $A_{3}$ | $C_{3}$ | $G_{3}$ | $T_{3}$ \n",
    "4 | $A_{4}$ | $C_{4}$ | $G_{4}$ | $T_{4}$ \n",
    "5 | $A_{5}$ | $C_{5}$ | $G_{5}$ | $T_{5}$ \n",
    "6 | $A_{6}$ | $C_{6}$ | $G_{6}$ | $T_{6}$ \n",
    "\n",
    "Where $A_{1}$ is the frequency count of A at position 1, $A_{2}$ is the frequency count of A at position 2, and so on.\n",
    "\n",
    "This table is called the absolute frequency matrix or position frequency matrix (PFM).\n",
    "\n",
    "To do this, we will write a loop that iterates over the two-dimensional list and prints each position's information. Note: Your table should be informative and easy to read (but it doesn't have to be perfect - i.e., don't waste your time making it look too fancy). ** (10 marks) **"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What should the nucleotide frequency counts in each row of your table sum to? **(2.5 marks)**\n",
    "\n",
    "\n",
    "\n",
    "### Sequence logo creation (5 marks)\n",
    "\n",
    "Go to: http://ccg.vital-it.ch/pwmtools/pwmscan.php#\n",
    "\n",
    "Follow the instructions on the webpage and create a sequence logo of your PFM. All that should be required of you is the following:\n",
    "\n",
    "* set custom weight matrix format\n",
    "* paste your PFM into the text field\n",
    "* click submit\n",
    "\n",
    "Once the page loads you should see a sequence logo of your PFM on the right-side of the screen. Right-click and save the image in the COMP364 HW1's subfolder called 'images'.\n",
    "\n",
    "Create a Markdown box below this text or insert beneath this text the sequence logo using the following command:\n",
    "\n",
    "&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`<`img src=\"images/your_file_name.png\" width=\"250\"`>`\n",
    "\n",
    "Replace 'your_file_name.png' with your sequence logo's filename."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Position probability matrices (PPM)\n",
    "\n",
    "Typically, PWMs are calculated as the log-likelihood ratio matrix, where $A_{1}$ represents the log-ratio of the relative frequency to the background frequency for nucleotide 'A' at position 1.\n",
    "\n",
    "First, we'll start by calculating the relative frequencies and creating the position probability matrix (PPM). To do this we will convert the absolute frequency counts to probabilities.\n",
    "\n",
    "How do we convert absolute frequencies to probabilites?\n",
    "\n",
    "Let's define some mathematical notation:\n",
    "\n",
    "&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$M$ will be the PPM\n",
    "\n",
    "&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$M_{i,j}$ is the probability of nucleotide $i$ occuring at position $j$ of the PPM\n",
    "\n",
    "&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$N$ is the number of DNA sequences observed to be bound by the protein of interest\n",
    "\n",
    "&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$X_{k,j}$ is the observed nucleotide at position $j$ of observed DNA sequence $k$ \n",
    "\n",
    "By using this notation, we can define $M_{i,j}$ as a function of $X_{k,j}$ such that:\n",
    "\n",
    "$$M_{i,j}=\\frac{1}{N}\\sum_{k=1}^{N}I\\left(X_{k,j}=i\\right)$$\n",
    "\n",
    "&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Where $I()$ returns 1 if $X_{k,j}=i$ and 0 otherwise\n",
    "\n",
    "For anyone unfamiliar with summation notation, sigma ($\\sum$) is the standard for writing long sums. For example, let's sum all of the digits present in the decimal system.\n",
    "\n",
    "$$0+1+2+3+4+5+6+7+8+9=45$$\n",
    "\n",
    "This can be represented as:\n",
    "\n",
    "$$\\sum_{i=0}^{9}i=0+1+2+3+4+5+6+7+8+9=45$$\n",
    "\n",
    "'$\\sum_{i=0}^{9}i$' denotes that we sum all $i$ starting at $i=0$, incrememnt by 1, until we reach $i=9$.\n",
    "\n",
    "If we wanted to represent this as a for loop in Python, it would be:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "editable": false
   },
   "outputs": [],
   "source": [
    "total = 0 # why are we not naming the variable 'sum'?\n",
    "for i in range(10):\n",
    "    total += i\n",
    "print(total)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In practice (and as you will find), implementing the conversion from absolute frequency to a probability is relatively easy. Complete the code below to create a PPM (can be accomplished with as little as two lines of code). ** (10 marks) **\n",
    "\n",
    "Tip - use round() to limit the number of digits after the decimal point when printing the table. See the following link for more information: https://docs.python.org/3.6/library/functions.html#round"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# insert PPM code here\n",
    "\n",
    "# insert your table printing code below to ensure that your PPM is correct\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What should each row of your table sum to now? **(2.5 marks)**\n",
    "\n",
    "\n",
    "\n",
    "### Likelihoods\n",
    "\n",
    "Now that we have a PPM calculated, we can define the likelihood of observing a specific DNA sequences within the motif. Or in other words, if we give you a DNA sequence of six nucleotides, what is the probability that the protein of interest will bind it?\n",
    "\n",
    "For example, we observed the DNA sequence 'GTAGTA'. The likelihood, $L$, of of this sequence being bound by our protein of interest is:\n",
    "\n",
    "$$ L(seq\\ |\\ M) = M_{G,1} \\times M_{T,2} \\times M_{A,3} \\times M_{G,4} \\times M_{T,5} \\times M_{A,6} $$\n",
    "\n",
    "What is the likelihood of observing the following sequences? **(1 mark each for correct answers) ** \n",
    "\n",
    "a) CGTAGT = \n",
    "\n",
    "b) ACGTAA = \n",
    "\n",
    "c) AAAAAA = \n",
    "\n",
    "(hint: fill in the code below to detemine the likelihood of observing the above sequences) ** (7 marks for correct implementation) **"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# define sequence to calculate likelihood of\n",
    "sequence = \"\"\n",
    "\n",
    "# define initial likelihood\n",
    "likelihood = 1.0\n",
    "\n",
    "# insert likelihood calculation code here\n",
    "    \n",
    "print(\"likelihood of '\"+sequence+\"':\",likelihood)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is a useful program to calculate likelihoods that you have created, but it's not very practical. Add user functionality to the code: **(10 marks)**\n",
    "\n",
    "* continually prompt the user for a string containing a DNA sequence\n",
    "* ensure that the provided string is valid to have the likelihood calculated (print a useful error message if it does not)\n",
    "* then calculate the likelihood"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Position weight matrices (PWM)\n",
    "\n",
    "Finally, we will now convert our PPM to a PWM by calculating the log2-ratio between the observed probabilities (found in the PPM) and the background probability. For simplicity, we will assume each nucleotide occurs with an equal probability or:\n",
    "\n",
    "$$p_{i}=\\frac{1}{|i|}$$\n",
    "\n",
    "&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Where $|i|$ is the size of our nucleotide alphabet.\n",
    "\n",
    "What should the background frequncy be for any given nucleotide? **(2.5 marks)**\n",
    "\n",
    "\n",
    "\n",
    "Now we will calculate the PWM by calculating the log2-ratio between the PPM's probabilities and the background probability, where:\n",
    "\n",
    "$$ PWM_{i,j} = log_{2}(\\frac{M_{i,j}}{p_{i}})$$\n",
    "\n",
    "\n",
    "Again this is relatively easy and can be done in a few lines of code. **( 10 marks) **\n",
    "\n",
    "To help with the log2 transformation, we will import the Python math module (more on modules in the future) and use the math.log2() function. This function accepts a numerical object as input and returns the base-2 logarithm applied to the input. See the following link for more information: https://docs.python.org/3.4/library/math.html\n",
    "\n",
    "You may encounter problems when taking the log2() of some ratios (i.e., log2(0) is negative infinity and zero divided by any number is undefined). This can be avoided by either using conditional statements or adding +1 to each ratio (feel free to choose either)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import math\n",
    "\n",
    "# insert PWM code here\n",
    "\n",
    "# insert your table printing code below to ensure that your PWM is correct"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Motif scanning\n",
    "\n",
    "Another use for PWMs is to search for occurrences of the motif in a given DNA sequence. To do this we use what is called a sliding window approach.\n",
    "\n",
    "For example, we observe a DNA sequence of size 10 using a window of size 3.\n",
    "\n",
    "&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;DNA sequence = ACGTACTATT\n",
    "\n",
    "&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Windows of size 3 = [ACG], [CGT], [GTA], [TAC], [CTA], [TAT], [ATT]\n",
    "\n",
    "In this context, the first instance of the sliding window will observe nucleotides 0-2, the second instance will then observe nucleotides 1-3, and so on. The window continues to slide along the DNA sequence until it reaches the end.\n",
    "\n",
    "For each window, we will calculate a score (similar to the likelihood) that we define as the following:\n",
    "\n",
    "$$score_{s} = \\sum_{j=1}^{N}PWM_{s,j}$$\n",
    "\n",
    "&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Where $s$ is the observed nucleotide in the DNA sequence located at position $j$ of the PWM.\n",
    " \n",
    "Given the defined list object below containing a random DNA sequence, identify all posible locations (above a given score threshold) where the protein may bind and sore hits based on the log2-ratio. Provide the total number of hits as well as the DNA index for the start and score of each window. ** (15 marks) **\n",
    "\n",
    "For example:\n",
    "\n",
    "&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;total number of hits: ???\n",
    "\n",
    "&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;DNA index: ??\t(score)\n",
    "\n",
    "&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;DNA index: ??\t(score)\n",
    "\n",
    "&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;DNA index: ??\t(score)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true,
    "editable": false
   },
   "outputs": [],
   "source": [
    "# DNA sequence to be scanned\n",
    "DNA_sequence = \"GTGTCGCGACAGCCCAGCATGATATCCTGAGGCGCTACGCCGACGGATGTCCCACGGAATTGCCACGGTAGCCGAGCGCTACGCGGGCGACACGAACTTATGTATGGAGCGGGCCGTCGAAAGGTCGTACCCTTGCAGTTAGCACGTAGCCCGGCCCCACTAGCACAGCAGTGCCTCGGGCGGCATCCTCATTATTAAGTTTTCTCTACAGCCAAACGACCAGGTGCGCTTCCGCGGAGCGCGGTGGAGACTCGCCCACCCGGCAGCTCTGTGACGGGGACTAAAGGGGCGATGATAATCGCGAGTGCCGCGTTATGGTGGTGTCGGGACAGAGCGGCCTTGCGGCCAGTCGTATCCCTTCTCGAGCTCCGTCCGGTTAAGCGTGACACCCCCAGCGTACCCGCAAACCGCGATGGCTGTGCTTGGGGTCGATCGCACGTAGGACGGTCCCCAACGTGAGACGCCGGGGCACCAGTTCTCACGCCCAAAGCATAAACGACGGGCAG\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# define empty list to store motif hits\n",
    "hits = []\n",
    "hit_threshold = 5.0 # example value, try different values and see the effect\n",
    "\n",
    "# insert sliding window approach here\n",
    "#window_size = ???\n",
    "        \n",
    "# print out hits to stdout\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**BONUS (+5 marks)**: sort motif hits in descending order based on their score"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Edit Metadata",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}