{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Introduction to Bioinformatics**<br>\n",
    "A masters course by Blaž Zupan and Tomaž Curk<br>\n",
    "University of Ljubljana (C) 2016-2018\n",
    "\n",
    "Disclaimer: this work is a first draft of our working notes. The images were obtained from various web sites, but mainly from the wikipedia entries with explanations of different biological entities. Material is intended for our bioinformatics class and is not meant for distribution.\n",
    "\n",
    "## Lecture Notes Part 2\n",
    "# The First Look at the Genome"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Central Dogma of Molecular Biology"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The central dogma of molecular biology defines the flow of genetic information within biological systems. In simple terms, the central dogma says that \"DNA makes RNA makes protein.\" A bit longer: DNA encodes the information about the sequence of amino acids in proteins. To construct a protein from amino acids, the information – protein coding sequence – from DNA is first transcribed to messenger RNA and then translated to protein. Central dogma also claims that this is the direction how the information flows. That is, the information stored as a sequence of amino acids in proteins never gets translated back to the DNA."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Constitution of the DNA"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A DNA is a long molecule composed of nucleotides, a sugar called deoxyribose, and a phosphate group. There are four different nucleotides: cytosine (C), guanine (G), adenine (A), and thymine (T). A DNA consists of two strands coiled around each other to form a double helix. The two strands are joined to each other by weak hydrogen bonds, where adenine can bound to thymine and cytosine to guanine.\n",
    "\n",
    "Each strand is composed of alternating sugar and phosphate group, forming a sugar-phosphate backbone using a strong covalent bond. Nucleotides are attached to the backbone, one nucleotide per one sugar group. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](figs/01-dna-in-close.png \"Components and bonds in the DNA.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Because of complementarity of the two strains, one can use one strain to infer the composition of another strain. Both strands carry the same information.\n",
    "\n",
    "The difference in strength of the bonds turns DNA into a sort of a zipper. It does not take much energy to unzip the DNA, that is, open the double helix, while the bonds that hold the backbone together are strong and protect the sequence against any damages."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Directionality"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It turns out that two strains of the DNA run in a different direction. To start thinking about the direction, we have to examine a sugar group from a sugar-phosphate backbone. To make life more comfortable, and to put some order in the chemical formulas, chemists have decided to number the carbon atoms in the sugar and label them with \"prime\" notation, that is, using a number and a prime sign, like 1' (one-prime) and 3' (three prime). Here is a depiction of the atom numbering for a sugar group:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](figs/02-sugar.png \"Sugar group and atom numbering.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sugar is attached to the backbone with a 5'-end (five prime end), where we find a phosphate group, and with a 3'-end on the other side. Based on this nomenclature, the DNA has a 5' and a 3' end, according to the carbon atom of the sugar that ends the chain. Notice that if one strand runs from 5' to 3', the other, complementary one, runs in the opposite direction, from 3' to 5'."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Genome Size"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The complete DNA sequence defines what we call a genome. The genome is, therefore, the total genetic information contained within the cell. That includes the DNA in the nucleus and DNA in any of the organelles. Huh, this is new: turns out that some of the organelles also include DNA. In animals, such organelle is mitochondria, and in plants, this is the chloroplast. Despite this, when we will refer to a genome for eukaryotes, we will usually mean the DNA in the nucleus, and we will refer to the genetic material in the mitochondria as a mitochondrial genome. DNA in the nucleus is most often not a single molecule, but rather broken into pieces and organized within the chromosomes. Human has 23 pairs of chromosomes. But do not worry about chromosomes at this point (and at least not for a few next lectures).\n",
    "\n",
    "So what is the total length of the DNA sequence? It depends on an organism. Prokaryote (bacteria) have the shortest genome. We measure the length of the DNA in the base pairs (bp), which is a unit consisting of two nucleobases bound to each other by hydrogen bonds. Simply stated, one base pair, one nucleotide on each strand. The total number of nucleotides in one of the strands is the size (length) of the genome. Bacteria have genomes ranging from 0.5 to 13 Mbp in length. The unit Mbp means megabase pairs, or $10^6$ base pairs. We most often use notation Mb for mega base pairs. Genomes of eukaryotes are large and range from 8Mb to 670Gb. Viruses have a much smaller genome of the size from 5 to 50kb. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](figs/02-genome-size.png \"Sizes of the genomes.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And us, human? Our genome contains about 3,200 Mb, or 3,2 Gb. For comparison, there are 76,944 words in the first Harry Potter book (Harry Potter and the Philosopher's Stone). An average length of an English word is just over 5. The estimated number of letters in HP1 is therefore 384,720. At the letter level (representing nucleotides with a single letter), we would need to use equivalent of 8,300 Philosopher's Stones books to write a human genome. My version of HP1 is about 2cm thick, which would make a genome stack – one book on the top of another – for 166 meters. If human DNA were stretched out, it would form a very thin thread about 2 meters long."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## A Small Start"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us start small. With a virus. [Lambda phage](https://en.wikipedia.org/wiki/Lambda_phage) is a bacterial virus that infects and destroys bacteria Escherichia coli. This strange names for species use binomial nomenclature, or a Latin name, which is composed of genus and species. Like us, we are Homo sapiens. But, back to Lambda phage. One of the sources for its genome information is [GenBank](https://www.ncbi.nlm.nih.gov/genbank/), a sequence database maintained by The National Center for Biotechnology Information (NCBI). GenBank has a [page with information on Lambda phage genome](https://www.ncbi.nlm.nih.gov/nuccore/NC_001416.1). If you wonder what different fields on this page mean, check out [Sample GenBank Record page](https://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html) and click on any of the terms. The page gives information on people and groups that have sequenced and published the genome, on the structural composition of the genome, and finally, about the actual DNA sequence. It is in this latter that we are interested. While all this information is available through the web page, we are more curious about programmatic access to this information. Let us fetch the sequence and check on its length:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from Bio import Entrez, SeqIO\n",
    "\n",
    "Entrez.email = \"blaz.zupan@fri.uni-lj.si\"\n",
    "with Entrez.efetch(db=\"nucleotide\", id=\"nc_001416\", rettype=\"gb\") as handle:\n",
    "    data = SeqIO.read(handle, \"genbank\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "SeqRecord(seq=Seq('GGGCGGCGACCTCGCGGGTTTTCGCTATTTATGAAAATTTTCCGGTTTAAGGCG...ACG', IUPACAmbiguousDNA()), id='NC_001416.1', name='NC_001416', description='Enterobacteria phage lambda, complete genome', dbxrefs=['BioProject:PRJNA485481'])"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "48502"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The genome of Lambda phage has 48,502 base pairs. We have obtained only one strand; the other one is merely a reverse complement. Genome starts with three guanine nucleotides. Which nucleotides are the most prevalent in Lambda phage genome?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Counter({'G': 12820, 'C': 11362, 'A': 12334, 'T': 11986})"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from collections import Counter\n",
    "Counter(data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAW8AAAD8CAYAAAC4uSVNAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAADFdJREFUeJzt3V2sZWddx/Hf36kUC2ba0qbUgXDaxJcUq9BOVCIoIAq2RCAxpo0XIJgmQIwviTqEG7yyvFwgSiyN0VQCFOTNhKpEwURNTOsZFdoihZYWZQSBJo4vvQDq48VZU/ZMz5mzh+519vxnPp/kZNZee83q86x1znf2WXvv7hpjBIBevm3dAwDg1Ik3QEPiDdCQeAM0JN4ADYk3QEPiDdCQeAM0JN4ADZ0z144vuuiisbGxMdfuAc5Ihw8f/uoY4+Ldtpst3hsbG9nc3Jxr9wBnpKr6/DLbuWwC0JB4AzQk3gANiTdAQ+IN0JB4AzQk3gANiTdAQ7O9SefOI0ezcei2uXYP8Jg9cOO16x7Ct8wjb4CGxBugIfEGaEi8ARoSb4CGxBugIfEGaEi8ARoSb4CGxBugIfEGaEi8ARoSb4CGxBugIfEGaEi8ARpa6sMYqupJST423XxykoeTfGW6/UNjjK/NMDYAdrBUvMcYDyZ5RpJU1RuS/M8Y4y0zjguAk3DZBKAh8QZoaKXxrqobqmqzqjYffujoKncNwIKVxnuMcfMY4+AY4+C+8/avctcALHDZBKAh8QZoaKmXCi4aY7xhhnEAcAo88gZoSLwBGhJvgIbEG6Ah8QZoSLwBGhJvgIbEG6Ah8QZoSLwBGhJvgIbEG6Ah8QZoSLwBGhJvgIbEG6ChU/4whmVdeWB/Nm+8dq7dA5zVPPIGaEi8ARoSb4CGxBugIfEGaEi8ARoSb4CGxBugIfEGaGi2d1jeeeRoNg7dNtfuAfbEA6fpO8U98gZoSLwBGhJvgIbEG6Ah8QZoSLwBGhJvgIbEG6Ah8QZoSLwBGhJvgIbEG6Ah8QZoSLwBGhJvgIbEG6ChpeNdVS+tqlFV3zfngADY3ak88r4+yd9NfwKwRkvFu6qemOTZSV6V5LpZRwTArpZ95P2SJH8xxvhMkger6uoZxwTALpaN9/VJbp2Wb80Ol06q6oaq2qyqzYcfOrqK8QGwjV0/Pb6qLkzy/CRXVtVIsi/JqKpfH2OMxW3HGDcnuTlJzr30u8ejdgbASizzyPtnk7xzjPG0McbGGOOpSe5P8px5hwbATpaJ9/VJPnTCug/Eq04A1mbXyyZjjOdts+5t8wwHgGV4hyVAQ+IN0JB4AzQk3gANiTdAQ+IN0JB4AzQk3gANiTdAQ+IN0JB4AzQk3gANiTdAQ+IN0JB4AzQk3gAN7fphDN+qKw/sz+aN1861e4CzmkfeAA2JN0BD4g3QkHgDNCTeAA2JN0BD4g3QkHgDNCTeAA3N9g7LO48czcah2+baPUCS5IGz9J3cHnkDNCTeAA2JN0BD4g3QkHgDNCTeAA2JN0BD4g3QkHgDNCTeAA2JN0BD4g3QkHgDNCTeAA2JN0BD4g3Q0NLxrqonV9WtVXVfVR2uqj+rqu+Zc3AAbG+pT9KpqkryoSS3jDGum9b9YJJLknxmvuEBsJ1lPwbteUm+Psa46diKMcYn5hkSALtZ9rLJ9yc5POdAAFjeSp+wrKobqmqzqjYffujoKncNwIJl4313kqt322iMcfMY4+AY4+C+8/Y/tpEBsKNl4/3xJOdW1Q3HVlTVD1TVc+YZFgAns1S8xxgjycuSvGB6qeDdSX47yZfmHBwA21v21SYZY/x7kp+bcSwALMk7LAEaEm+AhsQboCHxBmhIvAEaEm+AhsQboCHxBmhIvAEaEm+AhsQboCHxBmhIvAEaEm+AhsQboCHxBmho6Q9jOFVXHtifzRuvnWv3AGc1j7wBGhJvgIbEG6Ah8QZoSLwBGhJvgIbEG6Ah8QZoSLwBGprtHZZ3HjmajUO3zbV7gNPSA3v0znKPvAEaEm+AhsQboCHxBmhIvAEaEm+AhsQboCHxBmhIvAEaEm+AhsQboCHxBmhIvAEaEm+AhsQboCHxBmhoqXhX1SVV9e6q+lxVHa6qv6+ql809OAC2t2u8q6qSfDjJ34wxLh9jXJ3kuiRPmXtwAGxvmY9Be36Sr40xbjq2Yozx+SS/O9uoADipZS6bPD3JP849EACWd8pPWFbV26vqE1X1D9vcd0NVbVbV5sMPHV3NCAF4lGXifXeSq47dGGO8NslPJLn4xA3HGDePMQ6OMQ7uO2//6kYJwHGWiffHkzy+ql69sO68mcYDwBJ2jfcYYyR5aZIfr6r7q+qOJLck+c25BwfA9pZ5tUnGGF/M1ssDATgNeIclQEPiDdCQeAM0JN4ADYk3QEPiDdCQeAM0JN4ADYk3QEPiDdCQeAM0JN4ADYk3QEPiDdCQeAM0JN4ADS31YQzfiisP7M/mjdfOtXuAs5pH3gANiTdAQ+IN0JB4AzQk3gANiTdAQ+IN0JB4AzQk3gAN1Rhjnh1X/XeSe2bZ+d66KMlX1z2Ix+hMmENiHqebM2Eep+McnjbGuHi3jWZ7e3ySe8YYB2fc/56oqs3u8zgT5pCYx+nmTJhH5zm4bALQkHgDNDRnvG+ecd976UyYx5kwh8Q8TjdnwjzazmG2JywBmI/LJgANrTzeVfWiqrqnqu6tqkOr3v9jVVVPraq/rqpPVdXdVfXL0/oLq+ovq+qz058XTOurqt42zeeTVXXVwr5ePm3/2ap6+Rrmsq+q/qmqPjLdvqyqbp/G+t6qety0/tzp9r3T/RsL+3jdtP6eqnrhGuZwflW9v6o+XVX/UlXPanoufnX6frqrqt5TVY/vcD6q6g+r6stVddfCupUd/6q6uqrunP7O26qq9nAeb56+rz5ZVR+qqvMX7tv2OO/Ur53O5VqNMVb2lWRfkvuSXJ7kcUk+keSKVf43VjDGS5NcNS1/Z5LPJLkiyZuSHJrWH0ryxmn5miR/nqSS/EiS26f1Fyb53PTnBdPyBXs8l19L8u4kH5luvy/JddPyTUlePS2/JslN0/J1Sd47LV8xnaNzk1w2nbt9ezyHW5L84rT8uCTndzsXSQ4kuT/Jdyych1d0OB9JfizJVUnuWli3suOf5I5p25r+7k/v4Tx+Ksk50/IbF+ax7XHOSfq107lc59eqD+Czknx04fbrkrxu3ZPcZcx/muQns/WGokundZdm63XqSfKOJNcvbH/PdP/1Sd6xsP647fZg3E9J8rEkz0/ykemH46sL36yPnIskH03yrGn5nGm7OvH8LG63R3PYn63o1Qnru52LA0n+bYrXOdP5eGGX85Fk44ToreT4T/d9emH9cdvNPY8T7ntZkndNy9se5+zQr5P9bK3za9WXTY59Ex/zhWndaWn6dfWZSW5PcskY44vTXV9Kcsm0vNOc1j3Xtyb5jST/N91+UpL/HGN8Y5vxPDLW6f6j0/brnsNlSb6S5I+myz9/UFVPSLNzMcY4kuQtSf41yRezdXwPp9/5OGZVx//AtHzi+nV4ZbYe+SenPo+T/WytzVn7hGVVPTHJB5L8yhjjvxbvG1v/vJ62L8Opqhcn+fIY4/C6x/IYnZOtX3V/f4zxzCT/m61f0x9xup+LJJmuCb8kW/8YfVeSJyR50VoHtSIdjv9uqur1Sb6R5F3rHssqrTreR5I8deH2U6Z1p5Wq+vZshftdY4wPTqv/o6oune6/NMmXp/U7zWmdc/3RJD9TVQ8kuTVbl05+J8n5VXXsf3mwOJ5Hxjrdvz/Jg1n/+fpCki+MMW6fbr8/WzHvdC6S5AVJ7h9jfGWM8fUkH8zWOep2Po5Z1fE/Mi2fuH7PVNUrkrw4yc9P/xAlpz6PB7PzuVyfFV9zOidbT1Zclm9e8H/6uq8NnTDGSvLHSd56wvo35/gnad40LV+b45+kuWNaf2G2rtdeMH3dn+TCNcznufnmE5Z/kuOfVHnNtPzaHP8E2fum5afn+CduPpe9f8Lyb5N877T8huk8tDoXSX44yd1JzpvGdkuSX+pyPvLoa94rO/559BOW1+zhPF6U5FNJLj5hu22Pc07Sr53O5Tq/5jiA12TrFRz3JXn9uie4zfiena1fAz+Z5J+nr2uydV3rY0k+m+SvFr75Ksnbp/ncmeTgwr5emeTe6esX1jSf5+ab8b58+mG5d/pmO3da//jp9r3T/Zcv/P3XT3O7JzO9EmCX8T8jyeZ0Pj48/fC3OxdJfivJp5PcleSdUxhO+/OR5D3Zuk7/9Wz9JvSqVR7/JAenY3Jfkt/LCU9OzzyPe7N1DfvYz/lNux3n7NCvnc7lOr+8wxKgobP2CUuAzsQboCHxBmhIvAEaEm+AhsQboCHxBmhIvAEa+n/C8YRvEm4rVwAAAABJRU5ErkJggg==\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "\n",
    "counts = Counter(data).items()\n",
    "pos = np.arange(4)+.5\n",
    "plt.barh(pos, [c for _, c in counts], align=\"center\")\n",
    "plt.yticks(pos, [n for n, _ in counts]);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[12820, 11362, 12334, 11986]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[c for _, c in counts]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us try with another species, this time, with bacteria. [Mycobacterium tuberculosis](https://en.wikipedia.org/wiki/Mycobacterium_tuberculosis) is a pathogenic bacteria, that is, bacteria that can cause disease. Mycobacterium tuberculosis infects lungs. To find the ID, that is, the locus information we need for fetching the genome, we again go to GenBank and search with the name of the bacteria. We find that several strains of bacteria have been sequenced and stored in GenBank. Two of them are [H37Rv](https://www.ncbi.nlm.nih.gov/nuccore/AL123456.3) and  [H37RvSiena](https://www.ncbi.nlm.nih.gov/nuccore/CP007027.1). We take the first one."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4411532"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "with Entrez.efetch(db=\"nucleotide\", id=\"AL123456.3\", rettype=\"gb\") as handle:\n",
    "    data = SeqIO.read(handle, \"genbank\")\n",
    "len(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This genome is way bigger than that of the virus and takes a while to load. If we got serious on this (and would not sketch the code in iPython), we would download the genome and save it locally in the standard [FASTA](https://en.wikipedia.org/wiki/FASTA_format) format to work with it locally:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open(\"data/m_tuberculosis.fasta\", \"w\") as f:\n",
    "    SeqIO.write([data], f, \"fasta\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Fine, we can work with this sequence locally. It takes only a second to load. Let us do some nucleotide frequence analysis as well:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Counter({'T': 758368, 'G': 1444614, 'A': 758552, 'C': 1449998})"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = SeqIO.read(\"data/m_tuberculosis.fasta\", \"fasta\")\n",
    "Counter(data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAW8AAAD8CAYAAAC4uSVNAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAADPhJREFUeJzt3V+MXOdZx/Hvg00c3CInqaPE2FE3QQWUYNomFhDRQhtKW+KKNBJCtrhoaZGltkL8E+AQCYUr3D8XJVDhWAgUqpS0tE0rNUAFKRIgoYR1i+ME6tSOHYhJaRIJ8ycXTc3Dxbxrjzfz5+x6jnce5/uRVj7znrPv++x7Zn579pwznshMJEm1fNtaFyBJWjnDW5IKMrwlqSDDW5IKMrwlqSDDW5IKMrwlqSDDW5IKMrwlqaD1fXW8efPmXFhY6Kt7SbooHTx48LnMvHLadr2F98LCAouLi311L0kXpYh4qst2njaRpIIMb0kqyPCWpIIMb0kqyPCWpIIMb0kqyPCWpIIMb0kqqLc36Rw+eYqFvQ/21b0kzaUT+3ZekHE88pakggxvSSrI8JakggxvSSrI8JakggxvSSrI8JakggxvSSrI8JakggxvSSrI8JakggxvSSrI8JakggxvSSrI8JakggxvSSqoc3hHxNURcX9EHIuIgxHx5xHxPX0WJ0kardMn6UREAA8A92bmrtb2WuAq4In+ypMkjdL1Y9DeDLyYmfuXGjLzUD8lSZKm6Xra5PuBg30WIknqbqYXLCNiT0QsRsTi6RdOzbJrSdKQruH9OHDTtI0y80Bm7sjMHes2bjq/yiRJY3UN7y8BGyJiz1JDRPxARLyxn7IkSZN0Cu/MTOB24C3tVsHHgd8Bvt5ncZKk0brebUJm/jvwMz3WIknqyHdYSlJBhrckFWR4S1JBhrckFWR4S1JBhrckFWR4S1JBhrckFWR4S1JBhrckFWR4S1JBhrckFWR4S1JBhrckFWR4S1JBhrckFdT5wxhWavvWTSzu29lX95L0suaRtyQVZHhLUkGGtyQVZHhLUkGGtyQVZHhLUkGGtyQVZHhLUkGGtyQV1Ns7LA+fPMXC3gf76l5atRO+81cXAY+8Jakgw1uSCjK8Jakgw1uSCjK8Jakgw1uSCjK8Jakgw1uSCjK8Jakgw1uSCjK8Jakgw1uSCjK8Jakgw1uSCjK8Jakgw1uSCuoc3hHxzojIiPi+PguSJE23kiPv3cDft38lSWuoU3hHxCuBNwDvBXb1WpEkaaquR963AX+ZmU8Az0fETT3WJEmaomt47wbub8v3M+bUSUTsiYjFiFg8/cKpWdQnSRph6qfHR8QVwC3A9ohIYB2QEfFrmZnD22bmAeAAwIYtr8mXdCZJmokuR94/DXw8M1+dmQuZeQ1wHHhjv6VJksbpEt67gQeWtX0G7zqRpDUz9bRJZr55RNvd/ZQjSerCd1hKUkGGtyQVZHhLUkGGtyQVZHhLUkGGtyQVZHhLUkGGtyQVZHhLUkGGtyQVZHhLUkGGtyQVZHhLUkGGtyQVZHhLUkGGtyQVNPXDGFZr+9ZNLO7b2Vf3kvSy5pG3JBVkeEtSQYa3JBVkeEtSQYa3JBVkeEtSQYa3JBVkeEtSQYa3JBXU2zssD588xcLeB/vqXpLmzokL+K5yj7wlqSDDW5IKMrwlqSDDW5IKMrwlqSDDW5IKMrwlqSDDW5IKMrwlqSDDW5IKMrwlqSDDW5IKMrwlqSDDW5IKMrwlqSDDW5IK6hTeEXFVRHwiIp6MiIMR8Q8RcXvfxUmSRpsa3hERwOeAv83M6zLzJmAXsK3v4iRJo3X5GLRbgG9m5v6lhsx8Cvi93qqSJE3U5bTJDcCX+y5EktTdii9YRsTHIuJQRPzjiHV7ImIxIhZPv3BqNhVKkl6iS3g/Dty49CAzPwD8OHDl8g0z80Bm7sjMHes2bppdlZKkc3QJ7y8Bl0bE+4baNvZUjySpg6nhnZkJvBP4sYg4HhGPAPcCv9F3cZKk0brcbUJmPsPg9kBJ0hzwHZaSVJDhLUkFGd6SVJDhLUkFGd6SVJDhLUkFGd6SVJDhLUkFGd6SVJDhLUkFGd6SVJDhLUkFGd6SVJDhLUkFGd6SVJDhLUkFdfowhtXYvnUTi/t29tW9JL2seeQtSQUZ3pJUkOEtSQUZ3pJUkOEtSQUZ3pJUkOEtSQUZ3pJUkOEtSQX19g7LwydPsbD3wb66l1bthO/81UXAI29JKsjwlqSCDG9JKsjwlqSCDG9JKsjwlqSCDG9JKsjwlqSCDG9JKsjwlqSCDG9JKsjwlqSCDG9JKsjwlqSCDG9JKsjwlqSCOn0YQ0S8CnioPbwaOA082x7/YGZ+s4faJEljdArvzHweeB1ARNwF/E9mfqTHuiRJE3jaRJIKMrwlqaCZhndE7ImIxYhYPP3CqVl2LUkaMtPwzswDmbkjM3es27hpll1LkoZ42kSSCjK8JamgTrcKDsvMu3qoQ5K0Ah55S1JBhrckFWR4S1JBhrckFWR4S1JBhrckFWR4S1JBhrckFWR4S1JBhrckFWR4S1JBhrckFWR4S1JBhrckFWR4S1JBhrckFbTiD2PoavvWTSzu29lX95L0suaRtyQVZHhLUkGGtyQVZHhLUkGGtyQVZHhLUkGGtyQVZHhLUkGGtyQVFJnZT8cR/w0c6aXzfmwGnlvrIlbAevtlvf2y3vFenZlXTtuot7fHA0cyc0eP/c9URCxab3+st1/W2695rNfTJpJUkOEtSQX1Gd4Heuy7D9bbL+vtl/X2a+7q7e2CpSSpP542kaSKMnOmX8DbGdwieBTYO+v+R4x3DfA3wD8DjwO/2NqvAP4K+Fr79/LWHsDdrb5HgRuH+npX2/5rwLuG2m8CDrfvuZuzf7GMHKNj3euArwBfaI+vBR5uY3wSuKS1b2iPj7b1C0N93NHajwBvm7YPxo3RodbLgE8DXwX+Bbh5nucX+OX2XHgM+FPg0nmaX+CPgG8Ajw21rdl8ThpjQr0fbs+HR4EHgMtmPW+r2Tfj6h1a96tAApvnZX5XnX2z6GRZIB0DrgMuAQ4B189yjBFjblmaDOA7gSeA64EPLT1BgL3AB9vyrcBftAn9YeDhoYl/sv17eVtemvxH2rbRvvcnW/vIMTrW/SvAJzgb3p8CdrXl/cD72vL7gf1teRfwybZ8fZvfDe3Jf6zN/9h9MG6MDrXeC/x8W76EQZjP5fwCW4HjwHcM/czvnqf5BX4UuJFzw3DN5nPcGFPqfSuwvi1/cKivmc3bSvfNpHpb+zXAF4GnOBveaz6/q86+GQfpzcAXhx7fAdwxyzE61PB54CcY/Ebe0tq2MLjvHOAeYPfQ9kfa+t3APUPt97S2LcBXh9rPbDdujA41bgMeAm4BvtB26nNDL4Yz89iebDe35fVtu1g+t0vbjdsHk8aYUusmBmEYy9rncn4ZhPe/tRfd+ja/b5u3+QUWODcM12w+x40xqd5l624H7hv1mj+feVvpvplWL4O/Hl8LnOBseM/F/K7ma9bnvJdeOEuebm0XREQsAK9n8CfWVZn5TFv1deCqtjyuxkntT49oZ8IY03wU+HXg/9rjVwH/mZnfGjHGmbra+lNt+5X+HJPGmORa4FngjyPiKxHxhxHxCuZ0fjPzJPAR4F+BZxjM10Hmd36XrOV8nu/r9j0MjixXU+8sn/tjRcRtwMnMPLRsVYX5HemiuWAZEa8EPgP8Umb+1/C6HPy6yz7H7zpGRLwD+EZmHuyznhlaz+BP0D/IzNcD/8vgT8Iz5mx+LwduY/BL57uAVzA411rGPM3nNBFxJ/At4L7zLqonEbER+E3gty7UmBdiH846vE8yOK+0ZFtr61VEfDuD4L4vMz/bmv8jIra09VsYXMCYVOOk9m0j2ieNMcmPAD8VESeA+xmcOvld4LKIWPrvCobHOFNXW78JeH4VP8fzE8aY5Gng6cx8uD3+NIMwn9f5fQtwPDOfzcwXgc8ymPN5nd8lazmfq3rdRsS7gXcAP9vCajX1Tpq3le6bcb6bwS/zQ+11tw34ckRcvYp6L9j8TnW+512WnVNaz+DE/rWcvShxwyzHGDFmAH8CfHRZ+4c59+LBh9ryTs69ePBIa7+Cwbndy9vXceCKtm75BYpbJ42xgtrfxNkLln/GuRdt3t+WP8C5F20+1ZZv4NyLNk8yuCg0dh+MG6NDnX8HfG9bvqv93HM5v8APMbjTZGPr717gF+ZtfnnpOe81m89xY0yp9+0M7vC6ctl2M5u3le6bSfUuW3eCs+e852J+V5V9s+hk2cTcyuCOj2PAnbPuf8R4b2Dw58mjwD+1r1sZnBt7iMFtO389NPEBfKzVdxjYMdTXexjcznMU+Lmh9h0Mbjs7Bvw+Z28NGjnGCmp/E2fD+7r2pDjanswbWvul7fHRtv66oe+/s9V0hHbFe9I+GDdGhzpfByy2Of5cezLP7fwCv83gNrbHgI8zeJHPzfwyuH3xGeBFBn/ZvHct53PSGBPqPcrgPO7Sa27/rOdtNftmXL3L1p/g3FsF13R+V/vlOywlqaCL5oKlJL2cGN6SVJDhLUkFGd6SVJDhLUkFGd6SVJDhLUkFGd6SVND/Ax1WrYoY4J+0AAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "counts = Counter(data).items()\n",
    "pos = np.arange(4)+.5\n",
    "plt.barh(pos, [c for _, c in counts], align=\"center\")\n",
    "plt.yticks(pos, [n for n, _ in counts]);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The differences in counts are now quite substantial. It is also interesting that the counts of A and T almost match, and so do the counts for G and C. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Time for some Formalism"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To describe a DNA sequence we will use lists of symbols from an alphabet $\\mathcal{N}=\\{{\\rm A},{\\rm C},{\\rm T},{\\rm G}\\}$. We will represent sequences with vectors and denote them with symbols like $\\vec{x}$, $\\vec{s}$ and $\\vec{t}$. Most often, we will be lazy and omit the arrow and use symbols like $x$, $s$ and $t$ for sequences.\n",
    "\n",
    "Consider now a DNA sequence $x$. This is a finite string from an alphabet $\\mathcal{N}$, such that $x=x_1 x_2 x_3 \\ldots x_n$, where $x_i$ is an element of a string such that $x_i\\in\\mathcal{N}$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Multinomial Model of a Sequence"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The simplest (and the least useful) model of a sequence is a multinomial model. This model simply contains the probabilities with which each of the nucleotides appears in the sequence. The model assumes that nucleotides are independently and identically distributed. In statistics this assumption is abbreviated as [i.i.d.](https://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables). Independently here means that the probability of encountering a symbol in a sequence does not depend on the sequence position or any of the neighbors in the sequence. Intuition can tell us that this should not be the case in the DNA, but yet multinomial model can still provide us a base to reason about the sequence and especially to reason how the sequences violate the i.i.d. property.\n",
    "\n",
    "The multinomial model of a DNA sequence will be defined as $\\vec{p}=(p_{\\rm A},p_{\\rm C},p_{\\rm T},p_{\\rm G})$, where $p_z=p(x_i=z)$ and $\\sum_{z\\in\\{\\rm A,C,T,G\\}}p_z=1$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Considering a multinomial model, the probability of a sequence is\n",
    "\n",
    "$$P(x)=\\prod_{i=1}^n p(x_i)$$ \n",
    "\n",
    "Do you think this probability will be large or small, or perhaps very very small? Why?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Estimation of Parameters of Multinomial Model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's very simple. We will estimate the probabilities that define the multinomial model through relative frequencies:\n",
    "\n",
    "$$p_z={N_z\\over n}$$\n",
    "\n",
    "where $z\\in\\mathcal{N}$. Turns out that most often we will observe that $p_{\\rm A}\\approx p_{\\rm T}$ and $p_{\\rm C}\\approx p_{\\rm G}$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us construct a multinomial model for the DNA sequence of Mycobacterium tuberculosis, that is, for the last sequence we have read in the Python code above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "c = Counter(data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "T: 0.172\n",
      "G: 0.327\n",
      "A: 0.172\n",
      "C: 0.329\n"
     ]
    }
   ],
   "source": [
    "ps = {z: count/len(data) for z, count in c.items()}\n",
    "print(\"\\n\".join(\"%s: %5.3f\" % (z, p) for z, p in ps.items()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Finding Unusual DNA Words"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "More then probabilities of individual nucleotides we could be interested in probabilities of short subsequences of length $k$. Here, we will consider only dinucleotides, that is subsequences of length $k=2$. Say, we would like to observe if a subsequence CG appears more often then it would be expected by chance. By chance, using our multinomial model, the probability that for some random nucleotide we would get a combination of C and G (that is, a subsequence CG) is equal to $p_C\\times p_G$. That was easy. So what is then the probability of CG which we can estimate from the actual genome? We observe all the dinucleotides (there are $n-1$ of them) and count how many of them are equal to CG. That is, we estimate this probability as $N_{\\rm CG}/(n-1)$. We can now report on odds ratio, that is the ration between observed probability and expected probability of the dinucleotide, where the expected probability is coming from the multinomial model that was constructed under the assumption of i.i.d. That is, under the assumption that there is no order in the genome. Often, instead of the odds ratio, we would report on log odds ratio, as this is a symmetrical measure of surprise centered at 0.\n",
    "\n",
    "Time to implement this in code. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "def tuple_walk(s, k=2):\n",
    "    \"\"\"Generate subsequences of length k.\"\"\"\n",
    "    for i in range(len(s) - (k-1)):\n",
    "        yield s[i:i+k]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, let's check if tuple walk actually works."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['AC', 'CC', 'CT', 'TA', 'AG', 'GG', 'GC', 'CT']"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "list(tuple_walk(\"ACCTAGGCT\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Great! Now we first need to get the multinomial model, and then we will estimate the probabilities for all the dinucleotides."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'T': 0.1719058141253424,\n",
       " 'G': 0.32746311258764527,\n",
       " 'A': 0.17194752299201274,\n",
       " 'C': 0.32868355029499957}"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "probs = {z: c/len(data) for z, c in Counter(data.seq).items()}\n",
    "probs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'TT': 0.0312555890460704,\n",
       " 'TG': 0.06286933039799561,\n",
       " 'GA': 0.06115133272326546,\n",
       " 'AC': 0.05886709171940535,\n",
       " 'CC': 0.09406326284457708,\n",
       " 'CG': 0.12734921277896494,\n",
       " 'AT': 0.037000306696246724,\n",
       " 'GG': 0.09234254502575183,\n",
       " 'GT': 0.05908560996171171,\n",
       " 'TC': 0.06086979780942262,\n",
       " 'CA': 0.06270702846698799,\n",
       " 'AG': 0.04490209861383724,\n",
       " 'GC': 0.11488347242714604,\n",
       " 'CT': 0.044564120710021075,\n",
       " 'AA': 0.031178064939360053,\n",
       " 'TA': 0.01691113583923586}"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "prob_pairs = {z: c/(len(data)-1) for z, c in Counter(tuple_walk(str(data.seq))).items()}\n",
    "prob_pairs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let us compute the odds ratio for CG:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.1831938666933133"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "odds = prob_pairs[\"CG\"] / (probs[\"C\"] * probs[\"G\"])\n",
    "odds"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It looks like CG dinucleotide appears more frequently than expected. The odds of 1.18 is quite close to 1.0. What would be the probability that we could get such a result by chance? Would you know how to apply your knowledge of statistics to answer this question?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Some Biological Backround"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There is a good reason why we could be interested in CG dinucleotides. We are interested in areas – or islands – of DNA sequences where CG dinucleotides, popularly referred to as CpG, appear in abundance. Namely, mostly in higher-level organisms the part of the DNA around CpG gets methylated. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](figs/02-methylation.png \"Methylation.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[DNA Methylation](https://en.wikipedia.org/wiki/DNA_methylation) can prevent DNA to be transcribed. Wikipedia states that the rate of cytosine CpG DNA methylation differs strongly between species: 14% of cytosines are methylated in Arabidopsis thaliana, 4% in Mus musculus, 2.3% in Escherichia coli, 0.03% in Drosophila, and virtually none (< 0.0002%) in yeast. For mammals, this rate could be even higher, as between 60% and 90% of all CpGs are methylated! Methylation is a way of regulation of gene expression, where gene expression refers to the degree information encoded by a gene (ok, we have yet to define this term, next lesson will be on this) is used in the synthesis of corresponding protein. Methylation has been recently extensively studied within a field called [epigenetics](https://en.wikipedia.org/wiki/Epigenetics)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](figs/02-epigenetics.png \"Mechanisms of epigenetics.\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}