{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Introduction to Bioinformatics** \n",
"A masters course by Blaž Zupan and Tomaž Curk \n",
"University of Ljubljana (C) 2016-2018\n",
"\n",
"Disclaimer: this work is a first draft of our working notes. The images were obtained from various web sites, but mainly from the wikipedia entries with explanations of different biological entities. Material is intended for our bioinformatics class and is not meant for distribution.\n",
"\n",
"## Lecture Notes Part 2\n",
"# The First Look at the Genome"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Central Dogma of Molecular Biology"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The central dogma of molecular biology defines the flow of genetic information within biological systems. In simple terms, the central dogma says that \"DNA makes RNA makes protein.\" A bit longer: DNA encodes the information about the sequence of amino acids in proteins. To construct a protein from amino acids, the information – protein coding sequence – from DNA is first transcribed to messenger RNA and then translated to protein. Central dogma also claims that this is the direction how the information flows. That is, the information stored as a sequence of amino acids in proteins never gets translated back to the DNA."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Constitution of the DNA"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A DNA is a long molecule composed of nucleotides, a sugar called deoxyribose, and a phosphate group. There are four different nucleotides: cytosine (C), guanine (G), adenine (A), and thymine (T). A DNA consists of two strands coiled around each other to form a double helix. The two strands are joined to each other by weak hydrogen bonds, where adenine can bound to thymine and cytosine to guanine.\n",
"\n",
"Each strand is composed of alternating sugar and phosphate group, forming a sugar-phosphate backbone using a strong covalent bond. Nucleotides are attached to the backbone, one nucleotide per one sugar group. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![](figs/01-dna-in-close.png \"Components and bonds in the DNA.\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Because of complementarity of the two strains, one can use one strain to infer the composition of another strain. Both strands carry the same information.\n",
"\n",
"The difference in strength of the bonds turns DNA into a sort of a zipper. It does not take much energy to unzip the DNA, that is, open the double helix, while the bonds that hold the backbone together are strong and protect the sequence against any damages."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Directionality"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It turns out that two strains of the DNA run in a different direction. To start thinking about the direction, we have to examine a sugar group from a sugar-phosphate backbone. To make life more comfortable, and to put some order in the chemical formulas, chemists have decided to number the carbon atoms in the sugar and label them with \"prime\" notation, that is, using a number and a prime sign, like 1' (one-prime) and 3' (three prime). Here is a depiction of the atom numbering for a sugar group:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![](figs/02-sugar.png \"Sugar group and atom numbering.\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sugar is attached to the backbone with a 5'-end (five prime end), where we find a phosphate group, and with a 3'-end on the other side. Based on this nomenclature, the DNA has a 5' and a 3' end, according to the carbon atom of the sugar that ends the chain. Notice that if one strand runs from 5' to 3', the other, complementary one, runs in the opposite direction, from 3' to 5'."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Genome Size"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The complete DNA sequence defines what we call a genome. The genome is, therefore, the total genetic information contained within the cell. That includes the DNA in the nucleus and DNA in any of the organelles. Huh, this is new: turns out that some of the organelles also include DNA. In animals, such organelle is mitochondria, and in plants, this is the chloroplast. Despite this, when we will refer to a genome for eukaryotes, we will usually mean the DNA in the nucleus, and we will refer to the genetic material in the mitochondria as a mitochondrial genome. DNA in the nucleus is most often not a single molecule, but rather broken into pieces and organized within the chromosomes. Human has 23 pairs of chromosomes. But do not worry about chromosomes at this point (and at least not for a few next lectures).\n",
"\n",
"So what is the total length of the DNA sequence? It depends on an organism. Prokaryote (bacteria) have the shortest genome. We measure the length of the DNA in the base pairs (bp), which is a unit consisting of two nucleobases bound to each other by hydrogen bonds. Simply stated, one base pair, one nucleotide on each strand. The total number of nucleotides in one of the strands is the size (length) of the genome. Bacteria have genomes ranging from 0.5 to 13 Mbp in length. The unit Mbp means megabase pairs, or $10^6$ base pairs. We most often use notation Mb for mega base pairs. Genomes of eukaryotes are large and range from 8Mb to 670Gb. Viruses have a much smaller genome of the size from 5 to 50kb. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![](figs/02-genome-size.png \"Sizes of the genomes.\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And us, human? Our genome contains about 3,200 Mb, or 3,2 Gb. For comparison, there are 76,944 words in the first Harry Potter book (Harry Potter and the Philosopher's Stone). An average length of an English word is just over 5. The estimated number of letters in HP1 is therefore 384,720. At the letter level (representing nucleotides with a single letter), we would need to use equivalent of 8,300 Philosopher's Stones books to write a human genome. My version of HP1 is about 2cm thick, which would make a genome stack – one book on the top of another – for 166 meters. If human DNA were stretched out, it would form a very thin thread about 2 meters long."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## A Small Start"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let us start small. With a virus. [Lambda phage](https://en.wikipedia.org/wiki/Lambda_phage) is a bacterial virus that infects and destroys bacteria Escherichia coli. This strange names for species use binomial nomenclature, or a Latin name, which is composed of genus and species. Like us, we are Homo sapiens. But, back to Lambda phage. One of the sources for its genome information is [GenBank](https://www.ncbi.nlm.nih.gov/genbank/), a sequence database maintained by The National Center for Biotechnology Information (NCBI). GenBank has a [page with information on Lambda phage genome](https://www.ncbi.nlm.nih.gov/nuccore/NC_001416.1). If you wonder what different fields on this page mean, check out [Sample GenBank Record page](https://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html) and click on any of the terms. The page gives information on people and groups that have sequenced and published the genome, on the structural composition of the genome, and finally, about the actual DNA sequence. It is in this latter that we are interested. While all this information is available through the web page, we are more curious about programmatic access to this information. Let us fetch the sequence and check on its length:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from Bio import Entrez, SeqIO\n",
"\n",
"Entrez.email = \"blaz.zupan@fri.uni-lj.si\"\n",
"with Entrez.efetch(db=\"nucleotide\", id=\"nc_001416\", rettype=\"gb\") as handle:\n",
" data = SeqIO.read(handle, \"genbank\")"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"SeqRecord(seq=Seq('GGGCGGCGACCTCGCGGGTTTTCGCTATTTATGAAAATTTTCCGGTTTAAGGCG...ACG', IUPACAmbiguousDNA()), id='NC_001416.1', name='NC_001416', description='Enterobacteria phage lambda, complete genome', dbxrefs=['BioProject:PRJNA485481'])"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"48502"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The genome of Lambda phage has 48,502 base pairs. We have obtained only one strand; the other one is merely a reverse complement. Genome starts with three guanine nucleotides. Which nucleotides are the most prevalent in Lambda phage genome?"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Counter({'G': 12820, 'C': 11362, 'A': 12334, 'T': 11986})"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from collections import Counter\n",
"Counter(data)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAW8AAAD8CAYAAAC4uSVNAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAADFdJREFUeJzt3V2sZWddx/Hf36kUC2ba0qbUgXDaxJcUq9BOVCIoIAq2RCAxpo0XIJgmQIwviTqEG7yyvFwgSiyN0VQCFOTNhKpEwURNTOsZFdoihZYWZQSBJo4vvQDq48VZU/ZMz5mzh+519vxnPp/kZNZee83q86x1znf2WXvv7hpjBIBevm3dAwDg1Ik3QEPiDdCQeAM0JN4ADYk3QEPiDdCQeAM0JN4ADZ0z144vuuiisbGxMdfuAc5Ihw8f/uoY4+Ldtpst3hsbG9nc3Jxr9wBnpKr6/DLbuWwC0JB4AzQk3gANiTdAQ+IN0JB4AzQk3gANiTdAQ7O9SefOI0ezcei2uXYP8Jg9cOO16x7Ct8wjb4CGxBugIfEGaEi8ARoSb4CGxBugIfEGaEi8ARoSb4CGxBugIfEGaEi8ARoSb4CGxBugIfEGaEi8ARpa6sMYqupJST423XxykoeTfGW6/UNjjK/NMDYAdrBUvMcYDyZ5RpJU1RuS/M8Y4y0zjguAk3DZBKAh8QZoaKXxrqobqmqzqjYffujoKncNwIKVxnuMcfMY4+AY4+C+8/avctcALHDZBKAh8QZoaKmXCi4aY7xhhnEAcAo88gZoSLwBGhJvgIbEG6Ah8QZoSLwBGhJvgIbEG6Ah8QZoSLwBGhJvgIbEG6Ah8QZoSLwBGhJvgIbEG6ChU/4whmVdeWB/Nm+8dq7dA5zVPPIGaEi8ARoSb4CGxBugIfEGaEi8ARoSb4CGxBugIfEGaGi2d1jeeeRoNg7dNtfuAfbEA6fpO8U98gZoSLwBGhJvgIbEG6Ah8QZoSLwBGhJvgIbEG6Ah8QZoSLwBGhJvgIbEG6Ah8QZoSLwBGhJvgIbEG6ChpeNdVS+tqlFV3zfngADY3ak88r4+yd9NfwKwRkvFu6qemOTZSV6V5LpZRwTArpZ95P2SJH8xxvhMkger6uoZxwTALpaN9/VJbp2Wb80Ol06q6oaq2qyqzYcfOrqK8QGwjV0/Pb6qLkzy/CRXVtVIsi/JqKpfH2OMxW3HGDcnuTlJzr30u8ejdgbASizzyPtnk7xzjPG0McbGGOOpSe5P8px5hwbATpaJ9/VJPnTCug/Eq04A1mbXyyZjjOdts+5t8wwHgGV4hyVAQ+IN0JB4AzQk3gANiTdAQ+IN0JB4AzQk3gANiTdAQ+IN0JB4AzQk3gANiTdAQ+IN0JB4AzQk3gAN7fphDN+qKw/sz+aN1861e4CzmkfeAA2JN0BD4g3QkHgDNCTeAA2JN0BD4g3QkHgDNCTeAA3N9g7LO48czcah2+baPUCS5IGz9J3cHnkDNCTeAA2JN0BD4g3QkHgDNCTeAA2JN0BD4g3QkHgDNCTeAA2JN0BD4g3QkHgDNCTeAA2JN0BD4g3Q0NLxrqonV9WtVXVfVR2uqj+rqu+Zc3AAbG+pT9KpqkryoSS3jDGum9b9YJJLknxmvuEBsJ1lPwbteUm+Psa46diKMcYn5hkSALtZ9rLJ9yc5POdAAFjeSp+wrKobqmqzqjYffujoKncNwIJl4313kqt322iMcfMY4+AY4+C+8/Y/tpEBsKNl4/3xJOdW1Q3HVlTVD1TVc+YZFgAns1S8xxgjycuSvGB6qeDdSX47yZfmHBwA21v21SYZY/x7kp+bcSwALMk7LAEaEm+AhsQboCHxBmhIvAEaEm+AhsQboCHxBmhIvAEaEm+AhsQboCHxBmhIvAEaEm+AhsQboCHxBmho6Q9jOFVXHtifzRuvnWv3AGc1j7wBGhJvgIbEG6Ah8QZoSLwBGhJvgIbEG6Ah8QZoSLwBGprtHZZ3HjmajUO3zbV7gNPSA3v0znKPvAEaEm+AhsQboCHxBmhIvAEaEm+AhsQboCHxBmhIvAEaEm+AhsQboCHxBmhIvAEaEm+AhsQboCHxBmhoqXhX1SVV9e6q+lxVHa6qv6+ql809OAC2t2u8q6qSfDjJ34wxLh9jXJ3kuiRPmXtwAGxvmY9Be36Sr40xbjq2Yozx+SS/O9uoADipZS6bPD3JP849EACWd8pPWFbV26vqE1X1D9vcd0NVbVbV5sMPHV3NCAF4lGXifXeSq47dGGO8NslPJLn4xA3HGDePMQ6OMQ7uO2//6kYJwHGWiffHkzy+ql69sO68mcYDwBJ2jfcYYyR5aZIfr6r7q+qOJLck+c25BwfA9pZ5tUnGGF/M1ssDATgNeIclQEPiDdCQeAM0JN4ADYk3QEPiDdCQeAM0JN4ADYk3QEPiDdCQeAM0JN4ADYk3QEPiDdCQeAM0JN4ADS31YQzfiisP7M/mjdfOtXuAs5pH3gANiTdAQ+IN0JB4AzQk3gANiTdAQ+IN0JB4AzQk3gAN1Rhjnh1X/XeSe2bZ+d66KMlX1z2Ix+hMmENiHqebM2Eep+McnjbGuHi3jWZ7e3ySe8YYB2fc/56oqs3u8zgT5pCYx+nmTJhH5zm4bALQkHgDNDRnvG+ecd976UyYx5kwh8Q8TjdnwjzazmG2JywBmI/LJgANrTzeVfWiqrqnqu6tqkOr3v9jVVVPraq/rqpPVdXdVfXL0/oLq+ovq+qz058XTOurqt42zeeTVXXVwr5ePm3/2ap6+Rrmsq+q/qmqPjLdvqyqbp/G+t6qety0/tzp9r3T/RsL+3jdtP6eqnrhGuZwflW9v6o+XVX/UlXPanoufnX6frqrqt5TVY/vcD6q6g+r6stVddfCupUd/6q6uqrunP7O26qq9nAeb56+rz5ZVR+qqvMX7tv2OO/Ur53O5VqNMVb2lWRfkvuSXJ7kcUk+keSKVf43VjDGS5NcNS1/Z5LPJLkiyZuSHJrWH0ryxmn5miR/nqSS/EiS26f1Fyb53PTnBdPyBXs8l19L8u4kH5luvy/JddPyTUlePS2/JslN0/J1Sd47LV8xnaNzk1w2nbt9ezyHW5L84rT8uCTndzsXSQ4kuT/Jdyych1d0OB9JfizJVUnuWli3suOf5I5p25r+7k/v4Tx+Ksk50/IbF+ax7XHOSfq107lc59eqD+Czknx04fbrkrxu3ZPcZcx/muQns/WGokundZdm63XqSfKOJNcvbH/PdP/1Sd6xsP647fZg3E9J8rEkz0/ykemH46sL36yPnIskH03yrGn5nGm7OvH8LG63R3PYn63o1Qnru52LA0n+bYrXOdP5eGGX85Fk44ToreT4T/d9emH9cdvNPY8T7ntZkndNy9se5+zQr5P9bK3za9WXTY59Ex/zhWndaWn6dfWZSW5PcskY44vTXV9Kcsm0vNOc1j3Xtyb5jST/N91+UpL/HGN8Y5vxPDLW6f6j0/brnsNlSb6S5I+myz9/UFVPSLNzMcY4kuQtSf41yRezdXwPp9/5OGZVx//AtHzi+nV4ZbYe+SenPo+T/WytzVn7hGVVPTHJB5L8yhjjvxbvG1v/vJ62L8Opqhcn+fIY4/C6x/IYnZOtX3V/f4zxzCT/m61f0x9xup+LJJmuCb8kW/8YfVeSJyR50VoHtSIdjv9uqur1Sb6R5F3rHssqrTreR5I8deH2U6Z1p5Wq+vZshftdY4wPTqv/o6oune6/NMmXp/U7zWmdc/3RJD9TVQ8kuTVbl05+J8n5VXXsf3mwOJ5Hxjrdvz/Jg1n/+fpCki+MMW6fbr8/WzHvdC6S5AVJ7h9jfGWM8fUkH8zWOep2Po5Z1fE/Mi2fuH7PVNUrkrw4yc9P/xAlpz6PB7PzuVyfFV9zOidbT1Zclm9e8H/6uq8NnTDGSvLHSd56wvo35/gnad40LV+b45+kuWNaf2G2rtdeMH3dn+TCNcznufnmE5Z/kuOfVHnNtPzaHP8E2fum5afn+CduPpe9f8Lyb5N877T8huk8tDoXSX44yd1JzpvGdkuSX+pyPvLoa94rO/559BOW1+zhPF6U5FNJLj5hu22Pc07Sr53O5Tq/5jiA12TrFRz3JXn9uie4zfiena1fAz+Z5J+nr2uydV3rY0k+m+SvFr75Ksnbp/ncmeTgwr5emeTe6esX1jSf5+ab8b58+mG5d/pmO3da//jp9r3T/Zcv/P3XT3O7JzO9EmCX8T8jyeZ0Pj48/fC3OxdJfivJp5PcleSdUxhO+/OR5D3Zuk7/9Wz9JvSqVR7/JAenY3Jfkt/LCU9OzzyPe7N1DfvYz/lNux3n7NCvnc7lOr+8wxKgobP2CUuAzsQboCHxBmhIvAEaEm+AhsQboCHxBmhIvAEa+n/C8YRvEm4rVwAAAABJRU5ErkJggg==\n",
"text/plain": [
"