{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "**Introduction to Bioinformatics**
\n", "A masters course by Blaž Zupan and Tomaž Curk
\n", "University of Ljubljana (C) 2016-2018\n", "\n", "Disclaimer: this work is a first draft of our working notes. The images were obtained from various web sites, but mainly from the wikipedia entries with explanations of different biological entities. Material is intended for our bioinformatics class and is not meant for distribution.\n", "\n", "## Lecture Notes Part 5\n", "\n", "# Sequence Alignment" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Molecular Roots of Evolution" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We, living creatures on earth, evolved. In fact, we are still\n", "evolving. The heritable characteristics of our population change over\n", "successive generations. When cells divide, the DNA of a cell is\n", "replicated: two identical replicates of the DNA are made from one\n", "original DNA molecule. Well, not totally identical. When replicating\n", "human DNA, for instance, there are at least a few errors left after\n", "the replications. Like, on average, there are about seven nucleotides\n", "that do not match those from the original DNA. The cell includes\n", "machinery to detect and repair such mistakes, but even that machinery\n", "sometimes overlooks the mistakes and this could be passed to the\n", "organisms of the next generation. These replication mistakes are called\n", "mutations. Mutations can be harmless (changes in \"junk\" part\n", "of the DNA) or harmful (the organism with a mutated DNA does not\n", "survive), but some could be beneficial and make the organism better\n", "fit than the old generation. Organisms with better fitness would\n", "proliferate throughout the population. This process, called evolution,\n", "happens all the time, for all organisms on Earth.\n", "\n", "Our description of molecular background of evolution is very\n", "simplistic. But will suffice for now. In the previous lecture we\n", "learned about genes, units of heredity and segments of the DNA that code\n", "for proteins. Particularly interesting are mutations that change the\n", "coding sequence of the gene, as the resulting protein may\n", "have a different sequence of amino acids. It is \"may\", as different\n", "sequence of nucleotides may result in the same amino acid chain (think\n", "about the number of different codons for each amino acid and\n", "redundancy in the DNA-protein coding tables). \n", "\n", "Two different species of organisms with a common ancestor may\n", "therefore have some differences in the genome sequence, and therefore\n", "may have (slightly) different genes that encode similar proteins. If\n", "these genes are coming from the same ancestor gene, two\n", "gene variants are called *orthologous genes*. The difference between\n", "genes arises due to nucleotide substitution, insertion of the \n", "sequence or deletion of the DNA sequence. The later two types of\n", "changes in the DNA are called indels. Another kind of mistake in the DNA\n", "replication appears where a part of the DNA is duplicated, that is,\n", "mistakenly appears several times in the new genome. As there could be\n", "now several genes encoding the same protein, all but one of these\n", "genes can be changed with no harm. Multiple copies of the same genes\n", "that are then, within a single species, changed accross the evolution,\n", "give raise to gene families, a set of similar genes that are all stem from the\n", "same predecessor. Such genes are called *parlogous genes*. Orthologs\n", "and paralogs are jointly referred as homologous genes.\n", "\n", "There is a great value of finding homologous genes within the same\n", "species or accross the species. To find them, we need to define the\n", "distance between sequences and means to express these distances\n", "numerically. Sequences of DNA that we would like to compare need first\n", "to be aligned to expose the similarities and differences. This\n", "procedure is referred to as *sequence alignment*. Sequence alignment\n", "and scoring of similarity of sequences give rise to many different\n", "procedures of bioinformatics, and help us in tasks such as:\n", "\n", "* gene function prediction: knowing the function of a gene in one\n", " species, and finding a similar gene in the other species, we can now\n", " make an inference,\n", "* gene finding: knowing genome structure in one organism can help us\n", " annotate the gene of another organism\n", "* phylogeny: based on comparison of gene sequences of organisms we can\n", " infer which organisms had common predecessors, and we reconstruct\n", " the tree of life,\n", "* sequence assembly: knowing the sequence of one genome we can\n", " assemble the sequence of another gene from small fragments of the\n", " DNA that have been sequence using modern high-throughput sequencing\n", " methods.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## So What Is Sequence Alignment?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Consider two sequences $s$ and $t$:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "s, t = (\"ATACGTA\", \"TATGATA\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and consider their possible alignment:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "A1 = (\"ATACGTA-\", \"-TATGATA\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us pretty-print the alignment:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def pp_alignment(s, t):\n", " print(s)\n", " print(\"\".join(\"|\" if si==ti else \" \" for si, ti in zip(s, t)))\n", " print(t)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ATACGTA-\n", " || | \n", "-TATGATA\n" ] } ], "source": [ "pp_alignment(*A1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The aligned sequences have the same length. We actually talk about the\n", "length of alignment. In our case, $|A_1(s,t)|=c=7$. The symbol \"-\"\n", "denotes an indel, a skip in the sequence due to deleted nucleotide in\n", "one sequence or added nucleotide in the other one.\n", "\n", "Consider a different alignment:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ATACG-TA\n", " || | ||\n", "-TATGATA\n" ] } ], "source": [ "A2 = (\"ATACG-TA\", \"-TATGATA\")\n", "pp_alignment(*A2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and another one" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-AT-ACGTA\n", " || | ||\n", "TATGA--TA\n" ] } ], "source": [ "A3 = (\"-AT-ACGTA\", \"TATGA--TA\")\n", "pp_alignment(*A3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Which alignment is better? We need some means to quantitatively assess the alignments and compute the alignment score." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A Scoring Function" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let $x$ and $y$ be aligned sequences, that is, $x$ a version of $s$\n", "after the alignment, and $y$ a version of $t$ after the\n", "alignment. Let $x_i$ be an $i$-th symbol of aligned sequence $x$, and $y_i$ be an\n", "$i$-th symbol of aligned sequence $y$. We define a scoring function\n", "for position $i$ to be $\\sigma(x_i,y_i)$. The alignment score is then\n", "defined as:\n", "\n", "$$M(A(s,t))=\\sum_{i=1}^c \\sigma(x_i, y_i)$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All we need now is a concrete scoring function $\\sigma$. Let us define one, and score our alignments. We will first use a very simple scoring function, which returns -1 if the symbols are different, and 1 if the symbols at some position of the alignment are the same." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def simple_sigma(a, b):\n", " return -1 if a != b else 1" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def M(A, sigma):\n", " return sum(sigma(xi, yi) for xi, yi in zip(*A))" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ATACGTA-\n", " || | \n", "-TATGATA\n", "Score: -2\n", "\n", "ATACG-TA\n", " || | ||\n", "-TATGATA\n", "Score: 2\n", "\n", "-AT-ACGTA\n", " || | ||\n", "TATGA--TA\n", "Score: 1\n", "\n" ] } ], "source": [ "alignments = [A1, A2, A3]\n", "for A in alignments:\n", " pp_alignment(*A)\n", " print(\"Score: %d\\n\" % M(A, simple_sigma))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The best alignment, according to out scoring function `simple_sigma` is alignment $A_2$. What if change the scoring function? Say, assuming that indels really hurt, because they can change the whole reading frame, we should punish them more than we punish a mismatch of the nucleotides. Let us try is out:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def punished_indels(a, b):\n", " return -2 if (a == \"-\" or b == \"-\") else (-1 if a != b else 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, a check if this works." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(-1, -2, 2)" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "punished_indels(\"A\", \"C\"), punished_indels(\"-\", \"T\"), punished_indels(\"G\", \"G\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Fine, looks like it works. Now we use the scoring function `punished_indels` to score the alignments:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ATACGTA-\n", " || | \n", "-TATGATA\n", "Score: -1\n", "\n", "ATACG-TA\n", " || | ||\n", "-TATGATA\n", "Score: 5\n", "\n", "-AT-ACGTA\n", " || | ||\n", "TATGA--TA\n", "Score: 2\n", "\n" ] } ], "source": [ "for A in alignments:\n", " pp_alignment(*A)\n", " print(\"Score: %d\\n\" % M(A, punished_indels))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Again, alignment $A_2$ looks best, but now the margin is bigger." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Where Are We Now?\n", "\n", "We can score the alignments. Given the scoring functions. The scoring\n", "functions can be more complicated, but for now, the ones we have\n", "define will do fine. One problem though. How do we get an alignment?\n", "We would actually like to find an alignment $A^*$ that, given a scoring\n", "function, maximizes the alignment score $M$, such that there is no\n", "other alignment $A$ where $M(A^*)