{
 "metadata": {
  "name": "",
  "signature": "sha256:11c6bb8a3947ad99feb3e36491915012fd98c3b557a2f0c42f40ba6661607dd5"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**NOTE**: This chapter is very basic at this stage, and needs work on text and content. It is slated for expansion. In the meantime, if you're looking for extra information on this topic I recommend [Chapter 27 of *Evolution*](http://evolution-textbook.org/content/free/contents/ch27.html) (free), and [*Inferring Phylogeny* by Felsenstein](http://www.amazon.com/Inferring-Phylogenies-Joseph-Felsenstein/dp/0878931775/ref=sr_1_1?s=books&ie=UTF8&qid=1397401191&sr=1-1&keywords=inferring+phylogenies). You can add suggestions for content, note issues, and follow progress on it updates, in [issue #25](https://github.com/gregcaporaso/An-Introduction-To-Applied-Bioinformatics/issues/25)."
     ]
    },
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Phylogenetic reconstruction"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "What: \n",
      "\n",
      "The process of inferring the evolutionary relationships between organisms or groups of organisms, and usually representing that information as a tree. \n",
      "\n",
      "Why:\n",
      "\n",
      "1. Make functional inferences about genes. \n",
      "2. Understand relationships between organisms (and therefore possible similarities between them). \n",
      "3. Compare the composition of communities of organisms (but we'll come back to this).\n",
      "\n",
      "How:\n",
      "\n",
      "By comparing traits of extant organisms. In our case, traits are columns in a multiple sequence alignment. Many algorithms and tools exist for achieving this, and they vary widely in runtime and quality of results. We're going to begin by learning about one of the oldest and simplest methods for doing this: *Unweighted Pair Group Method with Arithmetic Mean* or UPGMA. (Don't be scared by the name - it's actually fairly simple.) \n",
      "\n",
      "UPGMA is a heirarchical clustering algorithm. It is widely used, though it's application in phylogenetics is usually restricted to building preliminary trees to \"guide\" the process of multiple sequence alignment, as it makes some assumptions that don't work well for inferring relationships between organsims. We're going to start with it here however for a few reasons. First, the underlying math is very basic, so we don't need to assume anything about your background. Second, there are some other applications of UPGMA that we'll explore later, including grouping samples based on their species compositions. In general, my strategy with teaching this material is to start by giving you a basic introdution into how a process works so you can visualize and do it. From there, we can get more complex."
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Some terminology"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Here's the goal (well, *a* goal, but this is the basic idea):\n",
      "\n",
      "<img src=\"files/images/basic-rooted-tree1.jpg\" width=600>\n",
      "\n",
      "Each *leaf* (or *tip*, or *terminal node*) in this tree represents a sequence, and the length of the horizonal branches between them indicate their dissimilarity to one another. This is a *rooted tree*, which means that it includes an assumption about the last common ancestor of all sequences represented in the tree. \n",
      "\n",
      "Note that the vertical lines in this tree are used for layout purposes only - they do not represent dissimilarity between sequences.\n",
      "\n",
      "An **unrooted trees**, like the following, doesn't include an assumption about the last common ancestor of all sequences:\n",
      "\n",
      "<img src=\"files/images/basic-unrooted-tree1.jpg\" width=600>\n",
      "\n",
      "\n",
      "**Terminal nodes, tips or leaves** extant organisms, frequently called operational taxonomic units or OTUs. OTUs are families of related organisms. \n",
      "\n",
      "**Internal nodes** hypothetical ancestors - we postulate their existence but often don\u2019t have direct evidence.\n",
      "\n",
      "**Clade** a node and all nodes \"below\" it (i.e., toward the tips)\n",
      "\n",
      "**Root** the internal node defining the clade which contains all nodes in the tree\n",
      "\n",
      "**Branches** representative of the distance between the nodes.\n",
      "\n",
      "<img src=\"files/images/tree-schematic1.png\">\n"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Monophyletic group** the last common ancestor was a member of the group (e.g., multicellular organisms)\n",
      "\n",
      "<img src=\"files/images/tree-monophyly.png\">\n",
      "\n",
      "**Polyphyletic group** the last common ancestor was not a member of the group (e.g., flying animals)\n",
      "\n",
      "<img src=\"files/images/tree-polyphyly.png\">"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We assume that as time progresses, sequences will diverge such that more similar sequences have diverged more recently. The problem of phylogenetic reconstruction however is that we only have the tips. We don't have sequences for the internal nodes, and so we use modern sequences to develop a hypothesis about the evolutionary history of a sequence (and hopefully of the organisms who encode those sequences in their genomes).\n",
      "\n",
      "<img src=\"files/images/sequence-evo-tree.png\">"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "How many (rooted) trees are there for `n` sequences? This topic is discussed in detail in Chapter 3 of [Inferring Phylogenies](http://www.amazon.com/Inferring-Phylogenies-Joseph-Felsenstein/dp/0878931775/ref=sr_1_1?s=books&ie=UTF8&qid=1393288952&sr=1-1&keywords=inferring+phylogenies), the definitive text on this topic, but the basic answer is **a lot**.\n",
      "\n",
      "Because of the massive number of possible trees for any reasonable number of sequences, we can't just search all trees to figure out which one best matches our data. Instead, we must take a heurtistic approach to tree building. Some design features of heurtistic methods that are used in practice are that they:\n",
      "\n",
      "1. Look at a subset of the possible trees, and don\u2019t guarantee to find the best tree. \n",
      "2. Scale to trees for many OTUs (how well they scale depends on the method, and there is a lot of variability)\n",
      "3. Often provide a single tree, so do not include information on how likely other tree topologies are (we\u2019ll talk about methods, such as bootstrapping, to address this). \n"
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Distances and distance matrices"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Computing a UPGMA tree for a group of sequences relies on first computing *distances* between each pair of those sequences. A *distance*, in this sense, is a technical term. It's a measure of dissimilarity between two items, `x` and `y`, which meets a few criteria:\n",
      "\n",
      "1. `d(x,y) >= 0` (non-negativity)\n",
      "2. `d(x,y) = 0` [`iff`](http://en.wikipedia.org/wiki/If_and_only_if) `x = y` (identity of indiscernibles)\n",
      "3. `d(x,y) = d(y,x)` (symmetry) \n",
      "4. `d(x,z) <= d(x,y) + d(y,z)` (triangle inequality)\n",
      "\n",
      "Let's start with something more similar that sequences. Let's compute the distances between points in a Cartesian plane, to explore what each of these mean.  \n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from __future__ import print_function\n",
      "\n",
      "from IPython.core import page\n",
      "import matplotlib.pyplot as plt\n",
      "\n",
      "page.page = print\n",
      "\n",
      "x_vals = [1, 4, 2]\n",
      "y_vals = [3, 3, 6]\n",
      "coordinates = zip(x_vals, y_vals)\n",
      "\n",
      "fig, ax = plt.subplots()\n",
      "ax.scatter(x_vals, y_vals)\n",
      "ax.set_xlim(0)\n",
      "ax.set_ylim(0)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 1,
       "text": [
        "(0, 6.5)"
       ]
      },
      {
       "metadata": {},
       "output_type": "display_data",
       "png": "iVBORw0KGgoAAAANSUhEUgAAAW0AAAD7CAYAAAChScXIAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAADTZJREFUeJzt3W2IZGeZxvHrmumEjMY1SJZENyMDCwsKkkRCDBr1rGJP\nCG52w/hBQQ0BdRHEoFGjIkz5RQkyxDcUQ4yMrmQ/JMxi1mglSg6JiOPbzOZlxtWIgcSXCEajcSI4\n5vZD1Yxtd1ed53TPOafv7v8PmlTVeerUxTOnrj79nK6OI0IAgBy2DR0AAFCO0gaARChtAEiE0gaA\nRChtAEiE0gaARBbWuwPb/M4gAKxBRLjtc07JmXZEbLivvXv3Dp6BTGTairnIVPa1ViyPAEAilDYA\nJLJpS7uqqqEjrECmMmQqtxFzkalbXs/aijS5ELnefQDAVmNbMdSFSABAPyhtAEiE0gaARChtAEiE\n0gaARChtAEiksbRtn2X7VttHbR+xfUkfwQAAK5X8wahPSLojIl5ne0HSMzvOBACYYe6Ztu1nS3p5\nRNwsSRFxPCKe6CUZtpzxeKzFxT1aXNyj8Xg8dBxgQ5r7iUjbF0j6nKQjks6X9ANJ10TEsSVj+EQk\n1m08HuvKK6/SU09dL0naseM6HTiwX7t37x44GdCNrj4RuSDpxZI+ExEvlvRHSe9fQz5grn37bpwW\n9lWSJuW9b9+NQ8cCNpymNe1HJT0aEd+b3r9Vq5T2aDQ6ebuqqk31x1kA4FSo61p1Xa97P41/MMr2\nPZLeEhE/tj2StCMirluyneURrBvLI9hq1ro8UlLa50u6SdLpkn4q6eqlFyMpbZwq4/H45JLItde+\njcLGptZZaRe8MKUNAC3xp1kBYAugtAEgEUobABKhtAEgEUobABKhtAEgEUobABKhtAEgEUobABKh\ntAEgEUobABKhtAEgEUobABKhtAEgEUobABKhtAEgEUobABKhtAEgEUobABKhtAEgEUobABKhtAEg\nEUobABKhtAEgEUobABKhtAEgkYWSQbYflvR7SX+R9OeIuLjLUACA1RWVtqSQVEXE412GAQDM12Z5\nxJ2lAAAUKS3tkPQN29+3/dYuAwEAZitdHnlZRPzS9j9Kusv2jyLi3hMbR6PRyYFVVamqqlMaEgCy\nq+tadV2vez+OiHZPsPdKejIi9k3vR9t9AMBWZ1sR0XrZuXF5xPYzbD9revuZkhYl3d8+IgBgvUqW\nR86RdMD2ifFfjog7O00FAFhV6+WRFTtgeQQAWutseQQAsHFQ2gCQCKUNAIlQ2gCQCKUNAIlQ2gCQ\nCKUNAIlQ2gCQCKUNAIlQ2gCQCKUNAIlQ2gCQCKUNAIlQ2gCQCKUNAIlQ2gCQCKUNAIlQ2gCQCKUN\nAIlQ2gCQCKUNAIlQ2gCQCKUNAIlQ2gCQCKUNAIkUlbbt7bYP2b6960AAgNlKz7SvkXREUnSYBQDQ\noLG0bZ8n6XJJN0ly54kAADOVnGnfIOm9kp7uOAsAoMHCvI22Xyvp1xFxyHY1a9xoNDp5u6oqVdXM\noQCwJdV1rbqu170fR8xeprb9EUlvknRc0hmS/kHSbRHx5iVjYt4+AAAr2VZEtF5ynlvay17glZLe\nExH/tuxxShsAWlprabf9PW3aGQAGVHymPXMHnGkDQGt9nWkDAAZEaQNAIpQ2ACRCaQNAIpQ2ACRC\naQNAIpQ2ACRCaQNAIpQ2ACRCaQNAIpQ2ACRCaQNAIpQ2ACRCaQNAIpQ2ACRCaQNAIpQ2ACRCaQNA\nIpQ2ACRCaQNAIpQ2ACRCaQNAIpQ2ACRCaQNAIpQ2ACTSWNq2z7B90PZh20dsf7SPYACAlRaaBkTE\nn2z/a0Qcs70g6Vu2L42Ib/WQDwCwRNHySEQcm948XdJ2SY93lmgTGo/HWlzco8XFPRqPx0PHAbaM\nzfjec0Q0D7K3SfqhpH+W9NmIeN+SbVGyj61qPB7ryiuv0lNPXS9J2rHjOh04sF+7d+8eOBmwuW30\n955tRYRbP69N4dp+tqSxpPdHRD19jNKeY3Fxj+666wpJV00f2a/XvOYruvPO24aMBWx6G/29t9bS\nblzTXioinrD9VUkXSapPPD4ajU6OqapKVVW1zQEAm1pd16rret37aTzTtn22pOMR8TvbOzQ50/5w\nRHxzup0z7Tk2+o9owGa10d97nS2P2H6RpP2aXLTcJulLEfGxJdsp7Qbj8Vj79t0oSbr22rdtmIMG\n2Ow28nuvlzXtGS9MaQNAS2stbT4RCQCJUNoAkAilDQCJUNoAkAilDQCJUNoAkAilDQCJUNoAkAil\nDQCJUNoAkAilDQCJUNoAkAilDQCJUNoAkAilDQCJUNoAkAilDQCJUNoAkAilDQCJUNoAkAilDQCJ\nUNoAkAilDQCJUNoAkAilDQCJUNoAkEhjadveaftu2w/afsD2O/sIBgBYyRExf4B9rqRzI+Kw7TMl\n/UDSf0TE0en2aNoHAODv2VZEuO3zGs+0I+JXEXF4evtJSUclPa99RADAerVa07a9S9KFkg52EQYA\nMN9C6cDp0sitkq6ZnnGfNBqNTt6uqkpVVZ2ieACwOdR1rbqu172fxjVtSbJ9mqT/lfS1iPj4sm2s\naQNAS2td0y65EGlJ+yX9JiLetcp2ShsAWuqytC+VdI+k+ySdGPyBiPj6dDulDQAtdVbaBS9MaQNA\nS539yh8AYOOgtAEgEUobABKhtAEgEUobABKhtAEgEUobABKhtAEgEUobABKhtAEgEUobABKhtAEg\nEUobABKhtAEgEUobABKhtAEgEUobABKhtAEgEUobABKhtAEgEUobABKhtAEgEUobABKhtAEgEUob\nABJpLG3bN9t+zPb9fQQCAMxWcqb9BUmXdR0EANCssbQj4l5Jv+0hCwCgAWvaAJDIwqnYyWg0Onm7\nqipVVXUqdgsAm0Zd16rret37cUQ0D7J3Sbo9Il60yrYo2QcA4G9sKyLc9nksjwBAIiW/8neLpG9L\n+hfbj9i+uvtYAIDVFC2PzN0ByyMA0BrLIwCwBVDaAJAIpQ0AiVDaAJAIpQ0AiVDaAJAIpQ0AiVDa\nAJAIpQ0AiVDaAJAIpQ0AiVDaAJAIpQ0AiVDaAJAIpQ0AiVDaAJAIpQ0AiVDaAJAIpQ0AiVDaAJAI\npQ0AiVDaAJAIpQ0AiVDaAJAIpQ0AiTSWtu3LbP/I9k9sX9dHKADA6uaWtu3tkj4t6TJJL5T0Btsv\n6CPYetV1PXSEFchUhkzlNmIuMnWr6Uz7YkkPRcTDEfFnSf8t6d+7j7V+G/EfiUxlyFRuI+YiU7ea\nSvufJD2y5P6j08cAAANoKu3oJQUAoIgjZvey7UskjSLisun9D0h6OiKuXzKGYgeANYgIt31OU2kv\nSPp/Sa+W9AtJ35X0hog4utaQAIC1W5i3MSKO236HpLGk7ZI+T2EDwHDmnmkDADaW4k9ElnzIxvYn\np9v/z/aFpy7m2jLZrmw/YfvQ9OtDPWS62fZjtu+fM6bveZqbaaB52mn7btsP2n7A9jtnjOttrkoy\n9T1Xts+wfdD2YdtHbH90xri+j6nGXEMcV9PX3T59vdtnbO91rpoytZ6niGj80mRp5CFJuySdJumw\npBcsG3O5pDumt18i6Tsl+17rV2GmStJXusyxSq6XS7pQ0v0ztvc6T4WZhpincyVdML19pibXToY+\npkoyDTFXz5j+d0HSdyRdOvQxVZir97mavu67JX15tdcecK7mZWo1T6Vn2iUfsrlC0n5JioiDks6y\nfU7h/tei9IM/ra/OrkdE3Cvpt3OG9D1PJZmk/ufpVxFxeHr7SUlHJT1v2bBe56owk9T/XB2b3jxd\nk5OVx5cN6f2YKswl9TxXts/TpJhvmvHavc9VQSbNeXyF0tIu+ZDNamPOKw2yBiWZQtJLpz8G3WH7\nhR3mKdX3PJUYdJ5s79LkJ4GDyzYNNldzMvU+V7a32T4s6TFJd0fEkWVDBpmnglxDHFc3SHqvpKdn\nbB9irpoytZqn0tIuvVq5/LtFl1c5S/b9Q0k7I+J8SZ+S9D8d5mmjz3kqMdg82T5T0q2Srpme3a4Y\nsux+53PVkKn3uYqIpyPiAk3K5RW2q1WG9T5PBbl6nSvbr5X064g4pPlnrr3NVWGmVvNUWto/l7Rz\nyf2dmnyHmjfmvOljXWnMFBF/OPEjXER8TdJptp/TYaYSfc9To6HmyfZpkm6T9F8RsdqB2vtcNWUa\n8piKiCckfVXSRcs2DXpMzco1wFy9VNIVtn8m6RZJr7L9xWVj+p6rxkyt56lwEX1B0k81ueh3upov\nRF6i7i8alWQ6R3/7tcaLJT3cZaYlr7tLZRciO5+nwky9z5MmZx1flHTDnDF9H1MlmXqdK0lnSzpr\nenuHpHskvXroY6ow1yDvv+nrvVLS7UMfU4WZWs3T3A/XnBAzPmRj+z+n2z8XEXfYvtz2Q5L+KOnq\nkn2vVUkmSa+T9HbbxyUdk/T6LjNJku1bNPnHOdv2I5L2avLbLYPMU0kmDTBPkl4m6Y2S7rN9aPrY\nByU9/0SuAeaqMZP6n6vnStpve5smPxl/KSK+OeR7rzSXhjmulpp8Jx5+ruZmUst54sM1AJAI/7sx\nAEiE0gaARChtAEiE0gaARChtAEiE0gaARChtAEiE0gaARP4KoAqQtGhRiq4AAAAASUVORK5CYII=\n",
       "text": [
        "<matplotlib.figure.Figure at 0x110aeb350>"
       ]
      }
     ],
     "prompt_number": 1
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from scipy.spatial.distance import euclidean\n",
      "%psource euclidean"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\u001b[0;32mdef\u001b[0m \u001b[0meuclidean\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mu\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mv\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\n",
        "\u001b[0;34m\u001b[0m    \u001b[0;34m\"\"\"\u001b[0m\n",
        "\u001b[0;34m    Computes the Euclidean distance between two 1-D arrays.\u001b[0m\n",
        "\u001b[0;34m\u001b[0m\n",
        "\u001b[0;34m    The Euclidean distance between 1-D arrays `u` and `v`, is defined as\u001b[0m\n",
        "\u001b[0;34m\u001b[0m\n",
        "\u001b[0;34m    .. math::\u001b[0m\n",
        "\u001b[0;34m\u001b[0m\n",
        "\u001b[0;34m       {||u-v||}_2\u001b[0m\n",
        "\u001b[0;34m\u001b[0m\n",
        "\u001b[0;34m    Parameters\u001b[0m\n",
        "\u001b[0;34m    ----------\u001b[0m\n",
        "\u001b[0;34m    u : (N,) array_like\u001b[0m\n",
        "\u001b[0;34m        Input array.\u001b[0m\n",
        "\u001b[0;34m    v : (N,) array_like\u001b[0m\n",
        "\u001b[0;34m        Input array.\u001b[0m\n",
        "\u001b[0;34m\u001b[0m\n",
        "\u001b[0;34m    Returns\u001b[0m\n",
        "\u001b[0;34m    -------\u001b[0m\n",
        "\u001b[0;34m    euclidean : double\u001b[0m\n",
        "\u001b[0;34m        The Euclidean distance between vectors `u` and `v`.\u001b[0m\n",
        "\u001b[0;34m\u001b[0m\n",
        "\u001b[0;34m    \"\"\"\u001b[0m\u001b[0;34m\u001b[0m\n",
        "\u001b[0;34m\u001b[0m    \u001b[0mu\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0m_validate_vector\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mu\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\n",
        "\u001b[0;34m\u001b[0m    \u001b[0mv\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0m_validate_vector\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mv\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\n",
        "\u001b[0;34m\u001b[0m    \u001b[0mdist\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnorm\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mu\u001b[0m \u001b[0;34m-\u001b[0m \u001b[0mv\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\n",
        "\u001b[0;34m\u001b[0m    \u001b[0;32mreturn\u001b[0m \u001b[0mdist\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
        "\n"
       ]
      }
     ],
     "prompt_number": 2
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Does Euclidean distance appear to meet the above four requirements of a distance metric?\n",
      "\n",
      "\n",
      "We can now compute the distances between all of these pairs of coordinates, and we can store these in what's known as a *distance matrix*. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "dm = []\n",
      "\n",
      "for c1 in coordinates:\n",
      "    row = []\n",
      "    for c2 in coordinates:\n",
      "        row.append(euclidean(c1, c2))\n",
      "    dm.append(row)\n",
      "\n",
      "print(dm)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[[0.0, 3.0, 3.1622776601683795], [3.0, 0.0, 3.6055512754639891], [3.1622776601683795, 3.6055512754639891, 0.0]]\n"
       ]
      }
     ],
     "prompt_number": 3
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The [scikit-bio](https://github.com/biocore/scikit-bio) project defines defines a ``DistanceMatrix`` object that we can use to work with this."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from skbio.stats.distance import DistanceMatrix\n",
      "\n",
      "dm = DistanceMatrix(dm, map(str,range(3)))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 4
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "One feature that this gets us is nicer printing."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print(dm)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "3x3 distance matrix\n",
        "IDs:\n",
        "'0', '1', '2'\n",
        "Data:\n",
        "[[ 0.          3.          3.16227766]\n",
        " [ 3.          0.          3.60555128]\n",
        " [ 3.16227766  3.60555128  0.        ]]\n"
       ]
      }
     ],
     "prompt_number": 5
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "[scikit-bio](https://github.com/biocore/scikit-bio) also provides a convenience wrapper that runs the SciPy *pairwise distance* measures for us and returns a ``DistanceMatrix`` object. We'll use that from here."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from skbio.diversity.beta import pw_distances\n",
      "help(pw_distances)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Help on function pw_distances in module skbio.diversity.beta._base:\n",
        "\n",
        "pw_distances(counts, ids=None, metric='braycurtis')\n",
        "    Compute distances between all pairs of columns in a counts matrix\n",
        "    \n",
        "    Parameters\n",
        "    ----------\n",
        "    counts : 2D array_like of ints or floats\n",
        "        Matrix containing count/abundance data where each row contains counts\n",
        "        of observations in a given sample.\n",
        "    ids : iterable of strs, optional\n",
        "        Identifiers for each sample in ``counts``.\n",
        "    metric : str, optional\n",
        "        The name of the pairwise distance function to use when generating\n",
        "        pairwise distances. See the scipy ``pdist`` docs, linked under *See\n",
        "        Also*, for available metrics.\n",
        "    \n",
        "    Returns\n",
        "    -------\n",
        "    skbio.DistanceMatrix\n",
        "        Distances between all pairs of samples (i.e., rows). The number of\n",
        "        row and columns will be equal to the number of rows in ``counts``.\n",
        "    \n",
        "    Raises\n",
        "    ------\n",
        "    ValueError\n",
        "        If ``len(ids) != len(counts)``.\n",
        "    \n",
        "    See Also\n",
        "    --------\n",
        "    scipy.spatial.distance.pdist\n",
        "    pw_distances_from_table\n",
        "\n"
       ]
      }
     ],
     "prompt_number": 6
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "dm = pw_distances(coordinates, metric=\"euclidean\")\n",
      "print(dm)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "3x3 distance matrix\n",
        "IDs:\n",
        "'0', '1', '2'\n",
        "Data:\n",
        "[[ 0.          3.          3.16227766]\n",
        " [ 3.          0.          3.60555128]\n",
        " [ 3.16227766  3.60555128  0.        ]]\n"
       ]
      }
     ],
     "prompt_number": 7
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The conditions of a distance matrix listed above lead to a few specific features: the distance matrix is symmetric (if you flip the upper triangle over the diagonal, the values are the same as those in the lower triangle), the distance matrix is *hollow* (the diagonal is all zeros), and there are no negative values. "
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Computing the distance between pairs of sequences."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Most often, distances between pairs of sequences are derived from a multiple sequence alignment. These differ from the pairwise alignments that we\u2019ve looked at thus far, but use the same underlying algorithms (and we'll be coming back to this in the next chapeter).\n",
      "\n",
      "Let's load up some aligned sequences, and compute a distance matrix. For now, we'll compute distances between the sequences using the ``hamming`` function that we worked with in the pairwise alignment chapter."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from skbio import Alignment, DNA\n",
      "aln = Alignment([DNA('ACCGTGAAGCCAATAC', 's1'), \n",
      "                 DNA('A-CGTGCAACCATTAC', 's2'),\n",
      "                 DNA('AGCGTGCAGCCAATAC', 's3'),\n",
      "                 DNA('AGGGTGCCGC-AATAC', 's4'),\n",
      "                 DNA('AGGGTGCCAC-AATAC', 's5')])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 8
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%psource aln.distances"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "    \u001b[0;32mdef\u001b[0m \u001b[0mdistances\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdistance_fn\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mNone\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\n",
        "\u001b[0;34m\u001b[0m        \u001b[0;34m\"\"\"Compute distances between all pairs of sequences\u001b[0m\n",
        "\u001b[0;34m\u001b[0m\n",
        "\u001b[0;34m        Parameters\u001b[0m\n",
        "\u001b[0;34m        ----------\u001b[0m\n",
        "\u001b[0;34m        distance_fn : function, optional\u001b[0m\n",
        "\u001b[0;34m            Function for computing the distance between a pair of sequences.\u001b[0m\n",
        "\u001b[0;34m            This must take two sequences as input (as\u001b[0m\n",
        "\u001b[0;34m            `skbio.sequence.BiologicalSequence` objects) and return a\u001b[0m\n",
        "\u001b[0;34m            single integer or float value. Defaults to\u001b[0m\n",
        "\u001b[0;34m            `scipy.spatial.distance.hamming`.\u001b[0m\n",
        "\u001b[0;34m\u001b[0m\n",
        "\u001b[0;34m        Returns\u001b[0m\n",
        "\u001b[0;34m        -------\u001b[0m\n",
        "\u001b[0;34m        skbio.DistanceMatrix\u001b[0m\n",
        "\u001b[0;34m            Matrix containing the distances between all pairs of sequences.\u001b[0m\n",
        "\u001b[0;34m\u001b[0m\n",
        "\u001b[0;34m        Raises\u001b[0m\n",
        "\u001b[0;34m        ------\u001b[0m\n",
        "\u001b[0;34m        skbio.util.exception.BiologicalSequenceError\u001b[0m\n",
        "\u001b[0;34m            If ``len(self) != len(other)`` and ``distance_fn`` ==\u001b[0m\n",
        "\u001b[0;34m            ``scipy.spatial.distance.hamming``.\u001b[0m\n",
        "\u001b[0;34m\u001b[0m\n",
        "\u001b[0;34m        See Also\u001b[0m\n",
        "\u001b[0;34m        --------\u001b[0m\n",
        "\u001b[0;34m        skbio.DistanceMatrix\u001b[0m\n",
        "\u001b[0;34m        scipy.spatial.distance.hamming\u001b[0m\n",
        "\u001b[0;34m\u001b[0m\n",
        "\u001b[0;34m        Examples\u001b[0m\n",
        "\u001b[0;34m        --------\u001b[0m\n",
        "\u001b[0;34m        >>> from skbio.alignment import Alignment\u001b[0m\n",
        "\u001b[0;34m        >>> from skbio.sequence import DNA\u001b[0m\n",
        "\u001b[0;34m        >>> seqs = [DNA(\"A-CCGGG\", id=\"s1\"),\u001b[0m\n",
        "\u001b[0;34m        ...         DNA(\"ATCC--G\", id=\"s2\"),\u001b[0m\n",
        "\u001b[0;34m        ...         DNA(\"ATCCGGA\", id=\"s3\")]\u001b[0m\n",
        "\u001b[0;34m        >>> a1 = Alignment(seqs)\u001b[0m\n",
        "\u001b[0;34m        >>> print(a1.distances())\u001b[0m\n",
        "\u001b[0;34m        3x3 distance matrix\u001b[0m\n",
        "\u001b[0;34m        IDs:\u001b[0m\n",
        "\u001b[0;34m        's1', 's2', 's3'\u001b[0m\n",
        "\u001b[0;34m        Data:\u001b[0m\n",
        "\u001b[0;34m        [[ 0.          0.42857143  0.28571429]\u001b[0m\n",
        "\u001b[0;34m         [ 0.42857143  0.          0.42857143]\u001b[0m\n",
        "\u001b[0;34m         [ 0.28571429  0.42857143  0.        ]]\u001b[0m\n",
        "\u001b[0;34m\u001b[0m\n",
        "\u001b[0;34m        \"\"\"\u001b[0m\u001b[0;34m\u001b[0m\n",
        "\u001b[0;34m\u001b[0m        \u001b[0;32mreturn\u001b[0m \u001b[0msuper\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mAlignment\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdistances\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdistance_fn\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
        "\n"
       ]
      }
     ],
     "prompt_number": 9
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "master_dm = aln.distances()\n",
      "print(master_dm)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "5x5 distance matrix\n",
        "IDs:\n",
        "'s1', 's2', 's3', 's4', 's5'\n",
        "Data:\n",
        "[[ 0.      0.25    0.125   0.3125  0.375 ]\n",
        " [ 0.25    0.      0.1875  0.375   0.3125]\n",
        " [ 0.125   0.1875  0.      0.1875  0.25  ]\n",
        " [ 0.3125  0.375   0.1875  0.      0.0625]\n",
        " [ 0.375   0.3125  0.25    0.0625  0.    ]]\n"
       ]
      }
     ],
     "prompt_number": 10
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Once we have these distances, we can cluster the sequences based on their similiaries/dissimilarities. This is the first process that we'll explore for tree building.\n",
      "\n",
      "**NOTE:** The example below assumes that each value in this distance matrix is multiplied by the sequence length, so we'll do that here and work work with the resulting distance matrix."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "master_dm = DistanceMatrix(master_dm.data*16, master_dm.ids)\n",
      "print(master_dm)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "5x5 distance matrix\n",
        "IDs:\n",
        "'s1', 's2', 's3', 's4', 's5'\n",
        "Data:\n",
        "[[ 0.  4.  2.  5.  6.]\n",
        " [ 4.  0.  3.  6.  5.]\n",
        " [ 2.  3.  0.  3.  4.]\n",
        " [ 5.  6.  3.  0.  1.]\n",
        " [ 6.  5.  4.  1.  0.]]\n"
       ]
      }
     ],
     "prompt_number": 11
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Hierarchical clustering with UPGMA"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Unweighted Pair-Group Method with Arithmetic mean\n",
      "\n",
      "Unweighted: all tip-to-tip distances contribute equally\n",
      "Pair-group: all branch points lead to exactly two clades\n",
      "Arithmetic mean: distances to each clade are the mean of distances to all members of that clade\n",
      "\n",
      "Steps\n",
      "-----\n",
      "\n",
      "1. Identify the smallest distance in the matrix and define a clade containing only those members. Draw that clade, and set the *total* branch length to the distance between the tips.\n",
      "2. Create a new distance matrix with an entry representing the clade created in step 1. \n",
      "3. Calculate the distance matrix entries for the clade as the mean distance from each of the tips of the new clade to all other tips in the distance matrix.\n",
      "4. If there is only one distance (below or above the diagonal) in the distance matrix, use it to connect the remaing clades, and stop. Otherwise repeat step 1. \n",
      "\n",
      "Let's start, working from the above distance matrix."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Iteration 1\n",
      "------------\n",
      "\n",
      "Step 1.1: The smallest distance in the above matrix is `1.00`, between `s4` and `s5`. So, we'll draw that clade and set each branch length to `0.5`. \n",
      "\n",
      "Step 1.2: Next, we'll create a new, smaller distance matrix where the sequences `s4` and `s5` are now represented by a single clade, `(s4, s5)`. \n",
      "\n",
      "<img src=\"files/images/upgma-tree-iter1.png\">"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "iter1_ids = ['s1', 's2', 's3', '(s4, s5)']\n",
      "iter1_dm = [[0.0,   4.0,  2.0, None], \n",
      "            [4.0,   0.0,  3.0, None], \n",
      "            [2.0,   3.0,  0.0, None], \n",
      "            [None, None, None, None]]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 12
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Step 1.3: We'll now fill in the values from the new clade to each of the existing sequences (or clades). The distance will be the mean between a pre-existing clades, and each of the sequences in the new clade. For example, the distance between `s1` and `(s4, s5)` is the mean of the distance between `s1` and `s4` and `s1` and `s5`:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import numpy as np\n",
      "\n",
      "np.mean([master_dm[0][3], master_dm[0][4]])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 13,
       "text": [
        "5.5"
       ]
      }
     ],
     "prompt_number": 13
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Step 1.3 (continued): Similarly, the distance between `s2` and `(s4, s5)` is the mean of the distance between `s2` and `s4` and `s2` and `s5`:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "np.mean([master_dm[1][3], master_dm[1][4]])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 14,
       "text": [
        "5.5"
       ]
      }
     ],
     "prompt_number": 14
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Step 1.3 (continued): And finally, the distance between `s3` and `(s4, s5)` is the mean of the distance between `s3` and `s4` and the distance between `s3` and `s5`:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "np.mean([master_dm[2][3], master_dm[2][4]])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 15,
       "text": [
        "3.5"
       ]
      }
     ],
     "prompt_number": 15
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Step 1.3 (continued): If we fill these values in (note that they will be the same above and below the diagonal) the post-iteration 1 distance matrix looks like the following:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "iter1_dm = [[0.0, 4.0, 2.0, 5.5], \n",
      "      [4.0, 0.0, 3.0, 5.5], \n",
      "      [2.0, 3.0, 0.0, 3.5], \n",
      "      [5.5, 5.5, 3.5, 0.0]]\n",
      "\n",
      "iter1_dm = DistanceMatrix(iter1_dm, iter1_ids)\n",
      "print(iter1_dm)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "4x4 distance matrix\n",
        "IDs:\n",
        "'s1', 's2', 's3', '(s4, s5)'\n",
        "Data:\n",
        "[[ 0.   4.   2.   5.5]\n",
        " [ 4.   0.   3.   5.5]\n",
        " [ 2.   3.   0.   3.5]\n",
        " [ 5.5  5.5  3.5  0. ]]\n"
       ]
      }
     ],
     "prompt_number": 16
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Step 1.4: There is still more than one value below the diagonal, so we start a new iteration."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Iteration 2\n",
      "------------\n",
      "\n",
      "Step 2.1: The smallest distance in the above matrix is `2.00`, between `s1` and `s3`. So, we'll draw that clade and set each branch length to `1.0`. \n",
      "\n",
      "Step 2.2: Next, we'll create a new, smaller distance matrix where the sequences `s1` and `s3` are now represented by a single clade, `(s1, s3)`. \n",
      "\n",
      "<img src=\"files/images/upgma-tree-iter2.png\">"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "iter2_ids = ['(s1, s3)', 's2', '(s4, s5)']\n",
      "\n",
      "iter2_dm = [[None, None, None], \n",
      "      [None,  0.0, 5.5], \n",
      "      [None,  5.5, 0.0]]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 17
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Step 2.3: We'll now fill in the values from the new clade to each of the existing sequences (or clades). The distance will be the mean between a pre-existing clades, and each of the sequences in the new clade. For example, the distance between `s2` and `(s1, s3)` is the mean of the distance between `s2` and `s1` and `s2` and `s3`:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "np.mean([master_dm[1][0], master_dm[1][2]])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 18,
       "text": [
        "3.5"
       ]
      }
     ],
     "prompt_number": 18
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Step 2.3 (continued): Next, we need to find the distance between `(s1, s3)` and `(s4, s5)`. This is computed as the mean of the distance between `s1` and `s4`, the distance between `s1` and `s5`, the distance between `s3` and `s4`, and the distance between `s3` and `s5`. Note that we are going back to our master distance matrix here for these distances, **not** our iteration 1 distance matrix."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "np.mean([master_dm[0][3], master_dm[0][4], master_dm[2][3], master_dm[2][4]])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 19,
       "text": [
        "4.5"
       ]
      }
     ],
     "prompt_number": 19
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Step 2.3 (continued): We can now fill in all of the distances in our iteration 2 distance matrix."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "iter2_dm = [[0.0, 3.5, 4.5], \n",
      "            [3.5, 0.0, 5.5], \n",
      "            [4.5, 5.5, 0.0]]\n",
      "\n",
      "iter2_dm = DistanceMatrix(iter2_dm, iter2_ids)\n",
      "print(iter2_dm)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "3x3 distance matrix\n",
        "IDs:\n",
        "'(s1, s3)', 's2', '(s4, s5)'\n",
        "Data:\n",
        "[[ 0.   3.5  4.5]\n",
        " [ 3.5  0.   5.5]\n",
        " [ 4.5  5.5  0. ]]\n"
       ]
      }
     ],
     "prompt_number": 20
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Step 2.4: There is still more than one value below the diagonal, so we start a new iteration."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Iteration 3\n",
      "------------\n",
      "\n",
      "Step 3.1: The smallest distance in the above matrix is `3.50`, between `(s1, s3)` and `s2`. So, we'll draw that clade and set each branch length to `1.75`. \n",
      "\n",
      "Step 3.2: Next, we'll create a new, smaller distance matrix where the clade `(s1, s3)` and the sequence `s2` are now represented by a single clade, `((s1, s3), s2)`. \n",
      "\n",
      "<img src=\"files/images/upgma-tree-iter3.png\">"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "iter3_ids = ['((s1, s3), s2)', '(s4, s5)']\n",
      "\n",
      "iter3_dm = [[None, None], \n",
      "            [None,  0.0]]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 21
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Step 3.3: We'll now fill in the values from the new clade to each of the existing sequences (or clades). This is computed as the mean of the distance between `s1` and `s4`, the distance between `s1` and `s5`, the distance between `s3` and `s4`, the distance between `s3` and `s5`, the distance between `s2` and `s4`, and the distance between `s2` and `s5`. Again, note that we are going back to our master distance matrix here for these distances, **not** our iteration 1 or iteration 2 distance matrix."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "np.mean([master_dm[0][3], master_dm[0][4], master_dm[2][3], master_dm[2][4], master_dm[1][3], master_dm[1][4]])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 22,
       "text": [
        "4.833333333333333"
       ]
      }
     ],
     "prompt_number": 22
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Step 3.3 (continued): We can now fill in all of the distances in our iteration 3 distance matrix. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "iter3_dm = [[0.0, 4.8], \n",
      "            [4.8, 0.0]]\n",
      "\n",
      "iter3_dm = DistanceMatrix(iter3_dm, iter3_ids)\n",
      "print(iter3_dm)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2x2 distance matrix\n",
        "IDs:\n",
        "'((s1, s3), s2)', '(s4, s5)'\n",
        "Data:\n",
        "[[ 0.   4.8]\n",
        " [ 4.8  0. ]]\n"
       ]
      }
     ],
     "prompt_number": 23
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Step 3.4: At this stage, there is only one distance below the diagonal in our distance matrix. So, we can use that distance to draw the final branch in our tree, setting the total branch length to 4.8.\n",
      "\n",
      "<img src=\"files/images/upgma-tree-final.png\">"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "[SciPy](http://www.scipy.org/) contains support for running UPGMA and generating *dendrograms* (or basic tree visualizations). We can apply this to our distance matrix as follows. You can explore other options for hierarchical clustering in SciPy [here](http://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html) (see the *routines for agglomerative clustering*).  "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# scipy.cluster.hierarchy.average is an implementation of UPGMA\n",
      "from scipy.cluster.hierarchy import average, dendrogram\n",
      "lm = average(master_dm.condensed_form())\n",
      "d = dendrogram(lm, labels=master_dm.ids, orientation='right', \n",
      "               link_color_func=lambda x: 'black')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "display_data",
       "png": "iVBORw0KGgoAAAANSUhEUgAAAWsAAAD7CAYAAACsV7WPAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAC1BJREFUeJzt3X+s3fVdx/HniwIROrD3RiMVMDMGsn+qI5v+IYjdEFK8\nE3XZ5pYsSwz0DzfEhCDEJYaYkoyM/TDGv1T8sayZNM6pg2QjKyNZswzH0toqmxmb6GwENZx0kMWg\n6ds/zmlyYb3n3kvPud/75jwfyU3Pvd9vP/3k2/Z5v/2cc/pJVSFJ2t7OG3oCkqT1GWtJasBYS1ID\nxlqSGjDWktSAsZakBs6fdjCJr+uTpFehqjLL8da9s66ql33ce++93/e1RfzwOngdvA5eh7U+5sFl\nEElqwFhLUgObjvXevXvnMI1+vA5jXocxr8OY12F+Mm19JUnNa/1Fkl6rklBb/QSjJGl4xlqSGjDW\nktSAsZakBoy1JDVgrCWpAWMtSQ0Ya0lqwFhLUgPGWpIaMNaS1ICxlqQGjLUkDSDJg0mOJTme5DNJ\nfnDq+f6ve5I0Wxv5X/eSXFJVL0wefxQYVdV9a53vnbUkzVmSnUkemdxJn0jyrlWhDnAR8N/Txpi6\nYa4kaSb2ASeragUgyaWTH/8MuBl4Grhj2gAug0gzsLy8zGg0Gnoa2kZWL4MkuQp4FHgIeLiqjqw6\ndh7wh8BzVfV7a41nrKUZmKxRDj0NbRNnW7NOsgtYAfYDh6vqwKpj1wN3V9Xb1hrTZRBJmrMkuxk/\ngXgwySngtiQ/UVXfmqxZ3wIcnTaGsZak+dsDPJDkNPAScDvwF2fWroEngQ9MG8BlEGkGXAbRam6Y\nK0kLylhLUgPGWpIaMNaS1ICxlqQGjLUkNWCsJakBYy1JDRhrSWrAWEtSA8Zakhow1pLUgLGWpAaM\ntSQ1YKwlqQFjLUkNGGtJasBtvRpzR21pcbitV2NuJbV9+Huh1dzWS5IWlLGWpAaMtSQ1YKwlqQFj\nLUkNGGtJasBYS1IDxlqSGjDWktSAsZakBoy1JDVgrCWpAWMtSQ0Ya0lqwFhLUgPGWpIaMNaSNIAk\ntyd5OsnpJMvrnW+sJWkYR4AbgH/dyMnuwShJc5ZkJ3AIuBzYARyoqkOTYxsaw1hL0vztA05W1QpA\nkks3O4DLIJI0f8eBG5Pcn+S6qvruZgfwzlqagaWlpQ3/c1aLp6q+meQaYAW4L8nhqjqwmTGMtTQD\nzz///NBT0Dbyym/cSXYDo6o6mOQUcOvk62dOXPc7vcsgkjR/e4AnkhwFfhc4kOQ3gX9j/KTj8SR/\nNG2AVNXaB5OadlzDSoK/P9L2M/m7OdN1Me+sJakBYy1JDRhrSWrAWEtSA8Zakhow1pLUgLGWpAbm\n+g7G5eVlRqPRPH8JSVoIc31TjG/amC+vr7Q9+aYYSVpQxlqSGjDWktSAsZakBoy1JDVgrCWpAWMt\nSQ0Ya0lqwFhLUgPGWpIaMNaS1ICxlqQGjLUkNWCsJakBYy1JDRhrSWrAWEtSA8Zakhow1pLUgLGW\npAaMtSQ1YKwlqQFjLUkNGGtJasBYS1IDxlqSGjDWktSAsZakASQ5mOQbSU4keTDJ+dPON9aSNIxP\nVtUbqmoPcBFw27STp5ZcknTukuwEDgGXAzuAA1V1aNUpXwWumDaGsZak+dsHnKyqFYAkl545kOQC\n4L3AHdMGSFWtfTCpacfXk4Rz+fmabnl5mdFoNPQ0JJ1FVeXM4yRXAY8CDwEPV9WRVcf+GHihqu6c\nNp6xlqQZm7Qvr/jaLmAF2A8crqoDSe4Ffqqq3r7emC6DSNKcJdkNjKrqYJJTwK1JbgNuAm7Y0Bje\nWUvSbL3yzjrJTcADwGngJeD9wFeAZ4AXJ6d9uqruW3NMYy1Js3W2ZZBz5eusJakBYy1JDRhrSWrA\nWEtSA8Zakhow1pLUgLGWpAaMtSQ1YKwlqQFjLUkNGGtJasBYS1IDxlqSGjDWktSAsZakBoy1JDVg\nrCWpAWMtSQ0Ya0lqwFhLUgPGWpIaMNaS1ICxlqQGjLUkNWCsJakBYy1JDRhrSWrAWEtSA8Zakhow\n1pLUgLGWpAaMtSQ1YKwlqQFjLUkNGGtJasBYS9IAkvx5km8nOTr5+Mlp55+/VROTJL1MAXdV1V9v\n5GRjLUlzlmQncAi4HNgBHDhzaKNjuAwiSfO3DzhZVW+sqj3A5yZf/1CSf0jysSQXThvAWEvS/B0H\nbkxyf5Lrquq7wO9U1dXATwPLwD3TBkhVrX0wqWnH15OEc/n5knpZXl5mNBoNPY1toapetsSRZBew\nAuwHDlfVgVXHfp7x+vUvrTWea9aSZmY0GnmDxvhG9RWf7wZGVXUwySng1iSXVdWzGZ/8q8CJaWMa\na0mavz3AA0lOAy8B7wcOJvlhxk8yHgU+OG0Al0EkzYx/58cm12HDr/TYCJ9glKQGjLUkNWCsJakB\nYy1JDRhrSWrAWEtSA8Zakhow1pLUgLGWpAaMtSQ1YKwlqQFjLUkNGGtJasBYS1IDxlqSGjDWktSA\nsZakBoy1JDUw1z0Yl5aWvm/jSEnS5s11D0ZJi8U9GMfcg1GSFpSxlqQGjLUkNWCsJakBYy1JDRhr\nSWrAWEtSA8Zakhow1pLUgLGWpAaMtSQ1YKwlqQFjLUkNGGtJasBYS1IDxlqSGjDWkjSgJH+Q5IX1\nzjPWkjSQJG8GdgHrbq9jrCVpzpLsTPJIkmNJTiR5Z5IdwIeBu4F1twCb64a5kiQA9gEnq2oFIMml\nwO3A31bVsxvZWNxYS5qZpaUlNhKeBXQc+EiS+4GHgW8D7wD2ZoMXzN3NJWnGzra7eZJdwAqwH3gM\n+A3gfyaHfwz4VlVdvdaY3llL0pwl2Q2MqupgklPArVW1e9XxF6aFGoy1JG2FPcADSU4DLzG+q15t\n3SUMl0EkacbOtgxyrnzpniQ1YKwlqQFjLUkNGGtJasBYS1IDxlqSGjDWktSAsZakBoy1JDVgrCWp\nAWMtSQ0Ya0lqwFhLUgPGWpIa2HSsH3/88TlMox+vw5jXYczrMOZ1mB9j/Sp5Hca8DmNehzGvw/y4\nDCJJDRhrSWpg3W29tnAukvSaMettvabGWpK0PbgMIkkNGGtJamBTsU7yTJLjSY4m+ft5TWq7S7Jj\ncg0+O/RchpLkB5I8keRYkqeSfGjoOQ0hyZVJvpjkn5L8Y5I7hp7TEJL8aZLnkpwYei5DS7IvyTeS\nfDPJPTMbdzNr1kn+BXhTVT0/qwl0lORO4E3AJVV1y9DzGUqSi6vqe0nOB44Ad1XVkaHntZWSXAZc\nVlXHkrwO+BrwK1X19YGntqWS/BzwIvCJqtoz9HyGkmQH8M/ALwAnga8C75nFn4dXswwy02c4u0ly\nBfCLwJ+w4Neiqr43eXghsANYuG/iVfVsVR2bPH4R+Drwo8POautV1ZeA0dDz2AZ+Bni6qp6pqv8F\n/hL45VkMvNlYF/CFJE8m2T+LCTT0ceC3gdNDT2RoSc5Lcgx4DvhiVT019JyGlOT1wDXAE8PORAO6\nHPjOqs//ffK1c7bZWF9bVdcANwMfmPzTZ2EkeRvwn1V1lAW/qwaoqtNV9UbgCuD6JHsHntJgJksg\nfwX81uQOW4tpbq+F3lSsq+o/Jj/+F/AZxrf8i+RngVsma/efAt6a5BMDz2lwVXUKeAR489BzGUKS\nC4BPA5+sqr8Zej4a1EngylWfX8n47vqcbTjWSS5Ocsnk8U7gJmChnvmtqg9W1ZVV9ePAu4HHqup9\nQ89rCEl+KMmuyeOLgBuBo8POauslCfAg8FRV/f7Q89HgngSuSvL6JBcCvwb83SwG3syd9Y8AX5qs\nUT4BPFxVj85iEo0t8ts/dwOPrfrz8NmqOjzwnIZwLfBe4C2Tl3MeTbJv6ElttSSfAr4MXJ3kO0l+\nfeg5DaGq/g+4Hfg88BTw0KxeGeTbzSWpAd/BKEkNGGtJasBYS1IDxlqSGjDWktSAsZakBoy1JDVg\nrCWpgf8HV5/Srf0xPQEAAAAASUVORK5CYII=\n",
       "text": [
        "<matplotlib.figure.Figure at 0x113485050>"
       ]
      }
     ],
     "prompt_number": 24
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Acknowledgements"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The material in this section was compiled while consulting the following sources:\n",
      "\n",
      "1. The Phylogenetic Handbook (Lemey, Salemi, Vandamme)\n",
      "2. Inferring Phylogeny (Felsenstein)\n",
      "3. [Richard Edwards\u2019s teaching website](http://www.southampton.ac.uk/~re1u06/teaching/upgma/)\n"
     ]
    }
   ],
   "metadata": {}
  }
 ]
}