{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# [3.1]() Studying Microbial Diversity <a class='iab-edit' href='https://github.com/caporaso-lab/An-Introduction-to-Applied-Bioinformatics/edit/master/book/applications/biological-diversity.md#L2' target='_blank'>[edit]</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Table of Contents**\n",
    "0. [Getting started: the feature table](#1)\n",
    "0. [Terminology](#2)\n",
    "0. [Measuring alpha diversity](#3)\n",
    "    0. [Observed species (or Observed OTUs)](#3.1)\n",
    "        0. [A limitation of OTU counting](#3.1.1)\n",
    "    0. [Phylogenetic Diversity (PD)](#3.2)\n",
    "    0. [Even sampling](#3.3)\n",
    "0. [Measuring beta diversity](#4)\n",
    "    0. [Distance metrics](#4.1)\n",
    "        0. [Bray-Curtis](#4.1.1)\n",
    "        0. [Unweighted UniFrac](#4.1.2)\n",
    "        0. [Even sampling](#4.1.3)\n",
    "    0. [Interpreting distance matrices](#4.2)\n",
    "        0. [Distribution plots and comparisons](#4.2.1)\n",
    "        0. [Hierarchical clustering](#4.2.2)\n",
    "    0. [Ordination](#4.3)\n",
    "        0. [Polar ordination](#4.3.1)\n",
    "        0. [Determining the most important axes in polar ordination](#4.3.2)\n",
    "        0. [Interpreting ordination plots](#4.3.3)\n",
    "0. [Tools for using ordination in practice: scikit-bio, pandas, and matplotlib](#5)\n",
    "0. [PCoA versus PCA: what's the difference?](#6)\n",
    "0. [Are two different analysis approaches giving me the same result?](#7)\n",
    "    0. [Procrustes analysis](#7.1)\n",
    "0. [Where to go from here](#8)\n",
    "0. [Acknowledgements](#9)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "Unicellular organisms, often referred to as “microbes”, represent the vast majority of the diversity of life on Earth. Microbes perform an amazing array of biological functions and rarely live or act alone, but rather exist in complex communities composed of many interacting species. We now know that the traditional approach for studying microbial communities (or microbiomes), which relied on microbial culture (in other words, being able to grow the microbes in the lab), is insufficient because we don’t know the conditions required for the growth of most microbes. Recent advances that have linked microbiomes to processes ranging from global (for example, the cycling of biologically essential nutrients such as carbon and nitrogen) to personal (for example, human disease, including obesity and cancer) have thus relied on “culture independent” techniques. Identification now relies on sequencing fragments of microbial genomes, and using those fragments as “molecular fingerprints” that allow researchers to profile which microbes are present in an environment. Currently, the bottleneck in microbiome analysis is not DNA sequencing, but rather interpreting the large quantities of DNA sequence data that are generated: often on the order of tens to hundreds of gigabytes. This chapter will integrate many of the topics we've covered in previous chapters to introduce how we study communities of microorganisms using their DNA sequences."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## [3.1.1](#1) Getting started: the feature table<a name='1'></a> <a class='iab-edit' href='https://github.com/caporaso-lab/An-Introduction-to-Applied-Bioinformatics/edit/master/book/applications/biological-diversity.md#L6' target='_blank'>[edit]</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "From a bioinformatics perspective, studying biological diversity is centered around a few key pieces of information:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* A table of the frequencies of certain biological features (e.g., species or OTUs) on a per sample basis.\n",
    "* *Sample metadata* describing exactly what each of the samples is, as well as any relevant technical information.\n",
    "* *Feature metadata* describing each of the features. This can be taxonomic information, for example, but we'll come back to this when we discuss features in more detail (this will be completed as part of [#105](https://github.com/caporaso-lab/An-Introduction-To-Applied-Bioinformatics/issues/105)).\n",
    "* Optionally, information on the relationships between the biological features, typically in the form of a phylogenetic tree where tips in the tree correspond to OTUs in the table."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "None of these are trivial to generate (defining OTUs was described in the [OTU clustering chapter](../algorithms/5-sequence-mapping-and-clustering.ipynb), building trees in the [Phylogenetic reconstruction chapter](../algorithms/3-phylogeny-reconstruction.ipynb), and there is a lot of active work on standardized ways to describe samples in the form of metadata, for example [Yilmaz et al (2011)](http://www.nature.com/nbt/journal/v29/n5/full/nbt.1823.html) and the [isa-tab](http://isa-tools.org/) project. For this discussion we're going to largely ignore the complexities of generating each of these, so we can focus on how we study diversity."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The sample by feature frequency table is central to investigations of biological diversity. The Genomics Standards Consortium has recognized the [Biological Observation Matrix](http://www.biom-format.org) ([McDonald et al. (2011) *Gigascience*](http://www.gigasciencejournal.com/content/1/1/7)), or `biom-format` software and file format definition as a community standard for representing those tables. For now, we'll be using pandas to store these tables as the core ``biom.Table`` object is in the process of being ported to ``scikit-bio`` (to follow progress on this, see scikit-bio [issue #848](https://github.com/biocore/scikit-bio/issues/848)). Even though we're not currently using BIOM to represent these tables, we'll refer to these through-out this chapter as *BIOM tables* for consistency with other projects."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The basic data that goes into a BIOM table is the list of sample ids, the list of feature (e.g., OTU) ids, and the frequency matrix, which describes how many times each OTU was observed in each sample. We can build and display a BIOM table as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Populating the interactive namespace from numpy and matplotlib\n",
       "      A  B  C\n",
       "OTU1  1  0  0\n",
       "OTU2  3  2  0\n",
       "OTU3  0  0  6\n",
       "OTU4  1  4  2\n",
       "OTU5  0  4  1"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%matplotlib inline\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "sample_ids = ['A', 'B', 'C']\n",
    "feature_ids = ['OTU1', 'OTU2', 'OTU3', 'OTU4', 'OTU5']\n",
    "data = np.array([[1, 0, 0],\n",
    "                 [3, 2, 0],\n",
    "                 [0, 0, 6],\n",
    "                 [1, 4, 2],\n",
    "                 [0, 4, 1]])\n",
    "\n",
    "table1 = pd.DataFrame(data, index=feature_ids, columns=sample_ids)\n",
    "table1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we want the feature frequency vector for sample `A` from the above table, we use the pandas API to get that as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "OTU1    1\n",
       "OTU2    3\n",
       "OTU3    0\n",
       "OTU4    1\n",
       "OTU5    0\n",
       "Name: A, dtype: int64"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "table1['A']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**TODO**: Trees in Newick format; sample metadata in TSV format, and loaded into a pandas DataFrame."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before we start looking at what we can do with this data once we have it, let's discuss some terminology."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## [3.1.2](#2) Terminology<a name='2'></a> <a class='iab-edit' href='https://github.com/caporaso-lab/An-Introduction-to-Applied-Bioinformatics/edit/master/book/applications/biological-diversity.md#L61' target='_blank'>[edit]</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "There are literally hundreds of metrics of biological diversity. Here is some terminology that is useful for classifying these metrics."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Alpha versus beta diversity**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " * $\\alpha$ (i.e., within sample) diversity: Who is there? How many are there?\n",
    " * $\\beta$ (i.e., between sample) diversity: How similar are pairs of samples?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Quantitative versus qualitative metrics**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " * qualitative metrics only account for whether an organism is present or absent\n",
    " * quantitative metrics account for abundance"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Phylogenetic versus non-phylogenetic metrics**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " * non-phylogenetic metrics treat all OTUs as being equally related\n",
    " * phylogenetic metrics incorporate evolutionary relationships between the OTUs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the next sections we'll look at some metrics that cross these different categories. As new metrics are introduced, try to classify each of them into one class for each of the above three categories."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## [3.1.3](#3) Measuring alpha diversity<a name='3'></a> <a class='iab-edit' href='https://github.com/caporaso-lab/An-Introduction-to-Applied-Bioinformatics/edit/master/book/applications/biological-diversity.md#L82' target='_blank'>[edit]</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Table of Contents**\n",
    "0. [Observed species (or Observed OTUs)](#3.1)\n",
    "    0. [A limitation of OTU counting](#3.1.1)\n",
    "0. [Phylogenetic Diversity (PD)](#3.2)\n",
    "0. [Even sampling](#3.3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "The first type of metric that we'll look at will be alpha diversity, and we'll specifically focus on *richness* here. Richness refers to how many different *types* of organisms are present in a sample: for example, if we're interested in species richness of plants in the Sonoran Desert and the Costa Rican rainforest, we could go to each, count the number of different species of plants that we observe, and have a basic measure of species richness in each environment."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "An alternative type of alpha diversity measure would be *evenness*, and would tell us how even or uneven the distribution of species abundances are in a given environment. If, for example, the most abundant plant in the Sonoran desert was roughly as common as the least abundant plant (not the case!), we would say that the evenness of plant species was high. On the other hand, if the most abundant plant was thousands of times more common than the least common plant (probably closer to the truth), then we'd say that the evenness of plant species was low. We won't discuss evenness more here, but you can find coverage of this topic (as well as many of the others presented here) in [Measuring Biological Diversity](http://www.amazon.com/Measuring-Biological-Diversity-Anne-Magurran/dp/0632056339)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at two metrics of alpha diversity: observed species, and phylogenetic diversity."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### [3.1.3.1](#3.1) Observed species (or Observed OTUs)<a name='3.1'></a> <a class='iab-edit' href='https://github.com/caporaso-lab/An-Introduction-to-Applied-Bioinformatics/edit/master/book/applications/biological-diversity.md#L90' target='_blank'>[edit]</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Table of Contents**\n",
    "0. [A limitation of OTU counting](#3.1.1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "Observed species, or Observed OTUs as it's more accurately described, is about as simple of a metric as can be used to quantify alpha diversity. With this metric, we simply count the OTUs that are observed in a given sample. Note that this is a qualitative metric: we treat each OTU as being observed or not observed - we don't care how many times it was observed."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's define a new table for this analysis:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "    A  B  C\n",
       "B1  1  1  5\n",
       "B2  1  2  0\n",
       "B3  3  1  0\n",
       "B4  0  2  0\n",
       "B5  0  0  0\n",
       "A1  0  0  3\n",
       "E2  0  0  1"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sample_ids = ['A', 'B', 'C']\n",
    "feature_ids = ['B1','B2','B3','B4','B5','A1','E2']\n",
    "data = np.array([[1, 1, 5],\n",
    "                 [1, 2, 0],\n",
    "                 [3, 1, 0],\n",
    "                 [0, 2, 0],\n",
    "                 [0, 0, 0],\n",
    "                 [0, 0, 3],\n",
    "                 [0, 0, 1]])\n",
    "\n",
    "table2 = pd.DataFrame(data, index=feature_ids, columns=sample_ids)\n",
    "table2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our sample $A$ has an observed OTU frequency value of 3, sample $B$ has an observed OTU frequency of 4, and sample $C$ has an observed OTU frequency of 3. Note that this is different than the total counts for each column (which would be 5, 6, and 9 respectively). Based on the observed OTUs metric, we could consider samples $A$ and $C$ to have even OTU richness, and sample $B$ to have 33% higher OTU richness."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We could compute this in python as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "def observed_otus(table, sample_id):\n",
    "    return sum([e > 0 for e in table[sample_id]])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print(observed_otus(table2, 'A'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print(observed_otus(table2, 'B'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print(observed_otus(table2, 'C'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### [3.1.3.1.1](#3.1.1) A limitation of OTU counting<a name='3.1.1'></a> <a class='iab-edit' href='https://github.com/caporaso-lab/An-Introduction-to-Applied-Bioinformatics/edit/master/book/applications/biological-diversity.md#L143' target='_blank'>[edit]</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "Imagine that we have the same table, but some additional information about the OTUs in the table. Specifically, we've computed the following phylogenetic tree. And, for the sake of illustration, imagine that we've also assigned taxonomy to each of the OTUs and found that our samples contain representatives from the archaea, bacteria, and eukaryotes (their labels begin with `A`, `B`, and `E`, respectively)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, let's define a phylogenetic tree using the Newick format (which is described [here](http://evolution.genetics.washington.edu/phylip/newicktree.html), and more formally defined [here](http://evolution.genetics.washington.edu/phylip/newick_doc.html)). We'll then load that up using [scikit-bio](http://scikit-bio.org)'s [TreeNode](http://scikit-bio.org/generated/skbio.core.tree.TreeNode.html#skbio.core.tree.TreeNode) object, and visualize it with [ete3](http://etetoolkit.org)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "import ete3\n",
    "ts = ete3.TreeStyle()\n",
    "ts.show_leaf_name = True\n",
    "ts.scale = 250\n",
    "ts.branch_vertical_margin = 15\n",
    "ts.show_branch_length = True"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "from io import StringIO\n",
    "newick_tree = StringIO('((B1:0.2,B2:0.3):0.3,((B3:0.5,B4:0.3):0.2,B5:0.9):0.3,'\n",
    "                       '((A1:0.2,A2:0.3):0.3,'\n",
    "                       ' (E1:0.3,E2:0.4):0.7):0.55);')\n",
    "\n",
    "from skbio.tree import TreeNode\n",
    "\n",
    "tree = TreeNode.read(newick_tree)\n",
    "tree = tree.root_at_midpoint()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<IPython.core.display.Image object>"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "t = ete3.Tree.from_skbio(tree, map_attributes=[\"value\"])\n",
    "t.render(\"%%inline\", tree_style=ts)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pairing this with the table we defined above (displayed again in the cell below), given what you now know about these OTUs, which would you consider the most diverse? Are you happy with the $\\alpha$ diversity conclusion that you obtained when computing the number of observed OTUs in each sample?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "    A  B  C\n",
       "B1  1  1  5\n",
       "B2  1  2  0\n",
       "B3  3  1  0\n",
       "B4  0  2  0\n",
       "B5  0  0  0\n",
       "A1  0  0  3\n",
       "E2  0  0  1"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "table2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### [3.1.3.2](#3.2) Phylogenetic Diversity (PD)<a name='3.2'></a> <a class='iab-edit' href='https://github.com/caporaso-lab/An-Introduction-to-Applied-Bioinformatics/edit/master/book/applications/biological-diversity.md#L190' target='_blank'>[edit]</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "Phylogenetic Diversity (PD) is a metric that was developed by Dan Faith in the early 1990s (find the original paper [here](http://www.sciencedirect.com/science/article/pii/0006320792912013)). Like many of the measures that are used in microbial community ecology, it wasn't initially designed for studying microbial communities, but rather communities of \"macro-organisms\" (macrobes?). Some of these metrics, including PD, do translate well to microbial community analysis, while some don't translate as well. (For an illustration of the effect of sequencing error on PD, where it is handled well, versus its effect on the Chao1 metric, where it is handled less well, see Figure 1 of [Reeder and Knight (2010)](http://www.nature.com/nmeth/journal/v7/n9/full/nmeth0910-668b.html))."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "PD is relatively simple to calculate. It is computed simply as the sum of the branch length in a phylogenetic tree that is \"covered\" or represented in a given sample. Let's look at an example to see how this works."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I'll now define a couple of functions that we'll use to compute PD."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_observed_nodes(tree, table, sample_id, verbose=False):\n",
    "    observed_otus = [obs_id for obs_id in table.index\n",
    "                if table[sample_id][obs_id] > 0]\n",
    "    observed_nodes = set()\n",
    "    # iterate over the observed OTUs\n",
    "    for otu in observed_otus:\n",
    "        t = tree.find(otu)\n",
    "        observed_nodes.add(t)\n",
    "        if verbose:\n",
    "            print(t.name, t.length, end=' ')\n",
    "        for internal_node in t.ancestors():\n",
    "            if internal_node.length is None:\n",
    "                # we've hit the root\n",
    "                if verbose:\n",
    "                    print('')\n",
    "            else:\n",
    "                if verbose and internal_node not in observed_nodes:\n",
    "                    print(internal_node.length, end=' ')\n",
    "                observed_nodes.add(internal_node)\n",
    "    return observed_nodes\n",
    "\n",
    "def phylogenetic_diversity(tree, table, sample_id, verbose=False):\n",
    "    observed_nodes = get_observed_nodes(tree, table, sample_id, verbose=verbose)\n",
    "    result = sum(o.length for o in observed_nodes)\n",
    "    return result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And then apply those to compute the PD of our three samples. For each computation, we're also printing out the branch lengths of the branches that are observed *for the first time* when looking at a given OTU. When computing PD, we include the length of each branch only one time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "B1 0.2 0.3 0.2250000000000001\n",
       "B2 0.3\n",
       "B3 0.5 0.2 0.3\n",
       "2.0250000000000004"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd_A = phylogenetic_diversity(tree, table2, 'A', verbose=True)\n",
    "print(pd_A)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "B1 0.2 0.3 0.2250000000000001\n",
       "B2 0.3\n",
       "B3 0.5 0.2 0.3\n",
       "B4 0.3\n",
       "2.325"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd_B = phylogenetic_diversity(tree, table2, 'B', verbose=True)\n",
    "print(pd_B)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "B1 0.2 0.3 0.2250000000000001\n",
       "A1 0.2 0.3 0.32499999999999996\n",
       "E2 0.4 0.7\n",
       "2.65"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd_C = phylogenetic_diversity(tree, table2, 'C', verbose=True)\n",
    "print(pd_C)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How does this result compare to what we observed above with the Observed OTUs metric? Based on your knowledge of biology, which do you think is a better representation of the relative diversities of these samples?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### [3.1.3.3](#3.3) Even sampling<a name='3.3'></a> <a class='iab-edit' href='https://github.com/caporaso-lab/An-Introduction-to-Applied-Bioinformatics/edit/master/book/applications/biological-diversity.md#L258' target='_blank'>[edit]</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "Imagine again that we're going out to count plants in the Sonoran Desert and the Costa Rican rainforest. We're interested in getting an idea of the plant richness in each environment. In the Sonoran Desert, we survey a square kilometer area, and count 150 species of plants. In the rainforest, we survey a square meter, and count 15 species of plants. So, clearly the plant species richness in the Sonoran Desert is higher, right? What's wrong with this comparison?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The problem is that we've expended a lot more sampling effort in the desert than we did in the rainforest, so it shouldn't be surprising that we observed more species there. If we expended the same effort in the rainforest, we'd probably observe a lot more than 15 or 150 plant species, and we'd have a more sound comparison."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In sequencing-based studies of microorganism richness, the analog of sampling area is sequencing depth. If we collect 100 sequences from one sample, and 10,000 sequences from another sample, we can't directly compare the number of observed OTUs or the phylogenetic diversity of these because we expended a lot more sampling effort on the sample with 10,000 sequences than on the sample with 100 sequences. The way this is typically handled is by randomly subsampling sequences from the sample with more sequences until the sequencing depth is equal to that in the sample with fewer sequences. If we randomly select 100 sequences at random from the sample with 10,000 sequences, and compute the alpha diversity based on that random subsample, we'll have a better idea of the relative alpha diversities of the two samples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "        A    B  C\n",
       "OTU1   50    4  0\n",
       "OTU2   35  200  0\n",
       "OTU3  100    2  1\n",
       "OTU4   15  400  1\n",
       "OTU5    0   40  1"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sample_ids = ['A', 'B', 'C']\n",
    "feature_ids = ['OTU1', 'OTU2', 'OTU3', 'OTU4', 'OTU5']\n",
    "data = np.array([[50, 4, 0],\n",
    "                 [35, 200, 0],\n",
    "                 [100, 2, 1],\n",
    "                 [15, 400, 1],\n",
    "                 [0, 40, 1]])\n",
    "\n",
    "bad_table = pd.DataFrame(data, index=feature_ids, columns=sample_ids)\n",
    "bad_table"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print(observed_otus(bad_table, 'A'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "5"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print(observed_otus(bad_table, 'B'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print(observed_otus(bad_table, 'C'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "A    200\n",
       "B    646\n",
       "C      3\n",
       "dtype: int64"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print(bad_table.sum())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**TODO**: Add alpha rarefaction discussion."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## [3.1.4](#4) Measuring beta diversity<a name='4'></a> <a class='iab-edit' href='https://github.com/caporaso-lab/An-Introduction-to-Applied-Bioinformatics/edit/master/book/applications/biological-diversity.md#L310' target='_blank'>[edit]</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Table of Contents**\n",
    "0. [Distance metrics](#4.1)\n",
    "    0. [Bray-Curtis](#4.1.1)\n",
    "    0. [Unweighted UniFrac](#4.1.2)\n",
    "    0. [Even sampling](#4.1.3)\n",
    "0. [Interpreting distance matrices](#4.2)\n",
    "    0. [Distribution plots and comparisons](#4.2.1)\n",
    "    0. [Hierarchical clustering](#4.2.2)\n",
    "0. [Ordination](#4.3)\n",
    "    0. [Polar ordination](#4.3.1)\n",
    "    0. [Determining the most important axes in polar ordination](#4.3.2)\n",
    "    0. [Interpreting ordination plots](#4.3.3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "$\\beta$-diversity (canonically pronounced *beta diversity*) refers to **between sample diversity**, and is typically used to answer questions of the form: is sample $A$ more similar in composition to sample $B$ or sample $C$? In this section we'll explore two (of tens or hundreds) of metrics for computing pairwise dissimilarity of samples to estimate $\\beta$ diversity."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### [3.1.4.1](#4.1) Distance metrics<a name='4.1'></a> <a class='iab-edit' href='https://github.com/caporaso-lab/An-Introduction-to-Applied-Bioinformatics/edit/master/book/applications/biological-diversity.md#L314' target='_blank'>[edit]</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Table of Contents**\n",
    "0. [Bray-Curtis](#4.1.1)\n",
    "0. [Unweighted UniFrac](#4.1.2)\n",
    "0. [Even sampling](#4.1.3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "#### [3.1.4.1.1](#4.1.1) Bray-Curtis<a name='4.1.1'></a> <a class='iab-edit' href='https://github.com/caporaso-lab/An-Introduction-to-Applied-Bioinformatics/edit/master/book/applications/biological-diversity.md#L316' target='_blank'>[edit]</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "The first metric that we'll look at is a quantitative non-phylogenetic $\\beta$ diversity metric called Bray-Curtis. The Bray-Curtis dissimilarity between a pair of samples, $j$ and $k$, is defined as follows:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$BC_{jk} = \\frac{ \\sum_{i} | X_{ij} - X_{ik}|} {\\sum_{i} (X_{ij} + X_{ik})}$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$i$ : feature (e.g., OTUs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$X_{ij}$ : frequency of feature $i$ in sample $j$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$X_{ik}$ : frequency of feature $i$ in sample $k$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This could be implemented in python as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "def bray_curtis_distance(table, sample1_id, sample2_id):\n",
    "    numerator = 0\n",
    "    denominator = 0\n",
    "    sample1_counts = table[sample1_id]\n",
    "    sample2_counts = table[sample2_id]\n",
    "    for sample1_count, sample2_count in zip(sample1_counts, sample2_counts):\n",
    "        numerator += abs(sample1_count - sample2_count)\n",
    "        denominator += sample1_count + sample2_count\n",
    "    return numerator / denominator"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "      A  B  C\n",
       "OTU1  1  0  0\n",
       "OTU2  3  2  0\n",
       "OTU3  0  0  6\n",
       "OTU4  1  4  2\n",
       "OTU5  0  4  1"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "table1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's now apply this to some pairs of samples:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.6"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print(bray_curtis_distance(table1, 'A', 'B'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.857142857143"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print(bray_curtis_distance(table1, 'A', 'C'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.684210526316"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print(bray_curtis_distance(table1, 'B', 'C'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.0"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print(bray_curtis_distance(table1, 'A', 'A'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.684210526316"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print(bray_curtis_distance(table1, 'C', 'B'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ultimately, we likely want to apply this to all pairs of samples to get a distance matrix containing all pairwise distances. Let's define a function for that, and then compute all pairwise Bray-Curtis distances between samples `A`, `B` and `C`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [],
   "source": [
    "from skbio.stats.distance import DistanceMatrix\n",
    "from numpy import zeros\n",
    "\n",
    "def table_to_distances(table, pairwise_distance_fn):\n",
    "    sample_ids = table.columns\n",
    "    num_samples = len(sample_ids)\n",
    "    data = zeros((num_samples, num_samples))\n",
    "    for i, sample1_id in enumerate(sample_ids):\n",
    "        for j, sample2_id in enumerate(sample_ids[:i]):\n",
    "            data[i,j] = data[j,i] = pairwise_distance_fn(table, sample1_id, sample2_id)\n",
    "    return DistanceMatrix(data, sample_ids)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3x3 distance matrix\n",
       "IDs:\n",
       "'A', 'B', 'C'\n",
       "Data:\n",
       "[[ 0.          0.6         0.85714286]\n",
       " [ 0.6         0.          0.68421053]\n",
       " [ 0.85714286  0.68421053  0.        ]]"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bc_dm = table_to_distances(table1, bray_curtis_distance)\n",
    "print(bc_dm)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### [3.1.4.1.2](#4.1.2) Unweighted UniFrac<a name='4.1.2'></a> <a class='iab-edit' href='https://github.com/caporaso-lab/An-Introduction-to-Applied-Bioinformatics/edit/master/book/applications/biological-diversity.md#L407' target='_blank'>[edit]</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "Just as phylogenetic alpha diversity metrics can be more informative than non-phylogenetic alpha diversity metrics, phylogenetic beta diversity metrics offer advantages over non-phylogenetic metrics such as Bray-Curtis. The most widely applied phylogenetic beta diversity metric as of this writing is unweighted UniFrac. UniFrac was initially presented in [Lozupone and Knight, 2005, Applied and Environmental Microbiology](http://aem.asm.org/content/71/12/8228.abstract), and has been widely applied in microbial ecology since (and the illustration of UniFrac computation presented below is derived from a similar example originally developed by Lozupone and Knight)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The unweighted UniFrac distance between a pair of samples `A` and `B` is defined as follows:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$U_{AB} = \\frac{unique}{observed}$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "where:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$unique$ : the unique branch length, or branch length that only leads to OTU(s) observed in sample $A$ or sample $B$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$observed$ : the total branch length observed in either sample $A$ or sample $B$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style=\"float: right; margin-left: 30px;\"><img title=\"Image by @gregcaporaso.\" style=\"float: right; margin-left: 30px;\" src=\"https://raw.githubusercontent.com/gregcaporaso/An-Introduction-To-Applied-Bioinformatics/master/book/applications/images/unifrac_tree_d0.png\" align=right/></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To illustrate how UniFrac distances are computed, before we get into actually computing them, let's look at a few examples. In these examples, imagine that we're determining the pairwise UniFrac distance between two samples: a red sample, and a blue sample. If a red box appears next to an OTU, that indicates that it's observed in the red sample; if a blue box appears next to the OTU, that indicates that it's observed in the blue sample; if a red and blue box appears next to the OTU, that indicates that the OTU is present in both samples; and if no box is presented next to the OTU, that indicates that it's present in neither sample."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To compute the UniFrac distance between a pair of samples, we need to know the sum of the branch length that was observed in either sample (the *observed* branch length), and the sum of the branch length that was observed only in a single sample (the *unique* branch length). In these examples, we color all of the *observed* branch length. Branch length that is unique to the red sample is red, branch length that is unique to the blue sample is blue, and branch length that is observed in both samples is purple. Unobserved branch length is black (as is the vertical branches, as those don't contribute to branch length - they are purely for visual presentation)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the tree on the right, all of the OTUs that are observed in either sample are observed in both samples. As a result, all of the observed branch length is purple. The unique branch length in this case is zero, so **we have a UniFrac distance of 0 between the red and blue samples**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<hr>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style=\"float: right; margin-left: 30px;\"><img title=\"Image by @gregcaporaso.\" style=\"float: right; margin-left: 30px;\" src=\"https://raw.githubusercontent.com/gregcaporaso/An-Introduction-To-Applied-Bioinformatics/master/book/applications/images/unifrac_tree_d1.png\" align=right/></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "On the other end of the spectrum, in the second tree, all of the OTUs in the tree are observed either in the red sample, or in the blue sample. All of the observed branch length in the tree is either red or blue, meaning that if you follow a branch out to the tips, you will observe only red or blue samples. In this case the unique branch length is equal to the observed branch length, so **we have a UniFrac distance of 1 between the red and blue samples**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<hr>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style=\"float: right; margin-left: 30px;\"><img title=\"Image by @gregcaporaso.\" style=\"float: right; margin-left: 30px;\" src=\"https://raw.githubusercontent.com/gregcaporaso/An-Introduction-To-Applied-Bioinformatics/master/book/applications/images/unifrac_tree_d0.5.png\" align=right/></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, most of the time we're somewhere in the middle. In this tree, some of our branch length is unique, and some is not. For example, OTU 1 is only observed in our red sample, so the terminal branch leading to OTU 1 is red (i.e., unique to the red sample). OTU 2 is only observed in our blue sample, so the terminal branch leading to OTU 2 is blue (i.e., unique to the blue sample). However, the internal branch leading to the node connecting OTU 1 and OTU 2 leads to OTUs observed in both the red and blue samples (i.e., OTU 1 and OTU 2), so is purple (i.e, observed branch length, but not unique branch length). In this case, **we have an intermediate UniFrac distance between the red and blue samples, maybe somewhere around 0.5**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<hr>\n",
    "<div style=\"float: right; margin-left: 30px;\"><img title=\"Image by @gregcaporaso.\" style=\"float: right; margin-left: 30px;\" src=\"https://raw.githubusercontent.com/gregcaporaso/An-Introduction-To-Applied-Bioinformatics/master/book/applications/images/unifrac_tree_with_distances.png\" align=right/></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's now compute the Unweighted UniFrac distances between some samples. Imagine we have the following tree, paired with our table below (printed below, for quick reference)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "      A  B  C\n",
       "OTU1  1  0  0\n",
       "OTU2  3  2  0\n",
       "OTU3  0  0  6\n",
       "OTU4  1  4  2\n",
       "OTU5  0  4  1"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "table1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style=\"float: right; margin-left: 30px;\"><img title=\"Image by @gregcaporaso.\" style=\"float: right; margin-left: 30px;\" src=\"https://raw.githubusercontent.com/gregcaporaso/An-Introduction-To-Applied-Bioinformatics/master/book/applications/images/unifrac_tree_with_distances_ab.png\" align=right/></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, let's compute the unweighted UniFrac distance between samples $A$ and $B$. The *unweighted* in *unweighted UniFrac* means that this is a qualitative diversity metric, meaning that we don't care about the abundances of the OTUs, only whether they are present in a given sample ($frequency > 0$) or not present ($frequency = 0$)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Start at the top right branch in the tree, and for each branch, determine if the branch is observed, and if so, if it is also unique. If it is observed then you add its length to your observed branch length. If it is observed and unique, then you also add its length to your unique branch length."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For samples $A$ and $B$, I get the following (in the tree on the right, red branches are those observed in $A$, blue branches are those observed in $B$, and purple are observed in both):"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$unique_{ab} = 0.5 + 0.75 = 1.25$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$observed_{ab} = 0.5 + 0.5 + 0.5 + 1.0 + 1.25 + 0.75 + 0.75 = 5.25$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$uu_{ab} = \\frac{unique_{ab}}{observed_{ab}} = \\frac{1.25}{5.25} = 0.238$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As an exercise, now compute the UniFrac distances between samples $B$ and $C$, and samples $A$ and $C$, using the above table and tree. When I do this, I get the following distance matrix."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3x3 distance matrix\n",
       "IDs:\n",
       "'A', 'B', 'C'\n",
       "Data:\n",
       "[[ 0.    0.24  0.52]\n",
       " [ 0.24  0.    0.35]\n",
       " [ 0.52  0.35  0.  ]]"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ids = ['A', 'B', 'C']\n",
    "d = [[0.00, 0.24, 0.52],\n",
    "      [0.24, 0.00, 0.35],\n",
    "      [0.52, 0.35, 0.00]]\n",
    "print(DistanceMatrix(d, ids))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " **TODO**: Interface change so this code can be used with ``table_to_distances``."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.23809523809523808\n",
       "0.52\n",
       "0.34782608695652173"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## This is untested!! I'm not certain that it's exactly right, just a quick test.\n",
    "\n",
    "newick_tree1 = StringIO('(((((OTU1:0.5,OTU2:0.5):0.5,OTU3:1.0):1.0),(OTU4:0.75,OTU5:0.75):1.25))root;')\n",
    "tree1 = TreeNode.read(newick_tree1)\n",
    "\n",
    "def unweighted_unifrac(tree, table, sample_id1, sample_id2, verbose=False):\n",
    "    observed_nodes1 = get_observed_nodes(tree, table, sample_id1, verbose=verbose)\n",
    "    observed_nodes2 = get_observed_nodes(tree, table, sample_id2, verbose=verbose)\n",
    "    observed_branch_length = sum(o.length for o in observed_nodes1 | observed_nodes2)\n",
    "    shared_branch_length = sum(o.length for o in observed_nodes1 & observed_nodes2)\n",
    "    unique_branch_length = observed_branch_length - shared_branch_length\n",
    "    unweighted_unifrac = unique_branch_length / observed_branch_length\n",
    "    return unweighted_unifrac\n",
    "\n",
    "print(unweighted_unifrac(tree1, table1, 'A', 'B'))\n",
    "print(unweighted_unifrac(tree1, table1, 'A', 'C'))\n",
    "print(unweighted_unifrac(tree1, table1, 'B', 'C'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### [3.1.4.1.3](#4.1.3) Even sampling<a name='4.1.3'></a> <a class='iab-edit' href='https://github.com/caporaso-lab/An-Introduction-to-Applied-Bioinformatics/edit/master/book/applications/biological-diversity.md#L512' target='_blank'>[edit]</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "**TODO**: Add discussion on necessity of even sampling"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### [3.1.4.2](#4.2) Interpreting distance matrices<a name='4.2'></a> <a class='iab-edit' href='https://github.com/caporaso-lab/An-Introduction-to-Applied-Bioinformatics/edit/master/book/applications/biological-diversity.md#L516' target='_blank'>[edit]</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Table of Contents**\n",
    "0. [Distribution plots and comparisons](#4.2.1)\n",
    "0. [Hierarchical clustering](#4.2.2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "In the previous section we computed distance matrices that contained the pairwise distances between a few samples. You can look at those distance matrices and get a pretty good feeling for what the patterns are. For example, what are the most similar samples? What are the most dissimilar samples?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What if instead of three samples though, we had more. Here's a screenshot from a distance matrix containing data on 105 samples (this is just the first few rows and columns):"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src='https://raw.githubusercontent.com/gregcaporaso/An-Introduction-To-Applied-Bioinformatics/master/book/applications/images/example_big_dm.png', width=800>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Do you have a good feeling for the patterns here? What are the most similar samples? What are the most dissimilar samples?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Chances are, you can't just squint at that table and understand what's going on (but if you can, I'm hiring!). The problem is exacerbated by the fact that in modern microbial ecology studies we may have thousands or tens of thousands of samples, not \"just\" hundreds as in the table above. We need tools to help us take these raw distances and convert them into something that we can interpret. In this section we'll look at some techniques, one of which we've covered previously, that will help us interpret large distance matrices."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<hr>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One excellent paper that includes a comparison of several different strategies for interpreting beta diversity results is [Costello *et al.* Science (2009) Bacterial Community Variation in Human Body Habitats Across Space and Time](https://www.sciencemag.org/content/326/5960/1694.full). In this study, the authors collected microbiome samples from 7 human subjects at about 25 sites on their bodies, at four different points in time."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Figure 1 shows several different approaches for comparing the resulting UniFrac distance matrix (this image is linked from the *Science* journal website - copyright belongs to *Science*):"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"https://www.sciencemag.org/content/326/5960/1694/F1.large.jpg\" width=800>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's generate a small distance matrix representing just a few of these body sites, and figure out how we'd generate and interpret each of these visualizations. The values in the distance matrix below are a subset of the unweighted UniFrac distance matrix representing two samples each from three body sites from the Costello *et al.* (2009) study."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "  body site individual\n",
       "A       gut  subject 1\n",
       "B       gut  subject 2\n",
       "C    tongue  subject 1\n",
       "D    tongue  subject 2\n",
       "E      skin  subject 1\n",
       "F      skin  subject 2"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sample_ids = ['A', 'B', 'C', 'D', 'E', 'F']\n",
    "_columns = ['body site', 'individual']\n",
    "_md = [['gut', 'subject 1'],\n",
    "       ['gut', 'subject 2'],\n",
    "       ['tongue', 'subject 1'],\n",
    "       ['tongue', 'subject 2'],\n",
    "       ['skin', 'subject 1'],\n",
    "       ['skin', 'subject 2']]\n",
    "\n",
    "human_microbiome_sample_md = pd.DataFrame(_md, index=sample_ids, columns=_columns)\n",
    "human_microbiome_sample_md"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "6x6 distance matrix\n",
       "IDs:\n",
       "'A', 'B', 'C', 'D', 'E', 'F'\n",
       "Data:\n",
       "[[ 0.    0.35  0.83  0.83  0.9   0.9 ]\n",
       " [ 0.35  0.    0.86  0.85  0.92  0.91]\n",
       " [ 0.83  0.86  0.    0.25  0.88  0.87]\n",
       " [ 0.83  0.85  0.25  0.    0.88  0.88]\n",
       " [ 0.9   0.92  0.88  0.88  0.    0.5 ]\n",
       " [ 0.9   0.91  0.87  0.88  0.5   0.  ]]"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dm_data = np.array([[0.00, 0.35, 0.83, 0.83, 0.90, 0.90],\n",
    "                    [0.35, 0.00, 0.86, 0.85, 0.92, 0.91],\n",
    "                    [0.83, 0.86, 0.00, 0.25, 0.88, 0.87],\n",
    "                    [0.83, 0.85, 0.25, 0.00, 0.88, 0.88],\n",
    "                    [0.90, 0.92, 0.88, 0.88, 0.00, 0.50],\n",
    "                    [0.90, 0.91, 0.87, 0.88, 0.50, 0.00]])\n",
    "\n",
    "human_microbiome_dm = DistanceMatrix(dm_data, sample_ids)\n",
    "print(human_microbiome_dm)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### [3.1.4.2.1](#4.2.1) Distribution plots and comparisons<a name='4.2.1'></a> <a class='iab-edit' href='https://github.com/caporaso-lab/An-Introduction-to-Applied-Bioinformatics/edit/master/book/applications/biological-diversity.md#L581' target='_blank'>[edit]</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "First, let's look at the analysis presented in panels E and F. Instead of generating bar plots here, we'll generate box plots as these are more informative (i.e., they provide a more detailed summary of the distribution being investigated). One important thing to notice here is the central role that the sample metadata plays in the visualization. If we just had our sample ids (i.e., letters ``A`` through ``F``) we wouldn't be able to group distances into *within* and *between* sample type categories, and we therefore couldn't perform the comparisons we're interested in."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
    "def within_between_category_distributions(dm, md, md_category):\n",
    "    within_category_distances = []\n",
    "    between_category_distances = []\n",
    "    for i, sample_id1 in enumerate(dm.ids):\n",
    "        sample_md1 = md[md_category][sample_id1]\n",
    "        for sample_id2 in dm.ids[:i]:\n",
    "            sample_md2 = md[md_category][sample_id2]\n",
    "            if sample_md1 == sample_md2:\n",
    "                within_category_distances.append(dm[sample_id1, sample_id2])\n",
    "            else:\n",
    "                between_category_distances.append(dm[sample_id1, sample_id2])\n",
    "    return within_category_distances, between_category_distances"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[0.34999999999999998, 0.25, 0.5]\n",
       "[0.82999999999999996, 0.85999999999999999, 0.82999999999999996, 0.84999999999999998, 0.90000000000000002, 0.92000000000000004, 0.88, 0.88, 0.90000000000000002, 0.91000000000000003, 0.87, 0.88]"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "within_category_distances, between_category_distances = within_between_category_distributions(human_microbiome_dm, human_microbiome_sample_md, \"body site\")\n",
    "print(within_category_distances)\n",
    "print(between_category_distances)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import seaborn as sns\n",
    "ax = sns.boxplot(data=[within_category_distances, between_category_distances])\n",
    "ax.set_xticklabels(['same body habitat', 'different body habitat'])\n",
    "ax.set_ylabel('Unweighted UniFrac Distance')\n",
    "_ = ax.set_ylim(0.0, 1.0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "method name               ANOSIM\n",
       "test statistic name            R\n",
       "sample size                    6\n",
       "number of groups               3\n",
       "test statistic                 1\n",
       "p-value                    0.065\n",
       "number of permutations       999\n",
       "Name: ANOSIM results, dtype: object"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from skbio.stats.distance import anosim\n",
    "anosim(human_microbiome_dm, human_microbiome_sample_md, 'body site')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we run through these same steps, but base our analysis on a different metadata category where we don't expect to see any significant clustering, you can see that we no longer get a significant result."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[0.82999999999999996, 0.84999999999999998, 0.90000000000000002, 0.88, 0.91000000000000003, 0.88]\n",
       "[0.34999999999999998, 0.85999999999999999, 0.82999999999999996, 0.25, 0.92000000000000004, 0.88, 0.90000000000000002, 0.87, 0.5]"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "within_category_distances, between_category_distances = within_between_category_distributions(human_microbiome_dm, human_microbiome_sample_md, \"individual\")\n",
    "print(within_category_distances)\n",
    "print(between_category_distances)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ax = sns.boxplot(data=[within_category_distances, between_category_distances])\n",
    "ax.set_xticklabels(['same person', 'different person'])\n",
    "ax.set_ylabel('Unweighted UniFrac Distance')\n",
    "_ = ax.set_ylim(0.0, 1.0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "method name                 ANOSIM\n",
       "test statistic name              R\n",
       "sample size                      6\n",
       "number of groups                 2\n",
       "test statistic           -0.333333\n",
       "p-value                      0.869\n",
       "number of permutations         999\n",
       "Name: ANOSIM results, dtype: object"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "anosim(human_microbiome_dm, human_microbiome_sample_md, 'individual')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Why do you think the distribution of distances between people has a greater range than the distribution of distances within people in this particular example?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we used ANOSIM testing whether our with and between category groups differ. This test is specifically designed for distance matrices, and it accounts for the fact that the values are not independent of one another. For example, if one of our samples was very different from all of the others, all of the distances associated with that sample would be large. It's very important to choose the appropriate statistical test to use. One free resource for helping you do that is [*The Guide to Statistical Analysis in Microbial Ecology (GUSTAME)*](http://mb3is.megx.net/gustame). If you're getting started in microbial ecology, I recommend spending some time studying GUSTAME."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### [3.1.4.2.2](#4.2.2) Hierarchical clustering<a name='4.2.2'></a> <a class='iab-edit' href='https://github.com/caporaso-lab/An-Introduction-to-Applied-Bioinformatics/edit/master/book/applications/biological-diversity.md#L664' target='_blank'>[edit]</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "Next, let's look at a hierarchical clustering analysis, similar to that presented in panel G above. Here I'm applying the UPGMA functionality implemented in [SciPy](http://www.scipy.org/scipylib/index.html) to generate a tree which we visualize with a dendrogram. However the tips in this tree don't represent sequences or OTUs, like they did when we [covered UPGMA for phylogenetic reconstruction](../2/4.ipynb#6.5), but instead they represent samples, and samples with a smaller branch length between them are more similar in composition than samples with a longer branch length between them. (Remember that only horizontal branch length is counted - vertical branch length is just to aid in the organization of the dendrogram.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from scipy.cluster.hierarchy import average, dendrogram\n",
    "lm = average(human_microbiome_dm.condensed_form())\n",
    "d = dendrogram(lm, labels=human_microbiome_dm.ids, orientation='right',\n",
    "               link_color_func=lambda x: 'black')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Again, we can see how the data really only becomes interpretable in the context of metadata:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "labels = [human_microbiome_sample_md['body site'][sid] for sid in sample_ids]\n",
    "d = dendrogram(lm, labels=labels, orientation='right',\n",
    "               link_color_func=lambda x: 'black')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "labels = [human_microbiome_sample_md['individual'][sid] for sid in sample_ids]\n",
    "d = dendrogram(lm, labels=labels, orientation='right',\n",
    "               link_color_func=lambda x: 'black')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### [3.1.4.3](#4.3) Ordination<a name='4.3'></a> <a class='iab-edit' href='https://github.com/caporaso-lab/An-Introduction-to-Applied-Bioinformatics/edit/master/book/applications/biological-diversity.md#L692' target='_blank'>[edit]</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Table of Contents**\n",
    "0. [Polar ordination](#4.3.1)\n",
    "0. [Determining the most important axes in polar ordination](#4.3.2)\n",
    "0. [Interpreting ordination plots](#4.3.3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "Finally, let's look at ordination, similar to that presented in panels A-D. The basic idea behind ordination is dimensionality reduction: we want to take high-dimensionality data (a distance matrix) and represent that in a few (usually two or three) dimensions. As humans, we're very bad at interpreting high dimensionality data directly: with ordination, we can take an $n$-dimensional data set (e.g., a distance matrix of shape $n \\times n$, representing the distances between $n$ biological samples) and reduce that to a 2-dimensional scatter plot similar to that presented in panels A-D above."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ordination is a technique that is widely applied in ecology and in bioinformatics, but the math behind some of the methods such as *Principal Coordinates Analysis* is fairly complex, and as a result I've found that these methods are a black box for a lot of people. Possibly the most simple ordination technique is one called Polar Ordination. Polar Ordination is not widely applied because it has some inconvenient features, but I find that it is useful for introducing the idea behind ordination. Here we'll work through a simple implementation of ordination to illustrate the process, which will help us to interpret ordination plots. In practice, you will use existing software, such as [scikit-bio](http://scikit-bio.org)'s [ordination module](http://scikit-bio.org/maths.stats.ordination.html)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "An excellent site for learning more about ordination is [Michael W. Palmer's Ordination Methods page](http://ordination.okstate.edu/)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### [3.1.4.3.1](#4.3.1) Polar ordination<a name='4.3.1'></a> <a class='iab-edit' href='https://github.com/caporaso-lab/An-Introduction-to-Applied-Bioinformatics/edit/master/book/applications/biological-diversity.md#L700' target='_blank'>[edit]</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "First, let's print our distance matrix again so we have it nearby."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "6x6 distance matrix\n",
       "IDs:\n",
       "'A', 'B', 'C', 'D', 'E', 'F'\n",
       "Data:\n",
       "[[ 0.    0.35  0.83  0.83  0.9   0.9 ]\n",
       " [ 0.35  0.    0.86  0.85  0.92  0.91]\n",
       " [ 0.83  0.86  0.    0.25  0.88  0.87]\n",
       " [ 0.83  0.85  0.25  0.    0.88  0.88]\n",
       " [ 0.9   0.92  0.88  0.88  0.    0.5 ]\n",
       " [ 0.9   0.91  0.87  0.88  0.5   0.  ]]"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print(human_microbiome_dm)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Polar ordination works in a few steps:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Step 1.** Identify the largest distance in the distance matrix."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Step 2.** Define a line, with the two samples contributing to that distance defining the endpoints."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Step 3.** Compute the location of each other sample on that axis as follows:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$a = \\frac{D^2 + D1^2 - D2^2}{2 \\times D}$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "where:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$D$ is distance between the endpoints"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$D1$ is distance between the current sample and endpoint 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$D2$ is distance between sample and endpoint 2."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Step 4.** Find the next largest distance that could be used to define an *uncorrelated axis*. (This step can be labor-intensive to do by hand - usually you would compute all of the axes, along with correlation scores. I'll pick one for the demo, and we'll wrap up by looking at all of the axes.)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here is what steps 2 and 3 look like in Python:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [],
   "source": [
    "def compute_axis_values(dm, endpoint1, endpoint2):\n",
    "    d = dm[endpoint1, endpoint2]\n",
    "    result = {endpoint1: 0, endpoint2: d}\n",
    "    non_endpoints = set(dm.ids) - set([endpoint1, endpoint2])\n",
    "    for e in non_endpoints:\n",
    "        d1 = dm[endpoint1, e]\n",
    "        d2 = dm[endpoint2, e]\n",
    "        result[e] = (d**2 + d1**2 - d2**2) / (2 * d)\n",
    "    return d, [result[e] for e in dm.ids]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "A 0.0863586956522\n",
       "B 0\n",
       "C 0.441086956522\n",
       "D 0.431793478261\n",
       "E 0.92\n",
       "F 0.774184782609"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "d, a1_values = compute_axis_values(human_microbiome_dm, 'B', 'E')\n",
    "for sid, a1_value in zip(human_microbiome_dm.ids, a1_values):\n",
    "    print(sid, a1_value)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "A 0.371193181818\n",
       "B 0.369602272727\n",
       "C 0.0355113636364\n",
       "D 0\n",
       "E 0.88\n",
       "F 0.737954545455"
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "d, a2_values = compute_axis_values(human_microbiome_dm, 'D', 'E')\n",
    "for sid, a2_value in zip(human_microbiome_dm.ids, a2_values):\n",
    "    print(sid, a2_value)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from pylab import scatter\n",
    "ord_plot = scatter(a1_values, a2_values, s=40)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And again, let's look at how including metadata helps us to interpret our results."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, we'll color the points by the body habitat that they're derived from:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "colors = {'tongue': 'red', 'gut':'yellow', 'skin':'blue'}\n",
    "c = [colors[human_microbiome_sample_md['body site'][e]] for e in human_microbiome_dm.ids]\n",
    "ord_plot = scatter(a1_values, a2_values, s=40, c=c)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And next we'll color the samples by the person that they're derived from. Notice that this plot and the one above are identical except for coloring. Think about how the colors (and therefore the sample metadata) help you to interpret these plots."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "person_colors = {'subject 1': 'red', 'subject 2':'yellow'}\n",
    "person_c = [person_colors[human_microbiome_sample_md['individual'][e]] for e in human_microbiome_dm.ids]\n",
    "ord_plot = scatter(a1_values, a2_values, s=40, c=person_c)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### [3.1.4.3.2](#4.3.2) Determining the most important axes in polar ordination<a name='4.3.2'></a> <a class='iab-edit' href='https://github.com/caporaso-lab/An-Introduction-to-Applied-Bioinformatics/edit/master/book/applications/biological-diversity.md#L802' target='_blank'>[edit]</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "Generally, you would compute the polar ordination axes for all possible axes. You could then order the axes by which represent the largest differences in sample composition, and the lowest correlation with previous axes. This might look like the following:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "axis 0: \t0.920\t1.000\tE\tB\n",
       "axis 1: \t0.910\t0.943\tF\tB\n",
       "axis 2: \t0.900\t0.928\tE\tA\n",
       "axis 3: \t0.900\t0.886\tF\tA\n",
       "axis 4: \t0.880\t0.543\tE\tD\n",
       "axis 5: \t0.880\t0.429\tF\tD\n",
       "axis 6: \t0.880\t0.429\tE\tC\n",
       "axis 7: \t0.870\t0.371\tF\tC\n",
       "axis 8: \t0.860\t0.543\tC\tB\n",
       "axis 9: \t0.850\t0.486\tD\tB\n",
       "axis 10: \t0.830\t0.429\tC\tA\n",
       "axis 11: \t0.830\t0.406\tD\tA\n",
       "axis 12: \t0.500\t0.232\tF\tE\n",
       "axis 13: \t0.350\t0.143\tB\tA\n",
       "axis 14: \t0.250\t0.493\tD\tC"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from scipy.stats import spearmanr\n",
    "\n",
    "data = []\n",
    "for i, sample_id1 in enumerate(human_microbiome_dm.ids):\n",
    "    for sample_id2 in human_microbiome_dm.ids[:i]:\n",
    "        d, axis_values = compute_axis_values(human_microbiome_dm, sample_id1, sample_id2)\n",
    "        r, p = spearmanr(a1_values, axis_values)\n",
    "        data.append((d, abs(r), sample_id1, sample_id2, axis_values))\n",
    "\n",
    "data.sort()\n",
    "data.reverse()\n",
    "for i, e in enumerate(data):\n",
    "    print(\"axis %d:\" % i, end=' ')\n",
    "    print(\"\\t%1.3f\\t%1.3f\\t%s\\t%s\" % e[:4])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So why do we care about axes being uncorrelated? And why do we care about explaining a lot of the variation? Let's look at a few of these plots and see how they compare to the plots above, where we compared axes 1 and 4."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ord_plot = scatter(data[0][4], data[1][4], s=40, c=c)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ord_plot = scatter(data[0][4], data[13][4], s=40, c=c)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ord_plot = scatter(data[0][4], data[14][4], s=40, c=c)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### [3.1.4.3.3](#4.3.3) Interpreting ordination plots<a name='4.3.3'></a> <a class='iab-edit' href='https://github.com/caporaso-lab/An-Introduction-to-Applied-Bioinformatics/edit/master/book/applications/biological-diversity.md#L855' target='_blank'>[edit]</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "There are a few points that are important to keep in mind when interpreting ordination plots. Review each one of these in the context of polar ordination to figure out the reason for each."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Directionality of the axes is not important (e.g., up/down/left/right)**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One thing that you may have notices as you computed the polar ordination above is that the method is *not symmetric*: in other words, the axis values for axis $EB$ are different than for axis $BE$. In practice though, we derive the same conclusions regardless of how we compute that axis: in this example, that samples cluster by body site."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [],
   "source": [
    "d, a1_values = compute_axis_values(human_microbiome_dm, 'E', 'B')\n",
    "d, a2_values = compute_axis_values(human_microbiome_dm, 'E', 'D')\n",
    "d, alt_a1_values = compute_axis_values(human_microbiome_dm, 'B', 'E')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ord_plot = scatter(a1_values, a2_values, s=40, c=c)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "execution_count": 58,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ord_plot = scatter(alt_a1_values, a2_values, s=40, c=c)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Some other important features:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Numerical scale of the axis is generally not useful\n",
    "* The order of axes is generally important (first axis explains the most variation, second axis explains the second most variation, ...)\n",
    "* Most techniques result in uncorrelated axes.\n",
    "* Additional axes can be generated (third, fourth, ...)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## [3.1.5](#5) Tools for using ordination in practice: scikit-bio, pandas, and matplotlib<a name='5'></a> <a class='iab-edit' href='https://github.com/caporaso-lab/An-Introduction-to-Applied-Bioinformatics/edit/master/book/applications/biological-diversity.md#L886' target='_blank'>[edit]</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "As I mentioned above, polar ordination isn't widely used in practice, but the features that it illustrates are common to ordination methods. One of the most widely used ordination methods used to study biological diversity is Principal Coordinates Analysis or PCoA, which is implemented in [scikit-bio](http://scikit-bio.org/)'s [``ordination`` module](http://scikit-bio.org/maths.stats.ordination.html) (among many other packages)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this section, we're going to make use of three python third-party modules to apply PCoA and visualize the results 3D scatter plots. The data we'll use here is the full unweighted UniFrac distance matrix from a study of soil microbial communities across North and South America (originally published in [Lauber *et al.* (2009)](http://www.ncbi.nlm.nih.gov/pubmed/19502440)). We're going to use [pandas](http://pandas.pydata.org/) to manage the metadata, [scikit-bio](http://scikit-bio.org/) to manage the distance matrix and compute PCoA, and [matplotlib](http://matplotlib.org/) to visualize the results."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, we'll load sample metadata into a [pandas DataFrame](http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.html). These are really useful for loading and working with the type of tabular information that you'd typically store in a spreadsheet or database table. (Note that one thing I'm doing in the following cell is tricking pandas into thinking that it's getting a file as input, even though I have the information represented as tab-separated lines in a multiline string. [python's StringIO](https://docs.python.org/2/library/stringio.html) is very useful for this, and it's especially convenient in your unit tests... which you're writing for all of your code, right?) Here we'll load the tab-separated text, and then print it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "              pH                                         ENVO biome   Latitude\n",
       "CF3.141691  3.56    ENVO:Temperate broadleaf and mixed forest biome  42.116667\n",
       "PE5.141692  3.57                        ENVO:Tropical humid forests -12.633333\n",
       "BF2.141708  3.61    ENVO:Temperate broadleaf and mixed forest biome  41.583333\n",
       "CF2.141679  3.63    ENVO:Temperate broadleaf and mixed forest biome  41.933333\n",
       "CF1.141675  3.92    ENVO:Temperate broadleaf and mixed forest biome  42.158333\n",
       "HF2.141686  3.98    ENVO:Temperate broadleaf and mixed forest biome  42.500000\n",
       "BF1.141647  4.05    ENVO:Temperate broadleaf and mixed forest biome  41.583333\n",
       "PE4.141683  4.10                        ENVO:Tropical humid forests -13.083333\n",
       "PE2.141725  4.11                        ENVO:Tropical humid forests -13.083333\n",
       "PE1.141715  4.12                        ENVO:Tropical humid forests -13.083333\n",
       "PE6.141700  4.12                        ENVO:Tropical humid forests -12.650000\n",
       "TL3.141709  4.23                                     ENVO:shrubland  68.633333\n",
       "HF1.141663  4.25    ENVO:Temperate broadleaf and mixed forest biome  42.500000\n",
       "PE3.141731  4.25                        ENVO:Tropical humid forests -13.083333\n",
       "BB1.141690  4.30                        ENVO:Tropical humid forests  44.870000\n",
       "MP2.141695  4.38    ENVO:Temperate broadleaf and mixed forest biome  49.466667\n",
       "MP1.141661  4.56                          ENVO:Temperate grasslands  49.466667\n",
       "TL1.141653  4.58                                     ENVO:grassland  68.633333\n",
       "BB2.141659  4.60    ENVO:Temperate broadleaf and mixed forest biome  44.866667\n",
       "LQ3.141712  4.67                        ENVO:Tropical humid forests  18.300000\n",
       "CL3.141664  4.89    ENVO:Temperate broadleaf and mixed forest biome  34.616667\n",
       "LQ1.141701  4.89                        ENVO:Tropical humid forests  18.300000\n",
       "HI4.141735  4.92  ENVO:Tropical and subtropical grasslands, sava...  20.083333\n",
       "SN1.141681  4.95    ENVO:Temperate broadleaf and mixed forest biome  36.450000\n",
       "LQ2.141729  5.03                        ENVO:Tropical humid forests  18.300000\n",
       "CL4.141667  5.03                          ENVO:Temperate grasslands  34.616667\n",
       "DF3.141696  5.05    ENVO:Temperate broadleaf and mixed forest biome  35.966667\n",
       "BZ1.141724  5.12                                        ENVO:forest  64.800000\n",
       "SP2.141678  5.13                          ENVO:Temperate grasslands  36.616667\n",
       "IE1.141648  5.27                          ENVO:Temperate grasslands  41.800000\n",
       "...          ...                                                ...        ...\n",
       "DF2.141726  6.84    ENVO:Temperate broadleaf and mixed forest biome  35.966667\n",
       "GB1.141665  6.84                                     ENVO:grassland  39.333333\n",
       "SR1.141680  6.84                                     ENVO:shrubland  34.700000\n",
       "SA1.141670  6.90                                        ENVO:forest  35.366667\n",
       "SR3.141674  6.95                                     ENVO:shrubland  34.683333\n",
       "KP4.141733  7.10                                     ENVO:shrubland  39.100000\n",
       "GB3.141652  7.18                                        ENVO:forest  39.316667\n",
       "CA1.141704  7.27                                        ENVO:forest  36.050000\n",
       "BP1.141702  7.53                                     ENVO:grassland  43.750000\n",
       "GB2.141732  7.57                                        ENVO:forest  39.316667\n",
       "MT1.141719  7.57                                        ENVO:forest  46.800000\n",
       "JT1.141699  7.60                                     ENVO:shrubland  33.966667\n",
       "MD2.141689  7.65                                     ENVO:shrubland  34.900000\n",
       "SF1.141728  7.71                                     ENVO:shrubland  35.383333\n",
       "CM1.141723  7.85                          ENVO:Temperate grasslands  33.300000\n",
       "MD3.141707  7.90                                     ENVO:shrubland  34.900000\n",
       "KP3.141658  7.92                                     ENVO:shrubland  39.100000\n",
       "SB1.141730  7.92                                     ENVO:shrubland  34.466667\n",
       "RT1.141654  7.92                                     ENVO:shrubland  31.466667\n",
       "SR2.141673  8.00                                     ENVO:shrubland  34.683333\n",
       "CR1.141682  8.00                          ENVO:Temperate grasslands  33.933333\n",
       "CA2.141685  8.02                                     ENVO:shrubland  36.050000\n",
       "RT2.141710  8.07                          ENVO:Temperate grasslands  31.466667\n",
       "MD5.141688  8.07                                     ENVO:shrubland  35.200000\n",
       "SA2.141687  8.10                                     ENVO:shrubland  35.366667\n",
       "GB5.141668  8.22                                     ENVO:shrubland  39.350000\n",
       "SV1.141649  8.31                                     ENVO:shrubland  34.333333\n",
       "SF2.141677  8.38                                     ENVO:shrubland  35.383333\n",
       "SV2.141666  8.44                                     ENVO:grassland  34.333333\n",
       "MD4.141660  8.86                                     ENVO:shrubland  35.200000\n",
       "\n",
       "[89 rows x 3 columns]"
      ]
     },
     "execution_count": 59,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from iab.data import lauber_soil_sample_md\n",
    "lauber_soil_sample_md"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Just as one simple example of the many things that pandas can do, to look up a value, such as the pH of sample ``MT2.141698``, we can do the following. If you're interesting in learning more about pandas, [*Python for Data Analysis*](http://shop.oreilly.com/product/0636920023784.do) is a very good resource."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "6.6600000000000001"
      ]
     },
     "execution_count": 60,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lauber_soil_sample_md['pH']['MT2.141698']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we'll load our distance matrix. This is similar to ``human_microbiome_dm_data`` one that we loaded above, just a little bigger. After loading, we can visualize the resulting ``DistanceMatrix`` object for a summary."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<Figure size 432x288 with 2 Axes>"
      ]
     },
     "execution_count": 61,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from iab.data import lauber_soil_unweighted_unifrac_dm\n",
    "_ = lauber_soil_unweighted_unifrac_dm.plot(cmap='Greens')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Does this visualization help you to interpret the results? Probably not. Generally we'll need to apply some approaches that will help us with interpretation. Let's use ordination here. We'll run Principal Coordinates Analysis on our ``DistanceMatrix`` object. This gives us a matrix of coordinate values for each sample, which we can then plot. We can use ``scikit-bio``'s implementation of PCoA as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {},
   "outputs": [],
   "source": [
    "from skbio.stats.ordination import pcoa\n",
    "\n",
    "lauber_soil_unweighted_unifrac_pc = pcoa(lauber_soil_unweighted_unifrac_dm)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What does the following ordination plot tell you about the relationship between the similarity of microbial communities taken from similar and dissimilar latitudes?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<Figure size 432x288 with 2 Axes>"
      ]
     },
     "execution_count": 63,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "_ = lauber_soil_unweighted_unifrac_pc.plot(lauber_soil_sample_md, 'Latitude', cmap='Greens', title=\"Samples colored by Latitude\", axis_labels=('PC1', 'PC2', 'PC3'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If the answer to the above question is that there doesn't seem to be much association, you're on the right track. We can quantify this, for example, by testing for correlation between pH and value on PC 1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "rho: 0.158\n",
       "p-value: 1.4e-01"
      ]
     },
     "execution_count": 64,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from scipy.stats import spearmanr\n",
    "spearman_rho, spearman_p = spearmanr(lauber_soil_unweighted_unifrac_pc.samples['PC1'],\n",
    "                                     lauber_soil_sample_md['Latitude'][lauber_soil_unweighted_unifrac_pc.samples.index])\n",
    "print('rho: %1.3f' % spearman_rho)\n",
    "print('p-value: %1.1e' % spearman_p)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the next plot, we'll color the points by the pH of the soil sample they represent. What does this plot suggest about the relationship between the similarity of microbial communities taken from similar and dissimilar pH?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<Figure size 432x288 with 2 Axes>"
      ]
     },
     "execution_count": 65,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "_ = lauber_soil_unweighted_unifrac_pc.plot(lauber_soil_sample_md, 'pH', cmap='Greens', title=\"Samples colored by pH\", axis_labels=('PC1', 'PC2', 'PC3'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "rho: -0.958\n",
       "p-value: 1.9e-48"
      ]
     },
     "execution_count": 66,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from scipy.stats import spearmanr\n",
    "spearman_rho, spearman_p = spearmanr(lauber_soil_unweighted_unifrac_pc.samples['PC1'],\n",
    "                                     lauber_soil_sample_md['pH'][lauber_soil_unweighted_unifrac_pc.samples.index])\n",
    "print('rho: %1.3f' % spearman_rho)\n",
    "print('p-value: %1.1e' % spearman_p)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Taken together, these plots and statistics suggest that soil microbial community composition is much more closely associated with pH than it is with latitude: the key result that was presented in [Lauber *et al.* (2009)](http://www.ncbi.nlm.nih.gov/pubmed/19502440)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## [3.1.6](#6) PCoA versus PCA: what's the difference?<a name='6'></a> <a class='iab-edit' href='https://github.com/caporaso-lab/An-Introduction-to-Applied-Bioinformatics/edit/master/book/applications/biological-diversity.md#L1024' target='_blank'>[edit]</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "You may have also heard of a method related to PCoA, called Principal Components Analysis or PCA. There is, however, an important key difference between these methods. PCoA, which is what we've been working with, performs ordination with a distance matrix as input. PCA on the other hand performs ordination with sample by feature frequency data, such as the OTU tables that we've been working with, as input. It achieves this by computing Euclidean distance (see [here](http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.euclidean.html#scipy.spatial.distance.euclidean)) between the samples and then running PCoA. So, if your distance metric is Euclidean, PCA and PCoA are the same. In practice however, we want to be able to use distance metrics that work better for studying biological diversity, such as Bray-Curtis or UniFrac. Therefore we typically compute distances with whatever metric we want, and then run PCoA."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## [3.1.7](#7) Are two different analysis approaches giving me the same result?<a name='7'></a> <a class='iab-edit' href='https://github.com/caporaso-lab/An-Introduction-to-Applied-Bioinformatics/edit/master/book/applications/biological-diversity.md#L1028' target='_blank'>[edit]</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Table of Contents**\n",
    "0. [Procrustes analysis](#7.1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "A question that comes up frequently, often in method comparison, is whether two different approaches for analyzing some data giving the consistent results. This could come up, for example, if you were comparing DNA sequence data from the same samples generated on the 454 Titanium platform with data generated on the Illumina MiSeq platform to see if you would derive the same biological conclusions based on either platform. This was done, for example, in [Additional Figure 1](http://genomebiology.com/2011/12/5/R50/additional) of [*Moving Pictures of the Human Microbiome*](http://genomebiology.com/content/12/5/R50). Similarly, you might wonder if two different OTU clustering methods or beta diversity metrics would lead you to the same biological conclusion. Let's look at one way that you might address this question."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Imagine you ran three different beta diversity metrics on your BIOM table: unweighted UniFrac, Bray-Curtis, and weighted UniFrac (the quantitative analog of unweighted UniFrac), and then generated the following PCoA plots."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<Figure size 432x288 with 2 Axes>"
      ]
     },
     "execution_count": 67,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "_ = lauber_soil_unweighted_unifrac_pc.plot(lauber_soil_sample_md, 'pH', cmap='Greens',\n",
    "                                               title=\"Unweighted UniFrac, samples colored by pH\",\n",
    "                                               axis_labels=('PC1', 'PC2', 'PC3'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<Figure size 432x288 with 2 Axes>"
      ]
     },
     "execution_count": 68,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from iab.data import lauber_soil_bray_curtis_dm\n",
    "\n",
    "lauber_soil_bray_curtis_pcoa = pcoa(lauber_soil_bray_curtis_dm)\n",
    "\n",
    "_ = lauber_soil_bray_curtis_pcoa.plot(lauber_soil_sample_md, 'pH', cmap='Greens',\n",
    "                                        title=\"Bray-Curtis, samples colored by pH\",\n",
    "                                        axis_labels=('PC1', 'PC2', 'PC3'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "/Users/gregcaporaso/miniconda3/envs/iab/lib/python3.5/site-packages/skbio/stats/ordination/_principal_coordinate_analysis.py:102: RuntimeWarning: The result contains negative eigenvalues. Please compare their magnitude with the magnitude of some of the largest positive eigenvalues. If the negative ones are smaller, it's probably safe to ignore them, but if they are large in magnitude, the results won't be useful. See the Notes section for more details. The smallest eigenvalue is -0.010291669756329357 and the largest is 3.8374200744108204.\n",
       "  RuntimeWarning\n",
       "<Figure size 432x288 with 2 Axes>"
      ]
     },
     "execution_count": 69,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from iab.data import lauber_soil_weighted_unifrac_dm\n",
    "\n",
    "lauber_soil_weighted_unifrac_pcoa = pcoa(lauber_soil_weighted_unifrac_dm)\n",
    "\n",
    "_ = lauber_soil_weighted_unifrac_pcoa.plot(lauber_soil_sample_md, 'pH', cmap='Greens',\n",
    "                                             title=\"Weighted UniFrac, samples colored by pH\",\n",
    "                                             axis_labels=('PC1', 'PC2', 'PC3'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Specifically, what we want to ask when comparing these results is **given a pair of ordination plots, is their shape (in two or three dimensions) the same?** The reason we care is that we want to know, **given a pair of ordination plots, would we derive the same biological conclusions regardless of which plot we look at?**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use a [Mantel test](http://scikit-bio.org/docs/latest/generated/generated/skbio.stats.distance.mantel.html) for this, which is a way of testing for correlation between distance matrices."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Mantel r: 0.906\n",
       "p-value: 1.0e-03\n",
       "Number of samples compared: 88"
      ]
     },
     "execution_count": 70,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from skbio.stats.distance import mantel\n",
    "\n",
    "r, p, n = mantel(lauber_soil_unweighted_unifrac_dm, lauber_soil_weighted_unifrac_dm, method='spearman', strict=False)\n",
    "print(\"Mantel r: %1.3f\" % r)\n",
    "print(\"p-value: %1.1e\" % p)\n",
    "print(\"Number of samples compared: %d\" % n)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Mantel r: 0.930\n",
       "p-value: 1.0e-03\n",
       "Number of samples compared: 88"
      ]
     },
     "execution_count": 71,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "r, p, n = mantel(lauber_soil_unweighted_unifrac_dm, lauber_soil_bray_curtis_dm, method='spearman', strict=False)\n",
    "print(\"Mantel r: %1.3f\" % r)\n",
    "print(\"p-value: %1.1e\" % p)\n",
    "print(\"Number of samples compared: %d\" % n)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Mantel r: 0.850\n",
       "p-value: 1.0e-03\n",
       "Number of samples compared: 88"
      ]
     },
     "execution_count": 72,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "r, p, n = mantel(lauber_soil_weighted_unifrac_dm, lauber_soil_bray_curtis_dm, method='spearman', strict=False)\n",
    "print(\"Mantel r: %1.3f\" % r)\n",
    "print(\"p-value: %1.1e\" % p)\n",
    "print(\"Number of samples compared: %d\" % n)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The way that we'd interpret these results is that, although the plots above look somewhat different from one another, the underlying data (the distances between samples) are highly correlated across the different diversity metrics. As a result, we'd conclude that with any of these three diversity metrics we'd come to the conclusion that samples that are more similar in pH are more similar in their microbial community composition."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We could apply this same approach, for example, if we had clustered sequences into OTUs with two different approaches. For example, if we used *de novo* OTU picking and open reference OTU picking, we could compute UniFrac distance matrices based on each resulting BIOM table, and then compare those distance matrices with a Mantel test. This approach was applied in [Rideout et al 2014](https://peerj.com/articles/545/) to determine which OTU clustering methods would result in different biological conclusions being drawn from a data set."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### [3.1.7.1](#7.1) Procrustes analysis<a name='7.1'></a> <a class='iab-edit' href='https://github.com/caporaso-lab/An-Introduction-to-Applied-Bioinformatics/edit/master/book/applications/biological-diversity.md#L1105' target='_blank'>[edit]</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "A related approach, but which I think is less useful as it compares PCoA plots directly (and therefore a summary of the distance data, rather than the distance data itself) is called Procrustes analysis (you can read about the origin of the name [here](http://en.wikipedia.org/wiki/Procrustes)). Procrustes analysis takes two coordinate matrices as input and effectively tries to find the best superimposition of one on top of the other. The transformations that are applied are as follows:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Translation (the mean of all points is set to 1 on each dimension)\n",
    "* Scaling (root mean square distance of all points from the origin is 1 on each dimension)\n",
    "* Rotation (choosing one set of points as the reference, and rotate the other to minimize the sum of squares distance (SSD) between the corresponding points)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The output is a pair of *transformed coordinate matrices*, and an $M^{2}$ statistic which represents how dissimilar the coordinate matrices are to each other (so a small $M^{2}$ means that the coordinate matrices, and the plots, are more similar). [Procrustes analysis is implemented in scikit-bio](http://scikit-bio.org/generated/skbio.maths.stats.spatial.procrustes.html)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## [3.1.8](#8) Where to go from here<a name='8'></a> <a class='iab-edit' href='https://github.com/caporaso-lab/An-Introduction-to-Applied-Bioinformatics/edit/master/book/applications/biological-diversity.md#L1115' target='_blank'>[edit]</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "If you're interested in learning more about the topics presented in this chapter, I recommend [Measuring Biological Diversity](http://www.amazon.com/Measuring-Biological-Diversity-Anne-Magurran/dp/0632056339) by Anne E. Magurran, and the [QIIME 2 documentation](https://docs.qiime2.org). The [QIIME 2 software package](https://qiime2.org) is designed for performing the types of analyses described in this chapter."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## [3.1.9](#9) Acknowledgements<a name='9'></a> <a class='iab-edit' href='https://github.com/caporaso-lab/An-Introduction-to-Applied-Bioinformatics/edit/master/book/applications/biological-diversity.md#L1119' target='_blank'>[edit]</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "Much of content in this section is based on knowledge that I gained through years of working with the [QIIME](http://qiime.org/) and [QIIME 2](https://qiime2.org) user and developer communities. Thanks everyone, I'm looking forward to many more years of productive, fun and exciting work together!"
   ]
  }
 ],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 4
}