{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# [3.1]() Studying Microbial Diversity [edit]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Table of Contents**\n",
"0. [Getting started: the feature table](#1)\n",
"0. [Terminology](#2)\n",
"0. [Measuring alpha diversity](#3)\n",
" 0. [Observed species (or Observed OTUs)](#3.1)\n",
" 0. [A limitation of OTU counting](#3.1.1)\n",
" 0. [Phylogenetic Diversity (PD)](#3.2)\n",
" 0. [Even sampling](#3.3)\n",
"0. [Measuring beta diversity](#4)\n",
" 0. [Distance metrics](#4.1)\n",
" 0. [Bray-Curtis](#4.1.1)\n",
" 0. [Unweighted UniFrac](#4.1.2)\n",
" 0. [Even sampling](#4.1.3)\n",
" 0. [Interpreting distance matrices](#4.2)\n",
" 0. [Distribution plots and comparisons](#4.2.1)\n",
" 0. [Hierarchical clustering](#4.2.2)\n",
" 0. [Ordination](#4.3)\n",
" 0. [Polar ordination](#4.3.1)\n",
" 0. [Determining the most important axes in polar ordination](#4.3.2)\n",
" 0. [Interpreting ordination plots](#4.3.3)\n",
"0. [Tools for using ordination in practice: scikit-bio, pandas, and matplotlib](#5)\n",
"0. [PCoA versus PCA: what's the difference?](#6)\n",
"0. [Are two different analysis approaches giving me the same result?](#7)\n",
" 0. [Procrustes analysis](#7.1)\n",
"0. [Where to go from here](#8)\n",
"0. [Acknowledgements](#9)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"Unicellular organisms, often referred to as “microbes”, represent the vast majority of the diversity of life on Earth. Microbes perform an amazing array of biological functions and rarely live or act alone, but rather exist in complex communities composed of many interacting species. We now know that the traditional approach for studying microbial communities (or microbiomes), which relied on microbial culture (in other words, being able to grow the microbes in the lab), is insufficient because we don’t know the conditions required for the growth of most microbes. Recent advances that have linked microbiomes to processes ranging from global (for example, the cycling of biologically essential nutrients such as carbon and nitrogen) to personal (for example, human disease, including obesity and cancer) have thus relied on “culture independent” techniques. Identification now relies on sequencing fragments of microbial genomes, and using those fragments as “molecular fingerprints” that allow researchers to profile which microbes are present in an environment. Currently, the bottleneck in microbiome analysis is not DNA sequencing, but rather interpreting the large quantities of DNA sequence data that are generated: often on the order of tens to hundreds of gigabytes. This chapter will integrate many of the topics we've covered in previous chapters to introduce how we study communities of microorganisms using their DNA sequences."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## [3.1.1](#1) Getting started: the feature table[edit]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"From a bioinformatics perspective, studying biological diversity is centered around a few key pieces of information:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* A table of the frequencies of certain biological features (e.g., species or OTUs) on a per sample basis.\n",
"* *Sample metadata* describing exactly what each of the samples is, as well as any relevant technical information.\n",
"* *Feature metadata* describing each of the features. This can be taxonomic information, for example, but we'll come back to this when we discuss features in more detail (this will be completed as part of [#105](https://github.com/caporaso-lab/An-Introduction-To-Applied-Bioinformatics/issues/105)).\n",
"* Optionally, information on the relationships between the biological features, typically in the form of a phylogenetic tree where tips in the tree correspond to OTUs in the table."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"None of these are trivial to generate (defining OTUs was described in the [OTU clustering chapter](../algorithms/5-sequence-mapping-and-clustering.ipynb), building trees in the [Phylogenetic reconstruction chapter](../algorithms/3-phylogeny-reconstruction.ipynb), and there is a lot of active work on standardized ways to describe samples in the form of metadata, for example [Yilmaz et al (2011)](http://www.nature.com/nbt/journal/v29/n5/full/nbt.1823.html) and the [isa-tab](http://isa-tools.org/) project. For this discussion we're going to largely ignore the complexities of generating each of these, so we can focus on how we study diversity."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The sample by feature frequency table is central to investigations of biological diversity. The Genomics Standards Consortium has recognized the [Biological Observation Matrix](http://www.biom-format.org) ([McDonald et al. (2011) *Gigascience*](http://www.gigasciencejournal.com/content/1/1/7)), or `biom-format` software and file format definition as a community standard for representing those tables. For now, we'll be using pandas to store these tables as the core ``biom.Table`` object is in the process of being ported to ``scikit-bio`` (to follow progress on this, see scikit-bio [issue #848](https://github.com/biocore/scikit-bio/issues/848)). Even though we're not currently using BIOM to represent these tables, we'll refer to these through-out this chapter as *BIOM tables* for consistency with other projects."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The basic data that goes into a BIOM table is the list of sample ids, the list of feature (e.g., OTU) ids, and the frequency matrix, which describes how many times each OTU was observed in each sample. We can build and display a BIOM table as follows:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Populating the interactive namespace from numpy and matplotlib\n",
" A B C\n",
"OTU1 1 0 0\n",
"OTU2 3 2 0\n",
"OTU3 0 0 6\n",
"OTU4 1 4 2\n",
"OTU5 0 4 1"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%matplotlib inline\n",
"import numpy as np\n",
"import pandas as pd\n",
"\n",
"sample_ids = ['A', 'B', 'C']\n",
"feature_ids = ['OTU1', 'OTU2', 'OTU3', 'OTU4', 'OTU5']\n",
"data = np.array([[1, 0, 0],\n",
" [3, 2, 0],\n",
" [0, 0, 6],\n",
" [1, 4, 2],\n",
" [0, 4, 1]])\n",
"\n",
"table1 = pd.DataFrame(data, index=feature_ids, columns=sample_ids)\n",
"table1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we want the feature frequency vector for sample `A` from the above table, we use the pandas API to get that as follows:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"OTU1 1\n",
"OTU2 3\n",
"OTU3 0\n",
"OTU4 1\n",
"OTU5 0\n",
"Name: A, dtype: int64"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"table1['A']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**TODO**: Trees in Newick format; sample metadata in TSV format, and loaded into a pandas DataFrame."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Before we start looking at what we can do with this data once we have it, let's discuss some terminology."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## [3.1.2](#2) Terminology[edit]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"There are literally hundreds of metrics of biological diversity. Here is some terminology that is useful for classifying these metrics."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Alpha versus beta diversity**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" * $\\alpha$ (i.e., within sample) diversity: Who is there? How many are there?\n",
" * $\\beta$ (i.e., between sample) diversity: How similar are pairs of samples?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Quantitative versus qualitative metrics**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" * qualitative metrics only account for whether an organism is present or absent\n",
" * quantitative metrics account for abundance"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Phylogenetic versus non-phylogenetic metrics**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" * non-phylogenetic metrics treat all OTUs as being equally related\n",
" * phylogenetic metrics incorporate evolutionary relationships between the OTUs"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the next sections we'll look at some metrics that cross these different categories. As new metrics are introduced, try to classify each of them into one class for each of the above three categories."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## [3.1.3](#3) Measuring alpha diversity[edit]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Table of Contents**\n",
"0. [Observed species (or Observed OTUs)](#3.1)\n",
" 0. [A limitation of OTU counting](#3.1.1)\n",
"0. [Phylogenetic Diversity (PD)](#3.2)\n",
"0. [Even sampling](#3.3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"The first type of metric that we'll look at will be alpha diversity, and we'll specifically focus on *richness* here. Richness refers to how many different *types* of organisms are present in a sample: for example, if we're interested in species richness of plants in the Sonoran Desert and the Costa Rican rainforest, we could go to each, count the number of different species of plants that we observe, and have a basic measure of species richness in each environment."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"An alternative type of alpha diversity measure would be *evenness*, and would tell us how even or uneven the distribution of species abundances are in a given environment. If, for example, the most abundant plant in the Sonoran desert was roughly as common as the least abundant plant (not the case!), we would say that the evenness of plant species was high. On the other hand, if the most abundant plant was thousands of times more common than the least common plant (probably closer to the truth), then we'd say that the evenness of plant species was low. We won't discuss evenness more here, but you can find coverage of this topic (as well as many of the others presented here) in [Measuring Biological Diversity](http://www.amazon.com/Measuring-Biological-Diversity-Anne-Magurran/dp/0632056339)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's look at two metrics of alpha diversity: observed species, and phylogenetic diversity."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### [3.1.3.1](#3.1) Observed species (or Observed OTUs)[edit]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Table of Contents**\n",
"0. [A limitation of OTU counting](#3.1.1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"Observed species, or Observed OTUs as it's more accurately described, is about as simple of a metric as can be used to quantify alpha diversity. With this metric, we simply count the OTUs that are observed in a given sample. Note that this is a qualitative metric: we treat each OTU as being observed or not observed - we don't care how many times it was observed."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's define a new table for this analysis:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
" A B C\n",
"B1 1 1 5\n",
"B2 1 2 0\n",
"B3 3 1 0\n",
"B4 0 2 0\n",
"B5 0 0 0\n",
"A1 0 0 3\n",
"E2 0 0 1"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sample_ids = ['A', 'B', 'C']\n",
"feature_ids = ['B1','B2','B3','B4','B5','A1','E2']\n",
"data = np.array([[1, 1, 5],\n",
" [1, 2, 0],\n",
" [3, 1, 0],\n",
" [0, 2, 0],\n",
" [0, 0, 0],\n",
" [0, 0, 3],\n",
" [0, 0, 1]])\n",
"\n",
"table2 = pd.DataFrame(data, index=feature_ids, columns=sample_ids)\n",
"table2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Our sample $A$ has an observed OTU frequency value of 3, sample $B$ has an observed OTU frequency of 4, and sample $C$ has an observed OTU frequency of 3. Note that this is different than the total counts for each column (which would be 5, 6, and 9 respectively). Based on the observed OTUs metric, we could consider samples $A$ and $C$ to have even OTU richness, and sample $B$ to have 33% higher OTU richness."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We could compute this in python as follows:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"def observed_otus(table, sample_id):\n",
" return sum([e > 0 for e in table[sample_id]])"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print(observed_otus(table2, 'A'))"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"4"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print(observed_otus(table2, 'B'))"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print(observed_otus(table2, 'C'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### [3.1.3.1.1](#3.1.1) A limitation of OTU counting[edit]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"Imagine that we have the same table, but some additional information about the OTUs in the table. Specifically, we've computed the following phylogenetic tree. And, for the sake of illustration, imagine that we've also assigned taxonomy to each of the OTUs and found that our samples contain representatives from the archaea, bacteria, and eukaryotes (their labels begin with `A`, `B`, and `E`, respectively)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, let's define a phylogenetic tree using the Newick format (which is described [here](http://evolution.genetics.washington.edu/phylip/newicktree.html), and more formally defined [here](http://evolution.genetics.washington.edu/phylip/newick_doc.html)). We'll then load that up using [scikit-bio](http://scikit-bio.org)'s [TreeNode](http://scikit-bio.org/generated/skbio.core.tree.TreeNode.html#skbio.core.tree.TreeNode) object, and visualize it with [ete3](http://etetoolkit.org)."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"import ete3\n",
"ts = ete3.TreeStyle()\n",
"ts.show_leaf_name = True\n",
"ts.scale = 250\n",
"ts.branch_vertical_margin = 15\n",
"ts.show_branch_length = True"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"from io import StringIO\n",
"newick_tree = StringIO('((B1:0.2,B2:0.3):0.3,((B3:0.5,B4:0.3):0.2,B5:0.9):0.3,'\n",
" '((A1:0.2,A2:0.3):0.3,'\n",
" ' (E1:0.3,E2:0.4):0.7):0.55);')\n",
"\n",
"from skbio.tree import TreeNode\n",
"\n",
"tree = TreeNode.read(newick_tree)\n",
"tree = tree.root_at_midpoint()"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"t = ete3.Tree.from_skbio(tree, map_attributes=[\"value\"])\n",
"t.render(\"%%inline\", tree_style=ts)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Pairing this with the table we defined above (displayed again in the cell below), given what you now know about these OTUs, which would you consider the most diverse? Are you happy with the $\\alpha$ diversity conclusion that you obtained when computing the number of observed OTUs in each sample?"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
" A B C\n",
"B1 1 1 5\n",
"B2 1 2 0\n",
"B3 3 1 0\n",
"B4 0 2 0\n",
"B5 0 0 0\n",
"A1 0 0 3\n",
"E2 0 0 1"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"table2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### [3.1.3.2](#3.2) Phylogenetic Diversity (PD)[edit]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"Phylogenetic Diversity (PD) is a metric that was developed by Dan Faith in the early 1990s (find the original paper [here](http://www.sciencedirect.com/science/article/pii/0006320792912013)). Like many of the measures that are used in microbial community ecology, it wasn't initially designed for studying microbial communities, but rather communities of \"macro-organisms\" (macrobes?). Some of these metrics, including PD, do translate well to microbial community analysis, while some don't translate as well. (For an illustration of the effect of sequencing error on PD, where it is handled well, versus its effect on the Chao1 metric, where it is handled less well, see Figure 1 of [Reeder and Knight (2010)](http://www.nature.com/nmeth/journal/v7/n9/full/nmeth0910-668b.html))."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"PD is relatively simple to calculate. It is computed simply as the sum of the branch length in a phylogenetic tree that is \"covered\" or represented in a given sample. Let's look at an example to see how this works."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I'll now define a couple of functions that we'll use to compute PD."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"def get_observed_nodes(tree, table, sample_id, verbose=False):\n",
" observed_otus = [obs_id for obs_id in table.index\n",
" if table[sample_id][obs_id] > 0]\n",
" observed_nodes = set()\n",
" # iterate over the observed OTUs\n",
" for otu in observed_otus:\n",
" t = tree.find(otu)\n",
" observed_nodes.add(t)\n",
" if verbose:\n",
" print(t.name, t.length, end=' ')\n",
" for internal_node in t.ancestors():\n",
" if internal_node.length is None:\n",
" # we've hit the root\n",
" if verbose:\n",
" print('')\n",
" else:\n",
" if verbose and internal_node not in observed_nodes:\n",
" print(internal_node.length, end=' ')\n",
" observed_nodes.add(internal_node)\n",
" return observed_nodes\n",
"\n",
"def phylogenetic_diversity(tree, table, sample_id, verbose=False):\n",
" observed_nodes = get_observed_nodes(tree, table, sample_id, verbose=verbose)\n",
" result = sum(o.length for o in observed_nodes)\n",
" return result"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And then apply those to compute the PD of our three samples. For each computation, we're also printing out the branch lengths of the branches that are observed *for the first time* when looking at a given OTU. When computing PD, we include the length of each branch only one time."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"B1 0.2 0.3 0.2250000000000001\n",
"B2 0.3\n",
"B3 0.5 0.2 0.3\n",
"2.0250000000000004"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd_A = phylogenetic_diversity(tree, table2, 'A', verbose=True)\n",
"print(pd_A)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"B1 0.2 0.3 0.2250000000000001\n",
"B2 0.3\n",
"B3 0.5 0.2 0.3\n",
"B4 0.3\n",
"2.325"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd_B = phylogenetic_diversity(tree, table2, 'B', verbose=True)\n",
"print(pd_B)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"B1 0.2 0.3 0.2250000000000001\n",
"A1 0.2 0.3 0.32499999999999996\n",
"E2 0.4 0.7\n",
"2.65"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd_C = phylogenetic_diversity(tree, table2, 'C', verbose=True)\n",
"print(pd_C)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How does this result compare to what we observed above with the Observed OTUs metric? Based on your knowledge of biology, which do you think is a better representation of the relative diversities of these samples?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### [3.1.3.3](#3.3) Even sampling[edit]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"Imagine again that we're going out to count plants in the Sonoran Desert and the Costa Rican rainforest. We're interested in getting an idea of the plant richness in each environment. In the Sonoran Desert, we survey a square kilometer area, and count 150 species of plants. In the rainforest, we survey a square meter, and count 15 species of plants. So, clearly the plant species richness in the Sonoran Desert is higher, right? What's wrong with this comparison?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The problem is that we've expended a lot more sampling effort in the desert than we did in the rainforest, so it shouldn't be surprising that we observed more species there. If we expended the same effort in the rainforest, we'd probably observe a lot more than 15 or 150 plant species, and we'd have a more sound comparison."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In sequencing-based studies of microorganism richness, the analog of sampling area is sequencing depth. If we collect 100 sequences from one sample, and 10,000 sequences from another sample, we can't directly compare the number of observed OTUs or the phylogenetic diversity of these because we expended a lot more sampling effort on the sample with 10,000 sequences than on the sample with 100 sequences. The way this is typically handled is by randomly subsampling sequences from the sample with more sequences until the sequencing depth is equal to that in the sample with fewer sequences. If we randomly select 100 sequences at random from the sample with 10,000 sequences, and compute the alpha diversity based on that random subsample, we'll have a better idea of the relative alpha diversities of the two samples."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
" A B C\n",
"OTU1 50 4 0\n",
"OTU2 35 200 0\n",
"OTU3 100 2 1\n",
"OTU4 15 400 1\n",
"OTU5 0 40 1"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sample_ids = ['A', 'B', 'C']\n",
"feature_ids = ['OTU1', 'OTU2', 'OTU3', 'OTU4', 'OTU5']\n",
"data = np.array([[50, 4, 0],\n",
" [35, 200, 0],\n",
" [100, 2, 1],\n",
" [15, 400, 1],\n",
" [0, 40, 1]])\n",
"\n",
"bad_table = pd.DataFrame(data, index=feature_ids, columns=sample_ids)\n",
"bad_table"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"4"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print(observed_otus(bad_table, 'A'))"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"5"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print(observed_otus(bad_table, 'B'))"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print(observed_otus(bad_table, 'C'))"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"A 200\n",
"B 646\n",
"C 3\n",
"dtype: int64"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print(bad_table.sum())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**TODO**: Add alpha rarefaction discussion."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## [3.1.4](#4) Measuring beta diversity[edit]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Table of Contents**\n",
"0. [Distance metrics](#4.1)\n",
" 0. [Bray-Curtis](#4.1.1)\n",
" 0. [Unweighted UniFrac](#4.1.2)\n",
" 0. [Even sampling](#4.1.3)\n",
"0. [Interpreting distance matrices](#4.2)\n",
" 0. [Distribution plots and comparisons](#4.2.1)\n",
" 0. [Hierarchical clustering](#4.2.2)\n",
"0. [Ordination](#4.3)\n",
" 0. [Polar ordination](#4.3.1)\n",
" 0. [Determining the most important axes in polar ordination](#4.3.2)\n",
" 0. [Interpreting ordination plots](#4.3.3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"$\\beta$-diversity (canonically pronounced *beta diversity*) refers to **between sample diversity**, and is typically used to answer questions of the form: is sample $A$ more similar in composition to sample $B$ or sample $C$? In this section we'll explore two (of tens or hundreds) of metrics for computing pairwise dissimilarity of samples to estimate $\\beta$ diversity."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### [3.1.4.1](#4.1) Distance metrics[edit]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Table of Contents**\n",
"0. [Bray-Curtis](#4.1.1)\n",
"0. [Unweighted UniFrac](#4.1.2)\n",
"0. [Even sampling](#4.1.3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"#### [3.1.4.1.1](#4.1.1) Bray-Curtis[edit]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"The first metric that we'll look at is a quantitative non-phylogenetic $\\beta$ diversity metric called Bray-Curtis. The Bray-Curtis dissimilarity between a pair of samples, $j$ and $k$, is defined as follows:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"$BC_{jk} = \\frac{ \\sum_{i} | X_{ij} - X_{ik}|} {\\sum_{i} (X_{ij} + X_{ik})}$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"$i$ : feature (e.g., OTUs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"$X_{ij}$ : frequency of feature $i$ in sample $j$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"$X_{ik}$ : frequency of feature $i$ in sample $k$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This could be implemented in python as follows:"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"def bray_curtis_distance(table, sample1_id, sample2_id):\n",
" numerator = 0\n",
" denominator = 0\n",
" sample1_counts = table[sample1_id]\n",
" sample2_counts = table[sample2_id]\n",
" for sample1_count, sample2_count in zip(sample1_counts, sample2_counts):\n",
" numerator += abs(sample1_count - sample2_count)\n",
" denominator += sample1_count + sample2_count\n",
" return numerator / denominator"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
" A B C\n",
"OTU1 1 0 0\n",
"OTU2 3 2 0\n",
"OTU3 0 0 6\n",
"OTU4 1 4 2\n",
"OTU5 0 4 1"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"table1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's now apply this to some pairs of samples:"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.6"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print(bray_curtis_distance(table1, 'A', 'B'))"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.857142857143"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print(bray_curtis_distance(table1, 'A', 'C'))"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.684210526316"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print(bray_curtis_distance(table1, 'B', 'C'))"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.0"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print(bray_curtis_distance(table1, 'A', 'A'))"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.684210526316"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print(bray_curtis_distance(table1, 'C', 'B'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ultimately, we likely want to apply this to all pairs of samples to get a distance matrix containing all pairwise distances. Let's define a function for that, and then compute all pairwise Bray-Curtis distances between samples `A`, `B` and `C`."
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"from skbio.stats.distance import DistanceMatrix\n",
"from numpy import zeros\n",
"\n",
"def table_to_distances(table, pairwise_distance_fn):\n",
" sample_ids = table.columns\n",
" num_samples = len(sample_ids)\n",
" data = zeros((num_samples, num_samples))\n",
" for i, sample1_id in enumerate(sample_ids):\n",
" for j, sample2_id in enumerate(sample_ids[:i]):\n",
" data[i,j] = data[j,i] = pairwise_distance_fn(table, sample1_id, sample2_id)\n",
" return DistanceMatrix(data, sample_ids)"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3x3 distance matrix\n",
"IDs:\n",
"'A', 'B', 'C'\n",
"Data:\n",
"[[ 0. 0.6 0.85714286]\n",
" [ 0.6 0. 0.68421053]\n",
" [ 0.85714286 0.68421053 0. ]]"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"bc_dm = table_to_distances(table1, bray_curtis_distance)\n",
"print(bc_dm)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### [3.1.4.1.2](#4.1.2) Unweighted UniFrac[edit]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"Just as phylogenetic alpha diversity metrics can be more informative than non-phylogenetic alpha diversity metrics, phylogenetic beta diversity metrics offer advantages over non-phylogenetic metrics such as Bray-Curtis. The most widely applied phylogenetic beta diversity metric as of this writing is unweighted UniFrac. UniFrac was initially presented in [Lozupone and Knight, 2005, Applied and Environmental Microbiology](http://aem.asm.org/content/71/12/8228.abstract), and has been widely applied in microbial ecology since (and the illustration of UniFrac computation presented below is derived from a similar example originally developed by Lozupone and Knight)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The unweighted UniFrac distance between a pair of samples `A` and `B` is defined as follows:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"$U_{AB} = \\frac{unique}{observed}$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"where:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"$unique$ : the unique branch length, or branch length that only leads to OTU(s) observed in sample $A$ or sample $B$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"$observed$ : the total branch length observed in either sample $A$ or sample $B$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To illustrate how UniFrac distances are computed, before we get into actually computing them, let's look at a few examples. In these examples, imagine that we're determining the pairwise UniFrac distance between two samples: a red sample, and a blue sample. If a red box appears next to an OTU, that indicates that it's observed in the red sample; if a blue box appears next to the OTU, that indicates that it's observed in the blue sample; if a red and blue box appears next to the OTU, that indicates that the OTU is present in both samples; and if no box is presented next to the OTU, that indicates that it's present in neither sample."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To compute the UniFrac distance between a pair of samples, we need to know the sum of the branch length that was observed in either sample (the *observed* branch length), and the sum of the branch length that was observed only in a single sample (the *unique* branch length). In these examples, we color all of the *observed* branch length. Branch length that is unique to the red sample is red, branch length that is unique to the blue sample is blue, and branch length that is observed in both samples is purple. Unobserved branch length is black (as is the vertical branches, as those don't contribute to branch length - they are purely for visual presentation)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the tree on the right, all of the OTUs that are observed in either sample are observed in both samples. As a result, all of the observed branch length is purple. The unique branch length in this case is zero, so **we have a UniFrac distance of 0 between the red and blue samples**."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"On the other end of the spectrum, in the second tree, all of the OTUs in the tree are observed either in the red sample, or in the blue sample. All of the observed branch length in the tree is either red or blue, meaning that if you follow a branch out to the tips, you will observe only red or blue samples. In this case the unique branch length is equal to the observed branch length, so **we have a UniFrac distance of 1 between the red and blue samples**."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, most of the time we're somewhere in the middle. In this tree, some of our branch length is unique, and some is not. For example, OTU 1 is only observed in our red sample, so the terminal branch leading to OTU 1 is red (i.e., unique to the red sample). OTU 2 is only observed in our blue sample, so the terminal branch leading to OTU 2 is blue (i.e., unique to the blue sample). However, the internal branch leading to the node connecting OTU 1 and OTU 2 leads to OTUs observed in both the red and blue samples (i.e., OTU 1 and OTU 2), so is purple (i.e, observed branch length, but not unique branch length). In this case, **we have an intermediate UniFrac distance between the red and blue samples, maybe somewhere around 0.5**."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's now compute the Unweighted UniFrac distances between some samples. Imagine we have the following tree, paired with our table below (printed below, for quick reference)."
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
" A B C\n",
"OTU1 1 0 0\n",
"OTU2 3 2 0\n",
"OTU3 0 0 6\n",
"OTU4 1 4 2\n",
"OTU5 0 4 1"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"table1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, let's compute the unweighted UniFrac distance between samples $A$ and $B$. The *unweighted* in *unweighted UniFrac* means that this is a qualitative diversity metric, meaning that we don't care about the abundances of the OTUs, only whether they are present in a given sample ($frequency > 0$) or not present ($frequency = 0$)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Start at the top right branch in the tree, and for each branch, determine if the branch is observed, and if so, if it is also unique. If it is observed then you add its length to your observed branch length. If it is observed and unique, then you also add its length to your unique branch length."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For samples $A$ and $B$, I get the following (in the tree on the right, red branches are those observed in $A$, blue branches are those observed in $B$, and purple are observed in both):"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"$unique_{ab} = 0.5 + 0.75 = 1.25$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"$observed_{ab} = 0.5 + 0.5 + 0.5 + 1.0 + 1.25 + 0.75 + 0.75 = 5.25$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"$uu_{ab} = \\frac{unique_{ab}}{observed_{ab}} = \\frac{1.25}{5.25} = 0.238$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As an exercise, now compute the UniFrac distances between samples $B$ and $C$, and samples $A$ and $C$, using the above table and tree. When I do this, I get the following distance matrix."
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3x3 distance matrix\n",
"IDs:\n",
"'A', 'B', 'C'\n",
"Data:\n",
"[[ 0. 0.24 0.52]\n",
" [ 0.24 0. 0.35]\n",
" [ 0.52 0.35 0. ]]"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ids = ['A', 'B', 'C']\n",
"d = [[0.00, 0.24, 0.52],\n",
" [0.24, 0.00, 0.35],\n",
" [0.52, 0.35, 0.00]]\n",
"print(DistanceMatrix(d, ids))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" **TODO**: Interface change so this code can be used with ``table_to_distances``."
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.23809523809523808\n",
"0.52\n",
"0.34782608695652173"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## This is untested!! I'm not certain that it's exactly right, just a quick test.\n",
"\n",
"newick_tree1 = StringIO('(((((OTU1:0.5,OTU2:0.5):0.5,OTU3:1.0):1.0),(OTU4:0.75,OTU5:0.75):1.25))root;')\n",
"tree1 = TreeNode.read(newick_tree1)\n",
"\n",
"def unweighted_unifrac(tree, table, sample_id1, sample_id2, verbose=False):\n",
" observed_nodes1 = get_observed_nodes(tree, table, sample_id1, verbose=verbose)\n",
" observed_nodes2 = get_observed_nodes(tree, table, sample_id2, verbose=verbose)\n",
" observed_branch_length = sum(o.length for o in observed_nodes1 | observed_nodes2)\n",
" shared_branch_length = sum(o.length for o in observed_nodes1 & observed_nodes2)\n",
" unique_branch_length = observed_branch_length - shared_branch_length\n",
" unweighted_unifrac = unique_branch_length / observed_branch_length\n",
" return unweighted_unifrac\n",
"\n",
"print(unweighted_unifrac(tree1, table1, 'A', 'B'))\n",
"print(unweighted_unifrac(tree1, table1, 'A', 'C'))\n",
"print(unweighted_unifrac(tree1, table1, 'B', 'C'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### [3.1.4.1.3](#4.1.3) Even sampling[edit]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"**TODO**: Add discussion on necessity of even sampling"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### [3.1.4.2](#4.2) Interpreting distance matrices[edit]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Table of Contents**\n",
"0. [Distribution plots and comparisons](#4.2.1)\n",
"0. [Hierarchical clustering](#4.2.2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"In the previous section we computed distance matrices that contained the pairwise distances between a few samples. You can look at those distance matrices and get a pretty good feeling for what the patterns are. For example, what are the most similar samples? What are the most dissimilar samples?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What if instead of three samples though, we had more. Here's a screenshot from a distance matrix containing data on 105 samples (this is just the first few rows and columns):"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Do you have a good feeling for the patterns here? What are the most similar samples? What are the most dissimilar samples?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Chances are, you can't just squint at that table and understand what's going on (but if you can, I'm hiring!). The problem is exacerbated by the fact that in modern microbial ecology studies we may have thousands or tens of thousands of samples, not \"just\" hundreds as in the table above. We need tools to help us take these raw distances and convert them into something that we can interpret. In this section we'll look at some techniques, one of which we've covered previously, that will help us interpret large distance matrices."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One excellent paper that includes a comparison of several different strategies for interpreting beta diversity results is [Costello *et al.* Science (2009) Bacterial Community Variation in Human Body Habitats Across Space and Time](https://www.sciencemag.org/content/326/5960/1694.full). In this study, the authors collected microbiome samples from 7 human subjects at about 25 sites on their bodies, at four different points in time."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Figure 1 shows several different approaches for comparing the resulting UniFrac distance matrix (this image is linked from the *Science* journal website - copyright belongs to *Science*):"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's generate a small distance matrix representing just a few of these body sites, and figure out how we'd generate and interpret each of these visualizations. The values in the distance matrix below are a subset of the unweighted UniFrac distance matrix representing two samples each from three body sites from the Costello *et al.* (2009) study."
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
" body site individual\n",
"A gut subject 1\n",
"B gut subject 2\n",
"C tongue subject 1\n",
"D tongue subject 2\n",
"E skin subject 1\n",
"F skin subject 2"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sample_ids = ['A', 'B', 'C', 'D', 'E', 'F']\n",
"_columns = ['body site', 'individual']\n",
"_md = [['gut', 'subject 1'],\n",
" ['gut', 'subject 2'],\n",
" ['tongue', 'subject 1'],\n",
" ['tongue', 'subject 2'],\n",
" ['skin', 'subject 1'],\n",
" ['skin', 'subject 2']]\n",
"\n",
"human_microbiome_sample_md = pd.DataFrame(_md, index=sample_ids, columns=_columns)\n",
"human_microbiome_sample_md"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"6x6 distance matrix\n",
"IDs:\n",
"'A', 'B', 'C', 'D', 'E', 'F'\n",
"Data:\n",
"[[ 0. 0.35 0.83 0.83 0.9 0.9 ]\n",
" [ 0.35 0. 0.86 0.85 0.92 0.91]\n",
" [ 0.83 0.86 0. 0.25 0.88 0.87]\n",
" [ 0.83 0.85 0.25 0. 0.88 0.88]\n",
" [ 0.9 0.92 0.88 0.88 0. 0.5 ]\n",
" [ 0.9 0.91 0.87 0.88 0.5 0. ]]"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dm_data = np.array([[0.00, 0.35, 0.83, 0.83, 0.90, 0.90],\n",
" [0.35, 0.00, 0.86, 0.85, 0.92, 0.91],\n",
" [0.83, 0.86, 0.00, 0.25, 0.88, 0.87],\n",
" [0.83, 0.85, 0.25, 0.00, 0.88, 0.88],\n",
" [0.90, 0.92, 0.88, 0.88, 0.00, 0.50],\n",
" [0.90, 0.91, 0.87, 0.88, 0.50, 0.00]])\n",
"\n",
"human_microbiome_dm = DistanceMatrix(dm_data, sample_ids)\n",
"print(human_microbiome_dm)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### [3.1.4.2.1](#4.2.1) Distribution plots and comparisons[edit]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"First, let's look at the analysis presented in panels E and F. Instead of generating bar plots here, we'll generate box plots as these are more informative (i.e., they provide a more detailed summary of the distribution being investigated). One important thing to notice here is the central role that the sample metadata plays in the visualization. If we just had our sample ids (i.e., letters ``A`` through ``F``) we wouldn't be able to group distances into *within* and *between* sample type categories, and we therefore couldn't perform the comparisons we're interested in."
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [],
"source": [
"def within_between_category_distributions(dm, md, md_category):\n",
" within_category_distances = []\n",
" between_category_distances = []\n",
" for i, sample_id1 in enumerate(dm.ids):\n",
" sample_md1 = md[md_category][sample_id1]\n",
" for sample_id2 in dm.ids[:i]:\n",
" sample_md2 = md[md_category][sample_id2]\n",
" if sample_md1 == sample_md2:\n",
" within_category_distances.append(dm[sample_id1, sample_id2])\n",
" else:\n",
" between_category_distances.append(dm[sample_id1, sample_id2])\n",
" return within_category_distances, between_category_distances"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[0.34999999999999998, 0.25, 0.5]\n",
"[0.82999999999999996, 0.85999999999999999, 0.82999999999999996, 0.84999999999999998, 0.90000000000000002, 0.92000000000000004, 0.88, 0.88, 0.90000000000000002, 0.91000000000000003, 0.87, 0.88]"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"within_category_distances, between_category_distances = within_between_category_distributions(human_microbiome_dm, human_microbiome_sample_md, \"body site\")\n",
"print(within_category_distances)\n",
"print(between_category_distances)"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"