{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"header\"></a>\n",
    "**License**: BSD<br>\n",
    "**Copyright**: Copyright American Gut Project, 2015<br>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>\n",
       "\tdiv.cell{\n",
       "\t\tmax-width: 1000px;\n",
       "\t\tmargin-left: auto;\n",
       "\t\tmargin-right: auto;\n",
       "\t}\n",
       "\n",
       "\tdiv.text_cell_render{\n",
       "\t\tline-height : 200%; /*Increases the spacing between the text lines*/\n",
       "\t\twidth : 750 px; /*Sets the width to approximately 700 pixels... about 80 characters */\n",
       "\t\tfont-family: \"Charis SIL\", serif; /*Default font: Charis SIL;*/\n",
       "\t\tfont-size: 11.5pt;\n",
       "\t}\n",
       "\n",
       "\tdiv.text_cell_render h1{\n",
       "\t\tfont-size: 24pt;\n",
       "\t\tfont-family: \"Helvetica Neue\", \"Helvetica\", \"Arial\", sans-serif;\n",
       "\t\tfont-weight: bold;\n",
       "\t}\n",
       "\n",
       "\tdiv.text_cell_render h2{\n",
       "\t\tfont-size: 18pt;\n",
       "\t\tfont-family: \"Helvetica Neue\", \"Helvetica\", \"Arial\", sans-serif;\n",
       "\t\tfont-style: italic;\n",
       "\t}\n",
       "\n",
       "\tdiv.text_cell_render h3{\n",
       "\t\tfont-size: 16pt;\n",
       "\t\tfont-family: \"Helvetica Neue\", \"Helvetica\", \"Arial\", sans-serif;\n",
       "\t\tcolor: #808080;\n",
       "\t\ttext-decoration: underline;\t\n",
       "\t}\n",
       "\n",
       "\tdiv.text_cell_render h4{\n",
       "\t\tfont-size: 15pt;\n",
       "\t\tfont-family: \"Helvetica Neue\", \"Helvetica\", \"Arial\", sans-serif;\n",
       "\t\tcolor: #808080;\n",
       "\t\tfont-style: italic;\t\n",
       "\t}\n",
       "\n",
       "\t#table{\n",
       "\t\tborder: 0px;\n",
       "\t}\n",
       "\n",
       "\t.CodeMirror{\n",
       "    \tfont-family: Consolas, monospace;\n",
       "    }\n",
       "    \n",
       "    .render_html ol {list-style: decimal; margin: 1em 2em;}\n",
       "\n",
       "</style>\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# This cell allows us to render the notebook in the way we wish no matter where the notebook is rendered.\n",
    "from IPython.core.display import HTML\n",
    "css_file = 'ag.css'\n",
    "HTML(open(css_file, \"r\").read())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"top\"></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#Table of Contents\n",
    "* [Preprocessing and File Generation](#Preprocessing-and-File-Generation)\n",
    "\t* [Data Sets Generated by this Notebook](#Data-Sets-Generated-by-this-Notebook)\n",
    "\t* [Files and File Types](#Files-and-File-Types)\n",
    "\t\t* [Metadata](#Metadata)\n",
    "\t\t* [OTU Tables](#OTU-Tables)\n",
    "\t\t* [Distance Matrix](#Distance-Matrix)\n",
    "* [Notebook Requirements](#Notebook-Requirements)\n",
    "* [Function Imports](#Function-Imports)\n",
    "* [Analysis Parameters](#Analysis-Parameters)\n",
    "\t* [File Saving Parameters](#File-Saving-Parameters)\n",
    "\t* [Metadata and text file handling parameters](#Metadata-and-text-file-handling-parameters)\n",
    "\t* [Rarefaction parameters](#Rarefaction-parameters)\n",
    "\t* [Split Parameters](#Split-Parameters)\n",
    "\t* [Alpha Diversity Parameters](#Alpha-Diversity-Parameters)\n",
    "\t* [Beta Diversity Parameters](#Beta-Diversity-Parameters)\n",
    "\t* [Data Set Selection](#Data-Set-Selection)\n",
    "* [File paths and Directories](#File-paths-and-Directories)\n",
    "\t* [Base Directory](#Base-Directory)\n",
    "\t* [Reference Directories and Files](#Reference-Directories-and-Files)\n",
    "\t* [Working Directories and Files](#Working-Directories-and-Files)\n",
    "\t\t* [Download Directories and Files](#Download-Directories-and-Files)\n",
    "\t\t* [Rarefaction Directory and Files](#Rarefaction-Directory-and-Files)\n",
    "\t\t* [Alpha Diversity Directories and Files](#Alpha-Diversity-Directories-and-Files)\n",
    "\t\t* [Split Directories and Files](#Split-Directories-and-Files)\n",
    "\t* [Output Directories and Files](#Output-Directories-and-Files)\n",
    "\t\t* [Data Directory](#Data-Directory)\n",
    "\t\t* [Body Site Directories](#Body-Site-Directories)\n",
    "\t\t* [Data Set Directories](#Data-Set-Directories)\n",
    "\t\t* [Data File Names](#Data-File-Names)\n",
    "\t\t* [File Exensions](#File-Exensions)\n",
    "\t\t* [File Blanks](#File-Blanks)\n",
    "* [Data Download](#Data-Download)\n",
    "* [Mapping File Clean up](#Mapping-File-Clean-up)\n",
    "\t* [Age](#Age)\n",
    "\t* [Alcohol Consumption](#Alcohol-Consumption)\n",
    "\t* [Body Mass Index](#Body-Mass-Index)\n",
    "\t* [Collection Season](#Collection-Season)\n",
    "\t* [Collection Location](#Collection-Location)\n",
    "\t* [Sleep Duration](#Sleep-Duration)\n",
    "* [Identification of a Healthy Subset of Adults](#Identification-of-a-Healthy-Subset-of-Adults)\n",
    "* [Whole Table Rarefaction](#Whole-Table-Rarefaction)\n",
    "* [Whole Table Alpha Diversity](#Whole-Table-Alpha-Diversity)\n",
    "* [Whole Table Beta Diversity](#Whole-Table-Beta-Diversity)\n",
    "* [Split the table by bodysite](#Split-the-table-by-bodysite)\n",
    "* [Select a Single Sample for Each Participant](#Select-a-Single-Sample-for-Each-Participant)\n",
    "* [Filter the Table to the Healthy Subset](#Filter-the-Table-to-the-Healthy-Subset)\n",
    "* [References](#References)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Preprocessing and File Generation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The goal of this notebook is to take the OTU table and mapping files generated by the American Gut  Primary Processing Pipeline Notebook and manipulate them to produce a uniform set of rarefied and filtered tables which can be used in downstream analyses, such as our Power Notebook, or using  <a href=\"http://qiime.org\">QIIME</a>, <a href=\"http://biocore.github.io/emperor/\">EMPeror</a>, or <a href=\"http://picrust.github.io/picrust/\">PICRUSt</a> [<a href=\"#20383131\">1 - 3</a>]. This processing is centralized since it can be computationally expensive, given the size of the tables involved, and because it removes some error associated with the random number generation used in some steps.\n",
    "\n",
    "<a href=\"#top\">Return to the top</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Sets Generated by this Notebook"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook will generate American Gut datasets for three body sites. This process will include a dataset focused on all the samples collected for that location, and a set of samples which represent a single sample from each donor at that body site. These are calculated on a per body site basis, so an individual who donated two fecal samples and an oral sample will have one sample in the single fecal sample set, and one sample in the single oral set.\n",
    "\n",
    "Additionally, we decided to create a healthy subset for fecal samples. The definition is provided [below](#Identification-of-a-Healthy-Subset-of-Adults). \n",
    "\n",
    "The final directory structure will follow the following organization. Parent Directories are bolded.\n",
    "\n",
    "* **sample_data/**\n",
    "    * all/\n",
    "    * **fecal/**\n",
    "        * all_participants_all_samples/\n",
    "        * all_participants_one_sample/\n",
    "        * sub_participants_all_samples/\n",
    "        * sub_participants_one_sample/\n",
    "    * **oral/**\n",
    "        * all_participants_all_samples/\n",
    "        * all_participants_one_sample/\n",
    "    * **skin/**\n",
    "        * all_participants_all_samples/\n",
    "        * all_participants_one_sample/\n",
    "        \n",
    "You can choose to download the complete directory (rather than running this notebook) by setting the [**`data_download`**](#File-Saving-Parameters) variable.\n",
    "\n",
    "\n",
    "<a href=\"#top\">Return to the top</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Files and File Types"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our folders will contain three types of files: two metadata files, two Biom-format OTU table files and two distance matrices:\n",
    "* AGP_100nt_fecal.txt<sup>*</sup> (metadata)\n",
    "* AGP_100nt_fecal.biom<sup>*</sup> (otu table)\n",
    "* AGP_100nt_fecal_even10k.txt<sup>*</sup> (metadata)\n",
    "* AGP_100nt_fecal_even10k.biom<sup>*</sup> (otu table)\n",
    "* unweighted_unifrac_AGP_100nt_fecal_even10k.txt<sup>*</sup> (distance matrix)\n",
    "* weighted_unifrac_AGP_100nt_fecal_even10k.txt<sup>*</sup> (distance matrix)\n",
    "\n",
    "<sup>*</sup>_oral and _skin may be substituted for _fecal in file names.\n",
    "\n",
    "In addition, a text file called either single_samples.txt or subset_samples.txt may be generated. This is the list of samples in the data set. The file is primarily auxiliary, and is used by this notebook to filter Biom and distance matrix files into single sample data sets.\n",
    "\n",
    "Let’s talk a bit more about what information is contained in each file type."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Metadata"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A **metadata file**, sometimes called a **mapping file**, provides information about our samples which cannot be determined from 16S marker gene sequences alone. Metadata is as important as the bacterial sequence information for determining information about the microbiome. Metadata tells us information about the sample, such as the body site where it was collected, the participant’s age or whether or not they have certain diseases. Since these can make a very large difference in the microbiome, it’s important to have this information!\n",
    "\n",
    "American Gut metadata is collected through the participant survey. The survey allows participants to skip any question they do not wish to answer, meaning that some samples are missing fields. The python library used to handle metadata in these notebooks will specially encode these fields. Within IPython analysis notebooks, missing data will be represented as a Numpy `NaN`. Printed notebooks are set return empty spaces for missing values, although this can be changed by altering the [write_na](#Metadata-and-text-file-handling-parameters) variable. For certain QIIME scripts, leaving these fields blank will allow the script to ignore samples missing metadata.\n",
    "\n",
    "The American Gut metadata is also de-identified. This means the metadata does not contain information which could be used to identify a participant, like their name, email address, or the kit ID. Instead, each participant is assigned a code. This allows us to identify multiple samples from the same individual. The samples are identified by the barcode number, which appears on the sample tube. This number connects the survey metadata to the sample data in the OTU table and distance matrix.\n",
    "\n",
    "This notebook will add [Alpha Diversity](#Whole-Table-Alpha-Diversity) to the mapping file for rarefied data. Files which contain the alpha diversity will always have a rarefaction depth notation in the file name. The convention here is to include the word even, and then the rarefaction depth in the file name (i.e. `AGP_100nt_even10k_fecal.txt`)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### OTU Tables"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "An **OTU table** describes the bacterial content in a group of samples. An OTU, or operational taxonomic unit, is a cluster of sequences which are identical to some degree of similarity. We use these as stand-ins for bacterial groups. More information about OTUs, the level of similarity used, and the methods we use to generate OTUs can be found in the [Primary Processing Pipeline Notebook](http://nbviewer.ipython.org/github/biocore/American-Gut/blob/master/ipynb/module2_v1.0.ipynb). \n",
    "The OTU table allows us to link the sequencing results from our 16S data to the sample ID in a usable way, and gives an easier platform for comparison across samples. \n",
    "\n",
    "Our OTU tables are saved as a special file format, the [Biom format](http://www.biom-format.org) [<a href=\"#23587224\">4</a>]. Unlike the other file types we will generate here, Biom files cannot be viewed using a normal spreadsheet program on your computer. The benefit of Biom format is that it allows us to save large amounts of data in a smaller amount of space. Biom data may be encoded as a JSON string, or using the same HDF5 compression NASA does. Because of the size of the American Gut data, we recommend the HDF5 format."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Distance Matrix"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, Distance Matrices describe the relationship between any two samples based on their community structure, measuring an ecological concept called [Beta Diversity](#Whole-Table-Beta-Diversity). In this notebook, we will measure the UniFrac Distance between all of our samples. This takes into account the evolutionary relationship between bacteria when communities are compared [<a href=\"#16332807\">5</a>, <a href=\"#20827291\">6</a>]. \n",
    "Each cell in the distance matrix tells the distance between the sample given by the row and the sample given by the column. Distance matrices are symmetrical, which means that we can draw a line across the diagonal of the distance matrix, and the distances on either side of this line will be equal. The distances along that line should come from the same sample, and will have a distance of 0. We can use our distance matrix information directly, or use multidimensional scaling techniques like <a href=\"http://occamstypewriter.org/boboh/2012/01/17/pca_and_pcoa_explained/\">Principal Coordinates Analysis (PCoA)</a> or make a <a href=\"http://en.wikipedia.org/wiki/UPGMA\">UPGMA tree</a>."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<table style=\"width:90%; border-collapse: collapse; padding: 15px\">\n",
    "    <thead style=\"border-top: 2px solid black; \n",
    "                   border-bottom: 1px solid black; \n",
    "                   border-left: hidden; \n",
    "                   border-right: hidden; \n",
    "                   text-align: center;\n",
    "                   line-height:120%\">\n",
    "    <tr style=\"vertical-align: center\">\n",
    "        <td style=\"border-left: none; border-right: none; padding: 15px; width: 15%; text-align: center\"></td>\n",
    "        <th style=\"border-left: none; border-right: none; padding: 15px; width: 25%; text-align: center\">Metadata</th>\n",
    "        <th style=\"border-left: none; border-right: none; padding: 15px; width: 25%; text-align: center\">OTU Table</th>\n",
    "        <th style=\"border-left: none; border-right: none; padding: 15px; width: 25%; text-align: center\">Distance Matrix</th>\n",
    "    </tr>\n",
    "    </thead>\n",
    "    <tr style=\"border-top: 1px solid black; \n",
    "               border-bottom:hidden; \n",
    "               border-left:hidden; \n",
    "               border-right:hidden;\n",
    "               vertical-align:top;\n",
    "               line-height:120%\">\n",
    "        <th style=\"border-left:none; \n",
    "                   border-right:none; \n",
    "                   vertical-align:top; \n",
    "                   padding:15px\">\n",
    "            File Type\n",
    "        </th>\n",
    "        <td style=\"border-left:none;\n",
    "                   border-right:none;\n",
    "                   padding:15px\">\n",
    "            tab-delimted text file<br>(ends in .txt)\n",
    "        </td>\n",
    "        <td style=\"border-left:none;\n",
    "                   border-right:none;\n",
    "                   padding:15px\">\n",
    "            <a href=\"http://www.biom-format.org\">Biom-format file</a>\n",
    "            [<a href=\"#23587224\">4</a>]<br>(ends in .biom)\n",
    "        </td>\n",
    "        <td style=\"border-left:none;\n",
    "                   border-right:none; \n",
    "                   padding:15px\">\n",
    "            tab-delimted text file<br>(ends in .txt)\n",
    "        </td>\n",
    "    </tr>\n",
    "    <tr style=\"vertical-align:top; \n",
    "               border-style:hidden;\n",
    "               line-height:120%\">\n",
    "        <th style=\"border-style:hidden;\n",
    "                   padding:15px\">\n",
    "            Other viewing option\n",
    "        </th>\n",
    "        <td style=\"border-style:hidden;\n",
    "                   padding:15px\">\n",
    "            text editor (i.e. Emacs, TextEdit)<br>Spreadsheet program (i.e. Excel)\n",
    "        </td>\n",
    "        <td>\n",
    "            Using the Biom-format python package<br> \n",
    "            Using the \n",
    "            <a href=\"http://cran.r-project.org/web/packages/biom/vignettes/biom-demo.html\"\n",
    "            >R BIOM package</a>\n",
    "        </td>\n",
    "        <td style=\"border-style:hidden; \n",
    "                   padding:15px\">\n",
    "            text editor (i.e. Emacs, TextEdit)<br>\n",
    "            Spreadsheet program (i.e. Excel)\n",
    "        </td>\n",
    "    </tr>\n",
    "    <tr style=\"line-height:120%\">\n",
    "        <th style=\"border-style:hidden; padding:15px\">\n",
    "            Rows\n",
    "        </th>\n",
    "        <td style=\"border-style:hidden; padding:15px\">\n",
    "            Samples are in rows.\n",
    "        </td>\n",
    "        <td style=\"border-style:hidden; padding:15px\">\n",
    "            OTU sequence clusters are the rows in the table.\n",
    "        </td>\n",
    "        <td style=\"border-style:hidden; padding:15px\">\n",
    "            Rows are samples.\n",
    "        </td>\n",
    "    </tr>\n",
    "    <tr style=\"border-top:hidden; \n",
    "               border-bottom:2px solid black;\n",
    "               border-right:hidden;\n",
    "               border-left:hidden;\n",
    "               line-height:120%\">\n",
    "        <th style=\"border-left:none; border-right:none; padding:15px\">\n",
    "            Columns\n",
    "        </th>\n",
    "        <td style=\"border-left:none; border-right:none; padding:15px\">\n",
    "            Columns are metadata categories like AGE, BMI or disease status.\n",
    "        </td>\n",
    "        <td style=\"border-left:none; border-right:none; padding:15px\">\n",
    "            Columns are samples\n",
    "        </td>\n",
    "        <td style=\"border-left:none; border-right:none; padding:15px\">\n",
    "            Columns are samples.\n",
    "        </td>\n",
    "    </tr>\n",
    "</table>\n",
    "\n",
    "<a href=\"#top\">Return to the top</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Notebook Requirements"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* [Python 2.7](https://www.python.org/download/releases/2.7/)\n",
    "* [QIIME 1.9](http://www.qiime.org)\n",
    "* [h5py](http://www.h5py.org) and [hdf5](http://www.hdfgroup.org/HDF5/). These are required to read the American Gut tables.\n",
    "* [Jinja2](http://jinja.pocoo.org/docs/dev/), [pyzmq](https://learning-0mq-with-pyzmq.readthedocs.org/en/latest/),  [tornado](http://www.tornadoweb.org/en/stable/) and [jsonschema](http://json-schema.org) <br/>These are required to open a local IPython notebook instance. They are not installed automatically when you install IPython as a dependency for QIIME.\n",
    "* [IPython 3.0](http://ipython.org)\n",
    "* [Statsmodels 0.6.0](http://statsmodels.sourceforge.net)\n",
    "* [American Gut Python Library](https://github.com/biocore/American-Gut)\n",
    "\n",
    "$\\LaTeX$ is also recommended for running this suite of analysis notebooks, although it is not required for this notebook. [LiveTex](http://www.tug.org/texlive/) offers one installation solution.\n",
    "\n",
    "<a href=\"#top\">Return to the top</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Function Imports"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will start our analysis by importing the functions we need from python libraries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import shutil\n",
    "import copy\n",
    "import datetime\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import skbio\n",
    "import biom\n",
    "\n",
    "import americangut.diversity_analysis as div\n",
    "import americangut.geography_lib as geo\n",
    "\n",
    "from qiime_default_reference import get_reference_tree"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"#top\">Return to the top</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Analysis Parameters"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is also important to define certain aspects of how we'll handle files and do our analysis. It can be easier to set all these at the same time, so the systems are consistent every time we repeat the process, rather than repeat them multiple places. This way, we only have to change the parameter once."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## File Saving Parameters"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the course of this analysis, a series of files can be generated. The File Saving Parameters determine whether new files are saved.\n",
    "<table style=\"width:90%;\n",
    "              border-style:hidden;\n",
    "              borders-collapse:collapse;\n",
    "              line-height:120%\">\n",
    "    <tr>\n",
    "        <td style=\"width: 30%;\n",
    "                   text-align:left; \n",
    "                   vertical-align:top;\n",
    "                   background-color:#D0D0D0;\n",
    "                   border-right:hidden; \n",
    "                   border-bottom: 10px solid white;\n",
    "                   padding:10px\">\n",
    "            <strong>download_data</strong><br />(boolean)\n",
    "        </td>\n",
    "        <td style=\"width: 60%\n",
    "                   text-align: left;\n",
    "                   vertical-align: top;\n",
    "                   border-left:hidden;\n",
    "                   border-top:hidden;\n",
    "                   border-bottom:hidden;\n",
    "                   padding:10px;\n",
    "                   \">\n",
    "            This will download a directory of precomputed data tables, when <code style=\"color:ForestGreen\">True</code>. <strong><code>download_data</code></strong> will supersede <strong><code>overwrite</code></strong>. (That is, if <strong><code>download_data</code></strong> is <code style=\"color:ForestGreen\">True</code>, <strong><code>overwrite</code></strong> must be <code style=\"color:ForestGreen\">False</code>.)\n",
    "        </td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "        <td style=\"width: 30%;\n",
    "                   text-align:left; \n",
    "                   vertical-align:top;\n",
    "                   background-color:#D0D0D0;\n",
    "                   border-right:hidden; \n",
    "                   border-bottom: 10px solid white;\n",
    "                   padding:10px\">\n",
    "            <strong>overwrite</strong><br />(boolean)\n",
    "        </td>\n",
    "        <td style=\"width: 60%\n",
    "                   text-align: left;\n",
    "                   vertical-align: top;\n",
    "                   border-left:hidden;\n",
    "                   border-top:hidden;\n",
    "                   border-bottom:hidden;\n",
    "                   padding:10px;\n",
    "                   \">\n",
    "            <p>When <strong><code>overwrite</code></strong> is \n",
    "            <code style=\"color:ForestGreen\">True</code>, new files will be \n",
    "            generated and saved during data processing. It is recommended \n",
    "            that overwrite be set to \n",
    "            <code style=\"color:ForestGreen\">False</code>, \n",
    "            in which case new files will \n",
    "            only be generated when the file does not exist. This \n",
    "            substantially decreases analysis time.</p>\n",
    "        </td>\n",
    "    </tr>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "data_download = False\n",
    "overwrite = False\n",
    "\n",
    "# Checks the data download-overwrite relationship is valid\n",
    "if data_download:\n",
    "    overwrite = True"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"#top\">Return to the top</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Metadata and text file handling parameters"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "  We'll start by defining how we'll handle certain files, especially metadata files. We will use the [Pandas](http://pandas.pydata.org) library to handle most of our text files. This library provides some spreadsheet like functionality.\n",
    "\n",
    "<table style=\"width: 90%; border-style: hidden; borders-collapse: collapse; line-height:120%\">\n",
    "    <tr style=\"\">\n",
    "        <td style=\"width: 30%;\n",
    "                   text-align:left; \n",
    "                   vertical-align:top;\n",
    "                   background-color:#D0D0D0;\n",
    "                   border-right:hidden; \n",
    "                   border-bottom: 10px solid white;\n",
    "                   padding:10px\">\n",
    "            <strong>overwrite</strong><br />(boolean)\n",
    "        </td>\n",
    "        <td style=\"width: 60%\n",
    "                   text-align: left;\n",
    "                   vertical-align: top;\n",
    "                   border-left:hidden;\n",
    "                   border-top:hidden;\n",
    "                   border-bottom:hidden\n",
    "                   padding:10px\">\n",
    "           When <strong><code>overwrite</code></strong> is <code><font color=\"#228B22\">True</font></code>, new files will be generated and saved during data processing. <br>It is recommended that overwrite be set to <code><font color=\"#228B22\">False</font></code>, in which case new files will only be generated when the file does not exist. This substantially decreases analysis time.\n",
    "        </td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "        <td style=\"width: 30%;\n",
    "                   text-align:left; \n",
    "                   vertical-align:top;\n",
    "                   background-color:#D0D0D0;\n",
    "                   border-right:hidden; \n",
    "                   border-bottom: 10px solid white;\n",
    "                   padding:10px\">\n",
    "            <strong>txt_delim</strong><br />(string)\n",
    "        </td>\n",
    "        <td style=\"width: 60%\n",
    "                   text-align: left;\n",
    "                   vertical-align: top;\n",
    "                   border-left:hidden;\n",
    "                   border-top:hidden;\n",
    "                   border-bottom:hidden;\n",
    "                   padding:10px;\n",
    "                   \">\n",
    "            <strong><code>txt_delim</code></strong> specifies the way columns are separated in the files. QIIME typically consumes and produces tab-delimited (<code><font color=\"FireBrick\">\"\\t\"</font></code>) text files (.txt) for metadata and results generation.\n",
    "        </td>\n",
    "    </tr>\n",
    "    \n",
    "    <tr>\n",
    "        <td style=\"width: 30%;\n",
    "                   text-align:left; \n",
    "                   vertical-align:top;\n",
    "                   background-color:#D0D0D0;\n",
    "                   border-right:hidden; \n",
    "                   border-bottom: 10px solid white;\n",
    "                   padding:10px\">\n",
    "        <strong>map_index</strong><br />(string)\n",
    "        </td>\n",
    "        <td style=\"width: 60%\n",
    "                   text-align: left;\n",
    "                   vertical-align: top;\n",
    "                   border-left:hidden;\n",
    "                   border-top:hidden;\n",
    "                   border-bottom:hidden;\n",
    "                   padding:10px;\n",
    "                   \">\n",
    "        The name of the column containing the sample names. In QIIME, this column is called <code><font color=\"FireBrick\">#SampleID</font></code>.\n",
    "        </td>\n",
    "    <tr>\n",
    "    <tr>\n",
    "        <td style=\"width: 30%;\n",
    "                   text-align:left; \n",
    "                   vertical-align:top;\n",
    "                   background-color:#D0D0D0;\n",
    "                   border-right:hidden; \n",
    "                   border-bottom: 10px solid white;\n",
    "                   padding:10px\">\n",
    "        <strong>map_nas</strong><br />(list of strings)\n",
    "        </td>\n",
    "        <td style=\"width: 60%\n",
    "                   text-align: left;\n",
    "                   vertical-align: top;\n",
    "                   border-left:hidden;\n",
    "                   border-top:hidden;\n",
    "                   border-bottom:hidden;\n",
    "                   padding:10px;\n",
    "                   \">\n",
    "        It is possible a mapping file may be missing values, since American Gut participants are free to skip any question. The pandas package is able to omit these missing samples from analysis. In raw American Gut files, missing values are typically denoted as <code><font color=\"FireBrick\">“NA”</font></code>, <code><font color=\"FireBrick\">“no_data”</font></code>, <code><font color=\"FireBrick\">“unknown”</font></code>, and empty spaces (<code><font color=\"FireBrick\">“”</font></code>).\n",
    "        </td>\n",
    "    </tr>\n",
    "        <tr>\n",
    "        <td style=\"width: 30%;\n",
    "                   text-align:left; \n",
    "                   vertical-align:top;\n",
    "                   background-color:#D0D0D0;\n",
    "                   border-right:hidden; \n",
    "                   border-bottom: 10px solid white;\n",
    "                   padding:10px\">\n",
    "                   <strong>write_na</strong><br />(string)\n",
    "        </td>\n",
    "        <td style=\"width: 60%\n",
    "                   text-align: left;\n",
    "                   vertical-align: top;\n",
    "                   border-left:hidden;\n",
    "                   border-top:hidden;\n",
    "                   border-bottom:hidden;\n",
    "                   padding:10px;\n",
    "                   \">\n",
    "        The value to denote missing values when text files are written from Pandas data frames. Using an empty space, (<code><font color=\"FireBrick\">“”</font></code>) will allow certain QIIME scripts, like <a href=\"http://qiime.org/scripts/group_significance.html\">group_significance.py</a>, to ignore the missing values.\n",
    "        </td>\n",
    "    </tr>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "txt_delim = '\\t'\n",
    "map_index = '#SampleID'\n",
    "map_nas = ['NA', 'no_data', 'unknown', '']\n",
    "write_na = ''"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"#top\">Return to the top</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Rarefaction parameters"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We rarefy our data to set an even depth and allow head-to-head comparison of alpha and beta diversity. Rarefaction begins by removing samples from the table which do not have the minimum number of counts. Sequences are then drawn randomly out of a weighted pool until we reach the rarefaction depth.\n",
    "\n",
    "<table style=\"width: 90%; border-style: hidden; borders-collapse: collapse; line-height:120%\">\n",
    "    <tr>\n",
    "        <td style=\"width: 30%;\n",
    "                   text-align:left; \n",
    "                   vertical-align:top;\n",
    "                   background-color:#D0D0D0;\n",
    "                   border-right:hidden; \n",
    "                   border-bottom: 10px solid white;\n",
    "                   padding:10px\">\n",
    "            <strong>rarefaction_depth</strong><br>(int)\n",
    "        </td>\n",
    "        <td style=\"width: 60%\n",
    "                   text-align: left;\n",
    "                   vertical-align: top;\n",
    "                   border-left:hidden;\n",
    "                   border-top:hidden;\n",
    "                   border-bottom:hidden;\n",
    "                   padding:10px;\n",
    "                   \">\n",
    "            The <strong><code>rarefaction_depth</code></strong> specifies the number of sequence per samples to be used for analysis. A depth of 10,000 sequences/sample was selected  because it balances a better picture of diversity with retaining samples. \n",
    "        </td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "        <td style=\"width: 30%;\n",
    "                   text-align:left; \n",
    "                   vertical-align:top;\n",
    "                   background-color:#D0D0D0;\n",
    "                   border-right:hidden; \n",
    "                   border-bottom: 10px solid white;\n",
    "                   padding:10px\">\n",
    "            <strong>num_rarefactions</strong><br>(int)\n",
    "        </td>\n",
    "        <td style=\"width: 60%\n",
    "                   text-align: left;\n",
    "                   vertical-align: top;\n",
    "                   border-left:hidden;\n",
    "                   border-top:hidden;\n",
    "                   border-bottom:hidden;\n",
    "                   padding:10px;\n",
    "                   \">\n",
    "            The number of times we draw new rarefaction tables. This controls for bias due to single rarefaction instances. We selected 10 rarefactions to achieve a balance between computational efficiency and appropriate depth.\n",
    "        </td>\n",
    "    </tr>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "rarefaction_depth = 10000\n",
    "num_rarefactions = 10"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"#top\">Return to the top</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Split Parameters"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will split our OTU tables by body site, since the collection site on the human body plays a large role in community compositions in healthy adults [<a href=\"#22699609\">7</a>].\n",
    "\n",
    "<table style=\"width: 90%; border-style: hidden; borders-collapse: collapse; line-height:120%\">\n",
    "    <tr>\n",
    "        <td style=\"width: 30%;\n",
    "                   text-align:left; \n",
    "                   vertical-align:top;\n",
    "                   background-color:#D0D0D0;\n",
    "                   border-right:hidden; \n",
    "                   border-bottom: 10px solid white;\n",
    "                   padding:10px\">\n",
    "            <strong>split_field</strong><br>(string)\n",
    "        </td>\n",
    "        <td style=\"width: 60%\n",
    "                   text-align: left;\n",
    "                   vertical-align: top;\n",
    "                   border-left:hidden;\n",
    "                   border-top:hidden;\n",
    "                   border-bottom:hidden;\n",
    "                   padding:10px;\n",
    "                   \">\n",
    "            The metadata category which contains the body site information we will use to split the OTU table and mapping file.\n",
    "        </td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "        <td style=\"width: 30%;\n",
    "                   text-align:left; \n",
    "                   vertical-align:top;\n",
    "                   background-color:#D0D0D0;\n",
    "                   border-right:hidden; \n",
    "                   border-bottom: 10px solid white;\n",
    "                   padding:10px\">\n",
    "            <strong>split_prefix</strong><br>(string)\n",
    "        </td>\n",
    "        <td style=\"width: 60%\n",
    "                   text-align: left;\n",
    "                   vertical-align: top;\n",
    "                   border-left:hidden;\n",
    "                   border-top:hidden;\n",
    "                   border-bottom:hidden;\n",
    "                   padding:10px;\n",
    "                   \">\n",
    "        Under the standards used to format the American Gut metadata, a constant prefix is used to denote body site. This is used for string formatting and file naming.\n",
    "        </td>\n",
    "    </tr>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "split_field = 'BODY_HABITAT'\n",
    "split_prefix = 'UBERON:'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"#top\">Return to the top</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Alpha Diversity Parameters"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Alpha diversity looks at the variety of species within a sample. In this notebook, we calculate alpha diversity using a variety of metrics, and append these to the mapping file. A more complete discussion of alpha diversity can be found [below](#Whole-Table-Alpha-Diversity).\n",
    "\n",
    "<table style=\"width: 90%; border-style: hidden; borders-collapse: collapse; line-height:120%\">\n",
    "    <tr>\n",
    "        <td style=\"width: 30%;\n",
    "                   text-align:left; \n",
    "                   vertical-align:top;\n",
    "                   background-color:#D0D0D0;\n",
    "                   border-right:hidden; \n",
    "                   border-bottom: 10px solid white;\n",
    "                   padding:10px\">\n",
    "            <strong>alpha_metrics</strong><br>(string)\n",
    "        </td>\n",
    "        <td style=\"width: 60%\n",
    "                   text-align: left;\n",
    "                   vertical-align: top;\n",
    "                   border-left:hidden;\n",
    "                   border-top:hidden;\n",
    "                   border-bottom:hidden;\n",
    "                   padding:10px;\n",
    "                   \">\n",
    "            There are multiple alpha diversity metrics which can be used.\n",
    "            We will calculate four alpha diversity metrics here: PD Whole Tree \n",
    "            [<a href=\"#15831718\">8</a>], Observed Species, Chao1 \n",
    "            [<a href=\"#chao\">9</a>], and Shannon [<a href=\"#shannon\">10</a>] \n",
    "            diversity. Among these metrics, PD whole tree diversity is unique in\n",
    "            that it considers the evolutionary relationship between OTUs in a \n",
    "            sample by calculating the branch length on the phylogenetic tree \n",
    "            covered by a sample. A list of available metrics can be found on the \n",
    "            <a href=\"http://scikit-bio.org/docs/latest/generated/skbio.diversity.alpha.html#module-skbio.diversity.alpha\">scikit-bio website</a>. Metric names should be connected \n",
    "            with a comma (and no spaces).\n",
    "        </td>\n",
    "    </tr>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "alpha_metrics = 'PD_whole_tree,observed_species,chao1,shannon'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"#top\">Return to the top</a>\n",
    "<a id=\"params_beta\"></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Beta Diversity Parameters"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Beta Diversity compares the ecology community across multiple sites. A further discussion of beta diversity can be found [below](#Whole-Table-Beta-Diversity).  We will use QIIME to calculate beta diversity.\n",
    "\n",
    "<table style=\"width: 90%; border-style: hidden; borders-collapse: collapse; line-height:120%\">\n",
    "    <tr>\n",
    "        <td style=\"width: 30%;\n",
    "                   text-align:left; \n",
    "                   vertical-align:top;\n",
    "                   background-color:#D0D0D0;\n",
    "                   border-right:hidden; \n",
    "                   border-bottom: 10px solid white;\n",
    "                   padding:10px\">\n",
    "            <strong>beta_metrics</strong><br>(string)\n",
    "        </td>\n",
    "        <td style=\"width: 60%\n",
    "                   text-align: left;\n",
    "                   vertical-align: top;\n",
    "                   border-left:hidden;\n",
    "                   border-top:hidden;\n",
    "                   border-bottom:hidden;\n",
    "                   padding:10px;\n",
    "                   \">\n",
    "            As with alpha diversity, there are multiple ways we can \n",
    "            calculate our beta diversity metrics. Here, we use weighted and \n",
    "            unweighted UniFrac distances [<a href=\"#16332807\">5</a>]. UniFrac \n",
    "            distance determines the amount of the phylogenetic tree which does \n",
    "            not overlap between two samples. Weighted UniFrac takes into account \n",
    "            the abundance of taxa within a sample, while unweighted UniFrac \n",
    "            distance only considers the presence or absence of a particular \n",
    "            community member. Additional options are available in the documentation for \n",
    "            <a href=\"http://qiime.org/scripts/beta_diversity.html\">beta_diversity.py</a>.\n",
    "        </td>\n",
    "    </tr>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "beta_metrics = \"unweighted_unifrac,weighted_unifrac\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"#top\">Return to the top</a>\n",
    "\n",
    "<a id=\"datasets\"></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Set Selection"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can select the datasets which we’ll generate using this notebook. The default for each body site is to generate a <a href=\"#Data-Sets-Generated-by-this-Notebook\">data set</a> (OTU table, mapping file and distance matrices) for all participants and all samples.  We may want to limit our samples to a <a href=\"#Select-a-Single-Sample-for-Each-Participant\">single sample per individual</a>. Or, we could choose only to work with a subset of the data (see <a href=\"#Identification-of-a-Healthy-Subset-of-Adults\">Identification of a Healthy Subset of Adults</a>).\n",
    "\n",
    "<table style=\"width: 90%; border-style: hidden; borders-collapse: collapse; line-height:120%\">\n",
    "    <tr>\n",
    "        <td style=\"width: 30%;\n",
    "                   text-align:left; \n",
    "                   vertical-align:top;\n",
    "                   background-color:#D0D0D0;\n",
    "                   border-right:hidden; \n",
    "                   border-bottom: 10px solid white;\n",
    "                   padding:10px\"\n",
    "                   >\n",
    "            <strong>habitat_bodysite</strong><br>(list of strings)\n",
    "        </td>\n",
    "        <td style=\"width: 60%\n",
    "                   text-align: left;\n",
    "                   vertical-align: top;\n",
    "                   border-left:hidden;\n",
    "                   border-top:hidden;\n",
    "                   border-bottom:hidden;\n",
    "                   padding:10px;\n",
    "                   \">\n",
    "            A list of the body site names used by our American Gut metadata standards. Some of these are inconvenient for file naming, and so we will rename some of these fields using the corresponding <strong><code>all_bodysites</code></strong> name. For example, the standard name for a mouth sample is to label it as an <code><font color=\"FireBrick\">“oral cavity”</font></code>, however, spaces in file paths make life difficult, so this is mapped to <code><font color=\"FireBrick\">“oral”</font></code> in our <strong><code>all_bodysites</code></strong> list.\n",
    "        </td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "        <td style=\"width: 30%;\n",
    "                   text-align:left; \n",
    "                   vertical-align:top;\n",
    "                   background-color:#D0D0D0;\n",
    "                   border-right:hidden; \n",
    "                   border-bottom: 10px solid white;\n",
    "                   padding:10px\"\n",
    "                   >\n",
    "            <strong>all_bodysites</strong><br>(list of strings)\n",
    "        </td>\n",
    "        <td style=\"width: 60%\n",
    "                   text-align: left;\n",
    "                   vertical-align: top;\n",
    "                   border-left:hidden;\n",
    "                   border-top:hidden;\n",
    "                   border-bottom:hidden;\n",
    "                   padding:10px;\n",
    "                   \">\n",
    "            A list of all the possible body sites which will be used to generate the datasets here. The order of body sites must correspond to the order of body sites in <strong><code>habitat_bodysites</code></strong>.\n",
    "        </td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "        <td style=\"width: 30%;\n",
    "                   text-align:left; \n",
    "                   vertical-align:top;\n",
    "                   background-color:#D0D0D0;\n",
    "                   border-right:hidden; \n",
    "                   border-bottom: 10px solid white;\n",
    "                   padding:10px\"\n",
    "                   >\n",
    "            <strong>sub_part_sites</strong><br>(set)\n",
    "        </td>\n",
    "        <td style=\"width: 60%\n",
    "                   text-align: left;\n",
    "                   vertical-align: top;\n",
    "                   border-left:hidden;\n",
    "                   border-top:hidden;\n",
    "                   border-bottom:hidden;\n",
    "                   padding:10px;\n",
    "                   \">\n",
    "            We may also want to generate data sets which limits our sample set to \n",
    "exclude samples which are already known to significantly affect the microbiome. The subset currently being selected focuses mainly on fecal samples.\n",
    "        </td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "        <td style=\"width: 30%;\n",
    "                   text-align:left; \n",
    "                   vertical-align:top;\n",
    "                   background-color:#D0D0D0;\n",
    "                   border-right:hidden; \n",
    "                   border-bottom: 10px solid white;\n",
    "                   padding:10px\"\n",
    "                   >\n",
    "            <strong>one_samp_sites</strong><br>(set)\n",
    "        </td>\n",
    "        <td style=\"width: 60%\n",
    "                   text-align: left;\n",
    "                   vertical-align: top;\n",
    "                   border-left:hidden;\n",
    "                   border-top:hidden;\n",
    "                   border-bottom:hidden;\n",
    "                   padding:10px;\n",
    "                   \">\n",
    "            For some types of analysis, there is an assumption that samples are \n",
    "            independent, which in this context includes the requirement that \n",
    "            there are not multiple samples per individual. To limit analysis to \n",
    "            a single sample from each individual, we can select body site where \n",
    "            we want to filter for a single sample per individual. We recommend \n",
    "            doing this for all body sites.\n",
    "        </td>\n",
    "    </tr>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Lists all bodysites to be analyzed\n",
    "habitat_sites = ['feces', 'oral cavity', 'skin']\n",
    "all_bodysites = ['fecal', 'oral', 'skin']\n",
    "\n",
    "# Handles healthy subset OTU tables\n",
    "sub_part_sites = {'fecal'}\n",
    "\n",
    "# Handles single sample OTU tables\n",
    "one_samp_sites = {'fecal', 'oral', 'skin'}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"#top\">Return to the top</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# File paths and Directories"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To help organize the results of this notebook, we’ll start by setting up a series of files where results can be saved. This will provide a common file structure (`base_dir`, `working_dir`, etc.) for the results. We’re going to set up three primary directories here, and then nest additional directories inside.\n",
    "\n",
    "As we set up directories, we’ll make use the of the `check_dir` function. This will create the directories we identify if they do not exist."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Base Directory"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We need a general location to do all our analysis; this is the `base_dir`. All our other directories will exist within the base_dir, and allow us to work.\n",
    "\n",
    "<table style=\"width: 90%; border-style: hidden; borders-collapse: collapse; line-height:120%\">\n",
    "    <tr>\n",
    "        <td style=\"width: 30%;\n",
    "                   text-align:left; \n",
    "                   vertical-align:top;\n",
    "                   background-color:#D0D0D0;\n",
    "                   border-right:hidden; \n",
    "                   border-bottom: 10px solid white;\n",
    "                   padding:10px\"\n",
    "                   >\n",
    "            <strong>base_dir</strong><br>(string)\n",
    "        </td>\n",
    "        <td style=\"width: 60%\n",
    "                   text-align: left;\n",
    "                   vertical-align: top;\n",
    "                   border-left:hidden;\n",
    "                   border-top:hidden;\n",
    "                   border-bottom:hidden;\n",
    "                   padding:10px;\n",
    "                   \">\n",
    "            The file path for the directory where any files associated with the \n",
    "            analysis should be saved. It is suggested this be a directory called \n",
    "            <code><font color=\"FireBrick\">\"agp_analysis\"</font></code>, located \n",
    "            in the same directory as the IPython notebooks.\n",
    "        </td>\n",
    "    </tr>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "base_dir = os.path.join(os.path.abspath('.'), 'agp_analysis')\n",
    "div.check_dir(base_dir)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"#top\">Return to the top</a>\n",
    "<a id=\"dir_ref\"></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reference Directories and Files"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Some of the steps in our diversity calculations will require a phylogenetic tree. This contains information about the evolutionary relationship between OTUs which can be leveraged in calculating PD Whole Tree Diversity and UniFrac Distance.\n",
    "While there are multiple ways to pick OTUs, this table was generated using a reference-based technique. Therefore, we can simply download the phylogenetic tree file for the reference set. Our reference for this dataset was the Greengenes version 13_8 at 97% similarity [<a href=\"#22134646\">11</a>]. \n",
    "Please refer to the [Primary Processing Pipeline Notebook](http://nbviewer.ipython.org/github/biocore/American-Gut/blob/master/ipynb/module2_v1.0.ipynb) for more information about how OTUs are picked.\n",
    "\n",
    "<table style=\"width: 90%; border-style: hidden; borders-collapse: collapse; line-height:120%\">\n",
    "    <tr>\n",
    "        <td style=\"width: 30%;\n",
    "                   text-align:left; \n",
    "                   vertical-align:top;\n",
    "                   background-color:#D0D0D0;\n",
    "                   border-right:hidden; \n",
    "                   border-bottom: 10px solid white;\n",
    "                   padding:10px\"\n",
    "                   >\n",
    "        <strong>tree_fp</strong><br>(string)\n",
    "        </td>\n",
    "        <td style=\"width: 60%\n",
    "                   text-align: left;\n",
    "                   vertical-align: top;\n",
    "                   border-left:hidden;\n",
    "                   border-top:hidden;\n",
    "                   border-bottom:hidden;\n",
    "                   padding:10px;\n",
    "                   \">\n",
    "            This gives the  location of the correct tree file inside your Greengenes 13_8 \n",
    "            directory. The default tree from your QIIME installation is called, here.\n",
    "        </td>\n",
    "    </tr>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "tree_fp = get_reference_tree()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"#top\">Return to the top</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Working Directories and Files"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The working directories will be used to save files we generate while cleaning the data. These may include items like downloaded OTU tables, and rarefaction instances.\n",
    "\n",
    "<table style=\"width:90%; \n",
    "              border-style:hidden;\n",
    "              borders-collapse:collapse;\n",
    "              line-height:120%\">\n",
    "    <tr>\n",
    "        <td style=\"width: 30%;\n",
    "                   text-align:left; \n",
    "                   vertical-align:top;\n",
    "                   background-color:#D0D0D0;\n",
    "                   border-right:hidden; \n",
    "                   border-bottom: 10px solid white;\n",
    "                   padding:10px\">\n",
    "            <strong>working_dir</strong><br>(string)\n",
    "        </td>\n",
    "        <td style=\"width: 60%\n",
    "                   text-align: left;\n",
    "                   vertical-align: top;\n",
    "                   border-left:hidden;\n",
    "                   border-top:hidden;\n",
    "                   border-bottom:hidden;\n",
    "                   padding:10px;\n",
    "                   \">\n",
    "            The file path for a directory where intermediate files (i.e. \n",
    "            rarefaction instances) generated during the run of this notebook can \n",
    "            be stored. It is recommended this be located within the base \n",
    "            analysis directory.\n",
    "        </td>\n",
    "    </tr>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Sets up a directory to save intermediate files and downloads\n",
    "working_dir = os.path.join(base_dir, 'intermediate_files')\n",
    "div.check_dir(working_dir)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"#top\">Return to the top</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Download Directories and Files"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our first analysis step is to locate the American Gut OTU tables generated by the [Primary Processing Pipeline Notebook](http://nbviewer.ipython.org/github/biocore/American-Gut/blob/master/ipynb/module2_v1.0.ipynb) The tables may be generated locally, may be located in local GitHub Repository or they may be downloaded directly from [GitHub](https://github.com/biocore/American-Gut/tree/master/data). \n",
    "\n",
    "<table style=\"width: 90%; \n",
    "              border-style:hidden; \n",
    "              borders-collapse:collapse;\n",
    "              line-height:120%\">\n",
    "    <tr>\n",
    "        <td style=\"width: 30%;\n",
    "                   text-align:left; \n",
    "                   vertical-align:top;\n",
    "                   background-color:#D0D0D0;\n",
    "                   border-right:hidden; \n",
    "                   border-bottom: 10px solid white;\n",
    "                   padding:10px\">\n",
    "            <strong>download_dir</strong><br>(string)\n",
    "        </td>\n",
    "        <td style=\"width: 60%\n",
    "                   text-align: left;\n",
    "                   vertical-align: top;\n",
    "                   border-left:hidden;\n",
    "                   border-top:hidden;\n",
    "                   border-bottom:hidden;\n",
    "                   padding:10px;\n",
    "                   \">\n",
    "            The file path where downloaded files should be saved. The \n",
    "            <strong><code>download_dir</code></strong> may be located within the \n",
    "            <strong><code>working_dir</code></strong>, it’s also likely the \n",
    "            <strong><code>download_dir</code></strong> may be located outside \n",
    "            the <strong><code>base_dir</code></strong>.\n",
    "        </td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "        <td style=\"width: 30%;\n",
    "                   text-align:left; \n",
    "                   vertical-align:top;\n",
    "                   background-color:#D0D0D0;\n",
    "                   border-right:hidden; \n",
    "                   border-bottom: 10px solid white;\n",
    "                   padding:10px\">\n",
    "            <strong>download_otu_fp</strong><br>(string)\n",
    "        </td>\n",
    "        <td style=\"width: 60%\n",
    "                   text-align: left;\n",
    "                   vertical-align: top;\n",
    "                   border-left:hidden;\n",
    "                   border-top:hidden;\n",
    "                   border-bottom:hidden;\n",
    "                   padding:10px;\n",
    "                   \">\n",
    "            The uncompressed OTU table from GitHub is located at this file path. \n",
    "            This should be a .biom file, with no compression.\n",
    "        </td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "        <td style=\"width: 30%;\n",
    "                   text-align:left; \n",
    "                   vertical-align:top;\n",
    "                   background-color:#D0D0D0;\n",
    "                   border-right:hidden; \n",
    "                   border-bottom: 10px solid white;\n",
    "                   padding:10px\">\n",
    "            <strong>download_map_fp</strong><br>(string)\n",
    "        </td>\n",
    "        <td style=\"width: 60%\n",
    "                   text-align: left;\n",
    "                   vertical-align: top;\n",
    "                   border-left:hidden;\n",
    "                   border-top:hidden;\n",
    "                   border-bottom:hidden;\n",
    "                   padding:10px;\n",
    "                   \">\n",
    "            The location of the American Gut mapping file, downloaded from \n",
    "            GitHub.\n",
    "        </td>\n",
    "    </tr>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Creates a directory for unprocessed file downloads\n",
    "download_dir = os.path.join(working_dir, 'downloads')\n",
    "div.check_dir(download_dir)\n",
    "\n",
    "# Sets the filepaths for downloaded files\n",
    "download_otu_fp = os.path.join(download_dir, 'AG_100nt.biom')\n",
    "download_map_fp = os.path.join(download_dir, 'AG_100nt.txt')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"#top\">Return to the top</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Rarefaction Directory and Files"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When we perform rarefaction on our data, we will generate several similarly named files. To help keep these organized, we will create a directory for each set of rarefaction files.\n",
    "\n",
    "<table style=\"width:90%; \n",
    "              border-style:hidden; \n",
    "              borders-collapse:collapse; \n",
    "              line-height:120%\">\n",
    "    <tr>\n",
    "        <td style=\"width: 30%;\n",
    "                   text-align:left; \n",
    "                   vertical-align:top;\n",
    "                   background-color:#D0D0D0;\n",
    "                   border-right:hidden; \n",
    "                   border-bottom: 10px solid white;\n",
    "                   padding:10px\">\n",
    "            <strong>rare_dir</strong><br>(string)\n",
    "        </td>\n",
    "        <td style=\"width: 60%\n",
    "                   text-align: left;\n",
    "                   vertical-align: top;\n",
    "                   border-left:hidden;\n",
    "                   border-top:hidden;\n",
    "                   border-bottom:hidden;\n",
    "                   padding:10px;\n",
    "                   \">\n",
    "            The file path to the directory where we should save all of our \n",
    "            rarefaction files. This should be located in the <strong>\n",
    "            <code>working_dir</code></strong>.\n",
    "        </td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "        <td style=\"width: 30%;\n",
    "                   text-align:left; \n",
    "                   vertical-align:top;\n",
    "                   background-color:#D0D0D0;\n",
    "                   border-right:hidden; \n",
    "                   border-bottom: 10px solid white;\n",
    "                   padding:10px\">\n",
    "            <strong>rare_pattern</strong><br>(string)\n",
    "        </td>\n",
    "        <td style=\"width: 60%\n",
    "                   text-align: left;\n",
    "                   vertical-align: top;\n",
    "                   border-left:hidden;\n",
    "                   border-top:hidden;\n",
    "                   border-bottom:hidden;\n",
    "                   padding:10px;\n",
    "                   \">\n",
    "            This describes the way rarefied OTU tables will be saved in our \n",
    "            output directories. The \n",
    "            <code><font color=\"FireBrick\">“%i”</font></code> \n",
    "            will allow us to substitute any integers. In this case, we will \n",
    "            specify the rarefaction depth, and the rarefaction instance for each \n",
    "            table. \n",
    "        </td>\n",
    "    </tr>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Creates a parent directory for rarefaction instances\n",
    "rare_dir = os.path.join(working_dir, 'rarefaction')\n",
    "div.check_dir(rare_dir)\n",
    "\n",
    "# Sets a pattern for the filenames of the rarefaction files\n",
    "rare_pattern = 'rarefaction_%(rare_depth)i_%(rare_instance)i.biom'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"#top\">Return to the top</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Alpha Diversity Directories and Files"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When we perform the alpha diversity on our data, we will use similarly named rarefaction tables and generate several similarly named alpha diversity files. To help keep these organized, we will create a directory for each set of alpha diversity files.\n",
    "\n",
    "<table style=\"width:90%; \n",
    "              border-style:hidden; \n",
    "              borders-collapse:collapse; \n",
    "              line-height:120%\">\n",
    "    <tr>\n",
    "        <td style=\"width: 30%;\n",
    "                   text-align:left; \n",
    "                   vertical-align:top;\n",
    "                   background-color:#D0D0D0;\n",
    "                   border-right:hidden; \n",
    "                   border-bottom: 10px solid white;\n",
    "                   padding:10px\">\n",
    "            <strong>alpha_dir</strong><br>(string)\n",
    "        </td>\n",
    "        <td style=\"width: 60%\n",
    "                   text-align: left;\n",
    "                   vertical-align: top;\n",
    "                   border-left:hidden;\n",
    "                   border-top:hidden;\n",
    "                   border-bottom:hidden;\n",
    "                   padding:10px;\n",
    "                   \">\n",
    "            The file path to the directory where we should save all of our alpha \n",
    "            diversity files. This should be located in the \n",
    "            <strong><code>working_dir</code></strong>.\n",
    "        </td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "        <td style=\"width: 30%;\n",
    "                   text-align:left; \n",
    "                   vertical-align:top;\n",
    "                   background-color:#D0D0D0;\n",
    "                   border-right:hidden; \n",
    "                   border-bottom: 10px solid white;\n",
    "                   padding:10px\">\n",
    "            <strong>alpha_pattern</strong><br>(string)\n",
    "        </td>\n",
    "        <td style=\"width: 60%\n",
    "                   text-align: left;\n",
    "                   vertical-align: top;\n",
    "                   border-left:hidden;\n",
    "                   border-top:hidden;\n",
    "                   border-bottom:hidden;\n",
    "                   padding:10px;\n",
    "                   \">\n",
    "            This describes the way alpha diversity files will be saved in our \n",
    "output directories. The <code><font color=\"FireBrick\">“%i”</font></code> will allow us to substitute any integer. In this case, we will specify the rarefaction depth, and the rarefaction instance for each table. The rarefaction instances are numbered sequentially, from 0 to the number of rarefactions.\n",
    "        </td>\n",
    "    </tr>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Creates a parent directory for the alpha diversity files\n",
    "alpha_dir = os.path.join(working_dir, 'alpha')\n",
    "div.check_dir(alpha_dir)\n",
    "\n",
    "# Sets a pattern for the filenames of alpha diversity tables\n",
    "alpha_pattern = 'alpha_rarefaction_%(rare_depth)i_%(rare_instance)i.txt'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"#top\">Return to the top</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Split Directories and Files"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "At some point in our analysis, it will become necessary to split our OTU table by body site. The directory we create next will specify the location, while the file patterns will allow us to iterate or select one of many possible files using a simple string substitution.\n",
    "\n",
    "<table style=\"width:90%; \n",
    "              border-style:hidden; \n",
    "              borders-collapse:collapse; \n",
    "              line-height:120%\">\n",
    "    <tr>\n",
    "        <td style=\"width: 30%;\n",
    "                   text-align:left; \n",
    "                   vertical-align:top;\n",
    "                   background-color:#D0D0D0;\n",
    "                   border-right:hidden; \n",
    "                   border-bottom: 10px solid white;\n",
    "                   padding:10px\">\n",
    "            <strong>split_raw_dir</strong><br>(string)\n",
    "        </td>\n",
    "        <td style=\"width: 60%\n",
    "                   text-align: left;\n",
    "                   vertical-align: top;\n",
    "                   border-left:hidden;\n",
    "                   border-top:hidden;\n",
    "                   border-bottom:hidden;\n",
    "                   padding:10px;\n",
    "                   \">\n",
    "            The file path where we should put the unrarefied OTU table \n",
    "            after splitting by body site. This should be located in the \n",
    "            <strong><code>working_dir</code></strong>.\n",
    "        </td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "        <td style=\"width: 30%;\n",
    "                   text-align:left; \n",
    "                   vertical-align:top;\n",
    "                   background-color:#D0D0D0;\n",
    "                   border-right:hidden; \n",
    "                   border-bottom: 10px solid white;\n",
    "                   padding:10px\">\n",
    "            <strong>split_rare_dir</strong><br>(string)\n",
    "        </td>\n",
    "        <td style=\"width: 60%\n",
    "                   text-align: left;\n",
    "                   vertical-align: top;\n",
    "                   border-left:hidden;\n",
    "                   border-top:hidden;\n",
    "                   border-bottom:hidden;\n",
    "                   padding:10px;\n",
    "                   \">\n",
    "            The file path where we should put the rarefied OTU table after splitting by body site. This should be located in the \n",
    "            <strong><code>working_dir</code></strong>.\n",
    "        </td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "        <td style=\"width: 30%;\n",
    "                   text-align:left; \n",
    "                   vertical-align:top;\n",
    "                   background-color:#D0D0D0;\n",
    "                   border-right:hidden; \n",
    "                   border-bottom: 10px solid white;\n",
    "                   padding:10px\">\n",
    "            <strong>split_fn</strong><br>(string)\n",
    "        </td>\n",
    "        <td style=\"width: 60%\n",
    "                   text-align: left;\n",
    "                   vertical-align: top;\n",
    "                   border-left:hidden;\n",
    "                   border-top:hidden;\n",
    "                   border-bottom:hidden;\n",
    "                   padding:10px;\n",
    "                   \">\n",
    "            The files generated by OTU splitting will follow this naming \n",
    "            convention. The blanks, <a href=\"#Rarefaction-parameters\">rare_suffix</a>, \n",
    "            <a href=\"#Split-Parameters\">split_field</a>, \n",
    "            <a href=\"#Split-Parameters\">split_prefix</a>, \n",
    "            split_group and extension are used to specify the level of \n",
    "            rarefaction, the field used for splitting the data, the group in \n",
    "            that split, and the type of file generated. Here, we expect the \n",
    "            split_group to be a body site.\n",
    "        </td>\n",
    "    </tr>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Creates a directory for splitting the OTU table by bodysite\n",
    "split_raw_dir = os.path.join(working_dir, 'split_by_bodysite_raw')\n",
    "div.check_dir(split_raw_dir)\n",
    "split_rare_dir = os.path.join(working_dir, 'split_by_bodysite_rare')\n",
    "div.check_dir(split_rare_dir)\n",
    "\n",
    "# Sets a pattern for filenames in the split directory\n",
    "split_fn = 'AGP_100nt%(rare_suffix)s__%(split_field)s_%(split_prefix)s%(split_group)s__.%(extension)s'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"#top\">Return to the top</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Output Directories and Files"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we need to set up the directories and filenames where we will save our results."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data Directory"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We’ll start by creating an output directory where our results should be located.\n",
    "\n",
    "<table style=\"width: 90%; border-style: hidden; borders-collapse: collapse; line-height:120%\">\n",
    "    <tr>\n",
    "        <td style=\"width: 30%;\n",
    "                   text-align:left; \n",
    "                   vertical-align:top;\n",
    "                   background-color:#D0D0D0;\n",
    "                   border-right:hidden; \n",
    "                   border-bottom: 10px solid white;\n",
    "                   padding:10px\">\n",
    "            <strong>data_dir</strong><br>(string)\n",
    "        </td>\n",
    "        <td style=\"width: 60%\n",
    "                   text-align: left;\n",
    "                   vertical-align: top;\n",
    "                   border-left:hidden;\n",
    "                   border-top:hidden;\n",
    "                   border-bottom:hidden;\n",
    "                   padding:10px;\n",
    "                   \">\n",
    "            The file path for a directory where the results of this notebook \n",
    "            (OTU tables, mapping files, and distance matrices) should be saved. \n",
    "            This should be a directory in \n",
    "            <strong><code>base_dir</code></strong>.\n",
    "        </td>\n",
    "    </tr>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Sets up a directory where all results should be saved\n",
    "data_dir = os.path.join(base_dir, 'sample_data')\n",
    "if not data_download:\n",
    "    div.check_dir(data_dir)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"#top\">Return to the top</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Body Site Directories"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we'll create directories specific to the body sites for which we have samples. The file path for a directory where each of the result sets from this notebook should be stored. The folders are described <a href=\"#Data-Sets-Generated-by-this-Notebook\">above</a>.<br>\n",
    "\n",
    "<table style=\"width: 90%; border-style: hidden; borders-collapse: collapse; line-height:120%\">\n",
    "    <tr>\n",
    "        <td style=\"width: 30%;\n",
    "                   text-align:left; \n",
    "                   vertical-align:top;\n",
    "                   background-color:#D0D0D0;\n",
    "                   border-right:hidden; \n",
    "                   border-bottom: 10px solid white;\n",
    "                   padding:10px\">\n",
    "            <strong>all_dir</strong>;<br>(string)\n",
    "        </td>\n",
    "        <td style=\"width: 60%\n",
    "                   text-align: left;\n",
    "                   vertical-align: top;\n",
    "                   border-left:hidden;\n",
    "                   border-top:hidden;\n",
    "                   border-bottom:hidden;\n",
    "                   padding:10px;\n",
    "                   \">\n",
    "            The all data directory will contain the full American Gut results.<br>\n",
    "       </td>\n",
    "   </tr>\n",
    "   <tr>\n",
    "       <td style=\"width: 30%;\n",
    "                   text-align:left; \n",
    "                   vertical-align:top;\n",
    "                   background-color:#D0D0D0;\n",
    "                   border-right:hidden; \n",
    "                   border-bottom: 10px solid white;\n",
    "                   padding:10px\">\n",
    "            <em>bodysite directories</em><br>(string)\n",
    "        </td>\n",
    "        <td style=\"width: 60%\n",
    "                   text-align: left;\n",
    "                   vertical-align: top;\n",
    "                   border-left:hidden;\n",
    "                   border-top:hidden;\n",
    "                   border-bottom:hidden;\n",
    "                   padding:10px;\n",
    "                   \">\n",
    "            The <em>bodysite directories</em> can be identified by adjusting the <a href=\"#Data-Set-Selection\"><strong><code>all_bodysite</code></strong></a> variable.<br>\n",
    "        </td>\n",
    "    </tr>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Sets up an all sample directory\n",
    "all_dir = os.path.join(data_dir, 'all')\n",
    "if not data_download:\n",
    "    div.check_dir(all_dir)\n",
    "\n",
    "# Creates body-site specific directories\n",
    "if not data_download:\n",
    "    for site in all_bodysites:\n",
    "        div.check_dir(os.path.join(data_dir, site))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"#top\">Return to the top</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data Set Directories"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will set up file names for the output directories where we’ll save our final files. \n",
    "\n",
    "The <code><font color=\"FireBrick\">“%s”</font></code> prefix will allow us to insert any file path for the output directory. We expect the final file paths used with these directories to be located in the <strong><code>data_dir</code></strong>.\n",
    " \n",
    "<table style=\"width:90%; \n",
    "              border-style:hidden;\n",
    "              borders-collapse:collapse;\n",
    "              line-height:120%\">\n",
    "    <tr>\n",
    "        <td style=\"width: 30%;\n",
    "                   text-align:left; \n",
    "                   vertical-align:top;\n",
    "                   background-color:#D0D0D0;\n",
    "                   border-right:hidden; \n",
    "                   border-bottom: 10px solid white;\n",
    "                   padding:10px\">\n",
    "            <strong>asab_pattern</strong><br>(string)\n",
    "        </td>\n",
    "        <td style=\"width: 60%\n",
    "                   text-align: left;\n",
    "                   vertical-align: top;\n",
    "                   border-left:hidden;\n",
    "                   border-top:hidden;\n",
    "                   border-bottom:hidden;\n",
    "                   padding:10px;\n",
    "                   \">\n",
    "            A file pattern for the directory where we’ll save the data from all \n",
    "            samples from all participants. \n",
    "        </td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "        <td style=\"width: 30%;\n",
    "                   text-align:left; \n",
    "                   vertical-align:top;\n",
    "                   background-color:#D0D0D0;\n",
    "                   border-right:hidden; \n",
    "                   border-bottom: 10px solid white;\n",
    "                   padding:10px\">\n",
    "            <strong>assb_pattern</strong><br>(string)\n",
    "        </td>\n",
    "        <td style=\"width: 60%\n",
    "                   text-align: left;\n",
    "                   vertical-align: top;\n",
    "                   border-left:hidden;\n",
    "                   border-top:hidden;\n",
    "                   border-bottom:hidden;\n",
    "                   padding:10px;\n",
    "                   \">\n",
    "            A file pattern for the directory where data from a single sample for each individual at each body site is stored. Note that it's possible to have multiple samples for the same individual, as long as the individual contributed samples at multiple body sites. However, the two samples from the same individual will be represented in different tables.\n",
    "        </td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "        <td style=\"width: 30%;\n",
    "                   text-align:left; \n",
    "                   vertical-align:top;\n",
    "                   background-color:#D0D0D0;\n",
    "                   border-right:hidden; \n",
    "                   border-bottom: 10px solid white;\n",
    "                   padding:10px\">\n",
    "            <strong>ssab_pattern</strong><br>(string)\n",
    "        </td>\n",
    "        <td style=\"width: 60%\n",
    "                   text-align: left;\n",
    "                   vertical-align: top;\n",
    "                   border-left:hidden;\n",
    "                   border-top:hidden;\n",
    "                   border-bottom:hidden;\n",
    "                   padding:10px;\n",
    "                   \">\n",
    "            A file pattern for the directory where we’ll save the data from all \n",
    "            samples from a subset of participants.\n",
    "        </td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "        <td style=\"width: 30%;\n",
    "                   text-align:left; \n",
    "                   vertical-align:top;\n",
    "                   background-color:#D0D0D0;\n",
    "                   border-right:hidden; \n",
    "                   border-bottom: 10px solid white;\n",
    "                   padding:10px\">\n",
    "            <strong>sssb_pattern</strong><br>(string)\n",
    "        </td>\n",
    "        <td style=\"width: 60%\n",
    "                   text-align: left;\n",
    "                   vertical-align: top;\n",
    "                   border-left:hidden;\n",
    "                   border-top:hidden;\n",
    "                   border-bottom:hidden;\n",
    "                   padding:10px;\n",
    "                   \">\n",
    "            A file pattern for the directory where we’ll save the data from a \n",
    "            single samples from each participant in the healthy subset.\n",
    "        </td>\n",
    "    </tr>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Sets up the file path pattern for the possible sets of sample at each body\n",
    "# site\n",
    "asab_pattern = os.path.join(data_dir, '%(site)s/all_participants_all_samples')\n",
    "assb_pattern = os.path.join(data_dir, '%(site)s/all_participants_one_sample')\n",
    "ssab_pattern = os.path.join(data_dir, '%(site)s/sub_participants_all_samples')\n",
    "sssb_pattern = os.path.join(data_dir, '%(site)s/sub_participants_one_sample')\n",
    "\n",
    "# Checks the file paths\n",
    "if not data_download:\n",
    "    for site in all_bodysites:\n",
    "        site_blank = {'site': site}\n",
    "        div.check_dir(asab_pattern % site_blank)\n",
    "        if site in one_samp_sites:\n",
    "            div.check_dir(assb_pattern % site_blank)\n",
    "        if site in sub_part_sites:\n",
    "            div.check_dir(ssab_pattern % site_blank)\n",
    "        if site in sub_part_sites and site in one_samp_sites:\n",
    "            div.check_dir(sssb_pattern % site_blank)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"#top\">Return to the top</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data File Names"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we will set up file names for the file names we’ll put in the output directories. We can combine these with our directories to get our final file paths.\n",
    "\n",
    "The `AGP_100nt` (the data comes from the American Gut Project and sequences were trimmed to 100 nucleotides length). We can describe the [rarefaction depth](#Rarefaction-parameters) and [body site](#Data-Set-Selection) in the blanks.\n",
    "\n",
    "<table style=\"width:90%;\n",
    "              border-style:hidden;\n",
    "              borders-collapse:collapse;\n",
    "              line-height:120%\"\n",
    "              >\n",
    "    <tr>\n",
    "        <td style=\"width: 30%;\n",
    "                   text-align:left; \n",
    "                   vertical-align:top;\n",
    "                   background-color:#D0D0D0;\n",
    "                   border-right:hidden; \n",
    "                   border-bottom: 10px solid white;\n",
    "                   padding:10px\">\n",
    "            <strong>otu_fn</strong><br>(string)\n",
    "        </td>\n",
    "        <td style=\"width: 60%\n",
    "                   text-align: left;\n",
    "                   vertical-align: top;\n",
    "                   border-left:hidden;\n",
    "                   border-top:hidden;\n",
    "                   border-bottom:hidden;\n",
    "                   padding:10px;\n",
    "                   \">\n",
    "            A pattern for the filename for output OTU table files.\n",
    "        </td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "        <td style=\"width: 30%;\n",
    "                   text-align:left; \n",
    "                   vertical-align:top;\n",
    "                   background-color:#D0D0D0;\n",
    "                   border-right:hidden; \n",
    "                   border-bottom: 10px solid white;\n",
    "                   padding:10px\">\n",
    "            <strong>map_fn</strong><br>(string)\n",
    "        </td>\n",
    "        <td style=\"width: 60%\n",
    "                   text-align: left;\n",
    "                   vertical-align: top;\n",
    "                   border-left:hidden;\n",
    "                   border-top:hidden;\n",
    "                   border-bottom:hidden;\n",
    "                   padding:10px;\n",
    "                   \">\n",
    "            A pattern for the filename for output mapping files.\n",
    "        </td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "        <td style=\"width: 30%;\n",
    "                   text-align:left; \n",
    "                   vertical-align:top;\n",
    "                   background-color:#D0D0D0;\n",
    "                   border-right:hidden; \n",
    "                   border-bottom: 10px solid white;\n",
    "                   padding:10px\">\n",
    "            <strong>dm_fn</strong><br>(string)\n",
    "        </td>\n",
    "        <td style=\"width: 60%\n",
    "                   text-align: left;\n",
    "                   vertical-align: top;\n",
    "                   border-left:hidden;\n",
    "                   border-top:hidden;\n",
    "                   border-bottom:hidden;\n",
    "                   padding:10px;\n",
    "                   \">\n",
    "            A pattern for the file name used by the distance matrix files generated by the <a href=\"#params_beta\"><strong><code>beta_metrics</code></strong></a>\n",
    "        </td>\n",
    "    </tr>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "otu_fn = 'AGP_100nt%(rare_depth)s%(spacer)s%(site)s.biom'\n",
    "map_fn = 'AGP_100nt%(rare_depth)s%(spacer)s%(site)s.txt'\n",
    "dm_fn = ['%(metric)s_AGP_100nt%(rare_depth)s%(spacer)s%(site)s.txt' \n",
    "         % {'metric': m, 'rare_depth': '%(rare_depth)s', \n",
    "            'spacer': '%(spacer)s', 'site': '%(site)s'}\n",
    "         for m in beta_metrics.split(',')]\n",
    "\n",
    "sin_fn = 'single_samples.txt'\n",
    "sub_fn = 'subset_samples.txt'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"#top\">Return to the top</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### File Exensions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are a set of blanks which are filled in for each file name. Some of these blanks will follow consistent patterns, which we can set before the files are used.\n",
    "\n",
    "<table style=\"width:90%; \n",
    "              border-style:hidden;\n",
    "              borders-collapse:collapse;\n",
    "              line-height:120%\">\n",
    "    <tr>\n",
    "         <td style=\"width: 30%;\n",
    "                   text-align:left; \n",
    "                   vertical-align:top;\n",
    "                   background-color:#D0D0D0;\n",
    "                   border-right:hidden; \n",
    "                   border-bottom: 10px solid white;\n",
    "                   padding:10px\">\n",
    "            <strong>map_extension</strong><br>(string)\n",
    "        </td>\n",
    "        <td style=\"width: 60%\n",
    "                   text-align: left;\n",
    "                   vertical-align: top;\n",
    "                   border-left:hidden;\n",
    "                   border-top:hidden;\n",
    "                   border-bottom:hidden;\n",
    "                   padding:10px;\n",
    "                   \">\n",
    "            The file extension for <a href=\"#Metadata\">mapping files</a>. \n",
    "            Mapping files are typically tab-delimited text files.\n",
    "        </td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "         <td style=\"width: 30%;\n",
    "                   text-align:left; \n",
    "                   vertical-align:top;\n",
    "                   background-color:#D0D0D0;\n",
    "                   border-right:hidden; \n",
    "                   border-bottom: 10px solid white;\n",
    "                   padding:10px\">\n",
    "            <strong>otu_extension</strong><br>(string)\n",
    "        </td>\n",
    "        <td style=\"width: 60%\n",
    "                   text-align: left;\n",
    "                   vertical-align: top;\n",
    "                   border-left:hidden;\n",
    "                   border-top:hidden;\n",
    "                   border-bottom:hidden;\n",
    "                   padding:10px;\n",
    "                   \">\n",
    "            The file extension for <a href=\"#OTU-Tables\">OTU table files</a>. \n",
    "            OTU tables are typically Biom-formatted files.\n",
    "        </td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "         <td style=\"width: 30%;\n",
    "                   text-align:left; \n",
    "                   vertical-align:top;\n",
    "                   background-color:#D0D0D0;\n",
    "                   border-right:hidden; \n",
    "                   border-bottom: 10px solid white;\n",
    "                   padding:10px\">\n",
    "            <strong>rare_suffix</strong><br>(string)\n",
    "        </td>\n",
    "        <td style=\"width: 60%\n",
    "                   text-align: left;\n",
    "                   vertical-align: top;\n",
    "                   border-left:hidden;\n",
    "                   border-top:hidden;\n",
    "                   border-bottom:hidden;\n",
    "                   padding:10px;\n",
    "                   \">\n",
    "            This is added to file names to denote that rarefaction has occurred. \n",
    "            Typically, this should be \n",
    "            <code><font color=\"FireBrick\">“even”</font></code> with the \n",
    "            rarefaction depth.\n",
    "        </td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "        <td style=\"width: 30%;\n",
    "                   text-align:left; \n",
    "                   vertical-align:top;\n",
    "                   background-color:#D0D0D0;\n",
    "                   border-right:hidden; \n",
    "                   border-bottom: 10px solid white;\n",
    "                   padding:10px\">\n",
    "            <strong>raw_suffix</strong><br>(string)\n",
    "        </td>\n",
    "        <td style=\"width: 60%\n",
    "                   text-align: left;\n",
    "                   vertical-align: top;\n",
    "                   border-left:hidden;\n",
    "                   border-top:hidden;\n",
    "                   border-bottom:hidden;\n",
    "                   padding:10px;\n",
    "                   \">\n",
    "            This is added to files in which rarefaction has not been performed. \n",
    "            Usually, this will be an empty string. This is required to maintain \n",
    "            appropriate string formatting.\n",
    "        </td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "         <td style=\"width: 30%;\n",
    "                   text-align:left; \n",
    "                   vertical-align:top;\n",
    "                   background-color:#D0D0D0;\n",
    "                   border-right:hidden; \n",
    "                   border-bottom: 10px solid white;\n",
    "                   padding:10px\">\n",
    "            <strong>site_pad</strong>;<br><strong>all_</strong><br>(string)\n",
    "        </td>\n",
    "        <td style=\"width: 60%\n",
    "                   text-align: left;\n",
    "                   vertical-align: top;\n",
    "                   border-left:hidden;\n",
    "                   border-top:hidden;\n",
    "                   border-bottom:hidden;\n",
    "                   padding:10px;\n",
    "                   \">\n",
    "            These are spacers used to keep name formats clean and correct.\n",
    "        </td>\n",
    "    </tr>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "map_extension = 'txt'\n",
    "otu_extension = 'biom'\n",
    "\n",
    "site_pad = '_'\n",
    "all_ = ''\n",
    "\n",
    "rare_suffix = '_even10k'\n",
    "raw_suffix = ''"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"#top\">Return to the top</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### File Blanks"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can set up the values to fill in the file blanks, using the body site names and the file paths we’ve provided. We’ll generate some of these using substitutions.\n",
    "\n",
    "In cases where body site must be specified, this is handled in two ways. When dealing with splitting data by body site, the body site will be given by the <code><font color=\"FireBrick\">split_group</font></code>, and we'll use <code>nan</code> as a placeholder. \n",
    "\n",
    "After the tables are split, we generate body site-specific information as a list.\n",
    "\n",
    "<table style=\"width:90%;\n",
    "\t\t\t  border-style:hidden;\n",
    "\t\t\t  borders-collapse:collapse;\n",
    "\t\t\t  line-height:120%\">\n",
    "\t<tr>\n",
    "\t\t<td style=\"width: 30%;\n",
    "                   text-align:left; \n",
    "                   vertical-align:top;\n",
    "                   background-color:#D0D0D0;\n",
    "                   border-right:hidden; \n",
    "                   border-bottom: 10px solid white;\n",
    "                   padding:10px\">\n",
    "\t\t\t<strong>last_rare</strong><br />(dict)\n",
    "\t\t</td>\n",
    "\t\t<td style=\"width: 60%\n",
    "                   text-align: left;\n",
    "                   vertical-align: top;\n",
    "                   border-left:hidden;\n",
    "                   border-top:hidden;\n",
    "                   border-bottom:hidden;\n",
    "                   padding:10px;\n",
    "                   \">\n",
    "\t\t\tFills in the blanks for a rarefaction table or alpha diversity file. \n",
    "            This is used to check if all rarefaction of alpha diversity files \n",
    "            have been generated for an analysis.\n",
    "\t\t</td>\n",
    "\t</tr>\n",
    "\t<tr>\n",
    "\t\t<td style=\"width: 30%;\n",
    "                   text-align:left; \n",
    "                   vertical-align:top;\n",
    "                   background-color:#D0D0D0;\n",
    "                   border-right:hidden; \n",
    "                   border-bottom: 10px solid white;\n",
    "                   padding:10px\">\n",
    "\t\t\t<strong>all_raw_blanks</strong><br />(dict)\n",
    "\t\t</td>\n",
    "\t\t<td style=\"width: 60%\n",
    "                   text-align: left;\n",
    "                   vertical-align: top;\n",
    "                   border-left:hidden;\n",
    "                   border-top:hidden;\n",
    "                   border-bottom:hidden;\n",
    "                   padding:10px;\n",
    "                   \">\n",
    "            Fills in the blanks for the unrarefied, all-sample files.\n",
    "\t\t</td>\n",
    "\t</tr>\n",
    "\t<tr>\n",
    "\t\t<td style=\"width: 30%;\n",
    "                   text-align:left; \n",
    "                   vertical-align:top;\n",
    "                   background-color:#D0D0D0;\n",
    "                   border-right:hidden; \n",
    "                   border-bottom: 10px solid white;\n",
    "                   padding:10px\">\n",
    "\t\t\t<strong>all_rare_blanks</strong><br />(dict)\n",
    "\t\t</td>\n",
    "\t\t<td style=\"width: 60%\n",
    "                   text-align: left;\n",
    "                   vertical-align: top;\n",
    "                   border-left:hidden;\n",
    "                   border-top:hidden;\n",
    "                   border-bottom:hidden;\n",
    "                   padding:10px;\n",
    "                   \">\n",
    "            Fills in the blanks for the rarefied, all-sample files.\n",
    "\t\t</td>\n",
    "\t</tr>\n",
    "\t<tr>\n",
    "\t\t<td style=\"width: 30%;\n",
    "                   text-align:left; \n",
    "                   vertical-align:top;\n",
    "                   background-color:#D0D0D0;\n",
    "                   border-right:hidden; \n",
    "                   border-bottom: 10px solid white;\n",
    "                   padding:10px\">\n",
    "\t\t\t<strong>otu_raw_split_blanks</strong><br />(dict)\n",
    "\t\t</td>\n",
    "\t\t<td style=\"width: 60%\n",
    "                   text-align: left;\n",
    "                   vertical-align: top;\n",
    "                   border-left:hidden;\n",
    "                   border-top:hidden;\n",
    "                   border-bottom:hidden;\n",
    "                   padding:10px;\n",
    "                   \">\n",
    "            Fills in the blanks for the unrarefied split file patterns to \n",
    "            identify the split OTU table. \n",
    "\t\t</td>\n",
    "\t</tr>\n",
    "\t<tr>\n",
    "\t\t<td style=\"width: 30%;\n",
    "                   text-align:left; \n",
    "                   vertical-align:top;\n",
    "                   background-color:#D0D0D0;\n",
    "                   border-right:hidden; \n",
    "                   border-bottom: 10px solid white;\n",
    "                   padding:10px\">\n",
    "\t\t\t<strong>map_raw_split_blanks</strong><br />(dict)\n",
    "\t\t</td>\n",
    "\t\t<td style=\"width: 60%\n",
    "                   text-align: left;\n",
    "                   vertical-align: top;\n",
    "                   border-left:hidden;\n",
    "                   border-top:hidden;\n",
    "                   border-bottom:hidden;\n",
    "                   padding:10px;\n",
    "                   \">\n",
    "            Fills in the blanks for the unrarefied split file patterns to \n",
    "            identify the split metadata file. The <code>nan</code> value for \n",
    "            <code><font color=\"FireBrick\">split_group</font></code> is a place \n",
    "            holder.\n",
    "\t\t</td>\n",
    "\t</tr>\n",
    "\t<tr>\n",
    "\t\t<td style=\"width: 30%;\n",
    "                   text-align:left; \n",
    "                   vertical-align:top;\n",
    "                   background-color:#D0D0D0;\n",
    "                   border-right:hidden; \n",
    "                   border-bottom: 10px solid white;\n",
    "                   padding:10px\">\n",
    "\t\t\t<strong>otu_rare_split_blanks</strong><br />(dict)\n",
    "\t\t</td>\n",
    "\t\t<td style=\"width: 60%\n",
    "                   text-align: left;\n",
    "                   vertical-align: top;\n",
    "                   border-left:hidden;\n",
    "                   border-top:hidden;\n",
    "                   border-bottom:hidden;\n",
    "                   padding:10px;\n",
    "                   \">\n",
    "            Fills in the blanks for the rarefied split file patterns to identify \n",
    "            the split OTU table.\n",
    "\t\t</td>\n",
    "\t</tr>\n",
    "\t<tr>\n",
    "\t\t<td style=\"width: 30%;\n",
    "                   text-align:left; \n",
    "                   vertical-align:top;\n",
    "                   background-color:#D0D0D0;\n",
    "                   border-right:hidden; \n",
    "                   border-bottom: 10px solid white;\n",
    "                   padding:10px\">\n",
    "\t\t\t<strong>map_rare_split_blanks</strong><br />(dict)\n",
    "\t\t</td>\n",
    "\t\t<td style=\"width: 60%\n",
    "                   text-align: left;\n",
    "                   vertical-align: top;\n",
    "                   border-left:hidden;\n",
    "                   border-top:hidden;\n",
    "                   border-bottom:hidden;\n",
    "                   padding:10px;\n",
    "                   \">\n",
    "            Fills in the blanks for the rarefied split file patterns to identify \n",
    "            the split metadata file.\n",
    "\t\t</td>\n",
    "\t</tr>\n",
    "\t<tr>\n",
    "\t\t<td style=\"width: 30%;\n",
    "                   text-align:left; \n",
    "                   vertical-align:top;\n",
    "                   background-color:#D0D0D0;\n",
    "                   border-right:hidden; \n",
    "                   border-bottom: 10px solid white;\n",
    "                   padding:10px\">\n",
    "\t\t\t<strong>raw_sample_blanks</strong><br />(dict)\n",
    "\t\t</td>\n",
    "\t\t<td style=\"width: 60%\n",
    "                   text-align: left;\n",
    "                   vertical-align: top;\n",
    "                   border-left:hidden;\n",
    "                   border-top:hidden;\n",
    "                   border-bottom:hidden;\n",
    "                   padding:10px;\n",
    "                   \">\n",
    "            Fills in the blanks for the rarefied files at each body site.\n",
    "\t\t</td>\n",
    "\t</tr>\n",
    "\t<tr>\n",
    "\t\t<td style=\"width: 30%;\n",
    "                   text-align:left; \n",
    "                   vertical-align:top;\n",
    "                   background-color:#D0D0D0;\n",
    "                   border-right:hidden; \n",
    "                   border-bottom: 10px solid white;\n",
    "                   padding:10px\">\n",
    "\t\t\t<strong>rare_sample_blanks</strong><br />(dict)\n",
    "\t\t</td>\n",
    "\t\t<td style=\"width: 60%\n",
    "                   text-align: left;\n",
    "                   vertical-align: top;\n",
    "                   border-left:hidden;\n",
    "                   border-top:hidden;\n",
    "                   border-bottom:hidden;\n",
    "                   padding:10px;\n",
    "                   \">\n",
    "            Fills in the blanks for the rarefied files at each body site.\n",
    "\t\t</td>\n",
    "\t</tr>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "last_rare = {'rare_depth': rarefaction_depth,\n",
    "             'rare_instance': num_rarefactions - 1}\n",
    "\n",
    "all_raw_blanks = {'rare_depth': raw_suffix,\n",
    "                  'spacer': all_,\n",
    "                  'site': all_}\n",
    "all_rare_blanks = {'rare_depth': rare_suffix,\n",
    "                   'spacer': all_,\n",
    "                   'site': all_}\n",
    "\n",
    "otu_raw_split_blanks = {'rare_suffix': raw_suffix,\n",
    "                        'split_field': split_field,\n",
    "                        'split_prefix': split_prefix,\n",
    "                        'split_group': np.nan,\n",
    "                        'extension': otu_extension}\n",
    "\n",
    "map_raw_split_blanks = {'rare_suffix': raw_suffix,\n",
    "                        'split_field': split_field,\n",
    "                        'split_prefix': split_prefix,\n",
    "                        'split_group': np.nan,\n",
    "                        'extension': map_extension}\n",
    "\n",
    "otu_rare_split_blanks = {'rare_suffix': rare_suffix,\n",
    "                         'split_field': split_field,\n",
    "                         'split_prefix': split_prefix,\n",
    "                         'split_group': np.nan,\n",
    "                         'extension': otu_extension}\n",
    "\n",
    "map_rare_split_blanks = {'rare_suffix': rare_suffix,\n",
    "                         'split_field': split_field,\n",
    "                         'split_prefix': split_prefix,\n",
    "                         'split_group': np.nan,\n",
    "                         'extension': map_extension}\n",
    "raw_sample_blanks = {'site': np.nan,\n",
    "                     'rare_depth': raw_suffix,\n",
    "                     'spacer': site_pad}\n",
    "\n",
    "rare_sample_blanks = {'site': np.nan,\n",
    "                      'rare_depth': rare_suffix,\n",
    "                      'spacer': site_pad}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"#top\">Return to the top</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data Download"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now start our analysis by downloading the American Gut mapping file and OTU tables if they’re not already located in the <code><strong>download_dir</strong></code>, they will be downloaded to this location. If the files exist, new versions will be downloaded only if [**`overwrite`**](#File-Saving-Parameters) is set to <code><font color=\"228B22\">True</font></code>. \n",
    "\n",
    "*Note that this step requires an internet connection.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "if data_download:\n",
    "    !curl -OL ftp://ftp.microbio.me/pub/American-Gut-precomputed/r1-15/sample_data.tgz\n",
    "    !tar -xzf sample_data.tgz\n",
    "    shutil.move('./sample_data', base_dir)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Gets the biom file\n",
    "if not data_download and (not os.path.exists(download_otu_fp) or overwrite):\n",
    "    # Downloads the compressed biom file\n",
    "    !curl -OL https://github.com/biocore/American-Gut/blob/master/data/AG/AG_100nt.biom?raw=true\n",
    "    # Moves the biom file to its final location\n",
    "    shutil.move(os.path.join(os.path.abspath('.'), 'AG_100nt.biom?raw=true'), download_otu_fp)\n",
    "\n",
    "# Gets the mapping file\n",
    "if not data_download and (not os.path.exists(download_map_fp) or overwrite):\n",
    "    # Downloads the mapping files\n",
    "    !curl -OL https://github.com/biocore/American-Gut/blob/master/data/AG/AG_100nt.txt?raw=true\n",
    "    # Moves the file to the download file path\n",
    "    shutil.move(os.path.join('.', 'AG_100nt.txt?raw=true'), download_map_fp)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"#top\">Return to the top</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Mapping File Clean up"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will start by adjusting the metadata. This will correct errors and provide a uniform format for derived columns we may wish to use later or in downstream analyses."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Loads the mapping file\n",
    "raw_map = pd.read_csv(download_map_fp,\n",
    "                      sep=txt_delim, \n",
    "                      na_values=map_nas,\n",
    "                      index_col=False,\n",
    "                      dtype={map_index: str},\n",
    "                      low_memory=False)\n",
    "raw_map.index = raw_map[map_index]\n",
    "del raw_map[map_index]\n",
    "\n",
    "# Loads the OTU table\n",
    "raw_otu = biom.load_table(download_otu_fp)\n",
    "\n",
    "# Filters the raw map to remove any samples that are not present in the biom table\n",
    "raw_map = raw_map.loc[raw_otu.ids()]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"#top\">Return to the top</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Age"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are also a set of columns which are not included in the map, but may be useful for downstream analyses. These include age binned by decade (`AGE_CAT`). While there are QIIME analyses which can handle continuous metadata, binning can help reduce some of the noise.\n",
    "Here, we bin age by decade, with the exception of people under the age of 20. The gut develops in the first two years of life, and the guts of young children are significantly different than older children or adults [<a href=\"#20668239\">12</a>, <a href=\"#22699611\">13</a>]. We will also combine individuals over the age of 70 into their own category, due to the low sample counts of people over 80 as of round 14 (*n* < 20)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Bins age by decade (with the exception of young children)\n",
    "def categorize_age(x):\n",
    "    if np.isnan(x):\n",
    "        return x\n",
    "    elif x < 3:\n",
    "        return \"baby\"\n",
    "    elif x < 13:\n",
    "        return \"child\"\n",
    "    elif x < 20:\n",
    "        return \"teen\"\n",
    "    elif x < 30:\n",
    "        return \"20s\"\n",
    "    elif x < 40:\n",
    "        return \"30s\"\n",
    "    elif x < 50:\n",
    "        return \"40s\"\n",
    "    elif x < 60:\n",
    "        return \"50s\"\n",
    "    elif x < 70:\n",
    "        return \"60s\"\n",
    "    else:\n",
    "        return \"70+\"\n",
    "raw_map['AGE_CAT'] = raw_map.AGE.apply(categorize_age)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"#top\">Return to the top</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Alcohol Consumption"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In addition to considering the frequency with which people use alcohol (Never, Rarely, Occasionally, Regularly, or Daily), it may be helpful to simply look for an effect associated with any alcohol consumption."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def categorize_etoh(x):\n",
    "    if x == 'Never':\n",
    "        return \"No\"\n",
    "    elif isinstance(x, str):\n",
    "        return \"Yes\"\n",
    "    elif np.isnan(x):\n",
    "        return x\n",
    "    \n",
    "raw_map['ALCOHOL_CONSUMPTION'] = raw_map.ALCOHOL_FREQUENCY.apply(categorize_etoh)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"#top\">Return to the top</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Body Mass Index"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Body Mass Index (BMI) can be stratified into [categories](http://en.wikipedia.org/wiki/Body_mass_index#Categories) which give an approximate idea of body shape. It is worth noting that these stratifications do not hold well for growing children, where the BMI qualification is based on age and gender."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Categorizes the BMI into groups\n",
    "def categorize_bmi(x):\n",
    "    if np.isnan(x):\n",
    "        return x\n",
    "    elif x < 18.5:\n",
    "        return \"Underweight\"\n",
    "    elif x < 25:\n",
    "        return \"Normal\"\n",
    "    elif x < 30:\n",
    "        return \"Overweight\"\n",
    "    else:\n",
    "        return \"Obese\"\n",
    "raw_map['BMI_CAT'] = raw_map.BMI.apply(categorize_bmi)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"#top\">Return to the top</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Collection Season"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "American Gut samples have been collected since December of 2012. To look for patterns associated with the time of year samples were collected, we bin this date information into month, and season.\n",
    "\n",
    "We currently define our seasons according to the calendar in the Northern Hemisphere, because as of round fifteen, 99% of our samples were collected north of the equator. Additionally, rather than defining our seasons by the solar calendar, we have elected to use the first day of the month the solstice or equinox occurs in as the start of our season. So, while Winter technically begins on December 20th or 21st, according to the solar calendar, we consider December 1st as the first day of our Winter.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def convert_date(x):\n",
    "    \"\"\"Converts strings to a date object\"\"\"\n",
    "    if isinstance(x, str) and \"/\" in x:\n",
    "        try:\n",
    "            return pd.tseries.tools.to_datetime(x)\n",
    "        except:\n",
    "            return np.nan\n",
    "    else:\n",
    "        return x\n",
    "\n",
    "raw_map.COLLECTION_DATE = raw_map.COLLECTION_DATE.apply(convert_date)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Categorizes data by collection month and collection season\n",
    "month_map = {-1: [np.nan, np.nan],\n",
    "             np.nan: [np.nan, np.nan],\n",
    "             1: ['January', 'Winter'],\n",
    "             2: ['February', 'Winter'],\n",
    "             3: ['March', 'Spring'],\n",
    "             4: ['April', 'Spring'],\n",
    "             5: ['May', 'Spring'],\n",
    "             6: ['June', 'Summer'],\n",
    "             7: ['July', 'Summer'],\n",
    "             8: ['August', 'Summer'],\n",
    "             9: ['September', 'Fall'],\n",
    "             10: ['October', 'Fall'],\n",
    "             11: ['November', 'Fall'],\n",
    "             12: ['December', 'Winter']}\n",
    "\n",
    "def map_month(x):\n",
    "    try:\n",
    "        return month_map[x.month][0]\n",
    "    except:\n",
    "        return np.nan\n",
    "\n",
    "def map_season(x):\n",
    "    try:\n",
    "        return month_map[x.month][1]\n",
    "    except:\n",
    "        return np.nan\n",
    "\n",
    "# Maps the data as a month\n",
    "raw_map['COLLECTION_MONTH'] = \\\n",
    "    raw_map.COLLECTION_DATE.apply(map_month)\n",
    "\n",
    "# Maps the data as a season\n",
    "raw_map['COLLECTION_SEASON'] = \\\n",
    "    raw_map.COLLECTION_DATE.apply(map_season)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"#top\">Return to the top</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Collection Location"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The American Gut Project includes some geographical information about where samples were collected. While the data may be leveraged as-is, it can also be helpful to clean up the data. \n",
    "\n",
    "We'll start by checking for uniform country mapping. This will allow us to combine samples from countries or groups of countries with multiple descriptive names, such as the Great Britain and the United Kingdom."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def map_countries(x):\n",
    "    return geo.country_map.get(x, x)\n",
    "\n",
    "raw_map.COUNTRY = raw_map.COUNTRY.apply(map_countries)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In Rounds 1-15 participants come predominantly from the US, UK, Belgium and Canada. Since the area occupied by Belgium and the UK are smaller than the size of many states in the contiguous US (including some of the most represented states in the American Gut), we have elected to only consider the **STATE** field for American and Canadian samples.\n",
    "\n",
    "This does not alter the information provided by the zip (postal) code or Longitude/Latitude information."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Removes state information for any state not in the US\n",
    "# (This may change as additional countries are added.)\n",
    "countries = raw_map.groupby('COUNTRY').count().STATE.index.values\n",
    "for country in countries:\n",
    "    if country not in {'GAZ:United States of America', 'GAZ:Canada'}:\n",
    "        raw_map.loc[raw_map.COUNTRY == country, 'STATE'] = np.nan\n",
    "\n",
    "# Handles regional mapping, cleaning up states so that only American and\n",
    "# Canadian states are included \n",
    "def check_state(x):\n",
    "    if isinstance(x, str) and x in geo.us_state_map:\n",
    "        return geo.us_state_map[x.upper()]\n",
    "    elif  isinstance(x, str) and x in geo.canadian_map_english:\n",
    "        return geo.canadian_map_english[x.upper()]\n",
    "    else:\n",
    "        return np.nan\n",
    "raw_map['STATE'] = raw_map.STATE.apply(check_state)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We may also choose to use predefined regions to further bin our location data, and allow us to look for social or economic trends. To this end, we can apply regions defined by the [US Census Bureau](https://www.census.gov/geo/reference/gtc/gtc_census_divreg.html) and Economic Regions defined by the [US Bureau of Economic Analysis](http://www.bea.gov), which is part of the Department of Commerce."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Bins data by census region\n",
    "def census_f(x):\n",
    "    if  isinstance(x, str) and x in geo.regions_by_state:\n",
    "        return geo.regions_by_state[x]['Census_1']\n",
    "    else:\n",
    "        return np.nan\n",
    "raw_map['CENSUS_REGION'] = raw_map.STATE.apply(census_f)\n",
    "\n",
    "\n",
    "# Bins data by economic region\n",
    "def economic_f(x):\n",
    "    if isinstance(x, str) and  x in geo.regions_by_state:\n",
    "        return geo.regions_by_state[x]['Economic']\n",
    "    else:\n",
    "        return np.nan\n",
    "raw_map['ECONOMIC_REGION'] = raw_map.STATE.apply(economic_f)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"#top\">Return to the top</a>\n",
    "\n",
    "<a id=\"map_sleep\"></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Sleep Duration"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As of round 15, there are 36 participants who report sleeping less than five hours a night. To all for a larger sample size, we will pool these with the individuals who report sleeping between five and six hours a night, to create a group who report sleeping less than six hours a night."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "raw_map.loc[raw_map.SLEEP_DURATION == 'Less than 5 hours', 'SLEEP_DURATION'] = 'Less than 6 hours'\n",
    "raw_map.loc[raw_map.SLEEP_DURATION == '5-6 hours', 'SLEEP_DURATION'] = 'Less than 6 hours'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"#top\">Return to the top</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Identification of a Healthy Subset of Adults"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Certain health states are known to influence the microbiome in extreme ways. For some analyses we will do later, it may be useful to limit the noise associated with these conditions to allow us to look for new patterns. We have identified five metadata categories which we will use to limit our “healthy” subset.\n",
    "\n",
    "First, we limit based on age for several reasons. We chose to omit anyone under the age of twenty. The microbiome of very young children is not yet stable, and differs greatly from that of adults  [<a href=\"#20668239\">12</a>, <a href=\"#22699611\">13</a>]. Additionally, BMI limits are not easily assigned in people who are still growing. Without stratifying by gender, we assumed that growth will be complete in most people by the age of 20, and set our limit there. The limit at seventy was based on the number of individuals over that age, and on differences in the microbiome seen in older individuals [<a href=\"#20571116\">14</a>, <a href=\"#22797518\">15</a>].\n",
    "\n",
    "We also used Body Mass Index as an exclusion criteria, considering only people in the “normal” and “overweight” categories. (BMI 18.5 - 30). It has been suggested that obesity changes the gut microbiome, although the effect is not consistent across all studies [<a href=\"#25307765\">16</a>]. Additionally, we noticed that there were also alterations in our sample of underweight individuals.\n",
    "\n",
    "Recent antibiotic decreases alpha diversity and affects the microbiome [<a href=\"#20847294\">17</a>]. We chose to define “recent” as any time within the last year. We also excluded anyone who reported having Inflammatory Bowel Disease [<a href=\"#25307765\">16</a>], Type I Diabetes [<a href=\"#23274889\">18-21</a>], or Type II Diabetes [<a href=\"#20140211\">22</a>], since all three conditions are known to affect the microbiome."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Creates the subset if its not already in the mapping file\n",
    "if 'SUBSET' not in raw_map.columns:\n",
    "    subset_f = {'AGE': lambda x: 19 < x < 70 and not np.isnan(x),\n",
    "                'DIABETES': lambda x: x == 'I do not have diabetes',\n",
    "                'IBD': lambda x: x == 'I do not have IBD',\n",
    "                'ANTIBIOTIC_SELECT': lambda x: x == 'Not in the last year',\n",
    "                'BMI': lambda x: 18.5 <= x < 30 and not np.isnan(x)}\n",
    "\n",
    "    # Determines which samples meet the requirements of the categories\n",
    "    new_bin = {}\n",
    "    for cat, f in subset_f.iteritems():\n",
    "        new_bin[cat] = raw_map[cat].apply(f)\n",
    "\n",
    "    # Builds up the new binary dataframe\n",
    "    bin_frame = pd.DataFrame(new_bin)\n",
    "\n",
    "    # Adds a column to the current dataframe to look at the subset\n",
    "    bin_series = pd.DataFrame(new_bin).all(1)\n",
    "    bin_series.name = 'SUBSET'\n",
    "\n",
    "    raw_map = raw_map.join(bin_series)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"#top\">Return to the top</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Whole Table Rarefaction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will start by rarefying the whole body table using the rarefaction parameters we set earlier.  Rarefaction is a technique which filters out samples below a certain sequencing depth. Sequences are picked from a weighted average from the remaining samples to that all samples have an even depth. To control for bias which might occur with a single, random subsampling of the data, we use multiple rounds of rarefaction to more accurately estimate the alpha diversity.\n",
    "\n",
    "Rarefaction is important to make intra sample diversity (alpha diversity) comparisons possible. Below is a panel from Figure 1 of Human Gut Microbiome and Risk of Colorectal Cancer[<a href=\"#24316595\">23</a>]. The figure compares Shannon Diversity between individuals with colorectal cancer (*n*=47, red circles) and healthy controls (*n*=94, empty triangles) over several rarefaction depths, or sequence counts per sample.\n",
    "\n",
    "![Cancer Rarefaction curve](images/ahn2013jncicolorectalf1.jpg?raw=true)\n",
    "\n",
    "The figure also illustrates the importance of even sampling depth. If a control sample with 500 sequences per sample were compared with a cancer sample at a depth of 2500 sequences per sample, the cancer sample would appear more diverse. Comparisons at the same depth reveal the true pattern in the data: cancer samples are less diverse than controls.\n",
    "\n",
    "To perform multiple rarefactions, we will use the QIIME script, [multiple_rarefactions_even_depth.py](http://qiime.org/scripts/multiple_rarefactions_even_depth.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "if not data_download and (not os.path.exists(os.path.join(rare_dir, rare_pattern) %last_rare) or overwrite):\n",
    "    !multiple_rarefactions_even_depth.py -i $download_otu_fp -o $rare_dir -n $num_rarefactions -d $rarefaction_depth --lineages_included"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"#top\">Return to the top</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Whole Table Alpha Diversity"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will use our rarefaction tables to calculate the alpha diversity associated with each rarefaction.  Alpha diversity is a measure of intra sample diversity. Imagine that we could put up an imaginary wall around a 100ft x 100ft x 10 ft box in Yellowstone National Park, trapping all the vertebrate animals in that area for a short period of time. Imagine that we then made a very careful list (or took photographs) of the area so that we could document all the life we found in the area. We could count all the different types of animals we found in that area. This would be one measure of alpha diversity. \n",
    "\n",
    "Instead of just considering each type of animal to be equally similar, we wanted to include an evolutionary relationship between the animals. So, if our area contained a mouse, a squirrel and a rabbit, we might say these animals are more similar (and therefore less diverse) than if we found a mouse, a squirrel, and a sagebrush lizard in the same area. So, even though we’ve found three species in each case, the third species being a reptile would make it more diverse than the third species being a rodent. \n",
    "\n",
    "A diversity metric which accounts for shared evolutionary history between species is called a phylogenetic metric. This often uses a phylogenetic tree to provide information about that shared history. PD Whole Tree Diversity is a commonly used phylogenetic alpha diversity metric in microbiome research [<a href=\"#15831718\">8</a>]. A taxonomic metric assumes all species are equally different. Common taxonomic metrics for alpha diversity used  in microbiome research include Observed Species Diversity and Chao1 Diversity [<a href=\"#Chao\">9</a>, <a href=\"#shannon\">10</a>].\n",
    "\n",
    "Depending on what information we’re looking for, we might want to include information about the number of each animal belonging to the species we see. We might also want to consider the number of each different species we find in the area, weighting our diversity. So, if in our little area of Yellowstone, 90% of the animals we see are mice, while 5% are rabbits and 5% are trout, we would consider this less diverse than if 40% of the animals were mice, 30% were rabbits and 30% were trout. A metric which takes into account the counts of each species is a quantitative metric, while a qualitative metric looks only at the presence or absence of a species.\n",
    "\n",
    "While alpha diversity is calculated completely independently for each sample, the comparison of alpha diversity may provide clues about environmental changes. For example, pollution or an algal bloom may be associated with lower alpha diversity, and indicate a potential change in the health of the ecosystem.\n",
    "We’ll start our work with alpha diversity by calculating the diversity for our rarefied American Gut tables using the four metrics we selected in the alpha diversity parameters: the phylogenetic PD whole Tree Diversity, and the taxonomic metrics, Observed Species Diversity, Chao1 Diversity and Shannon Diversity. All the diversity metrics we are using here are qualitative metrics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "if not data_download and (not os.path.exists(os.path.join(alpha_dir, alpha_pattern) % last_rare) or overwrite):\n",
    "    !alpha_diversity.py -i $rare_dir -o $alpha_dir -m $alpha_metrics -t $tree_fp"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The alpha diversity results from QIIME are loaded into the notebook. To identify the best rarefaction instance, which we'll use as our OTU table moving forward, we try to identify the rarefaction instance which has alpha diversity closest to the mean alpha diversity represented in the table. We define \"closest\" using the normalized Euclidian distance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "if not data_download:\n",
    "    # Preallocates an output object for alpha diversity\n",
    "    alpha_rounds = {m: {} for m in alpha_metrics.split(',')}\n",
    "    div_metric = alpha_metrics.split(',')[0]\n",
    "\n",
    "    # Loops through the rarefaction instances\n",
    "    for ri in range(num_rarefactions):\n",
    "        a_file_blanks = {'rare_depth': rarefaction_depth,\n",
    "                         'rare_instance': ri}\n",
    "\n",
    "        # Sets the alpha diversity file path\n",
    "        alpha_fp = os.path.join(alpha_dir, alpha_pattern) % a_file_blanks\n",
    "\n",
    "        # Loads the alpha diversity table\n",
    "        alpha = pd.read_csv(alpha_fp,\n",
    "                            sep=txt_delim,\n",
    "                            index_col=False)\n",
    "        alpha.index = alpha['Unnamed: 0']\n",
    "        del alpha['Unnamed: 0']\n",
    "\n",
    "        # Extracts the alpha diversity metrics\n",
    "        for col in alpha_rounds:\n",
    "            alpha_rounds[col]['%i' %ri] = alpha[col]\n",
    "            alpha_rounds[col]['%i' %ri].name = '%i' % ri"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "if not data_download:\n",
    "    # Compiles the alpha diversity results into a single table\n",
    "    alpha_df = pd.DataFrame({'%s_mean' % metric: pd.DataFrame(alpha_rounds[metric]).mean(1)\n",
    "                             for metric in alpha_metrics.split(',')})\n",
    "\n",
    "    # Adds the alpha diversity results to the rarefied table\n",
    "    rare_map = raw_map.copy()\n",
    "    rare_map = rare_map.join(alpha_df)\n",
    "    rare_check = np.isnan(rare_map['%s_mean' % div_metric]) == False\n",
    "    rare_map = rare_map.loc[rare_check]\n",
    "\n",
    "    # Draws the data associated with each of the alpha diversity rounds\n",
    "    all_rounds = pd.DataFrame(alpha_rounds[div_metric])\n",
    "\n",
    "    # Lines up the data so the indices match (as a precaution)\n",
    "    all_rounds = all_rounds.sort_index()\n",
    "    alpha_df = alpha_df.sort_index()\n",
    "\n",
    "    # Calculates the distance between each round and the mean\n",
    "    mean_rounds = ([alpha_df['%s_mean' % div_metric].values] * \n",
    "                   np.ones((num_rarefactions, 1))).transpose()\n",
    "    diff = np.sqrt(np.square(all_rounds.values - np.square(mean_rounds))) / mean_rounds\n",
    "\n",
    "    # Determines the minimum distance between the round and the mean\n",
    "    round_labels = np.arange(0, 10)\n",
    "    round_avg = diff.mean(0)\n",
    "    best_rarefaction = round_labels[round_avg == min(round_avg)][0]\n",
    "    best_blanks = {'rare_depth': rarefaction_depth,\n",
    "                   'rare_instance': best_rarefaction}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We’ll save the whole body tables in their own directory, and the modified mapping files. We’ll also copy the raw OTU table and the rarefaction instance closest to the mean alpha diversity."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Saves the unrarefied mapping file\n",
    "if not data_download:\n",
    "    raw_map.to_csv(os.path.join(all_dir, map_fn) % all_raw_blanks,\n",
    "                   sep=txt_delim,\n",
    "                   na_rep=write_na,\n",
    "                   index_label=map_index)\n",
    "\n",
    "    # Saves the rarefied mapping file\n",
    "    rare_map.to_csv(os.path.join(all_dir, map_fn) % all_rare_blanks,\n",
    "                   sep=txt_delim,\n",
    "                   na_rep=write_na,\n",
    "                   index_label=map_index)\n",
    "\n",
    "    # Copies the raw OTU table\n",
    "    shutil.copy2(download_otu_fp, \n",
    "                 os.path.join(all_dir, otu_fn) % all_raw_blanks)\n",
    "\n",
    "    # Copies the rarefied OTU table\n",
    "    shutil.copy2(os.path.join(rare_dir, rare_pattern) % best_blanks,\n",
    "                 os.path.join(all_dir, otu_fn) % all_rare_blanks)\n",
    "\n",
    "raw_map = pd.read_csv(os.path.join(all_dir, map_fn) % all_raw_blanks,\n",
    "                      sep=txt_delim,\n",
    "                      na_values=map_nas,\n",
    "                      index_col=False,\n",
    "                      low_memory=False,\n",
    "                      dtype={map_index: str})\n",
    "raw_map.index = raw_map[map_index]\n",
    "del raw_map[map_index]\n",
    "\n",
    "rare_map = pd.read_csv(os.path.join(all_dir, map_fn) % all_rare_blanks,\n",
    "                       sep=txt_delim,\n",
    "                       na_values=map_nas,\n",
    "                       index_col=False,\n",
    "                       low_memory=False,\n",
    "                       dtype={map_index: str})\n",
    "rare_map.index = rare_map[map_index]\n",
    "del rare_map[map_index]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"#top\">Return to the top</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Whole Table Beta Diversity"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Beta Diversity allows us to make comparisons between samples and environments. Let’s go back to our 100ft x 100ft x 10ft cube in Yellowstone where we catalogued all the vertebrates. Let’s imagine that we’ve set up the same type of cube in New York City’s Central Park and cataloged all the vertebrates in that area as well.\n",
    "\n",
    "We could compare the two communities by seeing how many species are shared between the two, or, by making some measure that approximates the species. We might expect some overlap: depending on where we selected our regions, it would be unsurprising to encounter Chipmunks in both Central Park and Yellowstone National Park. However, there should also be some differences. Unless our Central Park location includes the zoo, it’s unlikely we’d find a Buffalo in New York City!\n",
    "\n",
    "If we use a taxonomic metric, based only on the species we find in the two locations, we might get very little overlap. While we might expect to find a squirrel in both Central Park and Yellowstone, the animals might be members of different genera! New York is home to the [Eastern Grey Squirrel](http://en.wikipedia.org/wiki/Eastern_gray_squirrel), *Sciurus carolinensis*, while we might find the [American Red Squirrel](http://en.wikipedia.org/wiki/American_red_squirrel), *Tamiasciurus hudsonicus*, in Yellowstone. [<a href=\"#yellowstone\">24</a>, <a href=\"#park\">25</a>]. In this case, a phylogenetic metric, which can account for some similarity between the two species of squirrels, may serve us much better.\n",
    "\n",
    "When we compare microbial communities for beta diversity, we frequently select a phylogenetic metric called UniFrac distance [<a href=\"#16332807\">5</a>]. This metric uses a phylogenetic tree, and determines what fraction of the tree is not shared between two communities. \n",
    "![UniFrac distance trees](http://unifrac.colorado.edu/static/images/fastunifrac/unifrac_significance/unifrac_test.jpg)\n",
    "\n",
    "If we consider only the presence and absence of each OTU in the samples, we have a qualitative metric, unweighted UniFrac distance. Unweighted UniFrac distance may take on values between 0 (everything the same) and 1 (everything different). Weighted UniFrac distance takes into account the abundance of the OTUs, and can take on values greater than 1.\n",
    "\n",
    "The UniFrac distance for each pairwise sample is arranged into a <a href=\"#ftype_dist\">Distance Matrix</a>. We can visualize the distance matrix by many techniques, like making PCoA plots in Emperor, or UPGMA trees like the one shown in the figure below [<a href=\"#24280061\">2</a>].\n",
    "\n",
    "![Unifrac to distance matrix](http://unifrac.colorado.edu/static/images/fastunifrac/cluster_samples/unifrac_clustering.jpg)\n",
    "\n",
    "Since UniFrac distance is calculated for each sample pair in the table, this is one of the most computationally expensive steps we will perform. However, once the UniFrac distance has been calculated for all of our samples, we can simply filter the table to focus on the samples we want. We can leverage the QIIME script, [beta_diveristy.py](http://qiime.org/scripts/beta_diversity.html) to perform our analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Sets up the filepaths for the all sample rarified table\n",
    "all_otu_rare_fp = os.path.join(all_dir, otu_fn) % all_rare_blanks\n",
    "\n",
    "check_dm_fp = np.array([os.path.exists(os.path.join(all_dir, fn_) \n",
    "                                       % all_rare_blanks) for fn_ in dm_fn])\n",
    "\n",
    "# Calculates the beta diversity\n",
    "if not data_download and (not check_dm_fp.all() or overwrite):\n",
    "    !beta_diversity.py -i $all_otu_rare_fp -m $beta_metrics -t $tree_fp -o $all_dir"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"#top\">Return to the top</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Body Site Split"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we’ve generated alpha and beta diversity results for all the body sites, we can start filtering the results. Body site has one of the largest impacts on the microbiome in adult humans [<a href=\"#22699609\">7</a>]. As a result, many analyses will focus on a single body site, often fecal samples.\n",
    "\n",
    "We’ll use the QIIME script, [split_otu_table.py](http://qiime.org/scripts/split_otu_table.html) to split our rarefied and unrarefied OTU tables by body site. We’ll put the output files in intermediate directories, and then move them to the appropriate locations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Sets the raw location file names\n",
    "all_raw_otu_fp = os.path.join(all_dir, otu_fn) % all_raw_blanks\n",
    "all_raw_map_fp = os.path.join(all_dir, map_fn) % all_raw_blanks\n",
    "\n",
    "# Sets the rarefied location file names\n",
    "all_rare_otu_fp = os.path.join(all_dir, otu_fn) % all_rare_blanks\n",
    "all_rare_map_fp = os.path.join(all_dir, map_fn) % all_rare_blanks\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Checks that the raw and rarified bodysite split tables exist\n",
    "raw_split_check = np.array([])\n",
    "rare_split_check = np.array([])\n",
    "\n",
    "for site in all_bodysites:\n",
    "    # Checks the unrarefied splits exists\n",
    "    otu_raw_split_blanks['split_group'] = site\n",
    "    map_raw_split_blanks['split_group'] = site\n",
    "    raw_otu_exist = os.path.exists(os.path.join(split_raw_dir, split_fn) \n",
    "                                   % otu_raw_split_blanks)\n",
    "    raw_map_exist = os.path.exists(os.path.join(split_raw_dir, split_fn) \n",
    "                                   % map_raw_split_blanks)\n",
    "    raw_split_check = np.hstack((raw_split_check, raw_otu_exist, raw_map_exist))\n",
    "    \n",
    "    # Checks the rarefied splits exist\n",
    "    otu_rare_split_blanks['split_group'] = site\n",
    "    map_rare_split_blanks['split_group'] = site\n",
    "    rare_otu_exist = os.path.exists(os.path.join(split_rare_dir, split_fn) \n",
    "                                   % otu_rare_split_blanks)\n",
    "    rare_map_exist = os.path.exists(os.path.join(split_rare_dir, split_fn) \n",
    "                                   % map_rare_split_blanks)\n",
    "    rare_split_check = np.hstack((rare_split_check, rare_otu_exist, rare_map_exist))\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Splits the otu table and mapping file by bodysite\n",
    "if not data_download and (not raw_split_check.any() or overwrite):\n",
    "    !split_otu_table.py -i $all_raw_otu_fp -m $all_raw_map_fp -f BODY_HABITAT -o $split_raw_dir \n",
    "\n",
    "# Splits the otu table and mapping file by bodysite\n",
    "if not data_download and (not rare_split_check.any() or overwrite):\n",
    "    !split_otu_table.py -i $all_rare_otu_fp -m $all_rare_map_fp -f BODY_HABITAT -o $split_rare_dir "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We’ll move and rename our split files to their final location."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Copies the files to their correct final folder\n",
    "if not data_download:\n",
    "    for idx, h_site in enumerate(habitat_sites):\n",
    "        otu_raw_split_blanks['split_group'] = h_site\n",
    "        map_raw_split_blanks['split_group'] = h_site\n",
    "        otu_rare_split_blanks['split_group'] = h_site\n",
    "        map_rare_split_blanks['split_group'] = h_site\n",
    "        raw_sample_blanks['site'] = all_bodysites[idx]\n",
    "        rare_sample_blanks['site'] = all_bodysites[idx]\n",
    "\n",
    "        # Copies the unrarefied mapping file\n",
    "        shutil.copy2(os.path.join(split_raw_dir, split_fn) \n",
    "                     % map_raw_split_blanks,\n",
    "                     os.path.join(asab_pattern, map_fn) \n",
    "                     % raw_sample_blanks)\n",
    "\n",
    "        # Copies the unrarefied OTU table\n",
    "        shutil.copy2(os.path.join(split_raw_dir, split_fn) \n",
    "                     % otu_raw_split_blanks,\n",
    "                     os.path.join(asab_pattern, otu_fn) \n",
    "                     % raw_sample_blanks)\n",
    "\n",
    "        # Copies the rarefied mapping file\n",
    "        shutil.copy2(os.path.join(split_rare_dir, split_fn) \n",
    "                     % map_rare_split_blanks,\n",
    "                     os.path.join(asab_pattern, map_fn) \n",
    "                     % rare_sample_blanks)\n",
    "\n",
    "        # Copies the rarefied OTU table\n",
    "        shutil.copy2(os.path.join(split_rare_dir, split_fn) \n",
    "                     % otu_rare_split_blanks,\n",
    "                     os.path.join(asab_pattern, otu_fn)\n",
    "                     % rare_sample_blanks)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To get our distance matrices for each OTU table, we’ll use the QIIME script, [filter_distance_matrix.py](http://qiime.org/scripts/filter_distance_matrix.html). We will use the mapping file for each body site to filter the distance matrices."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "if not data_download:\n",
    "    for idx, a_site in enumerate(all_bodysites):\n",
    "        \n",
    "        rare_sample_blanks['site'] = a_site\n",
    "\n",
    "        # Gets the rarefied mapping file for the site\n",
    "        map_in = os.path.join(asab_pattern, map_fn)  % rare_sample_blanks\n",
    "\n",
    "        for fn_ in dm_fn:\n",
    "            dm_in = os.path.join(all_dir, fn_) % all_rare_blanks\n",
    "            dm_out = os.path.join(asab_pattern, fn_) % rare_sample_blanks\n",
    "\n",
    "            if not os.path.exists(dm_out) or overwrite:\n",
    "                !filter_distance_matrix.py -i $dm_in -o $dm_out --sample_id_fp $map_in"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"#top\">Return to the top</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Select a Single Sample for Each Participant"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For some analyses we will choose to perform, it can be useful to work with a single sample for each participant at each body site. Many statistical tests assume sample independence. The microbiome among healthy adults is relatively stable across multiple samples within an individual; there is a higher correlation between your personal samples collected across several days than there is between your sample and another person’s sample collected at the same time [<a href=\"#22699609\">7</a>].\n",
    "\n",
    "We’re going to start defining our single sample data sets by writing a function which will allow us to randomly select a sample from each individual. This will take a pandas data frame as an input. We’ll group the data so we can look at each individual (given by the `HOST_SUBJECT_ID`), and then randomly select one sample id per individual."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def identify_single_samples(map_):\n",
    "    \"\"\"Selects a single sample for each participant\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    map_ : pandas DataFrame\n",
    "        A mapping file for our set of samples. A single body site should be\n",
    "        used with human samples.\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    single_ids : ndarray\n",
    "        A list of ids which represent a single sample per individual\n",
    "\n",
    "    \"\"\"\n",
    "    # Identifies a single sample per individual\n",
    "    single_ids = np.hstack([np.random.choice(np.array(ids, dtype=str), 1)\n",
    "                            for indv, ids in\n",
    "                            map_.groupby('HOST_SUBJECT_ID').groups.iteritems()])\n",
    "    return single_ids"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We’ll apply our filtering function at each body site. To do this, we'll define another function which will allow the filtering used functions we can define, like **`identify_single_samples`**.\n",
    "\n",
    "The function we're writing here, **`filter_dataset`**, will first identify the samples to be filtered using the function we pass in. It will use that list of samples to filter the rarefied and unrarefied (raw) mapping files. We will leverage the QIIME scripts, [`biom subset-table`](http://qiime.org/tutorials/working_with_biom_tables.html) and [filter_distance_matrix.py](http://qiime.org/scripts/filter_distance_matrix.html) to filter our OTU table and distance matrices."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def filter_dataset(filter_fun, site, dir_in, dir_out, ids_fp):\n",
    "    \"\"\"Filters data set to create a subset\n",
    "    \n",
    "    Parameters\n",
    "    ----------\n",
    "    filter_fun : function\n",
    "        A function which takes a pandas map and returns a list of sample \n",
    "        ids.\n",
    "    site : str\n",
    "        The body site for which the samples are being generated.\n",
    "    dir_in : str\n",
    "        The directory in which the input analysis files are located. The\n",
    "        directory and files are assumed to exist.\n",
    "    dir_out : str\n",
    "        The directory where the filtered files should be put. The directory\n",
    "        must exist.\n",
    "    ids_fp : str\n",
    "        The filepath where the list of ids in the subset is located.\n",
    "    \n",
    "    Returns\n",
    "    -------\n",
    "    There are no explicit python returns. Rarefied and unrarefied OTU tables,\n",
    "    and their corresponding mapping files (the rarefied file includes alpha\n",
    "    diversity) as well as distance matrices for all distance metrics used\n",
    "    are saved in the `dir_out`.\n",
    "\n",
    "    \"\"\"\n",
    "    rare_sample_blanks['site'] = site\n",
    "    raw_sample_blanks['site'] = site\n",
    "\n",
    "    # Sets up the file names for the original files\n",
    "    rare_map_in_fp = os.path.join(dir_in, map_fn) % rare_sample_blanks\n",
    "    rare_otu_in_fp = os.path.join(dir_in, otu_fn) % rare_sample_blanks\n",
    "\n",
    "    raw_map_in_fp = os.path.join(dir_in, map_fn) % raw_sample_blanks\n",
    "    raw_otu_in_fp = os.path.join(dir_in, otu_fn) % raw_sample_blanks\n",
    "\n",
    "    rare_map_out_fp = os.path.join(dir_out, map_fn) % rare_sample_blanks\n",
    "    rare_otu_out_fp = os.path.join(dir_out, otu_fn) % rare_sample_blanks\n",
    "\n",
    "    raw_map_out_fp = os.path.join(dir_out, map_fn) % raw_sample_blanks\n",
    "    raw_otu_out_fp = os.path.join(dir_out, otu_fn) % raw_sample_blanks\n",
    "\n",
    "    # Checks if the single sample id filepath exists\n",
    "    if not os.path.exists(ids_fp) or not os.path.exists(rare_map_out_fp) or overwrite:\n",
    "        # Reads in the rarefied table\n",
    "        rare_map_in = pd.read_csv(rare_map_in_fp,\n",
    "                                  sep=txt_delim,\n",
    "                                  na_values=map_nas,\n",
    "                                  index_col=False,\n",
    "                                  dtype={map_index: str})\n",
    "        rare_map_in.index = rare_map_in[map_index]\n",
    "        del rare_map_in[map_index]\n",
    "\n",
    "        # Identifies the sample ids\n",
    "        filt_ids = filter_fun(rare_map_in)\n",
    "        \n",
    "        rare_map_out = rare_map_in.loc[filt_ids]\n",
    "\n",
    "        # Saves the single sample filepath\n",
    "        ids_file = file(ids_fp, 'w')\n",
    "        ids_file.write('\\n'.join(list(filt_ids)))\n",
    "        ids_file.close()\n",
    "        \n",
    "        # Saves the rarefied mapping file\n",
    "        rare_map_out.to_csv(rare_map_out_fp,\n",
    "                            sep=txt_delim,\n",
    "                            na_rep=write_na,\n",
    "                            index_label=map_index)\n",
    "    else:\n",
    "        ids_file = open(ids_fp, 'r')\n",
    "        filt_ids = ids_file.read().split('\\n')\n",
    "        ids_file.close()\n",
    "        \n",
    "    if not os.path.exists(raw_map_out_fp) or overwrite:\n",
    "        raw_map_in = pd.read_csv(rare_map_in_fp,\n",
    "                                 sep=txt_delim,\n",
    "                                 na_values=map_nas,\n",
    "                                 index_col=False,\n",
    "                                 dtype={map_index: str})\n",
    "        raw_map_in.index = raw_map_in[map_index]\n",
    "        del raw_map_in[map_index]\n",
    "        raw_map_out = raw_map_in.loc[filt_ids]\n",
    "        raw_map_out.to_csv(raw_map_out_fp,\n",
    "                           sep=txt_delim,\n",
    "                           na_rep=write_na,\n",
    "                           index_label=map_index)\n",
    "\n",
    "    # Filters the OTU table down to single samples\n",
    "    if not os.path.exists(rare_otu_out_fp) or overwrite:\n",
    "        !biom subset-table -i $rare_otu_in_fp -o $rare_otu_out_fp -a sample -s $ids_fp\n",
    "\n",
    "    if not os.path.exists(raw_otu_out_fp) or overwrite:\n",
    "        !biom subset-table -i $raw_otu_in_fp -o $raw_otu_out_fp -a sample -s $ids_fp\n",
    "\n",
    "    # Filters the distance matrices\n",
    "    for dm_ in dm_fn:\n",
    "        dm_in_fp = os.path.join(dir_in, dm_) % rare_sample_blanks\n",
    "        dm_out_fp = os.path.join(dir_out, dm_) % rare_sample_blanks\n",
    "        if not os.path.exists(dm_out_fp) or overwrite:\n",
    "            !filter_distance_matrix.py -i $dm_in_fp -o $dm_out_fp --sample_id_fp $ids_fp    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's apply our two functions to identify a single sample per individual at each site."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "for idx, a_site in enumerate(all_bodysites):\n",
    "    # Skips any site where the healthy subset criteria should not be applied\n",
    "    if a_site not in one_samp_sites:\n",
    "        continue\n",
    "    filter_dataset(filter_fun=identify_single_samples, site=a_site, dir_in=asab_pattern, dir_out=assb_pattern,\n",
    "                   ids_fp=os.path.join(assb_pattern, sin_fn) % {'site':a_site})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"#top\">Return to the top</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Filter the Table to the Healthy Subset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we may wish to have a healthy subset of individuals for certain analyses. The criteria we’ve used to define the healthy subset are described [above](#Identification-of-a-Healthy-Subset-of-Adults).\n",
    "\n",
    "We're going to define a quick function that makes use of `SUBSET`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def identify_subset(map_):\n",
    "    return map_.loc[map_.SUBSET == True].index.values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we’ll use essentially the same pipeline we leveraged for filtering the single samples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "for idx, a_site in enumerate(all_bodysites):\n",
    "    # Skips any site where the healthy subset criteria should not be applied\n",
    "    if a_site not in sub_part_sites:\n",
    "        continue\n",
    "    filter_dataset(identify_subset, a_site, asab_pattern, ssab_pattern,\n",
    "                   os.path.join(ssab_pattern, sub_fn) % {'site':a_site})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "for idx, a_site in enumerate(all_bodysites):\n",
    "    # Skips any site where the healthy subset criteria should not be applied\n",
    "    if a_site not in sub_part_sites or a_site not in one_samp_sites:\n",
    "        continue\n",
    "    \n",
    "    filter_dataset(identify_subset, a_site, assb_pattern, sssb_pattern,\n",
    "                   os.path.join(sssb_pattern, sub_fn) % {'site':a_site})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have generated body site specific, rarefied OTU tables, mapping files with alpha diversity and UniFrac distance matrices for our American Gut Data, as well as creating focused datasets. We can choose to create further-filtered tables, or we can take the outputs of this notebook and use it for downstream analysis.\n",
    "\n",
    "<a href=\"#top\">Return to the top</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# References"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. <a id=\"20383131\"></a>Caporaso, J.G.; Kuczynski, J.; Strombaugh, J.; Bittinger, K.; Bushman, F.D.; Costello, E.K.; Fierer, N.; Peña, A.G., Goodrich, J.K.; Gordon, J.I.; Huttley, G.A.; Kelley, S.T.; Knights, D.; Koenig, J.E.; Ley, R.E.; Lozupone, C.A.; McDonald, D.; Muegge, B.D.; Pirrung, M.; Reeder, J.; Sevinsky, J.R.; Turnbaugh, P.J.; Walters, W.A.; Widmann, J.; Yatsunenko, T.; Zaneveld, J. and Knight, R. (2010) “[QIIME allows analysis of high-throughput community sequence data](http://www.ncbi.nlm.nih.gov/pubmed/20383131).” *Nature Methods*. **7**: 335 - 336.\n",
    "\n",
    "2. <a id=\"24280061\"></a>V&aacute;zquez-Baeza, Y.; Pirrung, M.; Gonzalez, A.; and Knight, R. (2013). “[EMPeror: a tool for visualizing high-throughput microbial community data](http://www.ncbi.nlm.nih.gov/pubmed/24280061).” *Gigascience*. **2**: 16.\n",
    "\n",
    "3. <a id=\"23975157\"></a>Langille, M.G.; Zaneveld, J.; Caporaso, J.G.; McDonald, D.; Knights, D.; Reyes, J.A.; Clemente, J.C.; Burkepile, D.E.; Vega Thurber, R.L.; Knight, R.; Beiko, R.G.; and Huttenhower, C. (2013). “[Predictive functional profiling of microbial communities using 16S rRNA marker gene sequences](http://www.ncbi.nlm.nih.gov/pubmed/23975157).” *Nat Biotechnol*. **31**: 814-821.\n",
    "\n",
    "4. <a id=\"23587224\"></a>McDonald, D.; Clemente, J.C.; Kuczynski, J.; Rideout, J.R.; Stombaugh, J.; Wendel, D.; Wilke, A.; Huse, S.; Hufnagle, J.; Meyer, F.; Knight, R.; and Caporaso, J.G. (2012). [The Biological Observation Matrix (BIOM) format or: how I learned to stop worrying and love the ome-ome](http://www.ncbi.nlm.nih.gov/pubmed/23587224).”  *Gigascience*. **1**:7.\n",
    "\n",
    "5. <a id=\"16332807\"></a>Lozupone, C.; and Knight, R. (2005). “[UniFrac: a new phylogenetic method for comparing microbial communities](http://www.ncbi.nlm.nih.gov/pubmed/16332807).” *Appl Enviro Microbiol.* **71**: 8228-8235.\n",
    "\n",
    "6. <a id=\"20827291\"></a>Lozupone, C.; LLadser, M.E.; Knights, D.; Stombaugh, J.; and Knight, R. (2011). “[UniFrac: an effective distance metric for microbial community composition](http://www.ncbi.nlm.nih.gov/pubmed/20827291).” *ISME J* **5**: 169-172.\n",
    "\n",
    "7. <a id=\"22699609\"></a>The Human Microbiome Consortium. (2012) “[Structure, Function and diversity of the healthy human microbiome.](http://www.ncbi.nlm.nih.gov/pubmed/22699609)” *Nature*. **486**: 207-214.\n",
    "\n",
    "8. <a id=\"15831718\"></a>Eckburg, P.B.; Bik, E.M.; Bernstein C.N.; Purdom, E.; Dethlefson, L.; Sargent, M.; Gill, S.R.; Nelson, K.E.; Relman, D.A. (2005) “[Diversity of the human intestinal microbial flora.](http://www.ncbi.nlm.nih.gov/pubmed/15831718)” *Science*. **308**: 1635-1638.\n",
    "\n",
    "9. <a id=\"chao\"></a>Chao, A. (1984) “[Nonparametric estimation of the number of classes in a population](http://viceroy.eeb.uconn.edu/estimateS/EstimateSPages/EstSUsersGuide/References/Chao1984.pdf).” *Scandinavian J  Stats*. **11**: 265-270.\n",
    "\n",
    "10. <a id=\"shannon\"></a>Seaby, R.M.H. and Henderson, P.A. (2006). “Species Diversity and Richness 4.” http://www.pisces-conservation.com/sdrhelp/index.html.\n",
    "\n",
    "11. <a id=\"22134646\"></a>McDonald, D.; Price, N.M.; Goodrich, J.; Nawrocki, E.P.; DeSantis, T.Z.; Probst, A.; Andersen, G.L.; Knight, R. and Hugenholtz, P. (2012). “[An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea.](http://www.ncbi.nlm.nih.gov/pubmed/22134646)” *ISME J*. **6**:610 - 618.\n",
    "\n",
    "12. <a id=\"20668239\"></a>Koenig, J.E.; Spor, A.; Scalfone, N.; Fricker, A.D.; Stombaugh, J.; Knight, R.; Angenent, L.T.; and Ley, R.E. (2011). “[Succession of microbial consortia in the developing infant gut microbiome](http://www.ncbi.nlm.nih.gov/pubmed/20668239).” *PNAS*. **108 Suppl 1**: 4578 - 4585.\n",
    "\n",
    "13. <a id=\"22699611\"></a>Yatsunenko, T.; Rey, F.E.; Manary, M.J.; Trehan, I.; Dominguez-Bello, M.G.; Contreras, M.; Magris, M.; Hidalgo, G.; Baldassano, R.N.; Anokhin, A.P.; Heath, A.C.; Warner, B.; Rdder, J.; Kuczynski, J.; Caporaso, J.G.; Lozupone, C.A.; Lauber, C.; Clemente, J.C.; Knights, D.; Knight, R. and Gordon, J.I. (2012) “[Human Gut microbiome viewed across age and geography](http://www.ncbi.nlm.nih.gov/pubmed/22699611).” *Nature*. **486**: 222-227.\n",
    "\n",
    "14. <a id=\"20571116\"></a>Claesson, M.J.; Cusacks, S.; O’Sullivan, O.; Greene-Diniz, R.; de Weerd, H.; Flannery, E.; Marchesi, J.R.; Falush, D.; Dinan, T.; Fitzgerald, G.; Stanton, C.; van Sinderen, D.; O’Connor, M.; Harnedy, N.; O’Connor, K.; Henry, C.; O’Mahony, D.; Fitzgerald, A.P.; Shananhan, F.; Twomey, C.; Hill, C.; Ross, R.P.; and O’Toole, P.W. (2011). “[Composition, variability and temporal stability of the intestinal microbiota of the elderly](http://www.ncbi.nlm.nih.gov/pubmed/20571116).” *PNAS*. **108 Suppl 1**: 4586 - 4591.\n",
    "\n",
    "15. <a id=\"22797518\"></a>Claesson, M.J.; Jeffery, I.B.; Conde, S.; Power, S.E.; O’Connor, E.M.; Cusack, S.; Harris, H.M.; Coakley, M.; Lakshminarayanan, B.; O’Sullivan, O.; Fitzgerald, G.F; Deane, J.; O’Connor, M.; Harnedy, N.; O’Connor, K.; O’Mahony, D.; van Sinderen, D.; Wallace, M.; Brennan, L.; Stanton, C.; Marchesi, J.R.; Fitzgerald, A.P.; Shanahan, F.; Hill, C.; Ross, R.P.; and O’Toole, P.W. (2012). “[Gut microbiota composition correlates with diet and health in the elderly](http://www.ncbi.nlm.nih.gov/pubmed/22797518).” *Nature*. **488**: 178-184.\n",
    "\n",
    "16. <a id=\"25307765\"></a>Walters, W.A.; Zu, Z.; and Knight, R. (2014) “[Meta-analysis of human gut microbes associated with obesity and IBD](http://www.ncbi.nlm.nih.gov/pubmed/25307765).” *FEBS Letters*. **588**: 4223-4233.\n",
    "\n",
    "17. <a id=\"20847294\"></a> Dethlefsen, L. and Relman, D.A. (2011) “[Incomplete recovery and individualized responses of the human distal gut microbiota to repeated antibiotic perturbation](http://www.ncbi.nlm.nih.gov/pubmed/20847294).” *PNAS*. **108 Suppl 1**: 4554-4561.\n",
    "\n",
    "18. <a id=\"23274889\"></a> de Goffau, M.C.; Luopajärvi, K.; Knip, M.; Ilonen, J.; Ruohtula, T.; Härkönen, T.; Orivuori, L.; Hakala, S.; Welling, G.W.; Harmensen, H.J.; and Vaarala, O. (2013). “[Fecal Microbiota composition differs between children with B-cell autoimmunity and those without](http://www.ncbi.nlm.nih.gov/pubmed/23274889).” *Diabetes*. **62**: 1238-1244.\n",
    "\n",
    "19. <a id=\"20613793\"></a> Giongo, A.; Gano, K.A.; Crabb, D.B.; Mukherjee, N.; Novelo, L.L.; Casella, G.; Drew, J.C.; Ilonen, J.; Knip, M.; Hyöty, H; Veijola, R.; Simell, T.; Simell, O.; Neu, J.; Wasserfall, C.H.; Schatz, D.; Atkinson, M.A.; and Triplett, E.W. (2011). “[Toward defining the autoimmune microbiome for type 1 diabetes](http://www.ncbi.nlm.nih.gov/pubmed/20613793).” *ISME J*. **5**: 82-91.\n",
    "\n",
    "20. <a id=\"24448554\"></a> Mejía-León, M.E.; Petrosino, J.F.; Ajami, N.J.; Domínguez-Bello, M.G.; and de la Barca, A.M. (2014). “[Fecal microbiota imbalance in Mexican children with type 1 diabetes](http://www.ncbi.nlm.nih.gov/pubmed/24448554).” *Science Reports*. **4**: 3814.\n",
    "\n",
    "21. <a id=\"23433344\"></a> Murrim M.; Leiva, I.; Gomez-Zumaquero, J.M.; Tinahones, F.J.; Cardona, F.; Soriguer, F.; and Queipo-Ortuño, M.I. (2013). “[Gut microbiota in children with type 1 diabetes differs from that in healthy children: a case-control study](http://www.ncbi.nlm.nih.gov/pubmed/23433344).” *BMC Med*. **11**:46.\n",
    "\n",
    "22. <a id=\"20140211\"></a>Larsen, N.; Vogensen, F.K.; van den Berg, F.W.; Nielsen, D.S.; Andreasen, A.S.; Pedersen, B.K.; Al-Soud, W.A.; Sørensen, S.J.; Hansen, H.L. and Jakobsen, M. (2010). “[Gut Microbiota in human adults with type 2 diabetes differs from non diabetic adults](http://www.ncbi.nlm.nih.gov/pubmed/20140211).” *PLoS One*. **5**: e9085.\n",
    "\n",
    "23. <a id=\"24316595\"></a>Ahn, J.; Sinha, R.; Pei, Z.; Cominanni, C.; Wu, J.; Shi, J.; Goedert, J.J.; Hayes, R.B.; and Yang, L. (2013). \"[Human gut microbiome and risk for colorectal cancer](http://www.ncbi.nlm.nih.gov/pubmed/24316595).\" *J Natl Cancer Inst.* **105**: 1907-1911.\n",
    "\n",
    "24. <a id=\"yellowstone\"></a>National Park Service. (2015). “[Mammal Checklist](http://www.nps.gov/yell/learn/nature/mammalscheck.htm).” *Yellowstone National Park*.\n",
    "\n",
    "25. <a id=\"park\"></a> Milieris, V. (2011) “[Biodiversity in Central Park](http://macaulay.cuny.edu/eportfolios/themanhattanproject/does-central-park-work/biodiversity-in-central-park-virginia-milieris/)”. *Exploring Central Park*. CUNY.\n",
    "\n",
    "<a href=\"#top\">Return to the top</a>"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}