\n", " | Metadata | \n", "OTU Table | \n", "Distance Matrix | \n", "
---|---|---|---|
\n", " File Type\n", " | \n", "\n",
" tab-delimted text file (ends in .txt)\n", " | \n",
" \n",
" Biom-format file\n",
" [4] (ends in .biom)\n", " | \n",
" \n",
" tab-delimted text file (ends in .txt)\n", " | \n",
"
\n",
" Using the Biom-format python package \n", " Using the \n", " R BIOM package\n", " | \n",
" \n",
" |||
\n",
" download_data (boolean)\n", " | \n",
" \n",
" This will download a directory of precomputed data tables, when True . download_data will supersede overwrite . (That is, if download_data is True , overwrite must be False .)\n",
" | \n",
"
\n",
" overwrite (boolean)\n", " | \n",
" \n",
" When | \n",
"
\n",
" overwrite (boolean)\n", " | \n",
" \n",
" When overwrite is True , new files will be generated and saved during data processing. It is recommended that overwrite be set to False , in which case new files will only be generated when the file does not exist. This substantially decreases analysis time.\n",
" | \n",
"
\n",
" txt_delim (string)\n", " | \n",
" \n",
" txt_delim specifies the way columns are separated in the files. QIIME typically consumes and produces tab-delimited (\"\\t\" ) text files (.txt) for metadata and results generation.\n",
" | \n",
"
\n",
" map_index (string)\n", " | \n",
" \n",
" The name of the column containing the sample names. In QIIME, this column is called #SampleID .\n",
" | \n",
"
\n",
" map_nas (list of strings)\n", " | \n",
" \n",
" It is possible a mapping file may be missing values, since American Gut participants are free to skip any question. The pandas package is able to omit these missing samples from analysis. In raw American Gut files, missing values are typically denoted as “NA” , “no_data” , “unknown” , and empty spaces (“” ).\n",
" | \n",
"
\n",
" write_na (string)\n", " | \n",
" \n",
" The value to denote missing values when text files are written from Pandas data frames. Using an empty space, (“” ) will allow certain QIIME scripts, like group_significance.py, to ignore the missing values.\n",
" | \n",
"
\n",
" rarefaction_depth (int)\n", " | \n",
" \n",
" The rarefaction_depth specifies the number of sequence per samples to be used for analysis. A depth of 10,000 sequences/sample was selected because it balances a better picture of diversity with retaining samples. \n",
" | \n",
"
\n",
" num_rarefactions (int)\n", " | \n",
" \n", " The number of times we draw new rarefaction tables. This controls for bias due to single rarefaction instances. We selected 10 rarefactions to achieve a balance between computational efficiency and appropriate depth.\n", " | \n", "
\n",
" split_field (string)\n", " | \n",
" \n", " The metadata category which contains the body site information we will use to split the OTU table and mapping file.\n", " | \n", "
\n",
" split_prefix (string)\n", " | \n",
" \n", " Under the standards used to format the American Gut metadata, a constant prefix is used to denote body site. This is used for string formatting and file naming.\n", " | \n", "
\n",
" alpha_metrics (string)\n", " | \n",
" \n", " There are multiple alpha diversity metrics which can be used.\n", " We will calculate four alpha diversity metrics here: PD Whole Tree \n", " [8], Observed Species, Chao1 \n", " [9], and Shannon [10] \n", " diversity. Among these metrics, PD whole tree diversity is unique in\n", " that it considers the evolutionary relationship between OTUs in a \n", " sample by calculating the branch length on the phylogenetic tree \n", " covered by a sample. A list of available metrics can be found on the \n", " scikit-bio website. Metric names should be connected \n", " with a comma (and no spaces).\n", " | \n", "
\n",
" beta_metrics (string)\n", " | \n",
" \n", " As with alpha diversity, there are multiple ways we can \n", " calculate our beta diversity metrics. Here, we use weighted and \n", " unweighted UniFrac distances [5]. UniFrac \n", " distance determines the amount of the phylogenetic tree which does \n", " not overlap between two samples. Weighted UniFrac takes into account \n", " the abundance of taxa within a sample, while unweighted UniFrac \n", " distance only considers the presence or absence of a particular \n", " community member. Additional options are available in the documentation for \n", " beta_diversity.py.\n", " | \n", "
\n",
" habitat_bodysite (list of strings)\n", " | \n",
" \n",
" A list of the body site names used by our American Gut metadata standards. Some of these are inconvenient for file naming, and so we will rename some of these fields using the corresponding all_bodysites name. For example, the standard name for a mouth sample is to label it as an “oral cavity” , however, spaces in file paths make life difficult, so this is mapped to “oral” in our all_bodysites list.\n",
" | \n",
"
\n",
" all_bodysites (list of strings)\n", " | \n",
" \n",
" A list of all the possible body sites which will be used to generate the datasets here. The order of body sites must correspond to the order of body sites in habitat_bodysites .\n",
" | \n",
"
\n",
" sub_part_sites (set)\n", " | \n",
" \n", " We may also want to generate data sets which limits our sample set to \n", "exclude samples which are already known to significantly affect the microbiome. The subset currently being selected focuses mainly on fecal samples.\n", " | \n", "
\n",
" one_samp_sites (set)\n", " | \n",
" \n", " For some types of analysis, there is an assumption that samples are \n", " independent, which in this context includes the requirement that \n", " there are not multiple samples per individual. To limit analysis to \n", " a single sample from each individual, we can select body site where \n", " we want to filter for a single sample per individual. We recommend \n", " doing this for all body sites.\n", " | \n", "
\n",
" base_dir (string)\n", " | \n",
" \n",
" The file path for the directory where any files associated with the \n",
" analysis should be saved. It is suggested this be a directory called \n",
" \"agp_analysis\" , located \n",
" in the same directory as the IPython notebooks.\n",
" | \n",
"
\n",
" tree_fp (string)\n", " | \n",
" \n", " This gives the location of the correct tree file inside your Greengenes 13_8 \n", " directory. The default tree from your QIIME installation is called, here.\n", " | \n", "
\n",
" working_dir (string)\n", " | \n",
" \n", " The file path for a directory where intermediate files (i.e. \n", " rarefaction instances) generated during the run of this notebook can \n", " be stored. It is recommended this be located within the base \n", " analysis directory.\n", " | \n", "
\n",
" download_dir (string)\n", " | \n",
" \n",
" The file path where downloaded files should be saved. The \n",
" download_dir may be located within the \n",
" working_dir , it’s also likely the \n",
" download_dir may be located outside \n",
" the base_dir .\n",
" | \n",
"
\n",
" download_otu_fp (string)\n", " | \n",
" \n", " The uncompressed OTU table from GitHub is located at this file path. \n", " This should be a .biom file, with no compression.\n", " | \n", "
\n",
" download_map_fp (string)\n", " | \n",
" \n", " The location of the American Gut mapping file, downloaded from \n", " GitHub.\n", " | \n", "
\n",
" rare_dir (string)\n", " | \n",
" \n",
" The file path to the directory where we should save all of our \n",
" rarefaction files. This should be located in the \n",
" working_dir .\n",
" | \n",
"
\n",
" rare_pattern (string)\n", " | \n",
" \n",
" This describes the way rarefied OTU tables will be saved in our \n",
" output directories. The \n",
" “%i” \n",
" will allow us to substitute any integers. In this case, we will \n",
" specify the rarefaction depth, and the rarefaction instance for each \n",
" table. \n",
" | \n",
"
\n",
" alpha_dir (string)\n", " | \n",
" \n",
" The file path to the directory where we should save all of our alpha \n",
" diversity files. This should be located in the \n",
" working_dir .\n",
" | \n",
"
\n",
" alpha_pattern (string)\n", " | \n",
" \n",
" This describes the way alpha diversity files will be saved in our \n",
"output directories. The “%i” will allow us to substitute any integer. In this case, we will specify the rarefaction depth, and the rarefaction instance for each table. The rarefaction instances are numbered sequentially, from 0 to the number of rarefactions.\n",
" | \n",
"
\n",
" split_raw_dir (string)\n", " | \n",
" \n",
" The file path where we should put the unrarefied OTU table \n",
" after splitting by body site. This should be located in the \n",
" working_dir .\n",
" | \n",
"
\n",
" split_rare_dir (string)\n", " | \n",
" \n",
" The file path where we should put the rarefied OTU table after splitting by body site. This should be located in the \n",
" working_dir .\n",
" | \n",
"
\n",
" split_fn (string)\n", " | \n",
" \n", " The files generated by OTU splitting will follow this naming \n", " convention. The blanks, rare_suffix, \n", " split_field, \n", " split_prefix, \n", " split_group and extension are used to specify the level of \n", " rarefaction, the field used for splitting the data, the group in \n", " that split, and the type of file generated. Here, we expect the \n", " split_group to be a body site.\n", " | \n", "
\n",
" data_dir (string)\n", " | \n",
" \n",
" The file path for a directory where the results of this notebook \n",
" (OTU tables, mapping files, and distance matrices) should be saved. \n",
" This should be a directory in \n",
" base_dir .\n",
" | \n",
"
\n",
" all_dir; (string)\n", " | \n",
" \n",
" The all data directory will contain the full American Gut results. \n", " | \n",
"
\n",
" bodysite directories (string)\n", " | \n",
" \n",
" The bodysite directories can be identified by adjusting the all_bodysite variable.\n", " | \n",
"
“%s”
prefix will allow us to insert any file path for the output directory. We expect the final file paths used with these directories to be located in the data_dir
.\n",
" \n",
"\n",
" asab_pattern (string)\n", " | \n",
" \n", " A file pattern for the directory where we’ll save the data from all \n", " samples from all participants. \n", " | \n", "
\n",
" assb_pattern (string)\n", " | \n",
" \n", " A file pattern for the directory where data from a single sample for each individual at each body site is stored. Note that it's possible to have multiple samples for the same individual, as long as the individual contributed samples at multiple body sites. However, the two samples from the same individual will be represented in different tables.\n", " | \n", "
\n",
" ssab_pattern (string)\n", " | \n",
" \n", " A file pattern for the directory where we’ll save the data from all \n", " samples from a subset of participants.\n", " | \n", "
\n",
" sssb_pattern (string)\n", " | \n",
" \n", " A file pattern for the directory where we’ll save the data from a \n", " single samples from each participant in the healthy subset.\n", " | \n", "
\n",
" otu_fn (string)\n", " | \n",
" \n", " A pattern for the filename for output OTU table files.\n", " | \n", "
\n",
" map_fn (string)\n", " | \n",
" \n", " A pattern for the filename for output mapping files.\n", " | \n", "
\n",
" dm_fn (string)\n", " | \n",
" \n",
" A pattern for the file name used by the distance matrix files generated by the beta_metrics \n",
" | \n",
"
\n",
" map_extension (string)\n", " | \n",
" \n", " The file extension for mapping files. \n", " Mapping files are typically tab-delimited text files.\n", " | \n", "
\n",
" otu_extension (string)\n", " | \n",
" \n", " The file extension for OTU table files. \n", " OTU tables are typically Biom-formatted files.\n", " | \n", "
\n",
" rare_suffix (string)\n", " | \n",
" \n",
" This is added to file names to denote that rarefaction has occurred. \n",
" Typically, this should be \n",
" “even” with the \n",
" rarefaction depth.\n",
" | \n",
"
\n",
" raw_suffix (string)\n", " | \n",
" \n", " This is added to files in which rarefaction has not been performed. \n", " Usually, this will be an empty string. This is required to maintain \n", " appropriate string formatting.\n", " | \n", "
\n",
" site_pad; all_ (string)\n", " | \n",
" \n", " These are spacers used to keep name formats clean and correct.\n", " | \n", "
split_group
, and we'll use nan
as a placeholder. \n",
"\n",
"After the tables are split, we generate body site-specific information as a list.\n",
"\n",
"\n",
"\t\t\tlast_rare (dict)\n", "\t\t | \n",
"\t\t\n", "\t\t\tFills in the blanks for a rarefaction table or alpha diversity file. \n", " This is used to check if all rarefaction of alpha diversity files \n", " have been generated for an analysis.\n", "\t\t | \n", "\t
\n",
"\t\t\tall_raw_blanks (dict)\n", "\t\t | \n",
"\t\t\n", " Fills in the blanks for the unrarefied, all-sample files.\n", "\t\t | \n", "\t
\n",
"\t\t\tall_rare_blanks (dict)\n", "\t\t | \n",
"\t\t\n", " Fills in the blanks for the rarefied, all-sample files.\n", "\t\t | \n", "\t
\n",
"\t\t\totu_raw_split_blanks (dict)\n", "\t\t | \n",
"\t\t\n", " Fills in the blanks for the unrarefied split file patterns to \n", " identify the split OTU table. \n", "\t\t | \n", "\t
\n",
"\t\t\tmap_raw_split_blanks (dict)\n", "\t\t | \n",
"\t\t\n",
" Fills in the blanks for the unrarefied split file patterns to \n",
" identify the split metadata file. The nan value for \n",
" split_group is a place \n",
" holder.\n",
"\t\t | \n",
"\t
\n",
"\t\t\totu_rare_split_blanks (dict)\n", "\t\t | \n",
"\t\t\n", " Fills in the blanks for the rarefied split file patterns to identify \n", " the split OTU table.\n", "\t\t | \n", "\t
\n",
"\t\t\tmap_rare_split_blanks (dict)\n", "\t\t | \n",
"\t\t\n", " Fills in the blanks for the rarefied split file patterns to identify \n", " the split metadata file.\n", "\t\t | \n", "\t
\n",
"\t\t\traw_sample_blanks (dict)\n", "\t\t | \n",
"\t\t\n", " Fills in the blanks for the rarefied files at each body site.\n", "\t\t | \n", "\t
\n",
"\t\t\trare_sample_blanks (dict)\n", "\t\t | \n",
"\t\t\n", " Fills in the blanks for the rarefied files at each body site.\n", "\t\t | \n", "\t
download_dir
, they will be downloaded to this location. If the files exist, new versions will be downloaded only if [**`overwrite`**](#File-Saving-Parameters) is set to True
. \n",
"\n",
"*Note that this step requires an internet connection.*"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"if data_download:\n",
" !curl -OL ftp://ftp.microbio.me/pub/American-Gut-precomputed/r1-15/sample_data.tgz\n",
" !tar -xzf sample_data.tgz\n",
" shutil.move('./sample_data', base_dir)"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Gets the biom file\n",
"if not data_download and (not os.path.exists(download_otu_fp) or overwrite):\n",
" # Downloads the compressed biom file\n",
" !curl -OL https://github.com/biocore/American-Gut/blob/master/data/AG/AG_100nt.biom?raw=true\n",
" # Moves the biom file to its final location\n",
" shutil.move(os.path.join(os.path.abspath('.'), 'AG_100nt.biom?raw=true'), download_otu_fp)\n",
"\n",
"# Gets the mapping file\n",
"if not data_download and (not os.path.exists(download_map_fp) or overwrite):\n",
" # Downloads the mapping files\n",
" !curl -OL https://github.com/biocore/American-Gut/blob/master/data/AG/AG_100nt.txt?raw=true\n",
" # Moves the file to the download file path\n",
" shutil.move(os.path.join('.', 'AG_100nt.txt?raw=true'), download_map_fp)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Return to the top"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Mapping File Clean up"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will start by adjusting the metadata. This will correct errors and provide a uniform format for derived columns we may wish to use later or in downstream analyses."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Loads the mapping file\n",
"raw_map = pd.read_csv(download_map_fp,\n",
" sep=txt_delim, \n",
" na_values=map_nas,\n",
" index_col=False,\n",
" dtype={map_index: str},\n",
" low_memory=False)\n",
"raw_map.index = raw_map[map_index]\n",
"del raw_map[map_index]\n",
"\n",
"# Loads the OTU table\n",
"raw_otu = biom.load_table(download_otu_fp)\n",
"\n",
"# Filters the raw map to remove any samples that are not present in the biom table\n",
"raw_map = raw_map.loc[raw_otu.ids()]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Return to the top"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Age"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are also a set of columns which are not included in the map, but may be useful for downstream analyses. These include age binned by decade (`AGE_CAT`). While there are QIIME analyses which can handle continuous metadata, binning can help reduce some of the noise.\n",
"Here, we bin age by decade, with the exception of people under the age of 20. The gut develops in the first two years of life, and the guts of young children are significantly different than older children or adults [12, 13]. We will also combine individuals over the age of 70 into their own category, due to the low sample counts of people over 80 as of round 14 (*n* < 20)."
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Bins age by decade (with the exception of young children)\n",
"def categorize_age(x):\n",
" if np.isnan(x):\n",
" return x\n",
" elif x < 3:\n",
" return \"baby\"\n",
" elif x < 13:\n",
" return \"child\"\n",
" elif x < 20:\n",
" return \"teen\"\n",
" elif x < 30:\n",
" return \"20s\"\n",
" elif x < 40:\n",
" return \"30s\"\n",
" elif x < 50:\n",
" return \"40s\"\n",
" elif x < 60:\n",
" return \"50s\"\n",
" elif x < 70:\n",
" return \"60s\"\n",
" else:\n",
" return \"70+\"\n",
"raw_map['AGE_CAT'] = raw_map.AGE.apply(categorize_age)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Return to the top"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Alcohol Consumption"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In addition to considering the frequency with which people use alcohol (Never, Rarely, Occasionally, Regularly, or Daily), it may be helpful to simply look for an effect associated with any alcohol consumption."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"def categorize_etoh(x):\n",
" if x == 'Never':\n",
" return \"No\"\n",
" elif isinstance(x, str):\n",
" return \"Yes\"\n",
" elif np.isnan(x):\n",
" return x\n",
" \n",
"raw_map['ALCOHOL_CONSUMPTION'] = raw_map.ALCOHOL_FREQUENCY.apply(categorize_etoh)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Return to the top"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Body Mass Index"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Body Mass Index (BMI) can be stratified into [categories](http://en.wikipedia.org/wiki/Body_mass_index#Categories) which give an approximate idea of body shape. It is worth noting that these stratifications do not hold well for growing children, where the BMI qualification is based on age and gender."
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Categorizes the BMI into groups\n",
"def categorize_bmi(x):\n",
" if np.isnan(x):\n",
" return x\n",
" elif x < 18.5:\n",
" return \"Underweight\"\n",
" elif x < 25:\n",
" return \"Normal\"\n",
" elif x < 30:\n",
" return \"Overweight\"\n",
" else:\n",
" return \"Obese\"\n",
"raw_map['BMI_CAT'] = raw_map.BMI.apply(categorize_bmi)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Return to the top"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Collection Season"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"American Gut samples have been collected since December of 2012. To look for patterns associated with the time of year samples were collected, we bin this date information into month, and season.\n",
"\n",
"We currently define our seasons according to the calendar in the Northern Hemisphere, because as of round fifteen, 99% of our samples were collected north of the equator. Additionally, rather than defining our seasons by the solar calendar, we have elected to use the first day of the month the solstice or equinox occurs in as the start of our season. So, while Winter technically begins on December 20th or 21st, according to the solar calendar, we consider December 1st as the first day of our Winter.\n"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"def convert_date(x):\n",
" \"\"\"Converts strings to a date object\"\"\"\n",
" if isinstance(x, str) and \"/\" in x:\n",
" try:\n",
" return pd.tseries.tools.to_datetime(x)\n",
" except:\n",
" return np.nan\n",
" else:\n",
" return x\n",
"\n",
"raw_map.COLLECTION_DATE = raw_map.COLLECTION_DATE.apply(convert_date)"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Categorizes data by collection month and collection season\n",
"month_map = {-1: [np.nan, np.nan],\n",
" np.nan: [np.nan, np.nan],\n",
" 1: ['January', 'Winter'],\n",
" 2: ['February', 'Winter'],\n",
" 3: ['March', 'Spring'],\n",
" 4: ['April', 'Spring'],\n",
" 5: ['May', 'Spring'],\n",
" 6: ['June', 'Summer'],\n",
" 7: ['July', 'Summer'],\n",
" 8: ['August', 'Summer'],\n",
" 9: ['September', 'Fall'],\n",
" 10: ['October', 'Fall'],\n",
" 11: ['November', 'Fall'],\n",
" 12: ['December', 'Winter']}\n",
"\n",
"def map_month(x):\n",
" try:\n",
" return month_map[x.month][0]\n",
" except:\n",
" return np.nan\n",
"\n",
"def map_season(x):\n",
" try:\n",
" return month_map[x.month][1]\n",
" except:\n",
" return np.nan\n",
"\n",
"# Maps the data as a month\n",
"raw_map['COLLECTION_MONTH'] = \\\n",
" raw_map.COLLECTION_DATE.apply(map_month)\n",
"\n",
"# Maps the data as a season\n",
"raw_map['COLLECTION_SEASON'] = \\\n",
" raw_map.COLLECTION_DATE.apply(map_season)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Return to the top"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Collection Location"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The American Gut Project includes some geographical information about where samples were collected. While the data may be leveraged as-is, it can also be helpful to clean up the data. \n",
"\n",
"We'll start by checking for uniform country mapping. This will allow us to combine samples from countries or groups of countries with multiple descriptive names, such as the Great Britain and the United Kingdom."
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"def map_countries(x):\n",
" return geo.country_map.get(x, x)\n",
"\n",
"raw_map.COUNTRY = raw_map.COUNTRY.apply(map_countries)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In Rounds 1-15 participants come predominantly from the US, UK, Belgium and Canada. Since the area occupied by Belgium and the UK are smaller than the size of many states in the contiguous US (including some of the most represented states in the American Gut), we have elected to only consider the **STATE** field for American and Canadian samples.\n",
"\n",
"This does not alter the information provided by the zip (postal) code or Longitude/Latitude information."
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Removes state information for any state not in the US\n",
"# (This may change as additional countries are added.)\n",
"countries = raw_map.groupby('COUNTRY').count().STATE.index.values\n",
"for country in countries:\n",
" if country not in {'GAZ:United States of America', 'GAZ:Canada'}:\n",
" raw_map.loc[raw_map.COUNTRY == country, 'STATE'] = np.nan\n",
"\n",
"# Handles regional mapping, cleaning up states so that only American and\n",
"# Canadian states are included \n",
"def check_state(x):\n",
" if isinstance(x, str) and x in geo.us_state_map:\n",
" return geo.us_state_map[x.upper()]\n",
" elif isinstance(x, str) and x in geo.canadian_map_english:\n",
" return geo.canadian_map_english[x.upper()]\n",
" else:\n",
" return np.nan\n",
"raw_map['STATE'] = raw_map.STATE.apply(check_state)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We may also choose to use predefined regions to further bin our location data, and allow us to look for social or economic trends. To this end, we can apply regions defined by the [US Census Bureau](https://www.census.gov/geo/reference/gtc/gtc_census_divreg.html) and Economic Regions defined by the [US Bureau of Economic Analysis](http://www.bea.gov), which is part of the Department of Commerce."
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Bins data by census region\n",
"def census_f(x):\n",
" if isinstance(x, str) and x in geo.regions_by_state:\n",
" return geo.regions_by_state[x]['Census_1']\n",
" else:\n",
" return np.nan\n",
"raw_map['CENSUS_REGION'] = raw_map.STATE.apply(census_f)\n",
"\n",
"\n",
"# Bins data by economic region\n",
"def economic_f(x):\n",
" if isinstance(x, str) and x in geo.regions_by_state:\n",
" return geo.regions_by_state[x]['Economic']\n",
" else:\n",
" return np.nan\n",
"raw_map['ECONOMIC_REGION'] = raw_map.STATE.apply(economic_f)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Return to the top\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Sleep Duration"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As of round 15, there are 36 participants who report sleeping less than five hours a night. To all for a larger sample size, we will pool these with the individuals who report sleeping between five and six hours a night, to create a group who report sleeping less than six hours a night."
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"raw_map.loc[raw_map.SLEEP_DURATION == 'Less than 5 hours', 'SLEEP_DURATION'] = 'Less than 6 hours'\n",
"raw_map.loc[raw_map.SLEEP_DURATION == '5-6 hours', 'SLEEP_DURATION'] = 'Less than 6 hours'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Return to the top"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Identification of a Healthy Subset of Adults"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Certain health states are known to influence the microbiome in extreme ways. For some analyses we will do later, it may be useful to limit the noise associated with these conditions to allow us to look for new patterns. We have identified five metadata categories which we will use to limit our “healthy” subset.\n",
"\n",
"First, we limit based on age for several reasons. We chose to omit anyone under the age of twenty. The microbiome of very young children is not yet stable, and differs greatly from that of adults [12, 13]. Additionally, BMI limits are not easily assigned in people who are still growing. Without stratifying by gender, we assumed that growth will be complete in most people by the age of 20, and set our limit there. The limit at seventy was based on the number of individuals over that age, and on differences in the microbiome seen in older individuals [14, 15].\n",
"\n",
"We also used Body Mass Index as an exclusion criteria, considering only people in the “normal” and “overweight” categories. (BMI 18.5 - 30). It has been suggested that obesity changes the gut microbiome, although the effect is not consistent across all studies [16]. Additionally, we noticed that there were also alterations in our sample of underweight individuals.\n",
"\n",
"Recent antibiotic decreases alpha diversity and affects the microbiome [17]. We chose to define “recent” as any time within the last year. We also excluded anyone who reported having Inflammatory Bowel Disease [16], Type I Diabetes [18-21], or Type II Diabetes [22], since all three conditions are known to affect the microbiome."
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Creates the subset if its not already in the mapping file\n",
"if 'SUBSET' not in raw_map.columns:\n",
" subset_f = {'AGE': lambda x: 19 < x < 70 and not np.isnan(x),\n",
" 'DIABETES': lambda x: x == 'I do not have diabetes',\n",
" 'IBD': lambda x: x == 'I do not have IBD',\n",
" 'ANTIBIOTIC_SELECT': lambda x: x == 'Not in the last year',\n",
" 'BMI': lambda x: 18.5 <= x < 30 and not np.isnan(x)}\n",
"\n",
" # Determines which samples meet the requirements of the categories\n",
" new_bin = {}\n",
" for cat, f in subset_f.iteritems():\n",
" new_bin[cat] = raw_map[cat].apply(f)\n",
"\n",
" # Builds up the new binary dataframe\n",
" bin_frame = pd.DataFrame(new_bin)\n",
"\n",
" # Adds a column to the current dataframe to look at the subset\n",
" bin_series = pd.DataFrame(new_bin).all(1)\n",
" bin_series.name = 'SUBSET'\n",
"\n",
" raw_map = raw_map.join(bin_series)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Return to the top"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Whole Table Rarefaction"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will start by rarefying the whole body table using the rarefaction parameters we set earlier. Rarefaction is a technique which filters out samples below a certain sequencing depth. Sequences are picked from a weighted average from the remaining samples to that all samples have an even depth. To control for bias which might occur with a single, random subsampling of the data, we use multiple rounds of rarefaction to more accurately estimate the alpha diversity.\n",
"\n",
"Rarefaction is important to make intra sample diversity (alpha diversity) comparisons possible. Below is a panel from Figure 1 of Human Gut Microbiome and Risk of Colorectal Cancer[23]. The figure compares Shannon Diversity between individuals with colorectal cancer (*n*=47, red circles) and healthy controls (*n*=94, empty triangles) over several rarefaction depths, or sequence counts per sample.\n",
"\n",
"![Cancer Rarefaction curve](images/ahn2013jncicolorectalf1.jpg?raw=true)\n",
"\n",
"The figure also illustrates the importance of even sampling depth. If a control sample with 500 sequences per sample were compared with a cancer sample at a depth of 2500 sequences per sample, the cancer sample would appear more diverse. Comparisons at the same depth reveal the true pattern in the data: cancer samples are less diverse than controls.\n",
"\n",
"To perform multiple rarefactions, we will use the QIIME script, [multiple_rarefactions_even_depth.py](http://qiime.org/scripts/multiple_rarefactions_even_depth.html)."
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"if not data_download and (not os.path.exists(os.path.join(rare_dir, rare_pattern) %last_rare) or overwrite):\n",
" !multiple_rarefactions_even_depth.py -i $download_otu_fp -o $rare_dir -n $num_rarefactions -d $rarefaction_depth --lineages_included"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Return to the top"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Whole Table Alpha Diversity"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will use our rarefaction tables to calculate the alpha diversity associated with each rarefaction. Alpha diversity is a measure of intra sample diversity. Imagine that we could put up an imaginary wall around a 100ft x 100ft x 10 ft box in Yellowstone National Park, trapping all the vertebrate animals in that area for a short period of time. Imagine that we then made a very careful list (or took photographs) of the area so that we could document all the life we found in the area. We could count all the different types of animals we found in that area. This would be one measure of alpha diversity. \n",
"\n",
"Instead of just considering each type of animal to be equally similar, we wanted to include an evolutionary relationship between the animals. So, if our area contained a mouse, a squirrel and a rabbit, we might say these animals are more similar (and therefore less diverse) than if we found a mouse, a squirrel, and a sagebrush lizard in the same area. So, even though we’ve found three species in each case, the third species being a reptile would make it more diverse than the third species being a rodent. \n",
"\n",
"A diversity metric which accounts for shared evolutionary history between species is called a phylogenetic metric. This often uses a phylogenetic tree to provide information about that shared history. PD Whole Tree Diversity is a commonly used phylogenetic alpha diversity metric in microbiome research [8]. A taxonomic metric assumes all species are equally different. Common taxonomic metrics for alpha diversity used in microbiome research include Observed Species Diversity and Chao1 Diversity [9, 10].\n",
"\n",
"Depending on what information we’re looking for, we might want to include information about the number of each animal belonging to the species we see. We might also want to consider the number of each different species we find in the area, weighting our diversity. So, if in our little area of Yellowstone, 90% of the animals we see are mice, while 5% are rabbits and 5% are trout, we would consider this less diverse than if 40% of the animals were mice, 30% were rabbits and 30% were trout. A metric which takes into account the counts of each species is a quantitative metric, while a qualitative metric looks only at the presence or absence of a species.\n",
"\n",
"While alpha diversity is calculated completely independently for each sample, the comparison of alpha diversity may provide clues about environmental changes. For example, pollution or an algal bloom may be associated with lower alpha diversity, and indicate a potential change in the health of the ecosystem.\n",
"We’ll start our work with alpha diversity by calculating the diversity for our rarefied American Gut tables using the four metrics we selected in the alpha diversity parameters: the phylogenetic PD whole Tree Diversity, and the taxonomic metrics, Observed Species Diversity, Chao1 Diversity and Shannon Diversity. All the diversity metrics we are using here are qualitative metrics."
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"if not data_download and (not os.path.exists(os.path.join(alpha_dir, alpha_pattern) % last_rare) or overwrite):\n",
" !alpha_diversity.py -i $rare_dir -o $alpha_dir -m $alpha_metrics -t $tree_fp"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The alpha diversity results from QIIME are loaded into the notebook. To identify the best rarefaction instance, which we'll use as our OTU table moving forward, we try to identify the rarefaction instance which has alpha diversity closest to the mean alpha diversity represented in the table. We define \"closest\" using the normalized Euclidian distance."
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"if not data_download:\n",
" # Preallocates an output object for alpha diversity\n",
" alpha_rounds = {m: {} for m in alpha_metrics.split(',')}\n",
" div_metric = alpha_metrics.split(',')[0]\n",
"\n",
" # Loops through the rarefaction instances\n",
" for ri in range(num_rarefactions):\n",
" a_file_blanks = {'rare_depth': rarefaction_depth,\n",
" 'rare_instance': ri}\n",
"\n",
" # Sets the alpha diversity file path\n",
" alpha_fp = os.path.join(alpha_dir, alpha_pattern) % a_file_blanks\n",
"\n",
" # Loads the alpha diversity table\n",
" alpha = pd.read_csv(alpha_fp,\n",
" sep=txt_delim,\n",
" index_col=False)\n",
" alpha.index = alpha['Unnamed: 0']\n",
" del alpha['Unnamed: 0']\n",
"\n",
" # Extracts the alpha diversity metrics\n",
" for col in alpha_rounds:\n",
" alpha_rounds[col]['%i' %ri] = alpha[col]\n",
" alpha_rounds[col]['%i' %ri].name = '%i' % ri"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"if not data_download:\n",
" # Compiles the alpha diversity results into a single table\n",
" alpha_df = pd.DataFrame({'%s_mean' % metric: pd.DataFrame(alpha_rounds[metric]).mean(1)\n",
" for metric in alpha_metrics.split(',')})\n",
"\n",
" # Adds the alpha diversity results to the rarefied table\n",
" rare_map = raw_map.copy()\n",
" rare_map = rare_map.join(alpha_df)\n",
" rare_check = np.isnan(rare_map['%s_mean' % div_metric]) == False\n",
" rare_map = rare_map.loc[rare_check]\n",
"\n",
" # Draws the data associated with each of the alpha diversity rounds\n",
" all_rounds = pd.DataFrame(alpha_rounds[div_metric])\n",
"\n",
" # Lines up the data so the indices match (as a precaution)\n",
" all_rounds = all_rounds.sort_index()\n",
" alpha_df = alpha_df.sort_index()\n",
"\n",
" # Calculates the distance between each round and the mean\n",
" mean_rounds = ([alpha_df['%s_mean' % div_metric].values] * \n",
" np.ones((num_rarefactions, 1))).transpose()\n",
" diff = np.sqrt(np.square(all_rounds.values - np.square(mean_rounds))) / mean_rounds\n",
"\n",
" # Determines the minimum distance between the round and the mean\n",
" round_labels = np.arange(0, 10)\n",
" round_avg = diff.mean(0)\n",
" best_rarefaction = round_labels[round_avg == min(round_avg)][0]\n",
" best_blanks = {'rare_depth': rarefaction_depth,\n",
" 'rare_instance': best_rarefaction}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We’ll save the whole body tables in their own directory, and the modified mapping files. We’ll also copy the raw OTU table and the rarefaction instance closest to the mean alpha diversity."
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Saves the unrarefied mapping file\n",
"if not data_download:\n",
" raw_map.to_csv(os.path.join(all_dir, map_fn) % all_raw_blanks,\n",
" sep=txt_delim,\n",
" na_rep=write_na,\n",
" index_label=map_index)\n",
"\n",
" # Saves the rarefied mapping file\n",
" rare_map.to_csv(os.path.join(all_dir, map_fn) % all_rare_blanks,\n",
" sep=txt_delim,\n",
" na_rep=write_na,\n",
" index_label=map_index)\n",
"\n",
" # Copies the raw OTU table\n",
" shutil.copy2(download_otu_fp, \n",
" os.path.join(all_dir, otu_fn) % all_raw_blanks)\n",
"\n",
" # Copies the rarefied OTU table\n",
" shutil.copy2(os.path.join(rare_dir, rare_pattern) % best_blanks,\n",
" os.path.join(all_dir, otu_fn) % all_rare_blanks)\n",
"\n",
"raw_map = pd.read_csv(os.path.join(all_dir, map_fn) % all_raw_blanks,\n",
" sep=txt_delim,\n",
" na_values=map_nas,\n",
" index_col=False,\n",
" low_memory=False,\n",
" dtype={map_index: str})\n",
"raw_map.index = raw_map[map_index]\n",
"del raw_map[map_index]\n",
"\n",
"rare_map = pd.read_csv(os.path.join(all_dir, map_fn) % all_rare_blanks,\n",
" sep=txt_delim,\n",
" na_values=map_nas,\n",
" index_col=False,\n",
" low_memory=False,\n",
" dtype={map_index: str})\n",
"rare_map.index = rare_map[map_index]\n",
"del rare_map[map_index]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Return to the top"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Whole Table Beta Diversity"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Beta Diversity allows us to make comparisons between samples and environments. Let’s go back to our 100ft x 100ft x 10ft cube in Yellowstone where we catalogued all the vertebrates. Let’s imagine that we’ve set up the same type of cube in New York City’s Central Park and cataloged all the vertebrates in that area as well.\n",
"\n",
"We could compare the two communities by seeing how many species are shared between the two, or, by making some measure that approximates the species. We might expect some overlap: depending on where we selected our regions, it would be unsurprising to encounter Chipmunks in both Central Park and Yellowstone National Park. However, there should also be some differences. Unless our Central Park location includes the zoo, it’s unlikely we’d find a Buffalo in New York City!\n",
"\n",
"If we use a taxonomic metric, based only on the species we find in the two locations, we might get very little overlap. While we might expect to find a squirrel in both Central Park and Yellowstone, the animals might be members of different genera! New York is home to the [Eastern Grey Squirrel](http://en.wikipedia.org/wiki/Eastern_gray_squirrel), *Sciurus carolinensis*, while we might find the [American Red Squirrel](http://en.wikipedia.org/wiki/American_red_squirrel), *Tamiasciurus hudsonicus*, in Yellowstone. [24, 25]. In this case, a phylogenetic metric, which can account for some similarity between the two species of squirrels, may serve us much better.\n",
"\n",
"When we compare microbial communities for beta diversity, we frequently select a phylogenetic metric called UniFrac distance [5]. This metric uses a phylogenetic tree, and determines what fraction of the tree is not shared between two communities. \n",
"![UniFrac distance trees](http://unifrac.colorado.edu/static/images/fastunifrac/unifrac_significance/unifrac_test.jpg)\n",
"\n",
"If we consider only the presence and absence of each OTU in the samples, we have a qualitative metric, unweighted UniFrac distance. Unweighted UniFrac distance may take on values between 0 (everything the same) and 1 (everything different). Weighted UniFrac distance takes into account the abundance of the OTUs, and can take on values greater than 1.\n",
"\n",
"The UniFrac distance for each pairwise sample is arranged into a Distance Matrix. We can visualize the distance matrix by many techniques, like making PCoA plots in Emperor, or UPGMA trees like the one shown in the figure below [2].\n",
"\n",
"![Unifrac to distance matrix](http://unifrac.colorado.edu/static/images/fastunifrac/cluster_samples/unifrac_clustering.jpg)\n",
"\n",
"Since UniFrac distance is calculated for each sample pair in the table, this is one of the most computationally expensive steps we will perform. However, once the UniFrac distance has been calculated for all of our samples, we can simply filter the table to focus on the samples we want. We can leverage the QIIME script, [beta_diveristy.py](http://qiime.org/scripts/beta_diversity.html) to perform our analysis."
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Sets up the filepaths for the all sample rarified table\n",
"all_otu_rare_fp = os.path.join(all_dir, otu_fn) % all_rare_blanks\n",
"\n",
"check_dm_fp = np.array([os.path.exists(os.path.join(all_dir, fn_) \n",
" % all_rare_blanks) for fn_ in dm_fn])\n",
"\n",
"# Calculates the beta diversity\n",
"if not data_download and (not check_dm_fp.all() or overwrite):\n",
" !beta_diversity.py -i $all_otu_rare_fp -m $beta_metrics -t $tree_fp -o $all_dir"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Return to the top"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Body Site Split"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we’ve generated alpha and beta diversity results for all the body sites, we can start filtering the results. Body site has one of the largest impacts on the microbiome in adult humans [7]. As a result, many analyses will focus on a single body site, often fecal samples.\n",
"\n",
"We’ll use the QIIME script, [split_otu_table.py](http://qiime.org/scripts/split_otu_table.html) to split our rarefied and unrarefied OTU tables by body site. We’ll put the output files in intermediate directories, and then move them to the appropriate locations."
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Sets the raw location file names\n",
"all_raw_otu_fp = os.path.join(all_dir, otu_fn) % all_raw_blanks\n",
"all_raw_map_fp = os.path.join(all_dir, map_fn) % all_raw_blanks\n",
"\n",
"# Sets the rarefied location file names\n",
"all_rare_otu_fp = os.path.join(all_dir, otu_fn) % all_rare_blanks\n",
"all_rare_map_fp = os.path.join(all_dir, map_fn) % all_rare_blanks\n"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Checks that the raw and rarified bodysite split tables exist\n",
"raw_split_check = np.array([])\n",
"rare_split_check = np.array([])\n",
"\n",
"for site in all_bodysites:\n",
" # Checks the unrarefied splits exists\n",
" otu_raw_split_blanks['split_group'] = site\n",
" map_raw_split_blanks['split_group'] = site\n",
" raw_otu_exist = os.path.exists(os.path.join(split_raw_dir, split_fn) \n",
" % otu_raw_split_blanks)\n",
" raw_map_exist = os.path.exists(os.path.join(split_raw_dir, split_fn) \n",
" % map_raw_split_blanks)\n",
" raw_split_check = np.hstack((raw_split_check, raw_otu_exist, raw_map_exist))\n",
" \n",
" # Checks the rarefied splits exist\n",
" otu_rare_split_blanks['split_group'] = site\n",
" map_rare_split_blanks['split_group'] = site\n",
" rare_otu_exist = os.path.exists(os.path.join(split_rare_dir, split_fn) \n",
" % otu_rare_split_blanks)\n",
" rare_map_exist = os.path.exists(os.path.join(split_rare_dir, split_fn) \n",
" % map_rare_split_blanks)\n",
" rare_split_check = np.hstack((rare_split_check, rare_otu_exist, rare_map_exist))\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Splits the otu table and mapping file by bodysite\n",
"if not data_download and (not raw_split_check.any() or overwrite):\n",
" !split_otu_table.py -i $all_raw_otu_fp -m $all_raw_map_fp -f BODY_HABITAT -o $split_raw_dir \n",
"\n",
"# Splits the otu table and mapping file by bodysite\n",
"if not data_download and (not rare_split_check.any() or overwrite):\n",
" !split_otu_table.py -i $all_rare_otu_fp -m $all_rare_map_fp -f BODY_HABITAT -o $split_rare_dir "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We’ll move and rename our split files to their final location."
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Copies the files to their correct final folder\n",
"if not data_download:\n",
" for idx, h_site in enumerate(habitat_sites):\n",
" otu_raw_split_blanks['split_group'] = h_site\n",
" map_raw_split_blanks['split_group'] = h_site\n",
" otu_rare_split_blanks['split_group'] = h_site\n",
" map_rare_split_blanks['split_group'] = h_site\n",
" raw_sample_blanks['site'] = all_bodysites[idx]\n",
" rare_sample_blanks['site'] = all_bodysites[idx]\n",
"\n",
" # Copies the unrarefied mapping file\n",
" shutil.copy2(os.path.join(split_raw_dir, split_fn) \n",
" % map_raw_split_blanks,\n",
" os.path.join(asab_pattern, map_fn) \n",
" % raw_sample_blanks)\n",
"\n",
" # Copies the unrarefied OTU table\n",
" shutil.copy2(os.path.join(split_raw_dir, split_fn) \n",
" % otu_raw_split_blanks,\n",
" os.path.join(asab_pattern, otu_fn) \n",
" % raw_sample_blanks)\n",
"\n",
" # Copies the rarefied mapping file\n",
" shutil.copy2(os.path.join(split_rare_dir, split_fn) \n",
" % map_rare_split_blanks,\n",
" os.path.join(asab_pattern, map_fn) \n",
" % rare_sample_blanks)\n",
"\n",
" # Copies the rarefied OTU table\n",
" shutil.copy2(os.path.join(split_rare_dir, split_fn) \n",
" % otu_rare_split_blanks,\n",
" os.path.join(asab_pattern, otu_fn)\n",
" % rare_sample_blanks)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To get our distance matrices for each OTU table, we’ll use the QIIME script, [filter_distance_matrix.py](http://qiime.org/scripts/filter_distance_matrix.html). We will use the mapping file for each body site to filter the distance matrices."
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"if not data_download:\n",
" for idx, a_site in enumerate(all_bodysites):\n",
" \n",
" rare_sample_blanks['site'] = a_site\n",
"\n",
" # Gets the rarefied mapping file for the site\n",
" map_in = os.path.join(asab_pattern, map_fn) % rare_sample_blanks\n",
"\n",
" for fn_ in dm_fn:\n",
" dm_in = os.path.join(all_dir, fn_) % all_rare_blanks\n",
" dm_out = os.path.join(asab_pattern, fn_) % rare_sample_blanks\n",
"\n",
" if not os.path.exists(dm_out) or overwrite:\n",
" !filter_distance_matrix.py -i $dm_in -o $dm_out --sample_id_fp $map_in"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Return to the top"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Select a Single Sample for Each Participant"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For some analyses we will choose to perform, it can be useful to work with a single sample for each participant at each body site. Many statistical tests assume sample independence. The microbiome among healthy adults is relatively stable across multiple samples within an individual; there is a higher correlation between your personal samples collected across several days than there is between your sample and another person’s sample collected at the same time [7].\n",
"\n",
"We’re going to start defining our single sample data sets by writing a function which will allow us to randomly select a sample from each individual. This will take a pandas data frame as an input. We’ll group the data so we can look at each individual (given by the `HOST_SUBJECT_ID`), and then randomly select one sample id per individual."
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"def identify_single_samples(map_):\n",
" \"\"\"Selects a single sample for each participant\n",
"\n",
" Parameters\n",
" ----------\n",
" map_ : pandas DataFrame\n",
" A mapping file for our set of samples. A single body site should be\n",
" used with human samples.\n",
"\n",
" Returns\n",
" -------\n",
" single_ids : ndarray\n",
" A list of ids which represent a single sample per individual\n",
"\n",
" \"\"\"\n",
" # Identifies a single sample per individual\n",
" single_ids = np.hstack([np.random.choice(np.array(ids, dtype=str), 1)\n",
" for indv, ids in\n",
" map_.groupby('HOST_SUBJECT_ID').groups.iteritems()])\n",
" return single_ids"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We’ll apply our filtering function at each body site. To do this, we'll define another function which will allow the filtering used functions we can define, like **`identify_single_samples`**.\n",
"\n",
"The function we're writing here, **`filter_dataset`**, will first identify the samples to be filtered using the function we pass in. It will use that list of samples to filter the rarefied and unrarefied (raw) mapping files. We will leverage the QIIME scripts, [`biom subset-table`](http://qiime.org/tutorials/working_with_biom_tables.html) and [filter_distance_matrix.py](http://qiime.org/scripts/filter_distance_matrix.html) to filter our OTU table and distance matrices."
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def filter_dataset(filter_fun, site, dir_in, dir_out, ids_fp):\n",
" \"\"\"Filters data set to create a subset\n",
" \n",
" Parameters\n",
" ----------\n",
" filter_fun : function\n",
" A function which takes a pandas map and returns a list of sample \n",
" ids.\n",
" site : str\n",
" The body site for which the samples are being generated.\n",
" dir_in : str\n",
" The directory in which the input analysis files are located. The\n",
" directory and files are assumed to exist.\n",
" dir_out : str\n",
" The directory where the filtered files should be put. The directory\n",
" must exist.\n",
" ids_fp : str\n",
" The filepath where the list of ids in the subset is located.\n",
" \n",
" Returns\n",
" -------\n",
" There are no explicit python returns. Rarefied and unrarefied OTU tables,\n",
" and their corresponding mapping files (the rarefied file includes alpha\n",
" diversity) as well as distance matrices for all distance metrics used\n",
" are saved in the `dir_out`.\n",
"\n",
" \"\"\"\n",
" rare_sample_blanks['site'] = site\n",
" raw_sample_blanks['site'] = site\n",
"\n",
" # Sets up the file names for the original files\n",
" rare_map_in_fp = os.path.join(dir_in, map_fn) % rare_sample_blanks\n",
" rare_otu_in_fp = os.path.join(dir_in, otu_fn) % rare_sample_blanks\n",
"\n",
" raw_map_in_fp = os.path.join(dir_in, map_fn) % raw_sample_blanks\n",
" raw_otu_in_fp = os.path.join(dir_in, otu_fn) % raw_sample_blanks\n",
"\n",
" rare_map_out_fp = os.path.join(dir_out, map_fn) % rare_sample_blanks\n",
" rare_otu_out_fp = os.path.join(dir_out, otu_fn) % rare_sample_blanks\n",
"\n",
" raw_map_out_fp = os.path.join(dir_out, map_fn) % raw_sample_blanks\n",
" raw_otu_out_fp = os.path.join(dir_out, otu_fn) % raw_sample_blanks\n",
"\n",
" # Checks if the single sample id filepath exists\n",
" if not os.path.exists(ids_fp) or not os.path.exists(rare_map_out_fp) or overwrite:\n",
" # Reads in the rarefied table\n",
" rare_map_in = pd.read_csv(rare_map_in_fp,\n",
" sep=txt_delim,\n",
" na_values=map_nas,\n",
" index_col=False,\n",
" dtype={map_index: str})\n",
" rare_map_in.index = rare_map_in[map_index]\n",
" del rare_map_in[map_index]\n",
"\n",
" # Identifies the sample ids\n",
" filt_ids = filter_fun(rare_map_in)\n",
" \n",
" rare_map_out = rare_map_in.loc[filt_ids]\n",
"\n",
" # Saves the single sample filepath\n",
" ids_file = file(ids_fp, 'w')\n",
" ids_file.write('\\n'.join(list(filt_ids)))\n",
" ids_file.close()\n",
" \n",
" # Saves the rarefied mapping file\n",
" rare_map_out.to_csv(rare_map_out_fp,\n",
" sep=txt_delim,\n",
" na_rep=write_na,\n",
" index_label=map_index)\n",
" else:\n",
" ids_file = open(ids_fp, 'r')\n",
" filt_ids = ids_file.read().split('\\n')\n",
" ids_file.close()\n",
" \n",
" if not os.path.exists(raw_map_out_fp) or overwrite:\n",
" raw_map_in = pd.read_csv(rare_map_in_fp,\n",
" sep=txt_delim,\n",
" na_values=map_nas,\n",
" index_col=False,\n",
" dtype={map_index: str})\n",
" raw_map_in.index = raw_map_in[map_index]\n",
" del raw_map_in[map_index]\n",
" raw_map_out = raw_map_in.loc[filt_ids]\n",
" raw_map_out.to_csv(raw_map_out_fp,\n",
" sep=txt_delim,\n",
" na_rep=write_na,\n",
" index_label=map_index)\n",
"\n",
" # Filters the OTU table down to single samples\n",
" if not os.path.exists(rare_otu_out_fp) or overwrite:\n",
" !biom subset-table -i $rare_otu_in_fp -o $rare_otu_out_fp -a sample -s $ids_fp\n",
"\n",
" if not os.path.exists(raw_otu_out_fp) or overwrite:\n",
" !biom subset-table -i $raw_otu_in_fp -o $raw_otu_out_fp -a sample -s $ids_fp\n",
"\n",
" # Filters the distance matrices\n",
" for dm_ in dm_fn:\n",
" dm_in_fp = os.path.join(dir_in, dm_) % rare_sample_blanks\n",
" dm_out_fp = os.path.join(dir_out, dm_) % rare_sample_blanks\n",
" if not os.path.exists(dm_out_fp) or overwrite:\n",
" !filter_distance_matrix.py -i $dm_in_fp -o $dm_out_fp --sample_id_fp $ids_fp "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's apply our two functions to identify a single sample per individual at each site."
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"for idx, a_site in enumerate(all_bodysites):\n",
" # Skips any site where the healthy subset criteria should not be applied\n",
" if a_site not in one_samp_sites:\n",
" continue\n",
" filter_dataset(filter_fun=identify_single_samples, site=a_site, dir_in=asab_pattern, dir_out=assb_pattern,\n",
" ids_fp=os.path.join(assb_pattern, sin_fn) % {'site':a_site})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Return to the top"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Filter the Table to the Healthy Subset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, we may wish to have a healthy subset of individuals for certain analyses. The criteria we’ve used to define the healthy subset are described [above](#Identification-of-a-Healthy-Subset-of-Adults).\n",
"\n",
"We're going to define a quick function that makes use of `SUBSET`."
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def identify_subset(map_):\n",
" return map_.loc[map_.SUBSET == True].index.values"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we’ll use essentially the same pipeline we leveraged for filtering the single samples."
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"for idx, a_site in enumerate(all_bodysites):\n",
" # Skips any site where the healthy subset criteria should not be applied\n",
" if a_site not in sub_part_sites:\n",
" continue\n",
" filter_dataset(identify_subset, a_site, asab_pattern, ssab_pattern,\n",
" os.path.join(ssab_pattern, sub_fn) % {'site':a_site})"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"for idx, a_site in enumerate(all_bodysites):\n",
" # Skips any site where the healthy subset criteria should not be applied\n",
" if a_site not in sub_part_sites or a_site not in one_samp_sites:\n",
" continue\n",
" \n",
" filter_dataset(identify_subset, a_site, assb_pattern, sssb_pattern,\n",
" os.path.join(sssb_pattern, sub_fn) % {'site':a_site})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We have generated body site specific, rarefied OTU tables, mapping files with alpha diversity and UniFrac distance matrices for our American Gut Data, as well as creating focused datasets. We can choose to create further-filtered tables, or we can take the outputs of this notebook and use it for downstream analysis.\n",
"\n",
"Return to the top"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# References"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1. Caporaso, J.G.; Kuczynski, J.; Strombaugh, J.; Bittinger, K.; Bushman, F.D.; Costello, E.K.; Fierer, N.; Peña, A.G., Goodrich, J.K.; Gordon, J.I.; Huttley, G.A.; Kelley, S.T.; Knights, D.; Koenig, J.E.; Ley, R.E.; Lozupone, C.A.; McDonald, D.; Muegge, B.D.; Pirrung, M.; Reeder, J.; Sevinsky, J.R.; Turnbaugh, P.J.; Walters, W.A.; Widmann, J.; Yatsunenko, T.; Zaneveld, J. and Knight, R. (2010) “[QIIME allows analysis of high-throughput community sequence data](http://www.ncbi.nlm.nih.gov/pubmed/20383131).” *Nature Methods*. **7**: 335 - 336.\n",
"\n",
"2. Vázquez-Baeza, Y.; Pirrung, M.; Gonzalez, A.; and Knight, R. (2013). “[EMPeror: a tool for visualizing high-throughput microbial community data](http://www.ncbi.nlm.nih.gov/pubmed/24280061).” *Gigascience*. **2**: 16.\n",
"\n",
"3. Langille, M.G.; Zaneveld, J.; Caporaso, J.G.; McDonald, D.; Knights, D.; Reyes, J.A.; Clemente, J.C.; Burkepile, D.E.; Vega Thurber, R.L.; Knight, R.; Beiko, R.G.; and Huttenhower, C. (2013). “[Predictive functional profiling of microbial communities using 16S rRNA marker gene sequences](http://www.ncbi.nlm.nih.gov/pubmed/23975157).” *Nat Biotechnol*. **31**: 814-821.\n",
"\n",
"4. McDonald, D.; Clemente, J.C.; Kuczynski, J.; Rideout, J.R.; Stombaugh, J.; Wendel, D.; Wilke, A.; Huse, S.; Hufnagle, J.; Meyer, F.; Knight, R.; and Caporaso, J.G. (2012). [The Biological Observation Matrix (BIOM) format or: how I learned to stop worrying and love the ome-ome](http://www.ncbi.nlm.nih.gov/pubmed/23587224).” *Gigascience*. **1**:7.\n",
"\n",
"5. Lozupone, C.; and Knight, R. (2005). “[UniFrac: a new phylogenetic method for comparing microbial communities](http://www.ncbi.nlm.nih.gov/pubmed/16332807).” *Appl Enviro Microbiol.* **71**: 8228-8235.\n",
"\n",
"6. Lozupone, C.; LLadser, M.E.; Knights, D.; Stombaugh, J.; and Knight, R. (2011). “[UniFrac: an effective distance metric for microbial community composition](http://www.ncbi.nlm.nih.gov/pubmed/20827291).” *ISME J* **5**: 169-172.\n",
"\n",
"7. The Human Microbiome Consortium. (2012) “[Structure, Function and diversity of the healthy human microbiome.](http://www.ncbi.nlm.nih.gov/pubmed/22699609)” *Nature*. **486**: 207-214.\n",
"\n",
"8. Eckburg, P.B.; Bik, E.M.; Bernstein C.N.; Purdom, E.; Dethlefson, L.; Sargent, M.; Gill, S.R.; Nelson, K.E.; Relman, D.A. (2005) “[Diversity of the human intestinal microbial flora.](http://www.ncbi.nlm.nih.gov/pubmed/15831718)” *Science*. **308**: 1635-1638.\n",
"\n",
"9. Chao, A. (1984) “[Nonparametric estimation of the number of classes in a population](http://viceroy.eeb.uconn.edu/estimateS/EstimateSPages/EstSUsersGuide/References/Chao1984.pdf).” *Scandinavian J Stats*. **11**: 265-270.\n",
"\n",
"10. Seaby, R.M.H. and Henderson, P.A. (2006). “Species Diversity and Richness 4.” http://www.pisces-conservation.com/sdrhelp/index.html.\n",
"\n",
"11. McDonald, D.; Price, N.M.; Goodrich, J.; Nawrocki, E.P.; DeSantis, T.Z.; Probst, A.; Andersen, G.L.; Knight, R. and Hugenholtz, P. (2012). “[An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea.](http://www.ncbi.nlm.nih.gov/pubmed/22134646)” *ISME J*. **6**:610 - 618.\n",
"\n",
"12. Koenig, J.E.; Spor, A.; Scalfone, N.; Fricker, A.D.; Stombaugh, J.; Knight, R.; Angenent, L.T.; and Ley, R.E. (2011). “[Succession of microbial consortia in the developing infant gut microbiome](http://www.ncbi.nlm.nih.gov/pubmed/20668239).” *PNAS*. **108 Suppl 1**: 4578 - 4585.\n",
"\n",
"13. Yatsunenko, T.; Rey, F.E.; Manary, M.J.; Trehan, I.; Dominguez-Bello, M.G.; Contreras, M.; Magris, M.; Hidalgo, G.; Baldassano, R.N.; Anokhin, A.P.; Heath, A.C.; Warner, B.; Rdder, J.; Kuczynski, J.; Caporaso, J.G.; Lozupone, C.A.; Lauber, C.; Clemente, J.C.; Knights, D.; Knight, R. and Gordon, J.I. (2012) “[Human Gut microbiome viewed across age and geography](http://www.ncbi.nlm.nih.gov/pubmed/22699611).” *Nature*. **486**: 222-227.\n",
"\n",
"14. Claesson, M.J.; Cusacks, S.; O’Sullivan, O.; Greene-Diniz, R.; de Weerd, H.; Flannery, E.; Marchesi, J.R.; Falush, D.; Dinan, T.; Fitzgerald, G.; Stanton, C.; van Sinderen, D.; O’Connor, M.; Harnedy, N.; O’Connor, K.; Henry, C.; O’Mahony, D.; Fitzgerald, A.P.; Shananhan, F.; Twomey, C.; Hill, C.; Ross, R.P.; and O’Toole, P.W. (2011). “[Composition, variability and temporal stability of the intestinal microbiota of the elderly](http://www.ncbi.nlm.nih.gov/pubmed/20571116).” *PNAS*. **108 Suppl 1**: 4586 - 4591.\n",
"\n",
"15. Claesson, M.J.; Jeffery, I.B.; Conde, S.; Power, S.E.; O’Connor, E.M.; Cusack, S.; Harris, H.M.; Coakley, M.; Lakshminarayanan, B.; O’Sullivan, O.; Fitzgerald, G.F; Deane, J.; O’Connor, M.; Harnedy, N.; O’Connor, K.; O’Mahony, D.; van Sinderen, D.; Wallace, M.; Brennan, L.; Stanton, C.; Marchesi, J.R.; Fitzgerald, A.P.; Shanahan, F.; Hill, C.; Ross, R.P.; and O’Toole, P.W. (2012). “[Gut microbiota composition correlates with diet and health in the elderly](http://www.ncbi.nlm.nih.gov/pubmed/22797518).” *Nature*. **488**: 178-184.\n",
"\n",
"16. Walters, W.A.; Zu, Z.; and Knight, R. (2014) “[Meta-analysis of human gut microbes associated with obesity and IBD](http://www.ncbi.nlm.nih.gov/pubmed/25307765).” *FEBS Letters*. **588**: 4223-4233.\n",
"\n",
"17. Dethlefsen, L. and Relman, D.A. (2011) “[Incomplete recovery and individualized responses of the human distal gut microbiota to repeated antibiotic perturbation](http://www.ncbi.nlm.nih.gov/pubmed/20847294).” *PNAS*. **108 Suppl 1**: 4554-4561.\n",
"\n",
"18. de Goffau, M.C.; Luopajärvi, K.; Knip, M.; Ilonen, J.; Ruohtula, T.; Härkönen, T.; Orivuori, L.; Hakala, S.; Welling, G.W.; Harmensen, H.J.; and Vaarala, O. (2013). “[Fecal Microbiota composition differs between children with B-cell autoimmunity and those without](http://www.ncbi.nlm.nih.gov/pubmed/23274889).” *Diabetes*. **62**: 1238-1244.\n",
"\n",
"19. Giongo, A.; Gano, K.A.; Crabb, D.B.; Mukherjee, N.; Novelo, L.L.; Casella, G.; Drew, J.C.; Ilonen, J.; Knip, M.; Hyöty, H; Veijola, R.; Simell, T.; Simell, O.; Neu, J.; Wasserfall, C.H.; Schatz, D.; Atkinson, M.A.; and Triplett, E.W. (2011). “[Toward defining the autoimmune microbiome for type 1 diabetes](http://www.ncbi.nlm.nih.gov/pubmed/20613793).” *ISME J*. **5**: 82-91.\n",
"\n",
"20. Mejía-León, M.E.; Petrosino, J.F.; Ajami, N.J.; Domínguez-Bello, M.G.; and de la Barca, A.M. (2014). “[Fecal microbiota imbalance in Mexican children with type 1 diabetes](http://www.ncbi.nlm.nih.gov/pubmed/24448554).” *Science Reports*. **4**: 3814.\n",
"\n",
"21. Murrim M.; Leiva, I.; Gomez-Zumaquero, J.M.; Tinahones, F.J.; Cardona, F.; Soriguer, F.; and Queipo-Ortuño, M.I. (2013). “[Gut microbiota in children with type 1 diabetes differs from that in healthy children: a case-control study](http://www.ncbi.nlm.nih.gov/pubmed/23433344).” *BMC Med*. **11**:46.\n",
"\n",
"22. Larsen, N.; Vogensen, F.K.; van den Berg, F.W.; Nielsen, D.S.; Andreasen, A.S.; Pedersen, B.K.; Al-Soud, W.A.; Sørensen, S.J.; Hansen, H.L. and Jakobsen, M. (2010). “[Gut Microbiota in human adults with type 2 diabetes differs from non diabetic adults](http://www.ncbi.nlm.nih.gov/pubmed/20140211).” *PLoS One*. **5**: e9085.\n",
"\n",
"23. Ahn, J.; Sinha, R.; Pei, Z.; Cominanni, C.; Wu, J.; Shi, J.; Goedert, J.J.; Hayes, R.B.; and Yang, L. (2013). \"[Human gut microbiome and risk for colorectal cancer](http://www.ncbi.nlm.nih.gov/pubmed/24316595).\" *J Natl Cancer Inst.* **105**: 1907-1911.\n",
"\n",
"24. National Park Service. (2015). “[Mammal Checklist](http://www.nps.gov/yell/learn/nature/mammalscheck.htm).” *Yellowstone National Park*.\n",
"\n",
"25. Milieris, V. (2011) “[Biodiversity in Central Park](http://macaulay.cuny.edu/eportfolios/themanhattanproject/does-central-park-work/biodiversity-in-central-park-virginia-milieris/)”. *Exploring Central Park*. CUNY.\n",
"\n",
"Return to the top"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 0
}