{ "metadata": { "name": "", "signature": "sha256:7e681d76e71038b4e63f5a70906c5a9879d09fc4807949394088a60704f18a59" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "The purpose of this notebook is to perform a complete analysis of the microbial community data for samples in the American Gut dataset. The notebook will produce as output one summary PDF for each fecal sample in American Gut. The notebook integrates the microbiome data from the American Gut Project, the [Human Microbiome Project](http://www.ncbi.nlm.nih.gov/pubmed/22699609), the [Global Gut Project](http://www.ncbi.nlm.nih.gov/pubmed/22699611), and the unpublished [Personal Genome Project](http://personalgenomes.org/). More information about the American Gut Project can be found [here](http://www.ncbi.nlm.nih.gov/pubmed/22699611).\n", "\n", "Essentially, this notebook provides an open-source recipe for all the results processing, so that independent scientists (as well as scientists involved in the project) can check every step, reproduce the results themselves, and make suggestions for improvements. If you have the appropriate computing infrastructure you can also run everything yourself; in future, we plan to make a version available on EC2 so anyone can use it without setting up their own cluster, but some additional software development will be required and we wanted to get the current version in the hands of the community sooner.\n", "\n", "The American Gut Project is part of the [Earth Microbiome Project](http://www.earthmicrobiome.org/) and follows the same protocols and data release standards put forth by the EMP. The project is run primarily through the [Knight Lab](http://www.earthmicrobiome.org/) in the [BioFrontiers Institute](http://biofrontiers.colorado.edu/) at the [University of Colorado at Boulder](http://colorado.edu).\n", "\n", "Over [50 people](https://github.com/biocore/American-Gut/blob/master/CREDITS.md) have volunteered their time to support this project. These volunteers have played roles from sending thousands of samples to thousands of participants all over the world, to developing complex web and database infrastructures, to developing new analysis techniques and procedures, to responding to the questions from our participants. This project would not be possible without the unified effort of so many dedicated people.\n", "\n", "This Notebook assumes the following resources are available:\n", "\n", "* a PBS/Torque-based compute cluster in which to submit jobs\n", "* [QIIME](http://qiime.org)\n", "* A custom [Emperor](http://www.gigasciencejournal.com/content/2/1/16/) [branch](https://github.com/eldeveloper/emperor/tree/faces_plus_lines) is available in the path\n", "* TexLive 2013\n", "* [scikit-bio](https://github.com/biocore/scikit-bio)\n", "* The American Gut [repository](https://github.com/biocore/American-Gut/)\n", "* [IPython](http://ipython.org)\n", "* [brewer2mpl](https://pypi.python.org/pypi/brewer2mpl/1.4)\n", "* [qiime-charts](https://github.com/ebolyen/qiime-charts)\n", "\n", "Specifically, for each sample, we will produce the following figures:\n", "\n", "* Principal coordinates analysis (PCoA) plot showing where each sample lies in the context of the American Gut Project, the [Human Microbiome Project](http://www.ncbi.nlm.nih.gov/pubmed/22699609), the [Global Gut](http://www.ncbi.nlm.nih.gov/pubmed/22699611) and the [Personal Genome Project](http://personalgenomes.org/) microbiome datasets.\n", "* Principal coordinates plot depicting the sample in the context of the American Gut Project and Global Gut Project, highlighting the relationship between age and the microbiome.\n", "* Principal coordinates plot comparing the sample to the full American Gut Project data set.\n", "* A chart summarizing at the phylum level the taxonomic classifications of the constituents of the sample\u2019s microbial community and comparing it to the average communities of several American Gut Project subsets (e.g. groups of people with similar age or similar diet) and to writer Michael Pollan's pre- and post-antibiotics samples. [Michael Pollan](http://michaelpollan.com/) was one of the first participants in the project and has given us permission to use his data as a reference.\n", "* A table detailing specific operational taxonomic units (OTUs) in the sample that differ significantly from the rest of the American Gut population in their relative abundance.\n", "\n", "Below are the major steps this notebook will take:\n", "\n", "1. Download the American Gut data and metadata from the European Bioinformatics Institute ([EBI](http://www.ebi.ac.uk/)), which is part of International Nucleotide Sequence Database Collaboration ([INSDC](http://www.insdc.org/)).\n", "2. Filter gammaproteobacterial OTUs (using multiple rounds of OTU picking) that exhibit inflated abundances due to blooms introduced before sample receipt at CU Boulder. See [here](http://americangut.org/?page_id=202) for further detail. \n", "3. Determine OTUs at 97% sequence identity (taxonomic resolution typically corresponding approximately to species-level assignments) by clustering against [Greengenes](http://greengenes.secondgenome.com) version 13_5.\n", "4. Trim sequences to 100 nucleotides in order to permit valid comparisons with other studies that provide pre-trimmed data, thus avoiding sequence length biases in OTU assignment.\n", "5. Merge the American Gut OTUs with the OTUs from the [Human Microbiome Project](http://www.ncbi.nlm.nih.gov/pubmed/22699609), the [Global Gut](http://www.ncbi.nlm.nih.gov/pubmed/22699611), and the [Personal Genome Project](http://personalgenomes.org/) microbiome data based on assigned taxonomy.\n", "6. Merge the sample metadata from all studies, which contains information on the study variables that describe each sample.\n", "7. Rarefy samples to 1000 sequences per sample (due to the relatively low coverage in the Human Microbiome Project) to normalize for differential sequencing depth.\n", "8. Compute beta diversity using unweighted [UniFrac](http://www.ncbi.nlm.nih.gov/pubmed/16332807).\n", "9. Compute principal coordinates using unweighted UniFrac distances and visualize using [Emperor](http://www.gigasciencejournal.com/content/2/1/16/).\n", "10. Build phylum level taxonomy summaries comparing each sample to similar groups (e.g., age, BMI, etc), and to [Michael Pollan](http://michaelpollan.com/).\n", "11. Determine abundant and enriched microbes within each sample relative to the American Gut population.\n", "12. Produce the full taxonomy summary for each sample using unrarefied data.\n", "13. Populate the results template (additional information about the results can be found here) to produce graphics and a frameable set of results.\n", "14. Produce a summary of the project.\n", "\n", "Additional information about the tools used in the American Gut Project and our contributions to the microbiome community can be found in the following publications:\n", "\n", "* [Minimum information about a marker gene sequence (MIMARKS) and minimum information about any (x) sequence (MIxS) specifications.](http://www.ncbi.nlm.nih.gov/pubmed/21552244)\n", "* [EMPeror: a tool for visualizing high-throughput microbial community data.](http://www.ncbi.nlm.nih.gov/pubmed/24280061)\n", "* [UniFrac: a new phylogenetic method for comparing microbial communities.](http://www.ncbi.nlm.nih.gov/pubmed/16332807)\n", "* [UniFrac--an online tool for comparing microbial community diversity in a phylogenetic context.](http://www.ncbi.nlm.nih.gov/pubmed/16893466)\n", "* [Quantitative and qualitative beta diversity measures lead to different insights into factors that structure microbial communities.](http://www.ncbi.nlm.nih.gov/pubmed/17220268)\n", "* [Fast UniFrac: facilitating high-throughput phylogenetic analyses of microbial communities including analysis of pyrosequencing and PhyloChip data.](http://www.ncbi.nlm.nih.gov/pubmed/19710709)\n", "* [UniFrac: an effective distance metric for microbial community comparison.](http://www.ncbi.nlm.nih.gov/pubmed/20827291)\n", "* [Linking long-term dietary patterns with gut microbial enterotypes.](http://www.ncbi.nlm.nih.gov/pubmed/21885731)\n", "* [A guide to enterotypes across the human body: meta-analysis of microbial community structures in human microbiome datasets.](http://www.ncbi.nlm.nih.gov/pubmed/23326225)\n", "* [Structure, function and diversity of the healthy human microbiome.](http://www.ncbi.nlm.nih.gov/pubmed/22699609)\n", "* [A framework for human microbiome research.](http://www.ncbi.nlm.nih.gov/pubmed/22699610)\n", "* [The Biological Observation Matrix (BIOM) format or: how I learned to stop worrying and love the ome-ome.](http://www.ncbi.nlm.nih.gov/pubmed/23587224)\n", "* [An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea.](http://www.ncbi.nlm.nih.gov/pubmed/22134646)\n", "* [The Earth Microbiome Project: Meeting report of the \"1 EMP meeting on sample selection and acquisition\" at Argonne National Laboratory October 6 2010.](http://www.ncbi.nlm.nih.gov/pubmed/21304728)\n", "* [Meeting report: the terabase metagenomics workshop and the vision of an Earth microbiome project.](http://www.ncbi.nlm.nih.gov/pubmed/21304727)\n", "\n", "More detail on our work on the effects of storage conditions can be found in these publications:\n", "\n", "* [Effect of storage conditions on the assessment of bacterial community structure in soil and human-associated samples.](http://www.ncbi.nlm.nih.gov/pubmed/20412303)\n", "* [Sampling and pyrosequencing methods for characterizing bacterial communities in the human gut using 16S sequence tags.](http://www.ncbi.nlm.nih.gov/pubmed/20673359)\n", "\n", "And more detail on our work on sequencing and data analysis protocols can be found in these publications:\n", "\n", "* [Short pyrosequencing reads suffice for accurate microbial community analysis.](http://www.ncbi.nlm.nih.gov/pubmed/17881377)\n", "* [Accurate taxonomy assignments from 16S rRNA sequences produced by highly parallel pyrosequencers.](http://www.ncbi.nlm.nih.gov/pubmed/18723574)\n", "* [Error-correcting barcoded primers for pyrosequencing hundreds of samples in multiplex.](http://www.ncbi.nlm.nih.gov/pubmed/18264105)\n", "* [Selection of primers for optimal taxonomic classification of environmental 16S rRNA gene sequences.](http://www.ncbi.nlm.nih.gov/pubmed/22237546)\n", "* [Comparison of Illumina paired-end and single-direction sequencing for microbial 16S rRNA gene amplicon surveys.](http://www.ncbi.nlm.nih.gov/pubmed/22170427)\n", "* [Impact of training sets on classification of high-throughput bacterial 16s rRNA gene surveys.](http://www.ncbi.nlm.nih.gov/pubmed/21716311)\n", "* [PrimerProspector: de novo design and taxonomic analysis of barcoded polymerase chain reaction primers.](http://www.ncbi.nlm.nih.gov/pubmed/21349862)\n", "* [QIIME allows analysis of high-throughput community sequencing data.](http://www.ncbi.nlm.nih.gov/pubmed/20383131)\n", "* [Using QIIME to analyze 16S rRNA gene sequences from microbial communities.](http://www.ncbi.nlm.nih.gov/pubmed/22161565)\n", "* [Meta-analyses of studies of the human microbiota.](http://www.ncbi.nlm.nih.gov/pubmed/23861384)\n", "* [Advancing our understanding of the human microbiome using QIIME.](http://www.ncbi.nlm.nih.gov/pubmed/24060131)\n", "* [Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample.](http://www.ncbi.nlm.nih.gov/pubmed/20534432)\n", "* [Ultra-high-throughput microbial community analysis on the Illumina HiSeq and MiSeq platforms.](http://www.ncbi.nlm.nih.gov/pubmed/22402401)\n", "* [Quality-filtering vastly improves diversity estimates from Illumina amplicon sequencing.](http://www.ncbi.nlm.nih.gov/pubmed/23202435)\n", "* [Human gut microbiome viewed across age and geography.](http://www.ncbi.nlm.nih.gov/pubmed/22699611)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's first adjust a few settings and parameters for the data in the Notebook." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# process a small subset of the data if True\n", "debug = False\n", "\n", "# EBI accesions to process\n", "accessions = ['ERP003819', \n", " 'ERP003822', \n", " 'ERP003820', \n", " 'ERP003821', \n", " 'ERP005367', \n", " 'ERP005366', \n", " 'ERP005361', \n", " 'ERP005362', \n", " 'ERP005651', \n", " 'ERP005821', \n", " 'ERP005949', \n", " 'ERP006349', \n", " 'ERP008512', \n", " 'ERP008604', \n", " 'ERP008617']\n", "\n", "# set the path to the Greengenes 13_5 97% OTU tree\n", "reference_rep_set = '/shared/gg_13_8_otus/rep_set/97_otus.fasta' # e.g., path to 97_otus.fasta from Greengenes\n", "reference_taxonomy = '/shared/gg_13_8_otus/taxonomy/97_otu_taxonomy.txt' # e.g., path to 97_otu_taxonomy.txt from Grengenes\n", "reference_tree = '/shared/gg_13_8_otus/trees/97_otus.tree' # e.g., path to 97_otus.tree from Grengenes" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then, let\u2019s get our environment up and running. The script ``cluster_utils.ipy`` contains some helper methods for submitting jobs to Torque-based clusters, and we also have quite a few methods from Python and from the American Gut repository to import. Additionally, let\u2019s create a new directory work in, define some helper functions, and set up some paths." ] }, { "cell_type": "code", "collapsed": false, "input": [ "%run cluster_utils.ipy\n", "import os\n", "import locale\n", "import datetime\n", "import shutil\n", "from functools import partial\n", "from tempfile import mktemp\n", "from collections import defaultdict\n", "from cPickle import loads\n", "from itertools import izip\n", "from ftplib import FTP\n", "from tarfile import open as tar_open\n", "from gzip import open as gz_open\n", "from glob import glob\n", "\n", "from IPython.lib.display import FileLink\n", "\n", "from americangut.util import fetch_study, trim_fasta, concatenate_files\n", "from americangut.results_utils import (check_file, get_path, get_repository_dir, stage_static_files, \n", " parse_identifying_data, filter_mapping_file, clean_and_reformat_mapping,\n", " parse_previously_printed, bootstrap_result, MissingFigure,\n", " construct_svg_smash_commands, construct_phyla_plots_cmds,\n", " per_sample_taxa_summaries, construct_bootstrap_and_latex_commands,\n", " count_unique_sequences_per_otu, write_bloom_fasta,\n", " harvest) \n", "from americangut.util import count_samples, count_seqs, count_unique_participants\n", "\n", "locale.setlocale(locale.LC_ALL, 'en_US')\n", "\n", "# get the current absolute path\n", "current_dir = os.path.abspath('.')\n", "\n", "# get the path to the American Gut repository \n", "repo_dir = get_repository_dir()\n", "\n", "# create a place to do work\n", "prj_name = \"americangut_results_r1-14\"\n", "working_dir = os.path.join(current_dir, prj_name)\n", "os.makedirs(prj_name)\n", "\n", "# path wrappers\n", "get_relative_new_path = lambda x: os.path.join(working_dir, x)\n", "get_relative_existing_path = partial(get_path, working_dir)\n", "\n", "# set the number of processors parallel tasks will use\n", "NUM_PROCS = 100\n", "NUM_PROCS_OTU_PICKING = 64\n", "\n", "# bootstrap the submit method\n", "submit = partial(submit, prj_name)\n", "submit_smr = lambda x: submit_qsub(x, prj_name, queue='mem512gbq', extra_args='-l ncpus=64 -l walltime=72:00:00')\n", "\n", "# make sure our reference files exist\n", "check_file(reference_rep_set)\n", "check_file(reference_taxonomy)\n", "check_file(reference_tree)\n", "\n", "if debug:\n", " sequence_files = [os.path.join(repo_dir, 'data', 'AG_debug', 'test_seqs.fna'),\n", " os.path.join(repo_dir, 'data', 'AG_debug', 'test_seqs_2.fna')]\n", " mapping_files = [os.path.join(repo_dir, 'data', 'AG_debug', 'test_mapping.txt')]\n", "else:\n", " sequence_files = [get_relative_new_path(acc + '.fna') for acc in accessions]\n", " mapping_files = [get_relative_new_path(acc + '.txt') for acc in accessions]\n", " \n", "params_file = get_relative_new_path('sortmerna_pick_params.txt')\n", "with open(params_file, 'w') as f:\n", " f.write(\"pick_otus:otu_picking_method sortmerna\\n\")\n", " f.write(\"pick_otus:similarity 0.97\\n\")\n", " f.write(\"pick_otus:threads %d\\n\" % NUM_PROCS_OTU_PICKING)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The various computational steps in this workflow require the use of a variety of external scripts. Here, we construct a Python ``dict``, a type of lookup structure, that will let us easily refer to commands that we want to run later." ] }, { "cell_type": "code", "collapsed": false, "input": [ "scripts = {\n", " 'Merge OTU Tables':'merge_otu_tables.py -i %(input_a)s,%(input_b)s -o %(output)s',\n", " 'Single Rarifaction':'single_rarefaction.py -i %(input)s -o %(output)s -d %(depth)s',\n", " 'Parallel Beta Diversity':'parallel_beta_diversity.py -i %(input)s -o %(output)s -X %(job_prefix)s -O %(num_jobs)s -t %(gg97_tree)s',\n", " 'Principal Coordinates':'principal_coordinates.py -i %(input)s -o %(output)s',\n", " 'Merge Mapping Files':'merge_mapping_files.py -m %(input_a)s,%(input_b)s -o %(output)s',\n", " 'Filter Samples':'filter_samples_from_otu_table.py -i %(input)s -o %(output)s --sample_id_fp=%(sample_id_fp)s',\n", " 'Summarize OTU by Category':'summarize_otu_by_cat.py -m %(mapping)s -o %(output)s -n -i %(otu_table)s -c %(category)s',\n", " 'Filter Distance Matrix':'filter_distance_matrix.py -i %(input)s -o %(output)s --sample_id_fp=%(sample_ids)s',\n", " 'Summarize Taxa':'summarize_taxa.py -i %(input)s -o %(output)s -L %(level)s',\n", " 'Summarize Taxa Mapping':'summarize_taxa.py -i %(input)s -o %(output)s -L %(level)s -m %(mapping)s',\n", " 'Taxonomy Comparison':'taxonomy_comparison.py -i %(input)s -m %(mapping)s -l %(level)s -o %(output)s -c %(list_of_categories)s',\n", " 'Make Emperor':'make_emperor.py -i %(input)s -o %(output)s -m %(mapping)s',\n", " 'SVG Smash':'replace_svg_object.py -i %(input)s -o %(output)s --prefix %(prefix)s --sample_id=%(sample_id)s',\n", " 'Make Phyla Plots':\"make_phyla_plots_AGP.py -i %(input)s -m %(mapping)s -o %(output)s -c '%(categories)s' -s %(samples)s %(debug)s\",\n", " 'OTU Significance':\"generate_otu_signifigance_tables_AGP.py -i %(input)s -o %(output)s -s %(samples)s\",\n", " 'Create Titles':'create_titles.py -m %(mapping)s -f',\n", " 'SVG to PDF':'inkscape -z -D -f %(input)s -A %(output)s',\n", " 'Format Template':'format_file.py -i %(template)s -k %(keys_for_replace)s -v %(values_for_replace)s -K %(keys_for_insert)s -V %(values_for_insert)s -o %(output)s',\n", " 'To PDF':'module load texlive_2013; cd %(path)s; lualatex %(input)s',\n", " 'PDF Smash':'gs -r150 -q -sPAPERSIZE=ledger -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dFIXEDMEDIA -dPDFFitPage -dCompatibilityLevel=1.4 -sOutputFile=%(output)s -c 100000000 setvmthreshold -f %(pdfs)s',\n", " 'gunzip': 'gunzip -f %(input)s',\n", " 'Pick Closed Reference OTUs': 'pick_closed_reference_otus.py -i %(input)s -o %(output)s -r %(reference)s -t %(taxonomy)s ' + '-p ' + params_file,\n", " 'Filter Sequences': 'filter_fasta.py -f %(input)s -m %(otus)s -n -o %(output)s',\n", " 'QIIME Charts': 'qiimecharts run-config -i %(input)s',\n", " 'Filter Sequences to Fecal': 'filter_fasta.py -f %(input)s -o %(output)s --mapping_fp %(mapping)s --valid_states \"%(states)s\"',\n", " 'AGPlots': 'make_plots.py -m %(mapping)s -t %(taxa)s -f pdf -s %(identified)s -c %(metadata_cat)s -v %(metadata_val)s -o %(output_prefix)s -k %(key_taxa)s',\n", " 'Filter Distance Matrix by Metadata':'filter_distance_matrix.py -i %(input)s -o %(output)s -m %(mapping)s -s %(states)s',\n", " 'Beta Diversity Comparison':'beta_sample_rarefaction.py -i %(inputs)s -o %(output)s -n 10 --y_max 0.7 -t %(title)s -l %(labels)s -y %(ylabel)s -v'\n", " }" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, lets set up the paths for the subsequent processing to create the figures. The first PCoA plot is a combination of the American Gut, Human Microbiome Project, Personal Genome Project, and Global Gut datasets. These projects used three different sequencing technologies, and in order to combine them, we need to have BIOM tables based on sequence data trimmed to the same length. See [here](http://www.ncbi.nlm.nih.gov/pubmed/23861384) for a more detailed discussion about meta-analyses." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# working directory file paths for tables and metadata. \n", "# ag -> American Gut\n", "# pgp -> Personal Genome Project\n", "# hmp -> Human Microbiome Project\n", "# gg -> Global Gut\n", "#\n", "# _t_ -> table\n", "# _m_ -> mapping file\n", "#\n", "# 100nt -> trimmed to the first 100 nucleotides \n", " \n", "stage_static_files('fecal', working_dir, debug=debug)\n", "\n", "bloom_seqs = get_relative_existing_path('BLOOM.fasta')\n", "\n", "jobs = []\n", "for f in glob(os.path.join(working_dir, \"*.biom.gz\")):\n", " jobs.append(submit(scripts['gunzip'] % {'input': f}))\n", "res = wait_on(jobs)\n", "\n", "pgp_100nt_t_fp = get_relative_existing_path('PGP_100nt.biom')\n", "pgp_100nt_m_fp = get_relative_existing_path('PGP_100nt.txt')\n", "hmp_100nt_t_fp = get_relative_existing_path('HMPv35_100nt.biom')\n", "hmp_100nt_m_fp = get_relative_existing_path('HMPv35_100nt.txt')\n", "gg_100nt_t_fp = get_relative_existing_path('GG_100nt.biom')\n", "gg_100nt_m_fp = get_relative_existing_path('GG_100nt.txt')\n", " \n", "# participant static images\n", "template = get_relative_existing_path('template_gut.tex')\n", "aglogo = get_relative_existing_path('logoshape.pdf')\n", "fig1_legend = get_relative_existing_path('figure1_legend.pdf')\n", "fig2_legend = get_relative_existing_path('figure2_legend.pdf')\n", "fig2_2ndlegend = get_relative_existing_path('figure2_country_legend.pdf')\n", "fig3_legend = get_relative_existing_path('figure3_legend.pdf')\n", "fig1_ovals = get_relative_existing_path('figure1_ovals.png')\n", "fig2_ovals = get_relative_existing_path('figure2_ovals.png')\n", "fig4_overlay = get_relative_existing_path('figure4_overlay.pdf')\n", "ball_legend = get_relative_existing_path('ball_legend.pdf')\n", "title = get_relative_existing_path('youramericangutsampletext.pdf')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, lets fetch the American Gut data from EBI (if necessary)." ] }, { "cell_type": "code", "collapsed": false, "input": [ "to_obtain = []\n", "for acc, seqs, map_ in zip(accessions, sequence_files, mapping_files):\n", " if not os.path.exists(seqs) or not os.path.exists(map_):\n", " to_obtain.append((acc, map_, seqs))\n", "\n", "if not debug:\n", " for acc, map_, seqs in to_obtain:\n", " print \"Fetching %s\" % acc\n", " fetch_study(acc, map_, seqs)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since the fetch from EBI can result in multiple individual mapping files, let\u2019s go ahead and merge them into a single file for ease of use downstream." ] }, { "cell_type": "code", "collapsed": false, "input": [ "if len(mapping_files) == 1:\n", " mapping_fp = mapping_files[0]\n", "else:\n", " mapping_fp = get_relative_new_path('combined_mapping.txt')\n", " \n", " merge_args = {'input_a': mapping_files[0], 'input_b': mapping_files[1], 'output': mapping_fp}\n", " res = wait_on(submit(scripts['Merge Mapping Files'] % merge_args))\n", " for m in mapping_files[2:]:\n", " merge_args = {'input_a': mapping_fp, 'input_b': m, 'output': mapping_fp}\n", " res = wait_on(submit(scripts['Merge Mapping Files'] % merge_args))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And lets do the same for the sequence files." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# concatenate all sequence files into one merged sequence file. This can take a while!\n", "merged_sequences_full_length_fp = get_relative_new_path('merged_sequences_full_length.fna')\n", "with open(merged_sequences_full_length_fp, 'w') as merged_seqs:\n", " concatenate_files([open(f, 'U') for f in sequence_files], merged_seqs)\n", "\n", "check_file(merged_sequences_full_length_fp)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The bloom sequences we observed were found only within the fecal samples. The first step in removing these sequences is to filter down the data to just those sequences corresponding to fecal samples." ] }, { "cell_type": "code", "collapsed": false, "input": [ "ag_just_fecal = os.path.splitext(os.path.basename(merged_sequences_full_length_fp))[0] + '-fecal.fna'\n", "filter_args = {'input': merged_sequences_full_length_fp,\n", " 'output': ag_just_fecal,\n", " 'mapping': mapping_fp,\n", " 'states': ':'.join(('BODY_SITE', 'UBERON:feces'))}\n", " \n", "wait_on(submit(scripts['Filter Sequences to Fecal'] % filter_args))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we're going to \"pick OTUs\" using the identified bloom sequences as the reference set." ] }, { "cell_type": "code", "collapsed": false, "input": [ "no_ext = os.path.splitext(os.path.basename(ag_just_fecal))[0]\n", "bloom_otus = ag_just_fecal + '-bloom-otus'\n", "bloom_hits = os.path.join(bloom_otus, 'sortmerna_picked_otus', no_ext + '_otus.txt')\n", "bloom_args = {\n", " 'input': ag_just_fecal,\n", " 'output': bloom_otus,\n", " 'reference': bloom_seqs,\n", " 'taxonomy': reference_taxonomy,\n", "}\n", "wait_on(submit_smr(scripts['Pick Closed Reference OTUs'] % bloom_args))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, lets filter out the sequences that recruited to the blooms from the full dataset." ] }, { "cell_type": "code", "collapsed": false, "input": [ "bloom_filtered = os.path.splitext(merged_sequences_full_length_fp)[0] + '-debloomed.fna'\n", "filter_args = {\n", " 'input': merged_sequences_full_length_fp,\n", " 'output': bloom_filtered,\n", " 'otus': bloom_hits}\n", "wait_on(submit(scripts['Filter Sequences'] % filter_args))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Part of the participant results depend on being able to combine the data with the Global Gut. However, those data were are only 100nt, so in order to reduce study bias, we're going to trim the American Gut data down to 100nt." ] }, { "cell_type": "code", "collapsed": false, "input": [ "bloom_filtered_100nt = os.path.splitext(bloom_filtered)[0] + '-100nt.fna'\n", "with open(bloom_filtered, 'U') as merged_seqs, open(bloom_filtered_100nt, 'w') as merged_seqs_trimmed:\n", " trim_fasta(merged_seqs, merged_seqs_trimmed, 100)\n", "\n", "check_file(bloom_filtered_100nt)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, lets pick OTUs for both the full length and the 100nt trim sequences. Splitting this into two cells due to an issue with SortMeRNA where the indexing potentially conflicts on multiple runs in parallel." ] }, { "cell_type": "code", "collapsed": false, "input": [ "params_file = get_relative_existing_path('sortmerna_pick_params.txt')\n", "with open(params_file, 'a') as f:\n", " f.write(\"pick_otus.py:sortmerna_db /home/mcdonadt/ResearchWork/gg_13_8_otus/97_otus_idx\\n\")\n", " \n", "full_length_otus = os.path.splitext(bloom_filtered)[0] + '-otus'\n", "full_length_args = {\n", " 'input': bloom_filtered,\n", " 'output': full_length_otus,\n", " 'reference': reference_rep_set,\n", " 'taxonomy': reference_taxonomy,\n", "}\n", "\n", "trimmed_otus = os.path.splitext(bloom_filtered_100nt)[0] + '-otus'\n", "trimmed_args = {\n", " 'input': bloom_filtered_100nt,\n", " 'output': trimmed_otus,\n", " 'reference': reference_rep_set,\n", " 'taxonomy': reference_taxonomy,\n", "}\n", "\n", "jobs = [submit_smr(scripts['Pick Closed Reference OTUs'] % full_length_args),\n", " submit_smr(scripts['Pick Closed Reference OTUs'] % trimmed_args)]\n", "wait_on(jobs)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Setup and verify our paths to the AG mapping and BIOM table." ] }, { "cell_type": "code", "collapsed": false, "input": [ "ag_100nt_m_fp = get_relative_existing_path('combined_mapping.txt') \n", "ag_100nt_t_fp = get_relative_existing_path('merged_sequences_full_length-debloomed-100nt-otus/otu_table.biom')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If available: paths to identified data, password for the data, and the path to any previously printed results (by barcode)." ] }, { "cell_type": "code", "collapsed": false, "input": [ "path_to_participant_data = None\n", "passwd_for_participant_data = None\n", "path_to_previously_printed = None" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Manage identified participant data if provided (e.g. to print participants\u2019 names on certificates). **Note: these data are _not_ provided for privacy reasons.** Processing can proceed without identified data, however." ] }, { "cell_type": "code", "collapsed": false, "input": [ "participants = parse_identifying_data(path_to_participant_data, passwd_for_participant_data)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "prev_printed = parse_previously_printed(path_to_previously_printed)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we need to reformat and clean the metadata to improve our ability to compare samples. Specifically, we are going to:\n", "\n", "* Group body sites into their major categories (e.g., transform \"forehead\" and \"skin of hand\" to just \"skin\")\n", "* Simplify country codes (e.g., GAZ:Venezuela to Venezuela)\n", "* Abbreviate experiment titles (e.g., American Gut Project to AGP)\n", "* create a hybrid field combining the experiment title with the body site" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# new file paths\n", "ag_100nt_m_massaged_fp = get_relative_new_path('AG_100nt_massaged.txt')\n", "gg_100nt_m_massaged_fp = get_relative_new_path('GG_100nt_massaged.txt')\n", "pgp_100nt_m_massaged_fp = get_relative_new_path('PGP_100nt_massaged.txt')\n", "hmp_100nt_m_massaged_fp = get_relative_new_path('HMP_100nt_massaged.txt')\n", "\n", "# parse PGP IDs list that are in the American Gut\n", "pgp_ids_fp = get_relative_existing_path('pgp_agp_barcodes.txt')\n", "pgp_ids = [l.strip() for l in open(pgp_ids_fp) if not l.startswith('#')]\n", "\n", "# massage\n", "clean_and_reformat_mapping(open(ag_100nt_m_fp, 'U'), open(ag_100nt_m_massaged_fp, 'w'), 'body_site', 'AGP', pgp_ids=pgp_ids)\n", "clean_and_reformat_mapping(open(gg_100nt_m_fp, 'U'), open(gg_100nt_m_massaged_fp, 'w'), 'body_site', 'GG')\n", "clean_and_reformat_mapping(open(pgp_100nt_m_fp, 'U'), open(pgp_100nt_m_massaged_fp, 'w'), 'body_site', 'PGP', pgp_ids=True)\n", "clean_and_reformat_mapping(open(hmp_100nt_m_fp, 'U'), open(hmp_100nt_m_massaged_fp, 'w'), 'bodysite', 'HMP')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we've cleaned up the metadata, we need to merge it into a single mapping file." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# setup output paths, (mm -> massaged mapping)\n", "hmp_pgp_mm_fp = get_relative_new_path('HMP_PGP_100nt_massaged.txt')\n", "ag_gg_mm_fp = get_relative_new_path('AG_GG_100nt_massaged.txt')\n", "hmp_pgp_ag_gg_mm_fp = get_relative_new_path('HMP_GG_AG_PGP_100nt_massaged.txt')\n", "\n", "hmp_pgp_cmd_args = {'input_a':hmp_100nt_m_massaged_fp,\n", " 'input_b':pgp_100nt_m_massaged_fp,\n", " 'output':hmp_pgp_mm_fp}\n", "\n", "ag_gg_cmd_args = {'input_a':ag_100nt_m_massaged_fp,\n", " 'input_b':gg_100nt_m_massaged_fp,\n", " 'output':ag_gg_mm_fp}\n", "\n", "hmp_pgp_ag_gg_cmd_args = {'input_a':hmp_pgp_mm_fp,\n", " 'input_b':ag_gg_mm_fp,\n", " 'output':hmp_pgp_ag_gg_mm_fp}\n", "\n", "# merge and block until completion\n", "hmp_pgp_job = submit(scripts['Merge Mapping Files'] % hmp_pgp_cmd_args)\n", "ag_gg_job = submit(scripts['Merge Mapping Files'] % ag_gg_cmd_args)\n", "jobs = wait_on([hmp_pgp_job, ag_gg_job])\n", "\n", "# merge and block until completion\n", "hmp_pgp_ag_gg_job = submit(scripts['Merge Mapping Files'] % hmp_pgp_ag_gg_cmd_args)\n", "jobs = wait_on(hmp_pgp_ag_gg_job)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to simplify compute on the downstream processes, we're going to filter out the metadata columns that we aren't interested in." ] }, { "cell_type": "code", "collapsed": false, "input": [ "fig1_m_fp = get_relative_new_path('HMP_GG_AG_PGP_figure1.txt')\n", "fig2_m_fp = get_relative_new_path('AG_GG_fecal_figure2.txt')\n", "fig3_m_fp = get_relative_new_path('AG_fecal_figure3.txt')\n", "fig4_m_fp = get_relative_new_path('AG_fecal_figure4.txt')\n", "oral_m_fp = get_relative_new_path('AG_oral.txt')\n", "skin_m_fp = get_relative_new_path('AG_skin.txt')\n", "\n", "# for the first PCoA, keep only these 5 columns regardless of value\n", "fig1_filter_criteria = {'TITLE_ACRONYM':None, \n", " 'SIMPLE_BODY_SITE':None,\n", " 'TITLE_BODY_SITE':None, \n", " 'HMP_SITE':None,\n", " 'IS_PGP': None}\n", "\n", "# for the second PCoA, we want only the same columns but only the fecal samples\n", "fig2_filter_criteria = {'TITLE_ACRONYM':None, \n", " 'AGE':None, \n", " 'SIMPLE_BODY_SITE':lambda x: x == 'FECAL', \n", " 'COUNTRY':None,\n", " 'IS_PGP': None}\n", "\n", "# for the third PCoA, we want only two columns and only fecal samples\n", "fig3_filter_criteria = {'TITLE_ACRONYM':None, \n", " 'SIMPLE_BODY_SITE':lambda x: x == 'FECAL',\n", " 'IS_PGP': None}\n", "\n", "# for the taxonomy figure, we want to retain fecal samples and a few additional categories\n", "fig4_filter_criteria = {'TITLE_ACRONYM':None,\n", " 'AGE_CATEGORY':None,\n", " 'SEX':None,\n", " 'BMI_CATEGORY':None,\n", " 'DIET_TYPE':None,\n", " 'SIMPLE_BODY_SITE':lambda x: x == 'FECAL'}\n", "\n", "oral_filter_criteria = {'TITLE_ACRONYM':None,\n", " 'AGE_CATEGORY':None,\n", " 'SEX':None,\n", " 'BMI_CATEGORY':None,\n", " 'DIET_TYPE':None,\n", " 'SIMPLE_BODY_SITE':lambda x: x == 'ORAL'}\n", "\n", "skin_filter_criteria = {'TITLE_ACRONYM':None,\n", " 'AGE_CATEGORY':None,\n", " 'SEX':None,\n", " 'BMI_CATEGORY':None,\n", " 'DIET_TYPE':None,\n", " 'SIMPLE_BODY_SITE':lambda x: x == 'SKIN'}\n", "\n", "\n", "filter_mapping_file(open(hmp_pgp_ag_gg_mm_fp, 'U'), open(fig1_m_fp, 'w'), fig1_filter_criteria)\n", "filter_mapping_file(open(ag_gg_mm_fp, 'U'), open(fig2_m_fp, 'w'), fig2_filter_criteria)\n", "filter_mapping_file(open(ag_100nt_m_massaged_fp, 'U'), open(fig3_m_fp, 'w'), fig3_filter_criteria)\n", "filter_mapping_file(open(ag_100nt_m_massaged_fp, 'U'), open(fig4_m_fp, 'w'), fig4_filter_criteria)\n", "filter_mapping_file(open(ag_100nt_m_massaged_fp, 'U'), open(oral_m_fp, 'w'), oral_filter_criteria)\n", "filter_mapping_file(open(ag_100nt_m_massaged_fp, 'U'), open(skin_m_fp, 'w'), skin_filter_criteria)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the figures, we'll need to merge and filter the tables in a few ways. For Figure 1, we want to place the American Gut population in the context of other large studies. To do so, we first need to merge the tables together. Since there are 4 tables to merge, we need to use two merge calls. (It is also feasible to use QIIME's ``parallel_merge_otu_tables.py here``). Figures 2 and 3 use a combination of Global Gut and American Gut data, but only fecal samples." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# resulting paths\n", "hmp_pgp_t_fp = get_relative_new_path(\"HMP_PGP_100nt.biom\")\n", "ag_gg_t_fp = get_relative_new_path(\"AG_GG_100nt.biom\")\n", "hmp_gg_ag_pgp_t_fp = get_relative_new_path(\"HMP_GG_AG_PGP_100nt.biom\")\n", "\n", "# setup the command arguments for each call\n", "hmp_pgp_cmd_args = {'input_a':hmp_100nt_t_fp, \n", " 'input_b':pgp_100nt_t_fp,\n", " 'output':hmp_pgp_t_fp}\n", "ag_gg_cmd_args = {'input_a':ag_100nt_t_fp,\n", " 'input_b':gg_100nt_t_fp,\n", " 'output':ag_gg_t_fp}\n", "hmp_gg_ag_pgp_cmd_args = {'input_a':ag_gg_t_fp,\n", " 'input_b':hmp_pgp_t_fp,\n", " 'output':hmp_gg_ag_pgp_t_fp}\n", "\n", "# merge and block until completion\n", "hmp_pgp_job = submit(scripts['Merge OTU Tables'] % hmp_pgp_cmd_args)\n", "ag_gg_job = submit(scripts['Merge OTU Tables'] % ag_gg_cmd_args)\n", "jobs = wait_on([hmp_pgp_job, ag_gg_job])\n", "\n", "# merge and block until completion\n", "hmp_gg_ag_pgp_job = submit(scripts['Merge OTU Tables'] % hmp_gg_ag_pgp_cmd_args)\n", "jobs = wait_on(hmp_gg_ag_pgp_job)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that the data are combined, we need to rarefy them in order to normalize for sequencing depth. The Human Microbiome Project has the shallowest coverage per sample. When operating with the HMP dataset, a rarefaction depth of 1,000 sequences per sample is a reasonable balance between accurate microbial community representation and sample retention." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# resulting path\n", "hmp_gg_ag_pgp_t_1k_fp = get_relative_new_path(\"HMP_GG_AG_PGP_100nt_even1k.biom\")\n", "\n", "# setup the command arguments\n", "hmp_gg_ag_pgp_t_1k_cmd_args = {'input':hmp_gg_ag_pgp_t_fp,\n", " 'output':hmp_gg_ag_pgp_t_1k_fp,\n", " 'depth':'1000'}\n", "\n", "# rarifiy and block until completion\n", "hmp_gg_ag_pgp_t_1k_job = submit(scripts['Single Rarifaction'] % hmp_gg_ag_pgp_t_1k_cmd_args)\n", "res = wait_on(hmp_gg_ag_pgp_t_1k_job)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have our rarefied OTU table and a corresponding mapping file that represents the AG, PGP, HMP, and GG datasets, we can compute the beta diversity of our samples. This step is computationally demanding and will run for a few hours on 100 processors. **Note, the final merge job for parallel beta diversity requires > 8GB of RAM**" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# setup output directory\n", "bdiv_dir = lambda x: 'bdiv_' + os.path.basename(x).split('.',1)[0]\n", "hmp_gg_ag_pgp_1k_unweighted_unifrac_d = get_relative_new_path(bdiv_dir(hmp_gg_ag_pgp_t_1k_fp))\n", " \n", "# setup beta diversity arguments\n", "prefix = 'ag2_bdiv_'\n", "hmp_gg_ag_pgp_cmd_args = {'input':hmp_gg_ag_pgp_t_1k_fp,\n", " 'output':hmp_gg_ag_pgp_1k_unweighted_unifrac_d,\n", " 'job_prefix':prefix,\n", " 'num_jobs':200,\n", " 'gg97_tree':reference_tree}\n", "\n", "# submit and wait\n", "hmp_gg_ag_pgp_bdiv_job = submit_qsub(scripts['Parallel Beta Diversity'] % hmp_gg_ag_pgp_cmd_args, prj_name, queue='memroute', extra_args='-l pvmem=16gb')\n", "res = wait_on(hmp_gg_ag_pgp_bdiv_job, additional_prefix=prefix)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have calculated beta diversity for all of the samples, we need to filter the table down to the subsets that we're interested in. For Figure 1 we want to use all of the samples. But for Figure 2 and 3, we only want to use subsets of the full distance matrix we just calculated." ] }, { "cell_type": "code", "collapsed": false, "input": [ "uw_bdiv_path = get_relative_existing_path(os.path.join(hmp_gg_ag_pgp_1k_unweighted_unifrac_d, \n", " 'unweighted_unifrac_HMP_GG_AG_PGP_100nt_even1k.txt'))\n", "w_bdiv_path = get_relative_existing_path(os.path.join(hmp_gg_ag_pgp_1k_unweighted_unifrac_d, \n", " 'weighted_unifrac_HMP_GG_AG_PGP_100nt_even1k.txt')) \n", "\n", "full_bdiv = uw_bdiv_path\n", "fig_path = lambda x: full_bdiv.rsplit('.txt',1)[0] + '-' + x + '.txt'\n", "fig1_bdiv = fig_path('fig1')\n", "fig2_bdiv = fig_path('fig2')\n", "fig3_bdiv = fig_path('fig3')\n", "\n", "fig1_cmd_args = {'input':full_bdiv,\n", " 'output':fig1_bdiv,\n", " 'sample_ids':fig1_m_fp}\n", "fig2_cmd_args = {'input':full_bdiv,\n", " 'output':fig2_bdiv,\n", " 'sample_ids':fig2_m_fp}\n", "fig3_cmd_args = {'input':full_bdiv,\n", " 'output':fig3_bdiv,\n", " 'sample_ids':fig3_m_fp}\n", "\n", "jobs = []\n", "jobs.append(submit(scripts['Filter Distance Matrix'] % fig1_cmd_args))\n", "jobs.append(submit(scripts['Filter Distance Matrix'] % fig2_cmd_args))\n", "jobs.append(submit(scripts['Filter Distance Matrix'] % fig3_cmd_args))\n", "res = wait_on(jobs)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to visualize these data with [Emperor](http://www.gigasciencejournal.com/content/2/1/16/), we need to transform our Unifrac distance matrices into principal coordinates space." ] }, { "cell_type": "code", "collapsed": false, "input": [ "pc_path = lambda x: x.rsplit('.txt',1)[0] + '_pc.txt'\n", "\n", "# verify the expected files are present\n", "check_file(fig1_bdiv)\n", "check_file(fig2_bdiv)\n", "check_file(fig3_bdiv)\n", "\n", "fig1_pc = pc_path(fig1_bdiv)\n", "fig2_pc = pc_path(fig2_bdiv)\n", "fig3_pc = pc_path(fig3_bdiv)\n", "\n", "# setup our arguments\n", "fig1_cmd_args = {'input':fig1_bdiv,\n", " 'output':fig1_pc}\n", "fig2_cmd_args = {'input':fig2_bdiv,\n", " 'output':fig2_pc}\n", "fig3_cmd_args = {'input':fig3_bdiv,\n", " 'output':fig3_pc}\n", "\n", "# submit the jobs\n", "jobs = []\n", "jobs.append(submit(scripts['Principal Coordinates'] % fig1_cmd_args))\n", "jobs.append(submit(scripts['Principal Coordinates'] % fig2_cmd_args))\n", "jobs.append(submit(scripts['Principal Coordinates'] % fig3_cmd_args))\n", "job_results = wait_on(jobs)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To make Figures 1, 2, and 3, we will use [Emperor](http://www.gigasciencejournal.com/content/2/1/16/), a WebGL-based Principal Coordinates viewer. While Emperor is the only tool that we're aware of that can effectively scale to datasets of this size for 3D visualization and painting of arbitrary metadata, the tie to WebGL makes its use here a little bit of a challenge. Specifically, we'll be able to generate the plots, but we cannot automatically generate the images from the Notebook.\n", "\n", "First, lets get Emperor up and running. in the following cell, we'll describe how what needs to happen to produce the images." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# quick little helper method\n", "emp_path = lambda x: x.rsplit('.txt',1)[0] + '-emp'\n", "\n", "# verify expected files are present\n", "check_file(fig1_pc)\n", "check_file(fig2_pc)\n", "check_file(fig3_pc)\n", "\n", "# setup output paths\n", "fig1_emp = emp_path(fig1_pc)\n", "fig2_emp = emp_path(fig2_pc)\n", "fig3_emp = emp_path(fig3_pc)\n", "fig3_filter = get_relative_new_path(\"figure3.biom\")\n", "fig3_taxa = get_relative_new_path(\"figure3_taxa\")\n", "fig3_taxa_mapping_fp = os.path.join(fig3_taxa, os.path.splitext(os.path.basename(fig3_m_fp))[0] + '_L2.txt')\n", "\n", "# setup arguments\n", "fig3_filter_args = {'input':hmp_gg_ag_pgp_t_1k_fp,\n", " 'output':fig3_filter,\n", " 'sample_id_fp':fig3_m_fp}\n", "fig3_summarize_args = {'input':fig3_filter,\n", " 'output':fig3_taxa,\n", " 'level':'2',\n", " 'mapping':fig3_m_fp}\n", "\n", "fig1_cmd_args = {'input':fig1_pc, \n", " 'output':fig1_emp, \n", " 'mapping':fig1_m_fp}\n", "fig2_cmd_args = {'input':fig2_pc, \n", " 'output':fig2_emp, \n", " 'mapping':fig2_m_fp}\n", "fig3_cmd_args = {'input':fig3_pc, \n", " 'output':fig3_emp, \n", " 'mapping':fig3_taxa_mapping_fp}\n", "\n", "# filter the table down to just the AG fecal samples\n", "filter_job = submit(scripts['Filter Samples'] % fig3_filter_args)\n", "res = wait_on(filter_job)\n", "\n", "taxa_job = submit(scripts['Summarize Taxa Mapping'] % fig3_summarize_args)\n", "res = wait_on(taxa_job)\n", "\n", "check_file(fig3_taxa_mapping_fp)\n", "\n", "jobs = []\n", "jobs.append(submit(scripts['Make Emperor'] % fig1_cmd_args))\n", "jobs.append(submit(scripts['Make Emperor'] % fig2_cmd_args))\n", "jobs.append(submit(scripts['Make Emperor'] % fig3_cmd_args))\n", "res = wait_on(jobs)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's the required manual intervention:\n", "\n", "1. Click on the first URL link in the resulting pane\n", "2. Rotate the view to your a nice perspective\n", "3. Click the \"Options\" tab on the right side, set a filename prefix (e.g., figure1) and then click on the \"Multishot\" button. **NOTE: this will produce a lot of files on to your LOCAL computer!**\n", "4. Wait until all of the images have downloaded\n", "5. Pack up the local image files:\n", " - Change directories on your LOCAL computer to your downloads folder\n", " - Execute (make sure to replace **prefix** with the prefix specified in 3): tar czf **prefix**.tar.gz **prefix**.*\n", " - If to many files, you can do: \"find . -name \"Figure_1.*\" -type f -print0 | tar czf Figure_1.tar.gz -T - --null\"\n", " - (adapted from: http://stackoverflow.com/questions/5891866/find-files-and-tar-them-with-spaces)\n", "6. Copy the **prefix**.tar.gz file back to your compute resource, and place it in your home directory (~/ or $HOME)\n", "7. repeat for each figure\n", "\n", "If the figures are redone, then you may need to clear your web-browsers cache befone the HTML links below will work correct. Optionally, you could hit refresh a few times after clicking the links as well." ] }, { "cell_type": "code", "collapsed": false, "input": [ "emp_index = lambda x: os.path.join(x, 'index.html')\n", "\n", "# form the expected paths for Emperor\n", "fig1 = emp_index(fig1_emp)\n", "fig2 = emp_index(fig2_emp)\n", "fig3 = emp_index(fig3_emp)\n", "\n", "# verify the expected files are present\n", "check_file(fig1)\n", "check_file(fig2)\n", "check_file(fig3)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "FileLink(fig1, result_html_prefix='

Figure 1:

')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Figure 2 is slightly different: we need to enable gradient colors under options, and change the uninformative ages (None/NA/0.0) to a grey." ] }, { "cell_type": "code", "collapsed": false, "input": [ "FileLink(fig2, result_html_prefix='

Figure 2:

')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "FileLink(fig3, result_html_prefix='

Figure 3:

')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once the images are created, we need to copy them to our working area. It is possible file paths will be different, in which case, you may need to change the DOWNLOAD_DIRECTORY variable in the next cell." ] }, { "cell_type": "code", "collapsed": false, "input": [ "DOWNLOAD_DIRECTORY = os.path.expandvars('$HOME')\n", "FIGURE1_EXPECTED_FILENAME = 'Figure_1.tar.gz'\n", "FIGURE2_EXPECTED_FILENAME = 'Figure_2.tar.gz'\n", "FIGURE3_EXPECTED_FILENAME = 'Figure_3.tar.gz'\n", "\n", "# helper function for creating wildcard paths\n", "source_path = lambda x,y: os.path.join(x,y)\n", "\n", "# setup the destination paths\n", "emperor_images = get_relative_new_path('emperor_images_svg')\n", "\n", "if not os.path.exists(emperor_images):\n", " os.mkdir(emperor_images)\n", "\n", "# setup the source paths\n", "figure1_src = source_path(DOWNLOAD_DIRECTORY, FIGURE1_EXPECTED_FILENAME)\n", "figure2_src = source_path(DOWNLOAD_DIRECTORY, FIGURE2_EXPECTED_FILENAME)\n", "figure3_src = source_path(DOWNLOAD_DIRECTORY, FIGURE3_EXPECTED_FILENAME)\n", "\n", "check_file(figure1_src)\n", "check_file(figure2_src)\n", "check_file(figure3_src)\n", "\n", "# unpack the tarballs\n", "jobs = []\n", "jobs.append(submit('tar xzf %s -C %s' % (figure1_src, emperor_images)))\n", "jobs.append(submit('tar xzf %s -C %s' % (figure2_src, emperor_images)))\n", "jobs.append(submit('tar xzf %s -C %s' % (figure3_src, emperor_images)))\n", "res = wait_on(jobs)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we need to produce the actual PDF files of the PCoA plots that will go into the individualized results." ] }, { "cell_type": "code", "collapsed": false, "input": [ "all_ag_IDs = set([l.strip().split('\\t')[0] for l in open(ag_100nt_m_fp) if not l.startswith('#')])\n", "all_emp_SVGs = os.listdir(emperor_images)\n", "template_files = get_relative_new_path('template_files')\n", "\n", "if not os.path.exists(template_files):\n", " os.mkdir(template_files)\n", "\n", "svg_smash_args = {'input':emperor_images, \n", " 'output':template_files, \n", " 'prefix':None, \n", " 'sample_id':None}\n", "\n", "commands = construct_svg_smash_commands(all_emp_SVGs, all_ag_IDs, scripts['SVG Smash'], svg_smash_args)\n", "res = farm_commands(commands, 25)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now lets create the phylum level taxonomy summary plots." ] }, { "cell_type": "code", "collapsed": false, "input": [ "fig4_sex_fp = get_relative_new_path('fig4_sex.biom')\n", "fig4_age_fp = get_relative_new_path('fig4_age_category.biom')\n", "fig4_diet_fp = get_relative_new_path('fig4_diet.biom')\n", "fig4_bmi_fp = get_relative_new_path('fig4_bmi_category.biom')\n", "ag_fecal_t_1k_fp = get_relative_new_path('ag_fecal_even1k.biom')\n", "ag_oral_t_1k_fp = get_relative_new_path('ag_oral_even1k.biom')\n", "ag_skin_t_1k_fp = get_relative_new_path('ag_skin_even1k.biom')\n", "template_files = get_relative_existing_path('template_files')\n", "\n", "check_file(hmp_gg_ag_pgp_t_1k_fp)\n", "check_file(fig4_m_fp)\n", "check_file(oral_m_fp)\n", "check_file(skin_m_fp)\n", "\n", "filter_fecal_args = {'input':hmp_gg_ag_pgp_t_1k_fp,\n", " 'output':ag_fecal_t_1k_fp,\n", " 'sample_id_fp':fig4_m_fp}\n", "\n", "filter_oral_args = {'input':hmp_gg_ag_pgp_t_1k_fp,\n", " 'output':ag_oral_t_1k_fp,\n", " 'sample_id_fp':oral_m_fp}\n", "\n", "filter_skin_args = {'input':hmp_gg_ag_pgp_t_1k_fp,\n", " 'output':ag_skin_t_1k_fp,\n", " 'sample_id_fp':skin_m_fp}\n", "\n", "otu_by_cat_sex_args = {'otu_table':ag_fecal_t_1k_fp,\n", " 'output':fig4_sex_fp,\n", " 'mapping':fig4_m_fp,\n", " 'category':'SEX'}\n", "otu_by_cat_age_args = {'otu_table':ag_fecal_t_1k_fp,\n", " 'output':fig4_age_fp,\n", " 'mapping':fig4_m_fp,\n", " 'category':'AGE_CATEGORY'}\n", "otu_by_cat_diet_args = {'otu_table':ag_fecal_t_1k_fp,\n", " 'output':fig4_diet_fp,\n", " 'mapping':fig4_m_fp,\n", " 'category':'DIET_TYPE'}\n", "otu_by_cat_bmi_args = {'otu_table':ag_fecal_t_1k_fp,\n", " 'output':fig4_bmi_fp,\n", " 'mapping':fig4_m_fp,\n", " 'category':'BMI_CATEGORY'}\n", "\n", "phyla_plot_args = {'input':ag_fecal_t_1k_fp,\n", " 'output':template_files,\n", " 'mapping':fig4_m_fp,\n", " 'debug':\"-d\" if debug else \"\",\n", " 'categories':'SEX:%s, AGE_CATEGORY:%s, DIET_TYPE:%s, BMI_CATEGORY:%s' % (fig4_sex_fp, fig4_age_fp, fig4_diet_fp, fig4_bmi_fp)}\n", " \n", "# filter the ag table down to just the fecal samples for fig 4\n", "filter_jobs = []\n", "filter_jobs.append(submit(scripts['Filter Samples'] % filter_fecal_args))\n", "filter_jobs.append(submit(scripts['Filter Samples'] % filter_oral_args))\n", "filter_jobs.append(submit(scripts['Filter Samples'] % filter_skin_args))\n", "res = wait_on(filter_jobs)\n", "\n", "# get the summarized tables\n", "jobs = []\n", "jobs.append(submit(scripts['Summarize OTU by Category'] % otu_by_cat_sex_args))\n", "jobs.append(submit(scripts['Summarize OTU by Category'] % otu_by_cat_age_args))\n", "jobs.append(submit(scripts['Summarize OTU by Category'] % otu_by_cat_diet_args))\n", "jobs.append(submit(scripts['Summarize OTU by Category'] % otu_by_cat_bmi_args))\n", "res = wait_on(jobs)\n", "\n", "# farm out the phyla plots\n", "sample_ids = [l.split('\\t')[0] for l in open(fig4_m_fp) if not l.startswith('#')]\n", "commands = construct_phyla_plots_cmds(sample_ids, scripts['Make Phyla Plots'], phyla_plot_args)\n", "res = farm_commands(commands, 5)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And for the last figure, lets create the over represented taxonomy table." ] }, { "cell_type": "code", "collapsed": false, "input": [ "ag_fecal_t_norare_fp = get_relative_new_path('ag_fecal_norare_filtered.biom')\n", "ag_skin_t_norare_fp = get_relative_new_path('ag_skin_norare_filtered.biom')\n", "ag_oral_t_norare_fp = get_relative_new_path('ag_oral_norare_filtered.biom')\n", "\n", "fecal_tax_sum = get_relative_new_path('ag_fecal_norare_taxasum')\n", "skin_tax_sum = get_relative_new_path('ag_skin_norare_taxasum')\n", "oral_tax_sum = get_relative_new_path('ag_oral_norare_taxasum')\n", "\n", "template_files = get_relative_existing_path('template_files')\n", "\n", "filter_fecal_args = {'input':ag_100nt_t_fp,\n", " 'output':ag_fecal_t_norare_fp,\n", " 'sample_id_fp':fig4_m_fp}\n", "\n", "filter_skin_args = {'input':ag_100nt_t_fp,\n", " 'output':ag_skin_t_norare_fp,\n", " 'sample_id_fp':skin_m_fp}\n", "\n", "filter_oral_args = {'input':ag_100nt_t_fp,\n", " 'output':ag_oral_t_norare_fp,\n", " 'sample_id_fp':oral_m_fp}\n", "\n", "sum_fecal_args = {'input':ag_fecal_t_norare_fp,\n", " 'output':fecal_tax_sum,\n", " 'level':\"2,3,6\"}\n", "\n", "sum_skin_args = {'input':ag_skin_t_norare_fp,\n", " 'output':skin_tax_sum,\n", " 'level':\"2,3,6\"}\n", "\n", "sum_oral_args = {'input':ag_oral_t_norare_fp,\n", " 'output':oral_tax_sum,\n", " 'level':\"2,3,6\"}\n", "\n", "filter_jobs = []\n", "filter_jobs.append(submit(scripts['Filter Samples'] % filter_fecal_args))\n", "filter_jobs.append(submit(scripts['Filter Samples'] % filter_skin_args))\n", "filter_jobs.append(submit(scripts['Filter Samples'] % filter_oral_args))\n", "res = wait_on(filter_jobs)\n", "\n", "sum_jobs = []\n", "sum_jobs.append(submit(scripts['Summarize Taxa'] % sum_fecal_args))\n", "sum_jobs.append(submit(scripts['Summarize Taxa'] % sum_skin_args))\n", "sum_jobs.append(submit(scripts['Summarize Taxa'] % sum_oral_args))\n", "res = wait_on(sum_jobs)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "sum_fecal_args = {'input':ag_fecal_t_norare_fp,\n", " 'output':fecal_tax_sum,\n", " 'level':\"2,3,6\"}\n", "res = wait_on(submit(scripts['Summarize Taxa'] % sum_fecal_args))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "template_files = get_relative_existing_path('template_files')\n", "fig5_lvl6 = get_relative_existing_path('ag_fecal_norare_taxasum/ag_fecal_norare_filtered_L6.biom')\n", "\n", "sig_args = {'input':fig5_lvl6,\n", " 'output':template_files}\n", "\n", "from time import sleep\n", "jobs = []\n", "sample_ids = [l.split('\\t')[0] for l in open(fig4_m_fp) if not l.startswith('#')]\n", "for id_ in sample_ids:\n", " args = sig_args.copy()\n", " args['samples'] = id_\n", " cmd = scripts['OTU Significance'] % args\n", " jobs.append(submit(cmd))\n", " sleep(0.1)\n", "res = wait_on(jobs)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets also dump out the per sample taxonomy information that we can distribute on the participant website." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from biom.util import biom_open\n", "with biom_open(fig5_lvl6) as table:\n", " per_sample_taxa_summaries(table, get_relative_new_path('template_files/Figure_6_%s.txt'))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we construct a method that will return the necessary commands to populate a template per sample." ] }, { "cell_type": "code", "collapsed": false, "input": [ "static_paths = {'template': template,\n", " 'aglogo':aglogo,\n", " 'fig1_legend': fig1_legend,\n", " 'fig2_legend': fig2_legend,\n", " 'fig2_2ndlegend': fig2_2ndlegend,\n", " 'fig3_legend': fig3_legend,\n", " 'fig4_overlay': fig4_overlay,\n", " 'fig1_ovals': fig1_ovals,\n", " 'fig2_ovals': fig2_ovals,\n", " 'ball_legend': ball_legend,\n", " 'title': title,\n", " 'working_dir': working_dir}" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now lets build up the list of commands that will populate the templates per sample." ] }, { "cell_type": "code", "collapsed": false, "input": [ "check_file(ag_100nt_m_massaged_fp)\n", "check_file(template)\n", "\n", "if not os.path.exists(get_relative_new_path('unidentified')):\n", " os.mkdir(get_relative_new_path('unidentified'))\n", "if participants is not None and os.path.exists(get_relative_new_path('identified')):\n", " os.mkdir(get_relative_new_path('identified'))\n", " \n", "base_setup_cmd = 'cd %s; %s'\n", "\n", "indiv_cmds, latex_cmds, missing = construct_bootstrap_and_latex_commands(all_ag_IDs, participants, \n", " get_relative_existing_path,\n", " static_paths, base_setup_cmd, \n", " scripts['To PDF'])\n", "_ = farm_commands(indiv_cmds, 25)\n", "_ = farm_commands(latex_cmds, 25)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These last few cells \"harvest\" the populated templates, which places them in a single folder. We can then \"smash\" or combine multiple templates together to simplify the results printing process." ] }, { "cell_type": "code", "collapsed": false, "input": [ "if participants is not None:\n", " res = farm_commands(harvest(get_relative_existing_path('identified')), 50)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "res = farm_commands(harvest(get_relative_existing_path('unidentified')), 50)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We only do this part if we have identifying information on hand." ] }, { "cell_type": "code", "collapsed": false, "input": [ "if participants is not None:\n", " harvest_path = get_path('identified/harvested')\n", " identified_smash = pdf_smash(harvest_path, 'identified', previously_printed=prev_printed)\n", " res = farm_commands(identified_smash, 1)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And we're done! You can view all of the intermediate files in the working directory which will be displayed below." ] }, { "cell_type": "code", "collapsed": false, "input": [ "print working_dir" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next up, we're going to create the project summary information." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# setup paths\n", "key_taxa = get_relative_existing_path('stacked_plots_key_taxa.txt')\n", "\n", "fecal_tax_sum_1k = get_relative_new_path('ag_fecal_1k_taxasum')\n", "oral_tax_sum_1k = get_relative_new_path('ag_oral_1k_taxasum')\n", "skin_tax_sum_1k = get_relative_new_path('ag_skin_1k_taxasum')\n", "\n", "fecal_tax_sum_1k_phy = get_relative_new_path('ag_fecal_1k_taxasum/ag_fecal_even1k_L2.txt')\n", "oral_tax_sum_1k_phy = get_relative_new_path('ag_oral_1k_taxasum/ag_oral_even1k_L2.txt')\n", "skin_tax_sum_1k_phy = get_relative_new_path('ag_skin_1k_taxasum/ag_skin_even1k_L2.txt')\n", "\n", "uw_unif_dm_fecal = get_relative_new_path('unweighted_unifrac_HMP_GG_AG_PGP_100nt_even1k_fecal.txt')\n", "uw_unif_dm_fecal_hmp = get_relative_new_path('unweighted_unifrac_HMP_even1k_fecal.txt')\n", "uw_unif_dm_fecal_agp = get_relative_new_path('unweighted_unifrac_AGP_even1k_fecal.txt')\n", "w_unif_dm_fecal = get_relative_new_path('weighted_unifrac_HMP_GG_AG_PGP_100nt_even1k_fecal.txt')\n", "w_unif_dm_fecal_hmp = get_relative_new_path('weighted_unifrac_HMP_even1k_fecal.txt')\n", "w_unif_dm_fecal_agp = get_relative_new_path('weighted_unifrac_AGP_even1k_fecal.txt')\n", "\n", "qiime_charts_config = get_relative_existing_path('metadata_charts.json')\n", "\n", "fig5 = get_relative_new_path('fig5.pdf')\n", "\n", "jobs = []\n", "\n", "# get summary counts\n", "agp_sample_count = count_samples(open(mapping_fp))\n", "agp_seq_count = count_seqs(open(merged_sequences_full_length_fp))\n", "agp_unique_participants = count_unique_participants(open(mapping_fp))\n", "pgp_sample_count = count_samples(open(ag_100nt_m_massaged_fp), criteria={'IS_PGP': 'Yes'})\n", "pgp_seq_count = count_seqs(open(bloom_filtered_100nt), subset=pgp_ids)\n", "pgp_unique_participants = count_unique_participants(open(ag_100nt_m_massaged_fp), criteria={'IS_PGP': 'Yes'})\n", "\n", "# from the GET2012 sampling which were not part of the American Gut\n", "pgp_sample_count += 439\n", "pgp_unique_participants += 86\n", "pgp_seq_count += 9509776\n", "\n", "# summarize taxa for the stacked taxa plots\n", "sum_fecal_args = {'input':ag_fecal_t_1k_fp,\n", " 'output':fecal_tax_sum_1k,\n", " 'level':\"2\"}\n", "\n", "sum_oral_args = {'input':ag_oral_t_1k_fp,\n", " 'output':oral_tax_sum_1k,\n", " 'level':\"2\"}\n", "\n", "sum_skin_args = {'input':ag_skin_t_1k_fp,\n", " 'output':skin_tax_sum_1k,\n", " 'level':\"2\"}\n", "jobs.append(submit(scripts['Summarize Taxa'] % sum_fecal_args))\n", "jobs.append(submit(scripts['Summarize Taxa'] % sum_oral_args))\n", "jobs.append(submit(scripts['Summarize Taxa'] % sum_skin_args))\n", "res = wait_on(jobs)\n", "\n", "jobs = []\n", "\n", "# setup the stacked taxa plots commands\n", "agplots_args = {'mapping': ag_100nt_m_massaged_fp,\n", " 'taxa': None,\n", " 'metadata_cat': 'SIMPLE_BODY_SITE',\n", " 'metadata_val': None,\n", " 'output_prefix': None,\n", " 'key_taxa': key_taxa}\n", "\n", "agplots_fecal = agplots_args.copy()\n", "agplots_fecal['taxa'] = fecal_tax_sum_1k_phy\n", "agplots_fecal['metadata_val'] = 'FECAL'\n", "agplots_fecal['output_prefix'] = 'ag_plots_fecal_'\n", "agplots_fecal['identified'] = 'fecal_identified.txt'\n", "\n", "agplots_oral = agplots_args.copy()\n", "agplots_oral['taxa'] = oral_tax_sum_1k_phy\n", "agplots_oral['metadata_val'] = 'ORAL'\n", "agplots_oral['output_prefix'] = 'ag_plots_oral_'\n", "agplots_oral['identified'] = 'oral_identified.txt'\n", "\n", "agplots_skin = agplots_args.copy()\n", "agplots_skin['taxa'] = skin_tax_sum_1k_phy\n", "agplots_skin['metadata_val'] = 'SKIN'\n", "agplots_skin['output_prefix'] = 'ag_plots_skin_'\n", "agplots_skin['identified'] = 'skin_identified.txt'\n", "\n", "# setup the initial filtering for the beta diversity HMP/AGP comparison\n", "filter_uwdm_to_fecal = {'input': fig1_bdiv,\n", " 'mapping': hmp_pgp_ag_gg_mm_fp,\n", " 'states': 'SIMPLE_BODY_SITE:FECAL',\n", " 'output': uw_unif_dm_fecal}\n", "filter_wdm_to_fecal = {'input': w_bdiv_path,\n", " 'mapping': hmp_pgp_ag_gg_mm_fp,\n", " 'states': 'SIMPLE_BODY_SITE:FECAL',\n", " 'output': w_unif_dm_fecal}\n", "\n", "# submit the stacked taxa plots, metadata summaries (QIIME Charts) and the initial beta diversity filtering commands\n", "jobs.append(submit(scripts['AGPlots'] % agplots_fecal))\n", "jobs.append(submit(scripts['AGPlots'] % agplots_oral))\n", "jobs.append(submit(scripts['AGPlots'] % agplots_skin))\n", "jobs.append(submit(scripts['QIIME Charts'] % {'input': qiime_charts_config}))\n", "jobs.append(submit(scripts['Filter Distance Matrix by Metadata'] % filter_uwdm_to_fecal))\n", "jobs.append(submit(scripts['Filter Distance Matrix by Metadata'] % filter_wdm_to_fecal))\n", "\n", "res = wait_on(jobs)\n", "\n", "jobs = []\n", "\n", "# second round of beta diversity filtering into project specific results\n", "filter_uwdm_to_fecal_hmp = {'input': uw_unif_dm_fecal,\n", " 'mapping': hmp_pgp_ag_gg_mm_fp,\n", " 'states': 'TITLE_ACRONYM:HMP',\n", " 'output': uw_unif_dm_fecal_hmp}\n", "\n", "filter_uwdm_to_fecal_agp = {'input': uw_unif_dm_fecal,\n", " 'mapping': hmp_pgp_ag_gg_mm_fp,\n", " 'states': 'TITLE_ACRONYM:AGP',\n", " 'output': uw_unif_dm_fecal_agp}\n", "\n", "filter_wdm_to_fecal_hmp = {'input': w_unif_dm_fecal,\n", " 'mapping': hmp_pgp_ag_gg_mm_fp,\n", " 'states': 'TITLE_ACRONYM:HMP',\n", " 'output': w_unif_dm_fecal_hmp}\n", "\n", "filter_wdm_to_fecal_agp = {'input': w_unif_dm_fecal,\n", " 'mapping': hmp_pgp_ag_gg_mm_fp,\n", " 'states': 'TITLE_ACRONYM:AGP',\n", " 'output': w_unif_dm_fecal_agp}\n", "\n", "jobs.append(submit(scripts['Filter Distance Matrix by Metadata'] % filter_uwdm_to_fecal_hmp))\n", "jobs.append(submit(scripts['Filter Distance Matrix by Metadata'] % filter_uwdm_to_fecal_agp))\n", "jobs.append(submit(scripts['Filter Distance Matrix by Metadata'] % filter_wdm_to_fecal_hmp))\n", "jobs.append(submit(scripts['Filter Distance Matrix by Metadata'] % filter_wdm_to_fecal_agp))\n", "res = wait_on(jobs)\n", "\n", "# perform the beta diversity comparison of the HMP and AGP\n", "bdiv_compare_w = {'inputs': ','.join([w_unif_dm_fecal_hmp, w_unif_dm_fecal_agp]),\n", " 'output': fig5 + '.dtest.pdf',\n", " 'labels': 'HMP,AGP',\n", " 'title': \"'Beta diversity added by sampled microbial communities'\",\n", " 'ylabel': \"'Diversity (weighted UniFrac)'\"}\n", "res = wait_on(submit(scripts['Beta Diversity Comparison'] % bdiv_compare_w))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we're going to get the PCoA shots sorted out. For this, we need a few different slices of beta diversity. First, we want to compare the AGP against the HMP, the AGP against the PGP, and finally the US fecal samples of the AGP against the Global Gut. We already have that last one computed though. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "pc_path = lambda x: x.rsplit('.txt',1)[0] + '_pc.txt'\n", "emp_path = lambda x: x.rsplit('.txt',1)[0] + '-emp'\n", "emp_index = lambda x: os.path.join(x, 'index.html')\n", "\n", "# filter beta diversity to just HMP and AGP\n", "uw_unif_dm_hmp_and_agp = get_relative_new_path('unweighted_unifrac_HMP_AG_100nt_even1k.txt')\n", "uw_unif_dm_pgp_and_agp = get_relative_new_path('unweighted_unifrac_PGP_AG_100nt_even1k.txt')\n", "uw_unif_dm_gg_and_agp = get_relative_new_path('unweighted_unifrac_GG_AG_100nt_even1k.txt')\n", "\n", "filter_uwdm_to_hmp_and_agp = {'input': fig1_bdiv,\n", " 'mapping': hmp_pgp_ag_gg_mm_fp,\n", " 'states': 'TITLE_ACRONYM:HMP,AGP',\n", " 'output': uw_unif_dm_hmp_and_agp}\n", "\n", "filter_uwdm_to_pgp_and_agp = {'input': fig1_bdiv,\n", " 'mapping': hmp_pgp_ag_gg_mm_fp,\n", " 'states': 'TITLE_ACRONYM:AGP,PGP',\n", " 'output': uw_unif_dm_pgp_and_agp}\n", "# just show US/Venezuela/Malawi\n", "filter_uwdm_to_gg_and_agp = {'input': fig2_bdiv,\n", " 'mapping': hmp_pgp_ag_gg_mm_fp,\n", " 'states': \"'COUNTRY:United States of America,Malawi,Venezuela'\",\n", " 'output': uw_unif_dm_gg_and_agp}\n", "\n", "jobs = []\n", "jobs.append(submit(scripts['Filter Distance Matrix by Metadata'] % filter_uwdm_to_hmp_and_agp))\n", "jobs.append(submit(scripts['Filter Distance Matrix by Metadata'] % filter_uwdm_to_pgp_and_agp))\n", "jobs.append(submit(scripts['Filter Distance Matrix by Metadata'] % filter_uwdm_to_gg_and_agp))\n", "res = wait_on(jobs)\n", "\n", "# compute principal coordinates \n", "uw_unif_pc_hmp_and_agp = pc_path(uw_unif_dm_hmp_and_agp)\n", "uw_unif_pc_pgp_and_agp = pc_path(uw_unif_dm_pgp_and_agp)\n", "uw_unif_pc_gg_and_agp = pc_path(uw_unif_dm_gg_and_agp)\n", "\n", "hmp_agp_pc_cmd_args = {'input':uw_unif_dm_hmp_and_agp,\n", " 'output':uw_unif_pc_hmp_and_agp}\n", "pgp_agp_pc_cmd_args = {'input':uw_unif_dm_pgp_and_agp,\n", " 'output':uw_unif_pc_pgp_and_agp}\n", "gg_agp_pc_cmd_args = {'input':uw_unif_dm_gg_and_agp,\n", " 'output':uw_unif_pc_gg_and_agp}\n", "\n", "jobs = []\n", "jobs.append(submit(scripts['Principal Coordinates'] % hmp_agp_pc_cmd_args))\n", "jobs.append(submit(scripts['Principal Coordinates'] % pgp_agp_pc_cmd_args))\n", "jobs.append(submit(scripts['Principal Coordinates'] % gg_agp_pc_cmd_args))\n", "res = wait_on(jobs)\n", "\n", "# produce PCoAs\n", "uw_unif_emp_hmp_and_agp = emp_path(uw_unif_pc_hmp_and_agp)\n", "uw_unif_emp_pgp_and_agp = emp_path(uw_unif_pc_pgp_and_agp)\n", "uw_unif_emp_gg_and_agp = emp_path(uw_unif_pc_gg_and_agp)\n", "\n", "hmp_agp_emp_cmd_args = {'input':uw_unif_pc_hmp_and_agp, \n", " 'output':uw_unif_emp_hmp_and_agp, \n", " 'mapping':fig1_m_fp}\n", "pgp_agp_emp_cmd_args = {'input':uw_unif_pc_pgp_and_agp, \n", " 'output':uw_unif_emp_pgp_and_agp, \n", " 'mapping':fig1_m_fp}\n", "gg_agp_emp_cmd_args = {'input':uw_unif_pc_gg_and_agp, \n", " 'output':uw_unif_emp_gg_and_agp, \n", " 'mapping':fig2_m_fp}\n", "\n", "jobs = []\n", "jobs.append(submit(scripts['Make Emperor'] % hmp_agp_emp_cmd_args))\n", "jobs.append(submit(scripts['Make Emperor'] % pgp_agp_emp_cmd_args))\n", "jobs.append(submit(scripts['Make Emperor'] % gg_agp_emp_cmd_args))\n", "res = wait_on(jobs)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As before with the PCoAs for the individual plots, we'll need to save SVGs and transfer them back. Similar to how we did the PCoAs for the participant results, we'll need to do some manual work here.\n", "\n", "* For the first figure, set TITLE_BODY_SITE, color and save the resulting figure as a SVG named \"agp_hmp\"\n", "* For the second figure, set TITLE_BODY_SITE, color and save the resulting figure as a SVG named \"agp_pgp\"\n", "* For the third figures, first set AGE, color as a gradient, remove points with values \"no_data\", \"None\", \"0.0\", and \"na\", then save a SVG as \"agp_gg_age\". Next, set COUNTRY and save as \"agp_gg_country\"\n", "\n", "The current colors being used are:\n", "\n", "* AGP-Fecal - 0x9c0b18\n", "* AGP-Oral - 0x23529a\n", "* AGP-Skin - 0xee8735\n", "* HMP/PGP-Fecal - 0xe75a44\n", "* HMP/PGP-Oral - 0x74c7df\n", "* HMP/PGP-Skin - 0xf3f5a0\n", "\n", "Once these images are saved, copy them back to $HOME." ] }, { "cell_type": "code", "collapsed": false, "input": [ "FileLink(emp_index(uw_unif_emp_hmp_and_agp), result_html_prefix='

HMP and AGP, please save an SVG as \"agp_hmp\"

')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "FileLink(emp_index(uw_unif_emp_pgp_and_agp), result_html_prefix='

PGP and AGP, please save an SVG as \"agp_pgp\"

')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "FileLink(emp_index(uw_unif_emp_gg_and_agp), result_html_prefix='

GG and AGP, please save an SVG as \"agp_gg_age\" and \"agp_gg_country\"

')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now lets convert the SVGs to PDFs which greatly reduces the size of the resulting populated template." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# SVG to PDF\n", "agp_hmp_svg = source_path(DOWNLOAD_DIRECTORY, 'agp_hmp.svg')\n", "agp_pgp_svg = source_path(DOWNLOAD_DIRECTORY, 'agp_pgp.svg')\n", "agp_gg_age_svg = source_path(DOWNLOAD_DIRECTORY, 'agp_gg_age.svg')\n", "agp_gg_country_svg = source_path(DOWNLOAD_DIRECTORY, 'agp_gg_country.svg')\n", "\n", "check_file(agp_hmp_svg)\n", "check_file(agp_pgp_svg)\n", "check_file(agp_gg_age_svg)\n", "check_file(agp_gg_country_svg)\n", "\n", "agp_hmp_pdf = get_relative_new_path('agp_hmp.pdf')\n", "agp_pgp_pdf = get_relative_new_path('agp_pgp.pdf')\n", "agp_gg_age_pdf = get_relative_new_path('agp_gg_age.pdf')\n", "agp_gg_country_pdf = get_relative_new_path('agp_gg_country.pdf')\n", "\n", "agp_hmp_args = {'input': agp_hmp_svg, 'output': agp_hmp_pdf}\n", "agp_pgp_args = {'input': agp_pgp_svg, 'output': agp_pgp_pdf}\n", "agp_gg_age_args = {'input': agp_gg_age_svg, 'output': agp_gg_age_pdf}\n", "agp_gg_country_args = {'input': agp_gg_country_svg, 'output': agp_gg_country_pdf}\n", "\n", "jobs = []\n", "jobs.append(submit(scripts['SVG to PDF'] % agp_hmp_args))\n", "jobs.append(submit(scripts['SVG to PDF'] % agp_pgp_args))\n", "jobs.append(submit(scripts['SVG to PDF'] % agp_gg_age_args))\n", "jobs.append(submit(scripts['SVG to PDF'] % agp_gg_country_args))\n", "res = wait_on(jobs)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we're going to create the macros file for the resulting template. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "macros_fp = get_relative_new_path('mod1_macros.tex')\n", "\n", "# get the date and format for human consumption (e.g., January 1, 2014)\n", "cur_date = datetime.datetime.now()\n", "date_fmt = cur_date.strftime(\"%B %d, %Y\")\n", "\n", "# format the counts into comma separated (e.g., 1,234,567)\n", "agp_samples_fmt = locale.format(\"%d\", agp_sample_count, grouping=True)\n", "agp_participants_fmt = locale.format(\"%d\", agp_unique_participants, grouping=True)\n", "agp_sequences_fmt = locale.format(\"%d\", agp_seq_count, grouping=True)\n", "pgp_samples_fmt = locale.format(\"%d\", pgp_sample_count, grouping=True)\n", "pgp_participants_fmt = locale.format(\"%d\", pgp_unique_participants, grouping=True)\n", "pgp_sequences_fmt = locale.format(\"%d\", pgp_seq_count, grouping=True)\n", "\n", "# build the macro template for the latex document\n", "macro_template = [\"% release date\"]\n", "macro_template.append(\"\\def\\\\releaseDate{%s}\" % date_fmt)\n", "macro_template.append(\"% participants paragraph\")\n", "macro_template.append(\"\\def\\\\numSamples{%s}\" % agp_samples_fmt)\n", "macro_template.append(\"\\def\\\\numParticipants{%s}\" % agp_participants_fmt)\n", "\n", "# published studies\n", "macro_template.append(\"% participants table\")\n", "macro_template.append(\"\\def\\hmpAge{Adults}\")\n", "macro_template.append(\"\\def\\hmpLocation{USA}\")\n", "macro_template.append(\"\\def\\hmpSamples{4,788}\")\n", "macro_template.append(\"\\def\\hmpParticipants{242}\")\n", "macro_template.append(\"\\def\\hmpSequences{36,797,226}\")\n", "macro_template.append(\"\\def\\ggAge{Adults,Children}\")\n", "macro_template.append(\"\\def\\ggLocation{Venezuela, Malawi, USA}\")\n", "macro_template.append(\"\\def\\ggSamples{531}\")\n", "macro_template.append(\"\\def\\ggParticipants{531}\")\n", "macro_template.append(\"\\def\\ggSequences{1,093,740,274}\")\n", "\n", "# growing studies\n", "macro_template.append(\"\\def\\pgpAge{Adults}\")\n", "macro_template.append(\"\\def\\pgpLocation{USA}\")\n", "macro_template.append(\"\\def\\pgpSamples{%s}\" % pgp_samples_fmt)\n", "macro_template.append(\"\\def\\pgpParticipants{%s}\" % pgp_participants_fmt)\n", "macro_template.append(\"\\def\\pgpSequences{%s}\" % pgp_sequences_fmt)\n", "macro_template.append(\"\\def\\\\agpAge{Adults, Children}\")\n", "macro_template.append(\"\\def\\\\agpLocation{Global}\")\n", "macro_template.append(\"\\def\\\\agpSamples{%s}\" % agp_samples_fmt)\n", "macro_template.append(\"\\def\\\\agpParticipants{%s}\" % agp_participants_fmt)\n", "macro_template.append(\"\\def\\\\agpSequences{%s}\" % agp_sequences_fmt)\n", "\n", "macro_template.append(\"% diversity figure\")\n", "macro_template.append(\"\\def\\\\numParticipantsLowerEstimate{%s}\" % agp_participants_fmt)\n", "\n", "macros = open(macros_fp, 'w')\n", "macros.write('\\n'.join(macro_template))\n", "macros.write('\\n')\n", "macros.close()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we'll actually populate the template." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# summary static images\n", "for_mod1 = ['ag_plots_fecal_legend.pdf',\n", " 'ag_plots_fecal_stack.pdf',\n", " 'ag_plots_oral_stack.pdf',\n", " 'ag_plots_skin_stack.pdf',\n", " 'agp_gg_age.pdf',\n", " 'agp_gg_country.pdf',\n", " 'agp_hmp.pdf',\n", " 'agp_pgp.pdf',\n", " 'fig2.pdf',\n", " 'fig3.pdf',\n", " 'fig5.pdf',\n", " 'fig6_a.pdf',\n", " 'fig6_b.pdf',\n", " 'fig6_legend.pdf',\n", " 's1.pdf',\n", " 's2.pdf',\n", " 's3.pdf',\n", " 'logoshape.pdf',\n", " 'legend_age.pdf',\n", " 'legend_agp_gg.pdf',\n", " 'legend_agp_hmp_pgp.pdf'\n", " ]\n", "\n", "os.mkdir(get_relative_new_path('pdfs-mod1'))\n", "for f in for_mod1:\n", " shutil.copy(get_relative_existing_path(f), get_relative_new_path('pdfs-mod1/'))\n", " \n", "res = wait_on(submit(scripts['To PDF'] % {'path': working_dir, 'input': 'mod1_main.tex'}))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And we're done!" ] } ], "metadata": {} } ] }