{
 "metadata": {
  "name": "",
  "signature": "sha256:52c403be5d640e7a556feaf493840e9752b3372f51d5c385996a6b9a250de8cd"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Filter and pick OTUs"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "* License: BSD\n",
      "* Copyright: American Gut Project, 2014"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Early in the American Gut Project, it was observed that some organisms bloomed likely as a result of increased shipping time and delay between when samples were collected and when they were put on ice (more detail can be found [here](http://americangut.org/?page_id=277)). The purpose of this notebook is to apply the filter developed in order to bioinformatically subtract these observed bloom sequences from fecal samples. It is important to apply this filter when combining data with the American Gut as to remove a potential study-effect bias as all fecal data in the American Gut has had this filter applied. The specific steps covered are:\n",
      "\n",
      "* Filter demultiplexed sequence data to only fecal samples\n",
      "* Determine what sequences in the fecal samples recruit to the observed bloom sequences\n",
      "* Remove the recruited bloom sequences from the demultiplexed sequence data\n",
      "* Trim the filtered demultiplexed data (to reduce study bias when combining with short reads, such as the data in [Yatsunenko et al 2012](http://www.nature.com/nature/journal/v486/n7402/abs/nature11053.html))\n",
      "* Pick OTUs against Greengenes 13_8 for the trimmed and full length filtered demultiplexed data\n",
      "\n",
      "The notebook is designed to support bloom filtering multiple datasets. \n",
      "\n",
      "The filtering is only intended to be applied to fecal data. As such, this notebook allows you to describe what metadata column and value to use so that only fecal samples are used."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%run cluster_utils.ipy\n",
      "import os\n",
      "from functools import partial\n",
      "\n",
      "# modify the following variable to set a base directory below your home\n",
      "YOUR_BASE_DIRECTORY = \"ICU\"\n",
      "current_home = os.environ['HOME']\n",
      "basedir = os.path.join(current_home, YOUR_BASE_DIRECTORY)\n",
      "working_dir = os.path.join(basedir, 'blooming')\n",
      "prj_name = 'blooming'\n",
      "trim_length = 100\n",
      "\n",
      "# define a submit function against qsub\n",
      "def submit(cmd_prefix, prj_name, queue, extra_args, cmd):\n",
      "    \"\"\"Submission wrapper to submit to qsub\"\"\"\n",
      "    cmd = \"%s; %s\" % (cmd_prefix, cmd)\n",
      "    return submit_qsub(cmd, job_name=prj_name, queue=queue,\n",
      "                       extra_args=extra_args)\n",
      "\n",
      "# define any environment specific details for any running compute job.\n",
      "# it is very likely that these will need to be changed to fit your environment.\n",
      "cmd_prefix = 'source ~/.bash_profile; workon ag'\n",
      "\n",
      "# bootstrap the submit method\n",
      "submit_8gb = partial(submit, cmd_prefix, prj_name, 'short8gb', '-l walltime=23:59:59')\n",
      "submit_smr = partial(submit, cmd_prefix, prj_name, 'long8gb', '-l nodes=1:ppn=64 -l walltime=72:00:00')\n",
      "\n",
      "# setup a working directory\n",
      "if not os.path.exists(working_dir):\n",
      "    os.mkdir(working_dir)\n",
      "path_joiner = partial(os.path.join, working_dir)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Here is where we'll setup paths for our input datasets"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# setup the paths to your input datasets\n",
      "paths_to_input_datasets = [\"/Users/mcdonadt/scratch/ICU/r1/seqs_r1.fna\",\n",
      "                           \"/Users/mcdonadt/scratch/ICU/r2/seqs_r2.fna\"]\n",
      "paths_to_input_metadata = [\"/Users/mcdonadt/scratch/ICU/minor_mapping.txt\",\n",
      "                           \"/Users/mcdonadt/scratch/ICU/minor_mapping.txt\"]\n",
      "fecal_category_and_value = [('BODY_SITE', 'UBERON:feces'),\n",
      "                            ('BODY_SITE', 'UBERON:feces')]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now, let's setup paths to our references"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# path to the bloom sequences (note, this can be obtained from https://github.com/biocore/American-Gut/tree/master/data/AG\n",
      "bloom_seqs = \"/Users/mcdonadt/ag/American-Gut/data/AG/BLOOM.fasta\"\n",
      "\n",
      "# path to Greengenes\n",
      "reference_rep_set = '/scratch/Users/mcdonadt/greengenes/gg_13_8_otus/rep_set/97_otus.fasta' # e.g., path to 97_otus.fasta from Greengenes\n",
      "reference_taxonomy = '/scratch/Users/mcdonadt/greengenes/gg_13_8_otus/taxonomy/97_otu_taxonomy.txt' # e.g., path to 97_otu_taxonomy.txt from Grengenes"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "On this system, all of the nodes have 64 cores, so let's allow SortMeRNA to use all of them."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# setup SortMeRNA params file for OTU picking\n",
      "num_threads_otu_picking = 64\n",
      "params_file = path_joiner('sortmerna_pick_params.txt')\n",
      "with open(params_file, 'w') as f:\n",
      "    f.write(\"pick_otus:otu_picking_method sortmerna\\n\")\n",
      "    f.write(\"pick_otus:similarity 0.97\\n\")\n",
      "    f.write(\"pick_otus:threads %d\\n\" % num_threads_otu_picking)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Next, we'll define the scripts were going to use for filtering."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "scripts = {'Filter Sequences': 'filter_fasta.py -f %(input)s -m %(otus)s -n -o %(output)s',\n",
      "           'Filter Sequences to Fecal': 'filter_fasta.py -f %(input)s -o %(output)s --mapping_fp %(mapping)s --valid_states \"%(states)s\"',\n",
      "           'Pick Closed Reference OTUs': 'pick_closed_reference_otus.py -i %(input)s -o %(output)s -r %(reference)s -t %(taxonomy)s ' + '-p ' + params_file,\n",
      "}           "
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Then, we'll filter the input datasets down to just the sequences associated with fecal samples"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "jobs = []\n",
      "just_fecal_datasets = []\n",
      "for d, m, cat in zip(paths_to_input_datasets, paths_to_input_metadata, fecal_category_and_value):\n",
      "    output = path_joiner(os.path.splitext(os.path.basename(d))[0] + '-fecal.fna')\n",
      "    just_fecal_datasets.append(output)\n",
      "    filter_args = {'input': d,\n",
      "                   'output': output,\n",
      "                   'mapping': m,\n",
      "                   'states': ':'.join(cat)}\n",
      "    \n",
      "    jobs.append(submit_8gb(scripts['Filter Sequences to Fecal'] % filter_args))\n",
      "wait_on(jobs)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Once we have the fecal data, we can filter for the previously observed blooms"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "filtered_outputs = []\n",
      "invalid_seqs = []\n",
      "jobs = []\n",
      "\n",
      "for d in just_fecal_datasets:\n",
      "    relative_name = os.path.basename(d)\n",
      "    no_ext = os.path.splitext(relative_name)[0]\n",
      "    \n",
      "    output = path_joiner(no_ext + '-bloomed')\n",
      "    filtered_outputs.append(output)\n",
      "    \n",
      "    failures = os.path.join(output, 'sortmerna_picked_otus', no_ext + '_otus.txt')\n",
      "    invalid_seqs.append(failures)\n",
      "    \n",
      "    bloom_args = {\n",
      "        'input': d,\n",
      "        'output': output,\n",
      "        'reference': bloom_seqs,\n",
      "        'taxonomy': reference_taxonomy,\n",
      "    }\n",
      "    \n",
      "    jobs.append(submit_smr(scripts['Pick Closed Reference OTUs'] % bloom_args))\n",
      "wait_on(jobs)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Since we have now identified the blooms, we can go back to the full input file and filter out the corresponding sequences"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "jobs = []\n",
      "bloom_filtered_seqs = []\n",
      "for full_file, failures in zip(paths_to_input_datasets, invalid_seqs):\n",
      "    output = os.path.splitext(full_file)[0] + '-bloomed.fna'\n",
      "    bloom_filtered_seqs.append(output)\n",
      "    \n",
      "    filter_args = {\n",
      "        'input': full_file,\n",
      "        'output': output,\n",
      "        'otus': failures}\n",
      "    jobs.append(submit_8gb(scripts['Filter Sequences'] % filter_args))\n",
      "wait_on(jobs)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Since within the American Gut, we also combine with HiSeq reads that are 100nt, let's go ahead and produce two datasets to pick OTUs with, one that is full length and one that is trimmed to just 100nt."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def trimmer(gen):\n",
      "    \"\"\"Trim seqs in gen to trim_length\"\"\"\n",
      "    from skbio import parse_fasta\n",
      "    return ((seq_id, seq[:trim_length]) for seq_id, seq in parse_fasta(gen))\n",
      "\n",
      "for_otu_picking = []\n",
      "for bfs in bloom_filtered_seqs:\n",
      "    trimmed_file = bfs + '-%dnt' % trim_length\n",
      "    for_otu_picking.append(bfs)\n",
      "    for_otu_picking.append(trimmed_file)\n",
      "    \n",
      "    with open(trimmed_file, 'w') as trimmed, open(bfs) as to_trim:\n",
      "        for payload in trimmer(to_trim):\n",
      "            trimmed.write(\">%s\\n%s\\n\" % payload)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "And finally, let's pick OTUs that can be used for subsequent analysis"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "jobs = []\n",
      "final_otus = []\n",
      "\n",
      "for fop in for_otu_picking:\n",
      "    output = fop + '-otus'\n",
      "    final_otus.append(output)\n",
      "    final_args = {\n",
      "        'input': fop,\n",
      "        'output': output,\n",
      "        'reference': reference_rep_set,\n",
      "        'taxonomy': reference_taxonomy,\n",
      "    }\n",
      "    \n",
      "    jobs.append(submit_smr(scripts['Pick Closed Reference OTUs'] % final_args))\n",
      "wait_on(jobs)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}