{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Several recent studies of amplicon taxonomic assignment methods have suggested that training Naive Bayes taxonomic classifiers against only the region of a sequence that was amplified, rather than a full length sequence, will give better taxonomic assignment results ([Mizrahi-Man et al. 2013](http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0053608), [Werner et al. 2012](http://www.nature.com/ismej/journal/v6/n1/full/ismej201182a.html)). \n",
    "\n",
    "This recipe shows how an alignment can be trimmed to cover only a region bound by a pair of primers, using scikit-bio's functionality for sequencing matching. The list of [``skbio.DNA``](http://scikit-bio.org/docs/0.5.0/generated/skbio.sequence.DNA.html) objects that results from this process could be used to train a Naive Bayes classifier.\n",
    "\n",
    "First we'll load an alignment into a [``skbio.TablularMSA``](http://scikit-bio.org/docs/0.5.0/generated/skbio.alignment.TabularMSA.html) object (*MSA* stands for *multiple sequence alignment*). \n",
    "We're going to work with the [qiime-default-reference](https://github.com/biocore/qiime-default-reference) so we have easy access to some sequences. If you want to adapt this recipe to create a region specific reference collection for another set of sequences, just assign the ``reference_alignment_filepath`` variable to a new filepath."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "TabularMSA[DNA]\n",
       "-----------------------------------------------------------------------\n",
       "Stats:\n",
       "    sequence count: 4797\n",
       "    position count: 7682\n",
       "-----------------------------------------------------------------------\n",
       "--------------------------------- ... ---------------------------------\n",
       "--------------------------------- ... ---------------------------------\n",
       "...\n",
       "--------------------------------- ... ---------------------------------\n",
       "--------------------------------- ... ---------------------------------"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import qiime_default_reference\n",
    "import skbio\n",
    "\n",
    "reference_alignment_filepath = qiime_default_reference.get_template_alignment()\n",
    "reference_alignment = skbio.TabularMSA.read(reference_alignment_filepath, constructor=skbio.DNA)\n",
    "reference_alignment"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we'll define the forward and reverse primers as ``DNA`` objects. The primers that we're using here are pulled from [Supplementary File 1](http://www.nature.com/ismej/journal/v6/n8/extref/ismej20128x2.txt) of [Caporaso et al. 2012](http://www.nature.com/ismej/journal/v6/n8/full/ismej20128a.html). Note that we're reverse complementing the reverse primer when we load it here. scikit-bio does not automatically try to match the reverse or reverse complement of a sequence, as some programs like BLAST and uclust do (for the sake of computational efficency)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "fwd_primer = skbio.DNA(\"GTGCCAGCMGCCGCGGTAA\", {'id':'fwd-primer'})\n",
    "rev_primer = skbio.DNA(\"GGACTACHVGGGTWTCTAAT\", {'id':'rev-primer'}).reverse_complement()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These primers contain some degeneracies, or characters representing multiple bases. In practice, this means that each of these primer sequences actually represents a collection of sequences, once the degeneracies are expanded. We can see what non-degenerate sequences are represented by our degenerate forward primer sequence as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "GTGCCAGCCGCCGCGGTAA\n",
      "GTGCCAGCAGCCGCGGTAA\n"
     ]
    }
   ],
   "source": [
    "for nondegenerate_sequence in fwd_primer.expand_degenerates():\n",
    "    print(nondegenerate_sequence)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We're going to match any of the non-degenerate variants of each primer in this example.\n",
    "\n",
    "The typical way to approach the problem of finding the boundaries of a short sequence in a longer sequence would be to use pairwise alignment. But, we're going to try a different approach here since pairwise alignment is inherently slow (it scales quadratically). Because these are sequencing primers, they're designed to be unique (so there shouldn't be multiple matches of a primer to a sequence), and they're designed to match as many sequences as possible. So let's try using regular expressions to match our sequencing primers in the reference database. Regular expression matching scales linearly, so is much faster to apply to many sequences. The scikit-bio [``GrammaredSequence``](http://scikit-bio.org/docs/0.5.0/generated/skbio.sequence.GrammaredSequence.html) objects contain a method, [``to_regex``](http://scikit-bio.org/docs/0.5.0/generated/skbio.sequence.GrammaredSequence.to_regex.html), that allows for conversion of a degenerate (or nondegenerate) sequence into a regular expression. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "re.compile(r'GTGCCAGC[CA]GCCGCGGTAA', re.UNICODE)"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fwd_primer.to_regex()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "re.compile(r'ATTAGA[AT]ACCC[GCT][GAT]GTAGTCC', re.UNICODE)"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "rev_primer.to_regex()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use these to create a new regular expression that will match the sequence, excluding the forward and reverse primers (since those are nearly the same in all sequences, they won't be useful in training a taxonomic classifier, so we choose not to include them). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "GTGCCAGC[CA]GCCGCGGTAA(.*)ATTAGA[AT]ACCC[GCT][GAT]GTAGTCC\n"
     ]
    }
   ],
   "source": [
    "regex = '{0}(.*){1}'.format(fwd_primer.to_regex().pattern,\n",
    "                            rev_primer.to_regex().pattern)\n",
    "print(regex)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, let's apply this regular expression to all of our reference sequences. This will let us find out how many reference sequences our pattern matches, and the start and stop positions of each match."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3033 of 4797 (63.23%) sequences have exact matches to the regular expression.\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "starts = []\n",
    "stops = []\n",
    "\n",
    "seq_count = 0\n",
    "match_count = 0\n",
    "\n",
    "for seq in reference_alignment:\n",
    "    seq_count += 1\n",
    "    for match in seq.find_with_regex(regex, ignore=seq.gaps()):\n",
    "        match_count += 1\n",
    "        starts.append(match.start)\n",
    "        stops.append(match.stop)\n",
    "\n",
    "starts = pd.Series(starts)\n",
    "stops = pd.Series(stops)\n",
    "match_percentage = (match_count / seq_count) * 100\n",
    "print('{0} of {1} ({2:.2f}%) sequences have exact matches to the regular expression.'.format(match_count, seq_count, match_percentage))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since we're matching our aligned reference sequences against an alignment, finding matches in around 60% of our sequences gives us an idea of how to slice all of our sequences, since the purpose of a multiple sequence alignment is to normalize the position numbers across all of the sequences in a sequence collection. One problem with matching against an alignment though is that the gaps in the alignment could make it harder to match our regular expression as the gaps would disrupt our matches. We get around this using the ``ignore`` parameter to [``DNA.find_with_regex``](http://scikit-bio.org/docs/0.5.0/generated/skbio.sequence.DNA.find_with_regex.html), which takes a boolean vector (a fancy name for an array or list of boolean values) indicating positions that should be ignored in the regular expression match.\n",
    "\n",
    "We can next look at the distribution of start positions and stop positions for each match. As you can see, these nearly always match the same position, and therefore tell us where we want to slice our aligned sequences to extract the region between our primers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "count    3033.000000\n",
       "mean     2263.022090\n",
       "std         1.916592\n",
       "min      2220.000000\n",
       "25%      2263.000000\n",
       "50%      2263.000000\n",
       "75%      2263.000000\n",
       "max      2358.000000\n",
       "dtype: float64"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "starts.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "count    3033.000000\n",
       "mean     4051.132542\n",
       "std         7.281395\n",
       "min      4050.000000\n",
       "25%      4051.000000\n",
       "50%      4051.000000\n",
       "75%      4051.000000\n",
       "max      4452.000000\n",
       "dtype: float64"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "stops.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The positions that we want to slice our alignment at can be found as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "fwd_primer_end = int(starts.mode())\n",
    "rev_primer_start = int(stops.mode())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we're ready to filter out all positions outside of this range. We do this by specifying that range using [``TabularMSA.iloc``](http://scikit-bio.org/docs/0.5.0/generated/skbio.alignment.TabularMSA.iloc.html). This gives us a ``TabularMSA`` that has the same number of sequences that we started with, but fewer positions in each ``Sequence``. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "filtered_reference_alignment = reference_alignment.iloc[..., fwd_primer_end:rev_primer_start]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "TabularMSA[DNA]\n",
       "-----------------------------------------------------------------------\n",
       "Stats:\n",
       "    sequence count: 4797\n",
       "    position count: 1788\n",
       "-----------------------------------------------------------------------\n",
       "T--AC---AG-AG-GGT-GCA-A-G-CG-TTAA ... ----------G-TGGG-GAG-C-A-AACA--GG\n",
       "T--AC---GA-AG-GGG-GCT-A-G-CG-TTGT ... ---------GTGGGGG--AG-C-G-AACA--GG\n",
       "...\n",
       "C--AC---CG-GC-AGC-TTA-A-G-TG-GTGA ... ----------C-AGGG-GAG-C-G-AACG--GG\n",
       "T--AC---CA-GC-TCC-GCG-A-G-TG-GTTG ... ----------T-GGGG-GAG-C-A-AAGG--GG"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "filtered_reference_alignment"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Because Naive Bayes classifiers are generally trained on unaligned sequences, we'd also want to remove all gaps from the alignment to get the unaligned sequences. These are the sequences that we'd want to use to train our classifier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "filtered_reference_sequences = [e.degap() for e in filtered_reference_alignment]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "DNA\n",
       "---------------------------------------------------------------------\n",
       "Metadata:\n",
       "    'description': ''\n",
       "    'id': '1111561'\n",
       "Stats:\n",
       "    length: 253\n",
       "    has gaps: False\n",
       "    has degenerates: False\n",
       "    has definites: True\n",
       "    GC-content: 52.57%\n",
       "---------------------------------------------------------------------\n",
       "0   TACAGAGGGT GCAAGCGTTA ATCGGATTGA CTGGGCGTAA AGGGCGCGTA GGCGGTAAGA\n",
       "60  TAAGTCAGAT GTTAAAAACC CGAGCTCAAC TTGGGGACTG CATTTGAAAC TATCTCACTA\n",
       "120 GAGTACAGTA GAGGAGAGCG GAATTTCCGG TGTAGCGGTG AAATGCGTAG ATATCGGAAG\n",
       "180 GAACACCAGT GGCGAAGGCG GCTCTCTGGA CTGACACTGA CGCTGAGGCG CGAAAGCGTG\n",
       "240 GGGAGCAAAC AGG"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "filtered_reference_sequences[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "DNA\n",
       "---------------------------------------------------------------------\n",
       "Metadata:\n",
       "    'description': ''\n",
       "    'id': '4483258'\n",
       "Stats:\n",
       "    length: 253\n",
       "    has gaps: False\n",
       "    has degenerates: False\n",
       "    has definites: True\n",
       "    GC-content: 66.01%\n",
       "---------------------------------------------------------------------\n",
       "0   TACCAGCTCC GCGAGTGGTT GGGGCGTTTA CTGGGCCTAA AGCGCCCGTA GCCGGCCCGG\n",
       "60  TAAGTCACCC CTGAAATCCA TGGGCTCAAC CCATGGGCCG GGGGTGATAC TGCCGGGCTT\n",
       "120 GGGGGTGGGA GAGGCCGAGG GTACTCCCGA GGTAGGGGCG AAATCCGATA ATCTCGGGAG\n",
       "180 GACCACCAGT GGCGGAAGCG CTCGGCTGGA ACACGCCCGA CGGTGAGGGG CGAAAGCTGG\n",
       "240 GGGAGCAAAG GGG"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "filtered_reference_sequences[-1]"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}