{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this recipe we'll explore how to perform six-frame translation on a nucleotide sequence. Briefly, when an unknown  sequence is obtained from the environment, there are six possible reading frames for translation. The six reading frames for a sequence start at positions 1, 2, and 3 in the forward orientation (denoted ``1``, ``2``, and ``3`` in scikit-bio) and positions 1, 2, and 3 in the reverse orientation (denoted ``-1``, ``-2``, and ``-3`` in scikit-bio).\n",
    "\n",
    "Six-frame translation can be used to find the possible proteins a nucleotide sequence might encode."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First let's define a single RNA sequence."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "RNA\n",
       "---------------------------------------------------------------------\n",
       "Stats:\n",
       "    length: 392\n",
       "    has gaps: False\n",
       "    has degenerates: False\n",
       "    has definites: True\n",
       "    GC-content: 33.42%\n",
       "---------------------------------------------------------------------\n",
       "0   GUGGGCUUUU UUAUAUCUUA AUUUGCCAUA CUAAUUGCGG CAAUCGCAUG AGGCGUUUUA\n",
       "60  UUAUUCCAUA CACAUAACUU UUCGACUUUA GCUUCAGUAA GAUAUGCAAU CCUCAGGGUA\n",
       "...\n",
       "300 GCACACAAAU CAGUAAUAUU UUGAGGUGUU CCAUGUGCAU AUGCUGAAGA UAGUAAAACU\n",
       "360 GUAAAAAAAA CACCAAAUUU UAAUUUAAUC AU"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from __future__ import print_function\n",
    "\n",
    "from skbio import RNA\n",
    "rna = RNA(\"GUGGGCUUUUUUAUAUCUUAAUUUGCCAUACUAAUUGCGGCAAUCGCAUGAGGCGUUUUAUUAUUCCAUACACAUAACUUUUCGACUUUAGCUUCAGUAAGAUAUGCAAUCCUCAGGGUAUCCUUCAUCCUUUCAAUCGCUUUUUUUUGUGAAUCUAUAUGUUGACUACCUGGUACUUCUACUUGAAAAGUUGCACCAUUCUUAAAAGUAAUGAUAGCCAUCUCUCUUUUUCCAGCUAGAGAUUCUGUAUACGACAAUAUCUUAUCAUUUAGCGUAUGUAUUUGUGUGUUGUGGUAUUCUGCACACAAAUCAGUAAUAUUUUGAGGUGUUCCAUGUGCAUAUGCUGAAGAUAGUAAAACUGUAAAAAAAACACCAAAUUUUAAUUUAAUCAU\")\n",
    "rna"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next let's create a genetic code object. The default genetic code in scikit-bio is the vertebrate nuclear genetic code, but others exist which contain minor differences (e.g., codons code for different amino acids, or the set of stop codons is slightly different) and can be obtained via the `genetic_code` factory. Since we're going to translate the cholera toxin RNA sequence (produced by the *Vibrio cholerae* bacterium), we'll use NCBI's [Bacterial, Archaeal and Plant Plastid Code](http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi#SG11) (transl_table=11):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "GeneticCode (Bacterial, Archaeal and Plant Plastid)\n",
       "-------------------------------------------------------------------------\n",
       "  AAs  = FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG\n",
       "Starts = ---M---------------M------------MMMM---------------M------------\n",
       "Base1  = UUUUUUUUUUUUUUUUCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG\n",
       "Base2  = UUUUCCCCAAAAGGGGUUUUCCCCAAAAGGGGUUUUCCCCAAAAGGGGUUUUCCCCAAAAGGGG\n",
       "Base3  = UCAGUCAGUCAGUCAGUCAGUCAGUCAGUCAGUCAGUCAGUCAGUCAGUCAGUCAGUCAGUCAG"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from skbio import GeneticCode\n",
    "gc = GeneticCode.from_ncbi(11)\n",
    "gc"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To perform six-frame translation of our RNA sequence to get a generator of the six translated sequence:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Protein\n",
      "---------------------------------------------------------------------\n",
      "Stats:\n",
      "    length: 130\n",
      "    has gaps: False\n",
      "    has degenerates: False\n",
      "    has definites: True\n",
      "    has stops: True\n",
      "---------------------------------------------------------------------\n",
      "0   VGFFIS*FAI LIAAIA*GVL LFHTHNFSTL ASVRYAILRV SFILSIAFFC ESIC*LPGTS\n",
      "60  T*KVAPFLKV MIAISLFPAR DSVYDNILSF SVCICVLWYS AHKSVIF*GV PCAYAEDSKT\n",
      "120 VKKTPNFNLI\n",
      "\n",
      "Protein\n",
      "---------------------------------------------------------------------\n",
      "Stats:\n",
      "    length: 130\n",
      "    has gaps: False\n",
      "    has degenerates: False\n",
      "    has definites: True\n",
      "    has stops: True\n",
      "---------------------------------------------------------------------\n",
      "0   WAFLYLNLPY *LRQSHEAFY YSIHITFRL* LQ*DMQSSGY PSSFQSLFFV NLYVDYLVLL\n",
      "60  LEKLHHS*K* **PSLFFQLE ILYTTISYHL AYVFVCCGIL HTNQ*YFEVF HVHMLKIVKL\n",
      "120 *KKHQILI*S\n",
      "\n",
      "Protein\n",
      "---------------------------------------------------------------------\n",
      "Stats:\n",
      "    length: 130\n",
      "    has gaps: False\n",
      "    has degenerates: False\n",
      "    has definites: True\n",
      "    has stops: True\n",
      "---------------------------------------------------------------------\n",
      "0   GLFYILICHT NCGNRMRRFI IPYT*LFDFS FSKICNPQGI LHPFNRFFL* IYMLTTWYFY\n",
      "60  LKSCTILKSN DSHLSFSS*R FCIRQYLII* RMYLCVVVFC TQISNILRCS MCIC*R**NC\n",
      "120 KKNTKF*FNH\n",
      "\n",
      "Protein\n",
      "---------------------------------------------------------------------\n",
      "Stats:\n",
      "    length: 130\n",
      "    has gaps: False\n",
      "    has degenerates: False\n",
      "    has definites: True\n",
      "    has stops: True\n",
      "---------------------------------------------------------------------\n",
      "0   MIKLKFGVFF TVLLSSAYAH GTPQNITDLC AEYHNTQIHT LNDKILSYTE SLAGKREMAI\n",
      "60  ITFKNGATFQ VEVPGSQHID SQKKAIERMK DTLRIAYLTE AKVEKLCVWN NKTPHAIAAI\n",
      "120 SMAN*DIKKP\n",
      "\n",
      "Protein\n",
      "---------------------------------------------------------------------\n",
      "Stats:\n",
      "    length: 130\n",
      "    has gaps: False\n",
      "    has degenerates: False\n",
      "    has definites: True\n",
      "    has stops: True\n",
      "---------------------------------------------------------------------\n",
      "0   *LN*NLVFFL QFYYLQHMHM EHLKILLICV QNTTTHKYIR *MIRYCRIQN L*LEKERWLS\n",
      "60  LLLRMVQLFK *KYQVVNI*I HKKKRLKG*R IP*GLHILLK LKSKSYVYGI IKRLMRLPQL\n",
      "120 VWQIKI*KSP\n",
      "\n",
      "Protein\n",
      "---------------------------------------------------------------------\n",
      "Stats:\n",
      "    length: 130\n",
      "    has gaps: False\n",
      "    has degenerates: False\n",
      "    has definites: True\n",
      "    has stops: True\n",
      "---------------------------------------------------------------------\n",
      "0   D*IKIWCFFY SFTIFSICTW NTSKYY*FVC RIPQHTNTYA K**DIVVYRI SSWKKRDGYH\n",
      "60  YF*EWCNFSS RSTR*STYRF TKKSD*KDEG YPEDCISY*S *SRKVMCME* *NASCDCRN*\n",
      "120 YGKLRYKKAH\n",
      "\n"
     ]
    }
   ],
   "source": [
    "for e in gc.translate_six_frames(rna):\n",
    "    print(repr(e), end=\"\\n\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The six protein sequences represent each possible reading frame in the RNA sequence, but start and stop codons are not taken into account by default since we don't know if we have a full length sequence. \n",
    "\n",
    "If instead we want to look only at putative proteins coded by each sequence, we could require that translation start at a start codon, and stop at a stop codon, and then review the resulting sequences."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Protein\n",
      "--------------------------\n",
      "Stats:\n",
      "    length: 6\n",
      "    has gaps: False\n",
      "    has degenerates: False\n",
      "    has definites: True\n",
      "    has stops: False\n",
      "--------------------------\n",
      "0 MGFFIS\n",
      "\n",
      "Protein\n",
      "--------------------------\n",
      "Stats:\n",
      "    length: 3\n",
      "    has gaps: False\n",
      "    has degenerates: False\n",
      "    has definites: True\n",
      "    has stops: False\n",
      "--------------------------\n",
      "0 MPY\n",
      "\n",
      "Protein\n",
      "--------------------------\n",
      "Stats:\n",
      "    length: 20\n",
      "    has gaps: False\n",
      "    has degenerates: False\n",
      "    has definites: True\n",
      "    has stops: False\n",
      "--------------------------\n",
      "0 MLICHTNCGN RMRRFIIPYT\n",
      "\n",
      "Protein\n",
      "---------------------------------------------------------------------\n",
      "Stats:\n",
      "    length: 124\n",
      "    has gaps: False\n",
      "    has degenerates: False\n",
      "    has definites: True\n",
      "    has stops: False\n",
      "---------------------------------------------------------------------\n",
      "0   MIKLKFGVFF TVLLSSAYAH GTPQNITDLC AEYHNTQIHT LNDKILSYTE SLAGKREMAI\n",
      "60  ITFKNGATFQ VEVPGSQHID SQKKAIERMK DTLRIAYLTE AKVEKLCVWN NKTPHAIAAI\n",
      "120 SMAN\n",
      "\n",
      "Protein\n",
      "----------------------------------------\n",
      "Stats:\n",
      "    length: 35\n",
      "    has gaps: False\n",
      "    has degenerates: False\n",
      "    has definites: True\n",
      "    has stops: False\n",
      "----------------------------------------\n",
      "0 MVFFLQFYYL QHMHMEHLKI LLICVQNTTT HKYIR\n",
      "\n",
      "Protein\n",
      "----------------------------\n",
      "Stats:\n",
      "    length: 24\n",
      "    has gaps: False\n",
      "    has degenerates: False\n",
      "    has definites: True\n",
      "    has stops: False\n",
      "----------------------------\n",
      "0 MKIWCFFYSF TIFSICTWNT SKYY\n",
      "\n"
     ]
    }
   ],
   "source": [
    "for e in gc.translate_six_frames(rna, start='require', stop='require'):\n",
    "    print(repr(e), end=\"\\n\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the sequences starts with M (which is the amino acid encoded by ATG, the start codon in this genetic code) and that the stop translation character has been trimmed off. One of these, the ``-1`` orientation (i.e., the reverse complement of the input sequence) looks more like a real protein coding sequence due to its length than the others.\n",
    "\n",
    "As a next step, try searching the putative proteins against a reference database (e.g., by BLASTing them using [NCBI's `blastp` tool](http://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome)) to figure out which of these might be an actual protein coding sequence and what it codes for. \n",
    "\n",
    "In the future, we'll have support for remote database searching in scikit-bio. You can track progress on that [here](https://github.com/biocore/scikit-bio/issues/225)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}