{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this recipe we look at how to find the longest run of a specified set of characters in a ``Sequence`` object. In this particular case we'll work with ``DNA`` sequences, and look for runs of purines and pyrimidines. In this context, I use the term *run* to mean a series of contiguous characters of interest (e.g., purines). We'll obtain the start and end positions of the longest stretch of various character sets.\n",
    "\n",
    "First, we'll configure the environment:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from skbio import DNA"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, let's load a single sequence from a fasta file. In this case, this is a short 16S sequence. In practice, you might be loading a whole genome here, but the process will be the same."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "DNA\n",
       "----------------------------------------------------------------------\n",
       "Metadata:\n",
       "    'description': ''\n",
       "    'id': '229854'\n",
       "Stats:\n",
       "    length: 1541\n",
       "    has gaps: False\n",
       "    has degenerates: True\n",
       "    has definites: True\n",
       "    GC-content: 52.30%\n",
       "----------------------------------------------------------------------\n",
       "0    GAGTTTGATC CTGGCTCAGA TTGAACGCTG GCGGCATGCT TAACACATGC AAGTCGAACG\n",
       "60   GCAGCATGAC TTAGCTTGCT AAGTTGATGG CGAGTGGCGA ACGGGTGAGT AACGCGTAGG\n",
       "...\n",
       "1440 CCATGGGAGT GGGCTGCACC AGAAGTAGAT AGTCTAACCG CAAGGGGGAC GTTTACCACG\n",
       "1500 GTGTGGTTCA TGACTGGGGT GAAGTCGTAA CAAGGTAGCC G"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dna = DNA.read('data/single_sequence1.fasta', seq_num=1)\n",
    "dna"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To find the longest run of purines, we use ``DNA.find_motifs('purine-run')`` to find *all* purine runs. These come back as python ``slice`` objects, so we can find the start and stop positions of each purine run and then Python's ``max`` function to find the longest. We can also slice our sequence object directly using the resulting slices."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "purine_runs = list(dna.find_motifs('purine-run', min_length=2))\n",
    "longest_purine = max(purine_runs, key=lambda x: x.stop - x.start)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "DNA\n",
       "--------------------------\n",
       "Metadata:\n",
       "    'description': ''\n",
       "    'id': '229854'\n",
       "Stats:\n",
       "    length: 11\n",
       "    has gaps: False\n",
       "    has degenerates: False\n",
       "    has definites: True\n",
       "    GC-content: 54.55%\n",
       "--------------------------\n",
       "0 AAAGAGGGGG A"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dna[longest_purine]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "slice(130, 141, None)"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "longest_purine"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To find the longest run of pyrimidines:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "pyrimidine_runs = list(dna.find_motifs('pyrimidine-run', min_length=2))\n",
    "longest_pyrimidine = max(pyrimidine_runs, key=lambda x: x.stop - x.start)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "DNA\n",
       "--------------------------\n",
       "Metadata:\n",
       "    'description': ''\n",
       "    'id': '229854'\n",
       "Stats:\n",
       "    length: 6\n",
       "    has gaps: False\n",
       "    has degenerates: False\n",
       "    has definites: True\n",
       "    GC-content: 50.00%\n",
       "--------------------------\n",
       "0 CTCTTC"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dna[longest_pyrimidine]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "slice(175, 181, None)"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "longest_pyrimidine"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To find the longest run of some other character or pattern that don't have built-in support in scikit-bio's sequence objects, we can use ``find_with_regex``. For example, to find the longest stretch of ``T`` characters:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "t_runs = list(dna.find_with_regex('([T]+)'))\n",
    "longest_t = max(t_runs, key=lambda x: x.stop - x.start)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "DNA\n",
       "--------------------------\n",
       "Metadata:\n",
       "    'description': ''\n",
       "    'id': '229854'\n",
       "Stats:\n",
       "    length: 3\n",
       "    has gaps: False\n",
       "    has degenerates: False\n",
       "    has definites: True\n",
       "    GC-content: 0.00%\n",
       "--------------------------\n",
       "0 TTT"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dna[longest_t]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "slice(3, 6, None)"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "longest_t"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}