{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "In this recipe we look at how to find the longest run of a specified set of characters in a ``Sequence`` object. In this particular case we'll work with ``DNA`` sequences, and look for runs of purines and pyrimidines. In this context, I use the term *run* to mean a series of contiguous characters of interest (e.g., purines). We'll obtain the start and end positions of the longest stretch of various character sets.\n", "\n", "First, we'll configure the environment:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from skbio import DNA" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, let's load a single sequence from a fasta file. In this case, this is a short 16S sequence. In practice, you might be loading a whole genome here, but the process will be the same." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "DNA\n", "----------------------------------------------------------------------\n", "Metadata:\n", " 'description': ''\n", " 'id': '229854'\n", "Stats:\n", " length: 1541\n", " has gaps: False\n", " has degenerates: True\n", " has definites: True\n", " GC-content: 52.30%\n", "----------------------------------------------------------------------\n", "0 GAGTTTGATC CTGGCTCAGA TTGAACGCTG GCGGCATGCT TAACACATGC AAGTCGAACG\n", "60 GCAGCATGAC TTAGCTTGCT AAGTTGATGG CGAGTGGCGA ACGGGTGAGT AACGCGTAGG\n", "...\n", "1440 CCATGGGAGT GGGCTGCACC AGAAGTAGAT AGTCTAACCG CAAGGGGGAC GTTTACCACG\n", "1500 GTGTGGTTCA TGACTGGGGT GAAGTCGTAA CAAGGTAGCC G" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dna = DNA.read('data/single_sequence1.fasta', seq_num=1)\n", "dna" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To find the longest run of purines, we use ``DNA.find_motifs('purine-run')`` to find *all* purine runs. These come back as python ``slice`` objects, so we can find the start and stop positions of each purine run and then Python's ``max`` function to find the longest. We can also slice our sequence object directly using the resulting slices." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [], "source": [ "purine_runs = list(dna.find_motifs('purine-run', min_length=2))\n", "longest_purine = max(purine_runs, key=lambda x: x.stop - x.start)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "DNA\n", "--------------------------\n", "Metadata:\n", " 'description': ''\n", " 'id': '229854'\n", "Stats:\n", " length: 11\n", " has gaps: False\n", " has degenerates: False\n", " has definites: True\n", " GC-content: 54.55%\n", "--------------------------\n", "0 AAAGAGGGGG A" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dna[longest_purine]" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "slice(130, 141, None)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "longest_purine" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To find the longest run of pyrimidines:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [], "source": [ "pyrimidine_runs = list(dna.find_motifs('pyrimidine-run', min_length=2))\n", "longest_pyrimidine = max(pyrimidine_runs, key=lambda x: x.stop - x.start)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "DNA\n", "--------------------------\n", "Metadata:\n", " 'description': ''\n", " 'id': '229854'\n", "Stats:\n", " length: 6\n", " has gaps: False\n", " has degenerates: False\n", " has definites: True\n", " GC-content: 50.00%\n", "--------------------------\n", "0 CTCTTC" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dna[longest_pyrimidine]" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "slice(175, 181, None)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "longest_pyrimidine" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To find the longest run of some other character or pattern that don't have built-in support in scikit-bio's sequence objects, we can use ``find_with_regex``. For example, to find the longest stretch of ``T`` characters:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [], "source": [ "t_runs = list(dna.find_with_regex('([T]+)'))\n", "longest_t = max(t_runs, key=lambda x: x.stop - x.start)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "DNA\n", "--------------------------\n", "Metadata:\n", " 'description': ''\n", " 'id': '229854'\n", "Stats:\n", " length: 3\n", " has gaps: False\n", " has degenerates: False\n", " has definites: True\n", " GC-content: 0.00%\n", "--------------------------\n", "0 TTT" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dna[longest_t]" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "slice(3, 6, None)" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "longest_t" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.1" } }, "nbformat": 4, "nbformat_minor": 0 }