{ "metadata": { "name": "", "signature": "sha256:d0b818846b182639011099ca56fd583aa186cae4284de49607ba34231dc100f2" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "This IPython notebook file demonstrates some of the functionality available in __poretools__, and an example run report for a recent R7 chemistry nanopore run. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some of the examples call out to R for plotting, so to make this work in this notebook we need to load the ``rpy2.ipython`` module." ] }, { "cell_type": "code", "collapsed": false, "input": [ "%load_ext rpy2.ipython" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 53 }, { "cell_type": "markdown", "metadata": {}, "source": [ "__poretools__ can run either on an individual FAST5 file, a directory containing FAST5 files, or a tar archive of FAST5 files. Here we set up a `$directory` variable for use for the rest of the tutorial. You could change this and run the same commands on your data." ] }, { "cell_type": "code", "collapsed": false, "input": [ "directory='/mnt/borage/nick/nanopore/data/Flowcell6/downloads'" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 18 }, { "cell_type": "code", "collapsed": false, "input": [ "!find $directory -maxdepth 1 -name \"*.fast5\" | wc -l" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "60196\r\n" ] } ], "prompt_number": 65 }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are 60,196 FAST5 files in the directory." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__poretools__ has a number of different command line options. Running __poretools__ with no parameters gives us a brief list (and complies with [Torsten's first rule](http://www.gigasciencejournal.com/content/2/1/15))" ] }, { "cell_type": "code", "collapsed": false, "input": [ "!poretools" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "usage: poretools [-h] [-v]\r\n", " \r\n", " {combine,fastq,fasta,stats,hist,events,readstats,tabular,nucdist,qualdist,winner,squiggle,times,yield_plot}\r\n", " ...\r\n", "poretools: error: too few arguments\r\n" ] } ], "prompt_number": 21 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can get more information if we run __poretools__ with the -h (help) option." ] }, { "cell_type": "code", "collapsed": false, "input": [ "!poretools -h" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "usage: poretools [-h] [-v]\r\n", " \r\n", " {combine,fastq,fasta,stats,hist,events,readstats,tabular,nucdist,qualdist,winner,squiggle,times,yield_plot}\r\n", " ...\r\n", "\r\n", "optional arguments:\r\n", " -h, --help show this help message and exit\r\n", " -v, --version Installed poretools version\r\n", "\r\n", "[sub-commands]:\r\n", " {combine,fastq,fasta,stats,hist,events,readstats,tabular,nucdist,qualdist,winner,squiggle,times,yield_plot}\r\n", " combine Combine a set of FAST5 files in a TAR achive\r\n", " fastq Extract FASTQ sequences from a set of FAST5 files\r\n", " fasta Extract FASTA sequences from a set of FAST5 files\r\n", " stats Get read size stats for a set of FAST5 files\r\n", " hist Plot read size histogram for a set of FAST5 files\r\n", " events Extract each nanopore event for each read.\r\n", " readstats Extract signal information for each read over time.\r\n", " tabular Extract the lengths and name/seq/quals from a set of\r\n", " FAST5 files in TAB delimited format\r\n", " nucdist Get the nucl. composition of a set of FAST5 files\r\n", " qualdist Get the qual score composition of a set of FAST5 files\r\n", " winner Get the longest read from a set of FAST5 files\r\n", " squiggle Plot the observed signals for FAST5 reads.\r\n", " times Return the start times from a set of FAST5 files in\r\n", " tabular format\r\n", " yield_plot Plot the yield over time for a set of FAST5 files\r\n" ] } ], "prompt_number": 22 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's start with a simple one, the ``stats`` command, this will give us some basic statistics about our reads.\n", "\n", "The ``-q`` option stops ``poretools`` outputting any warning messages." ] }, { "cell_type": "code", "collapsed": false, "input": [ "!poretools stats -q $directory" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "total reads\t104969\r\n", "total base pairs\t550200969\r\n", "mean\t5241.56\r\n", "median\t4616\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "min\t5\r\n", "max\t154417\r\n" ] } ], "prompt_number": 68 }, { "cell_type": "markdown", "metadata": {}, "source": [ "How do we have 104,969 reads from 60,196 FAST5 files? That's because forward, reverse and two-directional reads are all counted separately." ] }, { "cell_type": "code", "collapsed": false, "input": [ "!poretools stats -q --type fwd $directory" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "total reads\t53914\r\n", "total base pairs\t290880019\r\n", "mean\t5395.26\r\n", "median\t4441\r\n", "min\t5\r\n", "max\t154417\r\n" ] } ], "prompt_number": 69 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have 53,914 forward reads in our total dataset." ] }, { "cell_type": "code", "collapsed": false, "input": [ "!poretools stats -q --type rev $directory" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "total reads\t29622\r\n", "total base pairs\t124718901\r\n", "mean\t4210.35\r\n", "median\t3765\r\n", "min\t5\r\n", "max\t44835\r\n" ] } ], "prompt_number": 71 }, { "cell_type": "code", "collapsed": false, "input": [ "!poretools stats -q --type 2D $directory" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "total reads\t21433\r\n", "total base pairs\t134602049\r\n", "mean\t6280.13\r\n", "median\t6020\r\n", "min\t211\r\n", "max\t38598\r\n" ] } ], "prompt_number": 70 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have 21,433 two-direction reads, which is about 40% of the reads which have been base-called and about 72% of the reads that have a detectable complement strand." ] }, { "cell_type": "code", "collapsed": false, "input": [ "!poretools readstats -q $directory > readstats.txt" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 73 }, { "cell_type": "code", "collapsed": false, "input": [ "!wc -l readstats.txt" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "``readstats`` gives you a line per every FAST5 file in your dataset. The columns are:\n", "* start_time (represented as a UNIX timestamp, e.g. seconds since the UNIX epoch)\n", "* pore number\n", "* read number (not working ATM, the format changed)\n", "* length of forward read\n", "* length of reverse read" ] }, { "cell_type": "code", "collapsed": false, "input": [ "!head -10 readstats.txt" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "1404984747\t491\tNone\t301\t0\r\n", "1405005921\t325\tNone\t299\t0\r\n", "None\t410\tNone\t0\t0\r\n", "1404940637\t415\tNone\t5271\t5113\r\n", "1405002720\t222\tNone\t3064\t0\r\n", "1404959466\t209\tNone\t5901\t5634\r\n", "1404949019\t12\tNone\t4524\t3576\r\n", "1404930734\t470\tNone\t345\t0\r\n", "1404936756\t491\tNone\t310\t0\r\n", "1404947432\t109\tNone\t697\t0\r\n" ] } ], "prompt_number": 75 }, { "cell_type": "markdown", "metadata": {}, "source": [ "One useful plot you can easily do with the output of read stats is to plot the number of events in forward reads against reverse reads. Ideally every read would have a similar number, which would indicate the hairpin is correctly attached and the strand translocation rate is controlled by the enzyme." ] }, { "cell_type": "code", "collapsed": false, "input": [ "%R stats=read.table(\"readstats.txt\", sep=\"\\t\")\n", "%R stats=subset(stats, V4 < 20000 & V5 < 20000)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "<class 'pandas.core.frame.DataFrame'>\n", "Int64Index: 59048 entries, 0 to 59047\n", "Data columns (total 5 columns):\n", "V1 59048 non-null values\n", "V2 59048 non-null values\n", "V3 59048 non-null values\n", "V4 59048 non-null values\n", "V5 59048 non-null values\n", "dtypes: int32(2), object(3)\n", "" ], "metadata": {}, "output_type": "pyout", "prompt_number": 79, "text": [ "