{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "# Quick guide to the ipyrad API" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Getting Started" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Welcome! This tutorial will introduce you to the basics of working with ipyrad to assemble RADseq data.\n", "\n", "Note: this tutorial was created in a [Jupyter Notebook](http://jupyter.org/Jupyter) and assumes that you’re following-along in a notebook of your own. If you installed ipyrad then you will also have jupyter installed by default. For a little background on how jupyter notebooks see [Notebooks](notebooks.rst). All of the code below is written in Python. Follow along in your own notebook. \n", "\n", "To begin, we're going to import the ipyrad module and name it ip so that we can refer to it more easily:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import ipyrad as ip" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Assembly objects\n", "[Assembly objects](Assembly_objects) are used by ipyrad to access data stored on disk and to manipulate it. Each biological sample in a data set is represented as a [Sample object](Sample object), and a set of Samples is stored inside an Assembly object.\n", "\n", "The Assembly object has functions to assemble data, and to view or plot the resulting assembled files and statistics. Assembly objects can be copied or merged to allow branching events where different parameters can subsequently be applied to different Assembly objects. Examples of this are shown in the [workflow](workflow.rst). \n", "\n", "Every analysis begins by creating at least one Assembly object. Here we will name it \"data1\". " ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:ipyrad.core.assembly:New Assembly object `data1` created\n", "DEBUG:ipyrad.core.parallel:ipcontroller already running.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "New Assembly object `data1` created\n" ] } ], "source": [ "## create an Assembly object named data1. \n", "data1 = ip.Assembly(\"data1\")\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The printout tells us that we created the object `data1`, and also that it found 4 engines on our system that can be used for computation. An engine is simply a CPU. When working on a single machine it will usually be easiest to simply let the Assembly object connect to all available local engines. However, on HPC clusters you may need to modify the controller or the number of engines, as shown below:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "## create an Assembly object linked to 8 engines using MPI\n", "data1 = ip.Assembly(\"data1\", N=4, controller=\"MPI\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For more information about connecting CPUs for parallelization see [ipyparallel setup](ipyparallel_setup).\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Modifying Assembly parameters\n", "The arguments `get_params()` and `set_params()` are used to view and modify parameter settings of an Assembly object, respectively." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 1 project_dir ./test_rad \n", " 2 raw_fastq_path ./data/sim_rad_test_R1_.fastq.gz \n", " 3 barcodes_path ./data/sim_rad_test_barcodes.txt \n", " 4 sorted_fastq_path \n", " 5 restriction_overhang ('TGCAG', '') \n", " 6 max_low_qual_bases 5 \n", " 7 engines_per_job 4 \n", " 8 mindepth_statistical 6 \n", " 9 mindepth_majrule 6 \n", " 10 datatype rad \n", " 11 clust_threshold 0.85 \n", " 12 minsamp 4 \n", " 13 max_shared_heterozygosity 0.25 \n", " 14 prefix_outname data1 \n", " 15 phred_Qscore_offset 33 \n", " 16 max_barcode_mismatch 1 \n", " 17 filter_adapters 0 \n", " 18 filter_min_trim_len 35 \n", " 19 ploidy 2 \n", " 20 max_stack_size 1000 \n", " 21 max_Ns_consens (5, 5) \n", " 22 max_Hs_consens (8, 8) \n", " 23 max_SNPs_locus (100, 100) \n", " 24 max_Indels_locus (5, 99) \n", " 25 trim_overhang (1, 2, 2, 1) \n", " 26 hierarchical_clustering 0 \n", " 27 assembly_method denovo \n", " 28 reference_sequence \n" ] } ], "source": [ "## setting/modifying parameters for this Assembly object\n", "data1.set_params('project_dir', \"./test_rad\")\n", "data1.set_params('raw_fastq_path', \"./data/sim_rad_test_R1_.fastq.gz\")\n", "data1.set_params('barcodes_path', \"./data/sim_rad_test_barcodes.txt\")\n", "data1.set_params('filter_adapters', 0)\n", "data1.set_params('datatype', 'rad')\n", "\n", "## print the parameters for `data`\n", "data1.get_params()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To get more detailed information about each parameter use `ip.get_params_info()`, or look up their funcion in the documentation ([Parameters](parameters.rst)). To quickly look up the proper formatting for a parameter, you can use `ip.get_params_info(N)`, where N is the number of a parameter. Example:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " (10) clust_threshold -------------------------------------------------\n", " Clustering threshold. \n", " Examples:\n", " ----------------------------------------------------------------------\n", " data.setparams(10) = .85 ## clustering similarity threshold\n", " data.setparams(10) = .90 ## clustering similarity threshold\n", " data.setparams(10) = .95 ## very high values not recommended \n", " data.setparams(\"clust_threshold\") = .83 ## verbose\n", " ----------------------------------------------------------------------\n", " \n" ] } ], "source": [ "ip.get_params_info(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sample Objects\n", "Each biological sample in a data set is represented as a Sample object. Sample objects are created during `step1()` of an analysis, at which time they are linked to an Assembly object. \n", "\n", "When getting started raw data should be in one of two forms: \n", "+ Non-demultiplexed data files (accompanied by a barcode file).\n", "+ Demultiplexed data files. \n", "\n", "Note: For additional ways to add raw data files to a data set see [link_fastqs](link_fastqs). \n", "\n", "If the data are already demultiplexed then fastq files can be linked directly to the Assembly object, which in turn will create new Sample objects from them, or link them to existing Sample objects based on the file names (or pair of fastq files for paired data files). The files may be gzip compressed. If the data are not demultiplexed then you will have to run the step1 function below to demultiplex the raw data." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 new Samples created in `data1`.\n" ] } ], "source": [ "## This would link fastq files from the 'sorted_fastq_path' if present\n", "## Here it does nothing b/c there are no files in the sorted_fastq_path\n", "data1.link_fastqs()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 1: Demultiplexing raw data files\n", "Step1 uses barcode information to demultiplex data files found in param 2 ['raw_fastq_path']. It will create a Sample object for each barcoded sample. Below we use the step1() function to demultiplex. The `stats` attribute of an Assembly object is returned as a `pandas` data frame." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " state reads_raw\n", "1A_0 1 20099\n", "1B_0 1 19977\n", "1C_0 1 20114\n", "1D_0 1 19895\n", "2E_0 1 19928\n", "2F_0 1 19934\n", "2G_0 1 20026\n", "2H_0 1 19936\n", "3I_0 1 20084\n", "3J_0 1 20011\n", "3K_0 1 20117\n", "3L_0 1 19901\n" ] } ], "source": [ "## run step 1 to demultiplex the data\n", "data1.step1()\n", "\n", "## print the results for each Sample in data1\n", "print data1.stats" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "## remove the lane control sequence\n", "#data1.samples.pop(\"FGXCONTROL\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 2: Filter reads \n", "If for some reason we wanted to execute on just a subsample of our data, we could do this by selecting only certain samples to call the `step2` function on. Because `step2` is a function of `data`, it will always execute with the parameters that are linked to `data`. " ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:ipyrad.assemble.demultiplex:optim = 10000\n", "INFO:ipyrad.assemble.demultiplex:optim = 10000\n", "INFO:ipyrad.assemble.demultiplex:optim = 10000\n", "INFO:ipyrad.assemble.demultiplex:optim = 10000\n", "INFO:ipyrad.assemble.demultiplex:optim = 10000\n", "INFO:ipyrad.assemble.demultiplex:optim = 10000\n", "INFO:ipyrad.assemble.demultiplex:optim = 10000\n", "INFO:ipyrad.assemble.demultiplex:optim = 10000\n", "INFO:ipyrad.assemble.demultiplex:optim = 10000\n", "INFO:ipyrad.assemble.demultiplex:optim = 10000\n", "INFO:ipyrad.assemble.demultiplex:optim = 10000\n", "INFO:ipyrad.assemble.demultiplex:optim = 10000\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ " state reads_raw reads_filtered\n", "1A_0 2 20099 20099\n", "1B_0 2 19977 19977\n", "1C_0 2 20114 20114\n", "1D_0 2 19895 19895\n", "2E_0 2 19928 19928\n", "2F_0 2 19934 19934\n", "2G_0 2 20026 20026\n", "2H_0 2 19936 19936\n", "3I_0 2 20084 20084\n", "3J_0 2 20011 20011\n", "3K_0 2 20117 20117\n", "3L_0 2 19901 19901\n" ] } ], "source": [ "## example of ways to run step 2 to filter and trim reads\n", "\n", "#data1.step2([\"1A_0\"]) ## run on a single sample\n", "#data1.step2([\"1B_0\", \"1C_0\"]) ## run on one or more samples\n", "data1.step2(force=True) ## run on all samples, overwrite finished\n", "\n", "## print the results\n", "print data1.stats" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#data1.samples[\"veitchii\"].files" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Branching Assembly objects\n", "Let's imagine at this point that we are interested in clustering our data at two different clustering thresholds. We will try 0.90 and 0.85. First we need to make a copy/branch of the Assembly object. This will inherit the locations of the data linked in the first object, but diverge in any future applications to the object. Thus, the two Assembly objects can share the same working directory, and inherit shared files, but will diverge in creating new files linked to only one or the other. You can view the directories linked to an Assembly object with the `.dirs` argument, shown below. The prefix_outname (param 14) of the new object is automatically set to the Assembly object name. \n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 1 project_dir ./test_rad \n", " 2 raw_fastq_path ./data/sim_rad_test_R1_.fastq.gz \n", " 3 barcodes_path ./data/sim_rad_test_barcodes.txt \n", " 4 sorted_fastq_path \n", " 5 restriction_overhang ('TGCAG', '') \n", " 6 max_low_qual_bases 5 \n", " 7 engines_per_job 4 \n", " 8 mindepth_statistical 6 \n", " 9 mindepth_majrule 6 \n", " 10 datatype rad \n", " 11 clust_threshold 0.9 \n", " 12 minsamp 4 \n", " 13 max_shared_heterozygosity 0.25 \n", " 14 prefix_outname data2 \n", " 15 phred_Qscore_offset 33 \n", " 16 max_barcode_mismatch 1 \n", " 17 filter_adapters 0 \n", " 18 filter_min_trim_len 35 \n", " 19 ploidy 2 \n", " 20 max_stack_size 1000 \n", " 21 max_Ns_consens (5, 5) \n", " 22 max_Hs_consens (8, 8) \n", " 23 max_SNPs_locus (100, 100) \n", " 24 max_Indels_locus (5, 99) \n", " 25 trim_overhang (1, 2, 2, 1) \n", " 26 hierarchical_clustering 0 \n", " 27 assembly_method denovo \n", " 28 reference_sequence \n" ] } ], "source": [ "## create a copy of our Assembly object\n", "data2 = data1.branch(newname=\"data2\")\n", "\n", "## set clustering threshold to 0.90\n", "data2.set_params(11, 0.90)\n", "\n", "## look at inherited parameters\n", "data2.get_params()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 3: clustering within-samples\n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "DEBUG:ipyrad.core.parallel:ipcontroller already running.\n" ] } ], "source": [ "import ipyrad as ip\n", "data1 = ip.load_assembly(\"test_rad/data1\")" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "clustering 12 samples using 4 engines per job\n", " state reads_raw reads_filtered clusters_total clusters_kept\n", "1A_0 3 20099 20099 1000 1000\n", "1B_0 3 19977 19977 1000 1000\n", "1C_0 3 20114 20114 1000 1000\n", "1D_0 3 19895 19895 1000 1000\n", "2E_0 3 19928 19928 1000 1000\n", "2F_0 3 19934 19934 1000 1000\n", "2G_0 3 20026 20026 1000 1000\n", "2H_0 3 19936 19936 1000 1000\n", "3I_0 3 20084 20084 1000 1000\n", "3J_0 3 20011 20011 1000 1000\n", "3K_0 3 20117 20117 1000 1000\n", "3L_0 3 19901 19901 1000 1000\n" ] } ], "source": [ "## run step 3 to cluster reads within samples using vsearch\n", "data1.step3(force=True)\n", "\n", "## print the results\n", "print data1.stats" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [], "source": [ "## run step 3 to cluster reads in data2 at 0.90 sequence similarity\n", "data2.step3(force=True) \n", "\n", "## print the results\n", "print data2.stats" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Branched Assembly objects\n", "And you can see below that the two Assembly objects are now working with several shared directories (working, fastq, edits) but with different clust directories (clust_0.85 and clust_0.9). " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print \"data1 directories:\"\n", "for (i,j) in data1.dirs.items():\n", " print \"{}\\t{}\".format(i, j)\n", " \n", "print \"\\ndata2 directories:\"\n", "for (i,j) in data2.dirs.items():\n", " print \"{}\\t{}\".format(i, j)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "## TODO, just make a [name]_stats directory in [work] for each data obj\n", "data1.statsfiles\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Saving stats outputs\n", "Example: two simple ways to save the stats data frame to a file." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data1.stats.to_csv(\"data1_results.csv\", sep=\"\\t\")\n", "data1.stats.to_latex(\"data1_results.tex\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example of plotting with _ipyrad_\n", "There are a a few simple plotting functions in _ipyrad_ useful for visualizing results. These are in the module `ipyrad.plotting`. Below is an interactive plot for visualizing the distributions of coverages across the 12 samples in the test data set. " ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
010203040500501001501A_0010203040500501001501B_0010203040500501001501C_0010203040500501001501D_0010203040500501001502E_0010203040500501001502F_0010203040500501001502G_0010203040500501001502H_0010203040500501001503I_0010203040500501001503J_0010203040500501001503K_0010203040500501001503L_0
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import ipyrad.plotting as iplot\n", "\n", "## plot for one or more selected samples\n", "#iplot.depthplot(data1, [\"1A_0\", \"1B_0\"])\n", "\n", "## plot for all samples in data1\n", "iplot.depthplot(data1)\n", "\n", "## save plot as pdf and html\n", "#iplot.depthplot(data1, outprefix=\"testfig\")" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "### Step 4: Joint estimation of heterozygosity and error rate\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [], "source": [ "## run step 4\n", "data1.step4() \n", "\n", "## print the results\n", "print data1.stats" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 5: Consensus base calls\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#import ipyrad as ip\n", "\n", "## reload autosaved data. In case you quit and came back \n", "#data1 = ip.load_dataobj(\"test_rad/data1.assembly\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "## run step 5\n", "#data1.step5()\n", "\n", "## print the results\n", "#print data1.stats" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Quick parameter explanations are always on-hand" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ip.get_params_info(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Log history \n", "A common problem at the end of an analysis, or while troubleshooting it, is that you find you've completely forgotten which parameters you used at what point, and when you changed them. Documenting or executing code inside Jupyter notebooks (like the one you're reading right now) is a great way to keep track of all of this. In addition, _ipyrad_ also stores a log history which time stamps all modifications to Assembly objects. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "for i in data1.log:\n", " print i\n", " \n", "print \"\\ndata 2 log includes its pre-branching history with data1\"\n", "for i in data2.log:\n", " print i" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Saving Assembly objects\n", "Assembly objects can be saved and loaded so that interactive analyses can be started, stopped, and returned to quite easily. The format of these saved files is a serialized 'dill' object used by Python. Individual Sample objects are saved within Assembly objects. These objects to not contain the actual sequence data, but only link to it, and so are not very large. The information contained includes parameters and the log of Assembly objects, and the statistics and state of Sample objects. Assembly objects are autosaved each time an assembly `step` function is called, but you can also create your own checkpoints with the `save` command. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "## save assembly object\n", "#ip.save_assembly(\"data1.p\")\n", "\n", "## load assembly object\n", "#data = ip.load_assembly(\"data1.p\")\n", "#print data.name" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.10" } }, "nbformat": 4, "nbformat_minor": 0 }