{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "# User guide to the *ipyrad* API\n", "Welcome! This tutorial will introduce you to the basic and advanced features of working with the *ipyrad* API to assemble RADseq data in Python. The API offers many advantages over the command-line interface, but requires a little more work up front to learn the necessary tools for using it. This includes knowing some very rudimentary Python, and setting up a Jupyter notebook. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import ipyrad as ip" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Getting started with Jupyter notebooks\n", "\n", "This tutorial is an example of a [Jupyter Notebook](http://jupyter.org/Jupyter). If you've installed *ipyrad* then you already have jupyter installed as well, which you can start from the command-line (type `jupyter-notebook`) to launch an interactive notebook like this one. For some background on how jupyter notebooks work I would recommend searching on google, or watching this [YouTube video](https://www.youtube.com/watch?v=HW29067qVWk&t=47s). Once you have the hang of it, follow along with this code in your own notebook. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Connecting your notebook to a cluster\n", "We have a separate tutorial about using Jupyter notebooks and connecting Jupyter notebooks to a computing cluster (see [here](http://ipyrad.readthedocs.io/HPC_Tunnel.html#)). For this notebook I will assume that you are running this code in a Jupyter notebook, and that you have an *ipcluster* instance running either locally or remotely on a cluster. If an *ipcluster* instance is running on its default settings (default profile) then *ipyrad* will automatically use all available cores started on that cluster instance." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Import Python libraries\n", "The only library we need to import is *ipyrad*. The *import* command is usually the first code called in a Python document to load any necessary packages. In the code below, we use a convenient trick in Python to tell it that we want to refer to *ipyrad* simply as *ip*. This saves us a little space since we might type the name many times. Below that, we use the print statement to print the version number of *ipyrad*. This is good practice to keep a record of which software version we are using. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.7.19\n" ] } ], "source": [ "## this is a comment, it is not executed, but the code below it is.\n", "import ipyrad as ip\n", "\n", "## here we print the version\n", "print ip.__version__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The *ipyrad* API data structures\n", "There are two main objects in *ipyrad*: Assembly class objects and Sample class objects. And in fact, most users will only ever interact with the Assembly class objects, since Sample objects are stored inside of the Assembly objects, and the Assembly objects have functions, such as merge, and branch, that are designed for manipulating and exchanging Samples between different Assemblies. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Assembly Class objects\n", "Assembly objects are a unique data structure that ipyrad uses to store and organize information about how to Assemble RAD-seq data. It contains functions that can be applied to data, such as clustering, and aligning sequences. And it stores information about which settings (prarmeters) to use for assembly functions, and which Samples the functions should be applied to. You can think of it mostly as a container that has a set of rules associated with it. \n", "\n", "To create a new Assembly object use the `ip.Assembly()` function and pass it the name of your new Assembly. Creating an object in this way has exactly the same effect as using the **-n {name}** argument in the *ipyrad* command line tool, except in the API instead of creating a params.txt file, we store the new Assembly information in a Python variable. This can be named anything you want. Below I name the variable *data1* so it is easy to remember that the Assembly name is also data1. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "New Assembly: data1\n" ] } ], "source": [ "## create an Assembly object named data1. \n", "data1 = ip.Assembly(\"data1\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Setting parameters\n", "You now have a Assembly object with a default set of parameters associated with it, analogous to the params file in the command line tool. You can view and modify these parameters using two arguments to the Assembly object, `set_params()` and `get_params()`. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 assembly_name data1 \n", "1 project_dir ./pedicularis \n", "2 raw_fastq_path \n", "3 barcodes_path \n", "4 sorted_fastq_path ./example_empirical_rad/*.gz \n", "5 assembly_method denovo \n", "6 reference_sequence \n", "7 datatype rad \n", "8 restriction_overhang ('TGCAG', '') \n", "9 max_low_qual_bases 5 \n", "10 phred_Qscore_offset 33 \n", "11 mindepth_statistical 6 \n", "12 mindepth_majrule 6 \n", "13 maxdepth 10000 \n", "14 clust_threshold 0.85 \n", "15 max_barcode_mismatch 0 \n", "16 filter_adapters 2 \n", "17 filter_min_trim_len 35 \n", "18 max_alleles_consens 2 \n", "19 max_Ns_consens (5, 5) \n", "20 max_Hs_consens (8, 8) \n", "21 min_samples_locus 4 \n", "22 max_SNPs_locus (20, 20) \n", "23 max_Indels_locus (8, 8) \n", "24 max_shared_Hs_locus 0.5 \n", "25 trim_reads (0, 0, 0, 0) \n", "26 trim_loci (0, 0, 0, 0) \n", "27 output_formats ['p', 's', 'v'] \n", "28 pop_assign_file \n" ] } ], "source": [ "## setting/modifying parameters for this Assembly object\n", "data1.set_params('project_dir', \"pedicularis\")\n", "data1.set_params('sorted_fastq_path', \"./example_empirical_rad/*.gz\")\n", "data1.set_params('filter_adapters', 2)\n", "data1.set_params('datatype', 'rad')\n", "\n", "## prints the parameters to the screen\n", "data1.get_params()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Instantaneous parameter (and error) checking \n", "A nice feature of the `set_params()` function in the *ipyrad* API is that it checks your parameter settings at the time that you change them to make sure that they are compatible. By contrast, the *ipyrad* CLI does not check params until you try to run a step function. Below you can see that an error is raised when we try to set the \"clust_threshold\" parameters with an integer, since it requires the value to be a float (decimal). It's hard to catch every possible error, but we've tried to catch many of the most common errors in parameter settings. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "ename": "IPyradError", "evalue": " Error setting parameter 'clust_threshold'\n clust_threshold must be a decimal value between 0 and 1.\n You entered: 2.0\n ", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mIPyradError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;31m## this is expected to raise an error, since the clust_threshold\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0;31m## parameter cannot be 2.0\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0mdata1\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mset_params\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"clust_threshold\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m2.0\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m/home/deren/Documents/ipyrad/ipyrad/core/assembly.pyc\u001b[0m in \u001b[0;36mset_params\u001b[0;34m(self, param, newvalue)\u001b[0m\n\u001b[1;32m 810\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mException\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0minst\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 811\u001b[0m raise IPyradWarningExit(BAD_PARAMETER\\\n\u001b[0;32m--> 812\u001b[0;31m .format(param, inst, newvalue))\n\u001b[0m\u001b[1;32m 813\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 814\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/home/deren/Documents/ipyrad/ipyrad/assemble/util.pyc\u001b[0m in \u001b[0;36m__init__\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m 50\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m__init__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 51\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mipyrad\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__interactive__\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 52\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mIPyradError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 53\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 54\u001b[0m \u001b[0mSystemExit\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__init__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mIPyradError\u001b[0m: Error setting parameter 'clust_threshold'\n clust_threshold must be a decimal value between 0 and 1.\n You entered: 2.0\n " ] } ], "source": [ "## this should raise an error, since clust_threshold cannot be 2.0\n", "data1.set_params(\"clust_threshold\", 2.0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Attributes of Assembly objects\n", "Assembly objects have many attributes which you can access to learn more about your Assembly. To see the full list of options you can type the name of your Assembly variable, followed by a '.', and then press . This will use tab-completion to list all of the available options. Below I print a few examples. " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "data1\n" ] } ], "source": [ "print data1.name" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "fastqs : \n", "edits : \n", "clusts : \n", "consens : \n", "outfiles : \n", "\n" ] } ], "source": [ "## another example attribute listing directories\n", "## associated with this object. Most are empty b/c\n", "## we haven't started creating files yet. But you \n", "## can see that it shows the fastq directory. \n", "print data1.dirs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sample Class objects\n", "Sample Class objects correspond to individual samples in your study. They store the file paths pointing to the data that is saved on disk, and they store statistics about the results of each step of the Assembly. Sample class objects are stored inside Assembly class objects, and can be added, removed, or merged with other Sample class objects between differnt Assemblies. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Creating Samples\n", "Samples are created during step 1 of the ipyrad Assembly. This involves either demultiplexing raw data files or loading data files that are already demultiplexed. For this example we are loading demultiplexed data files. Because we've already entered the path to our data files in `sorted_fastq_path` of our Asssembly object, we can go ahead and run step 1 to create Sample objects that are linked to the data files. " ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Assembly: data1\n", "[####################] 100% loading reads | 0:00:11 | s1 | \n" ] } ], "source": [ "## run step 1 to create Samples objects\n", "data1.run(\"1\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The `.run()` command\n", "The run function is equivalent to the `-s` argument in the ipyrad command line tool, and tell ipyrad which steps (1-7) of the assembly to run. If a step has already been run on your samples they will be skipped and it will print a warning. You can enforce overwriting the existing data using the force flag. " ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Assembly: data1\n", "[####################] 100% loading reads | 0:00:10 | s1 | \n" ] } ], "source": [ "## The force flag allows you to re-run a step that is already finished\n", "data1.run(\"1\", force=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The run command will automatically parallelize work across all cores of a running ipcluster instance (remember, you should have started this outside of notebook. Or you can start it now.) If ipcluster is running on the default profile then ipyrad will detect and use it when the run command is called. However, if you start an ipcluster instance with a specific profile name then you will need to connect to it using the ipyparallel library and then pass the connection client object to ipyrad. I'll show an example of that here. " ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "## this is the explicit way to connect to ipcluster\n", "import ipyparallel\n", "\n", "## connect to a running ipcluster instance\n", "ipyclient = ipyparallel.Client()\n", "\n", "## or, if you used a named profile then enter that\n", "ipyclient = ipyparallel.Client(profile=\"default\")" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Assembly: data1\n", "[####################] 100% loading reads | 0:00:10 | s1 | \n" ] } ], "source": [ "## call the run function of ipyrad and pass it the ipyclient\n", "## process that you want the work distributed on.\n", "data1.run(\"1\", ipyclient=ipyclient, force=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Samples stored in an Assembly\n", "You can see below that after step 1 has been run there will be a collection of Sample objects stored in an Assembly that can be accessed from the attribute `.Samples`. They are stored as a dictionary in which the keys are Sample names and the values of the dictionary are the Sample objects. " ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'29154_superba': ,\n", " '30556_thamno': ,\n", " '30686_cyathophylla': ,\n", " '32082_przewalskii': ,\n", " '33413_thamno': ,\n", " '33588_przewalskii': ,\n", " '35236_rex': ,\n", " '35855_rex': ,\n", " '38362_rex': ,\n", " '39618_rex': ,\n", " '40578_rex': ,\n", " '41478_cyathophylloides': ,\n", " '41954_cyathophylloides': }" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## Sample objects stored as a dictionary\n", "data1.samples" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "### The progress bar\n", "As you can see running a step of the analysis prints a progress bar similar to what you would see in the *ipyrad* command line tool. There are some differences, however. It shows on the far right \"s1\" to indicate that this was step 1 of the assembly, and it does not print information about our cluster setup (e.g., number of nodes and cores). This was a stylistic choice to provide a cleaner output for analyses inside Jupyter notebooks. You can view the cluster information when running the step functions by adding the argument `show_cluster=True`. \n", "\n" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "host compute node: [4 cores] on oud\n", "Assembly: data1\n", "[####################] 100% loading reads | 0:00:10 | s1 | \n" ] } ], "source": [ "## run step 1 to create Samples objects\n", "data1.run(\"1\", show_cluster=True, force=True)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Viewing results of Assembly steps\n", "Results for each step are stored in Sample class objects, however, Assembly class objects have functions available for summarizing the stats of all Sample class objects that they contain, which provides an easy way to view results. This includes `.stats` attribute, and the `.stats_dfs` attributes for each step. " ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " state reads_raw\n", "29154_superba 1 696994\n", "30556_thamno 1 1452316\n", "30686_cyathophylla 1 1253109\n", "32082_przewalskii 1 964244\n", "33413_thamno 1 636625\n", "33588_przewalskii 1 1002923\n", "35236_rex 1 1803858\n", "35855_rex 1 1409843\n", "38362_rex 1 1391175\n", "39618_rex 1 822263\n", "40578_rex 1 1707942\n", "41478_cyathophylloides 1 2199740\n", "41954_cyathophylloides 1 2199613\n" ] } ], "source": [ "## print full stats summary\n", "print data1.stats" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " reads_raw\n", "29154_superba 696994\n", "30556_thamno 1452316\n", "30686_cyathophylla 1253109\n", "32082_przewalskii 964244\n", "33413_thamno 636625\n", "33588_przewalskii 1002923\n", "35236_rex 1803858\n", "35855_rex 1409843\n", "38362_rex 1391175\n", "39618_rex 822263\n", "40578_rex 1707942\n", "41478_cyathophylloides 2199740\n", "41954_cyathophylloides 2199613\n" ] } ], "source": [ "## print full stats for step 1 (in this case it's the same but for other\n", "## steps the stats_dfs often contains more information.)\n", "print data1.stats_dfs.s1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Branching to subsample taxa\n", "Branching in the *ipyrad* API works the same as in the CLI, but in many ways is easier to use because you can access attributes of the Assembly objects much more easily, such as when you want to provide a list of Sample names in order to subsample (exclude samples) during the branching process. Below is an example. " ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Samples in data1:\n", "30686_cyathophylla\n", "33413_thamno\n", "30556_thamno\n", "32082_przewalskii\n", "29154_superba\n", "41478_cyathophylloides\n", "40578_rex\n", "35855_rex\n", "33588_przewalskii\n", "39618_rex\n", "38362_rex\n", "35236_rex\n", "41954_cyathophylloides\n" ] } ], "source": [ "## access all Sample names in data1\n", "allsamples = data1.samples.keys()\n", "print \"Samples in data1:\\n\", \"\\n\".join(allsamples)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Samples in data2:\n", "30686_cyathophylla\n", "33413_thamno\n", "41478_cyathophylloides\n", "29154_superba\n", "40578_rex\n", "35855_rex\n", "30556_thamno\n", "39618_rex\n", "38362_rex\n", "35236_rex\n", "41954_cyathophylloides\n" ] } ], "source": [ "## Drop the two samples from this list that have \"prz\" in their names.\n", "## This is a programmatic way to remove the outgroup samples.\n", "subs = [i for i in allsamples if \"prz\" not in i]\n", "\n", "## use branching to create new Assembly named 'data2'\n", "## with only Samples whose name is in the subs list\n", "data2 = data1.branch(\"data2\", subsamples=subs)\n", "print \"Samples in data2:\\n\", \"\\n\".join(data2.samples)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Branching to iterate over parameter settings\n", "This is the real bread and butter of the *ipyrad* API. \n", "\n", "You can write simple for-loops using Python code to apply a range of parameter settings to different branched assemblies. Furthermore, using branching this can be done in a way that greatly reduces the amount of computation needed to produce multiple data sets. Essentially, branching allows you to recycle intermediate states that are shared between branched Assemblies. This is particularly useful when assemblies differ by only one or few parameters that are applied late in the assembly process. To set up efficient branching code in this way requires some prior knowledge about when (which step) each parameter is applied in ipyrad. That information is available in the documentation (http://ipyrad.readthedocs.io/parameters.html). \n", "\n", "When setting up for-loop routines like the one below it may be helpful to break the script up among multiple cells of a Jupyter notebook so that you can easily restart from one step or another. It may also be useful to subsample your data set to a small number of samples to test the code first, and if all goes well, then proceed with your full data set." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### An example to create many assemblies\n", "In the example below we will create 8 complete Assemblies which vary in three different parameter combinations (filter_setting, clust_threshold, and min_sample)." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "New Assembly: base\n", "Assembly: base\n", "[####################] 100% sorting reads | 0:00:02 | s1 | \n", "[####################] 100% writing/compressing | 0:00:00 | s1 | \n" ] } ], "source": [ "## Start by creating an initial assembly, setting the path to your data, \n", "## and running step1. I set a project-dir so that all of our data sets\n", "## will be grouped into a single directory called 'branch-test'.\n", "data = ip.Assembly(\"base\")\n", "data.set_params(\"project_dir\", \"branch-test\")\n", "data.set_params(\"raw_fastq_path\", \"./ipsimdata/rad_example_R1_.fastq.gz\")\n", "data.set_params(\"barcodes_path\", \"./ipsimdata/rad_example_barcodes.txt\")\n", "\n", "## step 1: load in the data\n", "data.run('1')" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Assembly: base_f1\n", "[####################] 100% processing reads | 0:00:03 | s2 | \n", "Assembly: base_f1_c85\n", "[####################] 100% dereplicating | 0:00:00 | s3 | | \n", "[####################] 100% clustering | 0:00:01 | s3 | \n", "[####################] 100% building clusters | 0:00:00 | s3 | \n", "[####################] 100% chunking | 0:00:00 | s3 | \n", "[####################] 100% aligning | 0:00:08 | s3 | \n", "[####################] 100% concatenating | 0:00:00 | s3 | \n", "[####################] 100% inferring [H, E] | 0:00:03 | s4 | \n", "[####################] 100% calculating depths | 0:00:00 | s5 | \n", "[####################] 100% chunking clusters | 0:00:00 | s5 | \n", "[####################] 100% consens calling | 0:00:13 | s5 | \n", "[####################] 100% concat/shuffle input | 0:00:00 | s6 | \n", "[####################] 100% clustering across | 0:00:01 | s6 | \n", "[####################] 100% building clusters | 0:00:00 | s6 | \n", "[####################] 100% aligning clusters | 0:00:04 | s6 | \n", "[####################] 100% database indels | 0:00:00 | s6 | \n", "[####################] 100% indexing clusters | 0:00:01 | s6 | \n", "[####################] 100% building database | 0:00:00 | s6 | \n", "Assembly: base_f1_c85_m4\n", "[####################] 100% filtering loci | 0:00:06 | s7 | \n", "[####################] 100% building loci/stats | 0:00:00 | s7 | \n", "[####################] 100% building vcf file | 0:00:03 | s7 | \n", "[####################] 100% writing vcf file | 0:00:00 | s7 | \n", "[####################] 100% building arrays | 0:00:00 | s7 | \n", "[####################] 100% writing outfiles | 0:00:00 | s7 | \n", "Outfiles written to: ~/Documents/ipyrad/tests/branch-test/base_f1_c85_m4_outfiles\n", "\n", "Assembly: base_f1_c85_m12\n", "[####################] 100% filtering loci | 0:00:00 | s7 | \n", "[####################] 100% building loci/stats | 0:00:00 | s7 | \n", "[####################] 100% building vcf file | 0:00:01 | s7 | \n", "[####################] 100% writing vcf file | 0:00:00 | s7 | \n", "[####################] 100% building arrays | 0:00:00 | s7 | \n", "[####################] 100% writing outfiles | 0:00:00 | s7 | \n", "Outfiles written to: ~/Documents/ipyrad/tests/branch-test/base_f1_c85_m12_outfiles\n", "\n", "Assembly: base_f1_c90\n", "[####################] 100% dereplicating | 0:00:00 | s3 | | \n", "[####################] 100% clustering | 0:00:01 | s3 | \n", "[####################] 100% building clusters | 0:00:00 | s3 | \n", "[####################] 100% chunking | 0:00:00 | s3 | \n", "[####################] 100% aligning | 0:00:09 | s3 | \n", "[####################] 100% concatenating | 0:00:00 | s3 | \n", "[####################] 100% inferring [H, E] | 0:00:03 | s4 | \n", "[####################] 100% calculating depths | 0:00:00 | s5 | \n", "[####################] 100% chunking clusters | 0:00:00 | s5 | \n", "[####################] 100% consens calling | 0:00:13 | s5 | \n", "[####################] 100% concat/shuffle input | 0:00:00 | s6 | \n", "[####################] 100% clustering across | 0:00:01 | s6 | \n", "[####################] 100% building clusters | 0:00:00 | s6 | \n", "[####################] 100% aligning clusters | 0:00:04 | s6 | \n", "[####################] 100% database indels | 0:00:00 | s6 | \n", "[####################] 100% indexing clusters | 0:00:00 | s6 | \n", "[####################] 100% building database | 0:00:00 | s6 | \n", "Assembly: base_f1_c90_m4\n", "[####################] 100% filtering loci | 0:00:00 | s7 | \n", "[####################] 100% building loci/stats | 0:00:00 | s7 | \n", "[####################] 100% building vcf file | 0:00:01 | s7 | \n", "[####################] 100% writing vcf file | 0:00:00 | s7 | \n", "[####################] 100% building arrays | 0:00:00 | s7 | \n", "[####################] 100% writing outfiles | 0:00:00 | s7 | \n", "Outfiles written to: ~/Documents/ipyrad/tests/branch-test/base_f1_c90_m4_outfiles\n", "\n", "Assembly: base_f1_c90_m12\n", "[####################] 100% filtering loci | 0:00:00 | s7 | \n", "[####################] 100% building loci/stats | 0:00:00 | s7 | \n", "[####################] 100% building vcf file | 0:00:01 | s7 | \n", "[####################] 100% writing vcf file | 0:00:00 | s7 | \n", "[####################] 100% building arrays | 0:00:00 | s7 | \n", "[####################] 100% writing outfiles | 0:00:00 | s7 | \n", "Outfiles written to: ~/Documents/ipyrad/tests/branch-test/base_f1_c90_m12_outfiles\n", "\n", "Assembly: base_f2\n", "[####################] 100% processing reads | 0:00:04 | s2 | \n", "Assembly: base_f2_c85\n", "[####################] 100% dereplicating | 0:00:00 | s3 | | \n", "[####################] 100% clustering | 0:00:00 | s3 | \n", "[####################] 100% building clusters | 0:00:00 | s3 | \n", "[####################] 100% chunking | 0:00:00 | s3 | \n", "[####################] 100% aligning | 0:00:09 | s3 | \n", "[####################] 100% concatenating | 0:00:00 | s3 | \n", "[####################] 100% inferring [H, E] | 0:00:03 | s4 | \n", "[####################] 100% calculating depths | 0:00:00 | s5 | \n", "[####################] 100% chunking clusters | 0:00:00 | s5 | \n", "[####################] 100% consens calling | 0:00:13 | s5 | \n", "[####################] 100% concat/shuffle input | 0:00:00 | s6 | \n", "[####################] 100% clustering across | 0:00:01 | s6 | \n", "[####################] 100% building clusters | 0:00:00 | s6 | \n", "[####################] 100% aligning clusters | 0:00:04 | s6 | \n", "[####################] 100% database indels | 0:00:00 | s6 | \n", "[####################] 100% indexing clusters | 0:00:00 | s6 | \n", "[####################] 100% building database | 0:00:00 | s6 | \n", "Assembly: base_f2_c85_m4\n", "[####################] 100% filtering loci | 0:00:00 | s7 | \n", "[####################] 100% building loci/stats | 0:00:00 | s7 | \n", "[####################] 100% building vcf file | 0:00:01 | s7 | \n", "[####################] 100% writing vcf file | 0:00:00 | s7 | \n", "[####################] 100% building arrays | 0:00:00 | s7 | \n", "[####################] 100% writing outfiles | 0:00:00 | s7 | \n", "Outfiles written to: ~/Documents/ipyrad/tests/branch-test/base_f2_c85_m4_outfiles\n", "\n", "Assembly: base_f2_c85_m12\n", "[####################] 100% filtering loci | 0:00:00 | s7 | \n", "[####################] 100% building loci/stats | 0:00:00 | s7 | \n", "[####################] 100% building vcf file | 0:00:01 | s7 | \n", "[####################] 100% writing vcf file | 0:00:00 | s7 | \n", "[####################] 100% building arrays | 0:00:00 | s7 | \n", "[####################] 100% writing outfiles | 0:00:00 | s7 | \n", "Outfiles written to: ~/Documents/ipyrad/tests/branch-test/base_f2_c85_m12_outfiles\n", "\n", "Assembly: base_f2_c90\n", "[####################] 100% dereplicating | 0:00:00 | s3 | | \n", "[####################] 100% clustering | 0:00:00 | s3 | \n", "[####################] 100% building clusters | 0:00:00 | s3 | \n", "[####################] 100% chunking | 0:00:00 | s3 | \n", "[####################] 100% aligning | 0:00:09 | s3 | \n", "[####################] 100% concatenating | 0:00:00 | s3 | \n", "[####################] 100% inferring [H, E] | 0:00:03 | s4 | \n", "[####################] 100% calculating depths | 0:00:00 | s5 | \n", "[####################] 100% chunking clusters | 0:00:00 | s5 | \n", "[####################] 100% consens calling | 0:00:14 | s5 | \n", "[####################] 100% concat/shuffle input | 0:00:00 | s6 | \n", "[####################] 100% clustering across | 0:00:01 | s6 | \n", "[####################] 100% building clusters | 0:00:00 | s6 | \n", "[####################] 100% aligning clusters | 0:00:04 | s6 | \n", "[####################] 100% database indels | 0:00:00 | s6 | \n", "[####################] 100% indexing clusters | 0:00:00 | s6 | \n", "[####################] 100% building database | 0:00:00 | s6 | \n", "Assembly: base_f2_c90_m4\n", "[####################] 100% filtering loci | 0:00:00 | s7 | \n", "[####################] 100% building loci/stats | 0:00:00 | s7 | \n", "[####################] 100% building vcf file | 0:00:01 | s7 | \n", "[####################] 100% writing vcf file | 0:00:00 | s7 | \n", "[####################] 100% building arrays | 0:00:00 | s7 | \n", "[####################] 100% writing outfiles | 0:00:00 | s7 | \n", "Outfiles written to: ~/Documents/ipyrad/tests/branch-test/base_f2_c90_m4_outfiles\n", "\n", "Assembly: base_f2_c90_m12\n", "[####################] 100% filtering loci | 0:00:00 | s7 | \n", "[####################] 100% building loci/stats | 0:00:00 | s7 | \n", "[####################] 100% building vcf file | 0:00:01 | s7 | \n", "[####################] 100% writing vcf file | 0:00:00 | s7 | \n", "[####################] 100% building arrays | 0:00:00 | s7 | \n", "[####################] 100% writing outfiles | 0:00:00 | s7 | \n", "Outfiles written to: ~/Documents/ipyrad/tests/branch-test/base_f2_c90_m12_outfiles\n", "\n" ] } ], "source": [ "## let's create a dictionary to hold the finished assemblies\n", "adict = {}\n", "\n", "## iterate over parameters settings creating a new named assembly\n", "for filter_setting in [1, 2]:\n", " ## create a new name for the assembly and branch\n", " newname = data.name + \"_f{}\".format(filter_setting)\n", " child1 = data.branch(newname)\n", " child1.set_params(\"filter_adapters\", filter_setting)\n", " child1.run(\"2\")\n", " \n", " ## iterate over clust thresholds\n", " for clust_threshold in ['0.85', '0.90']:\n", " newname = child1.name + \"_c{}\".format(clust_threshold[2:])\n", " child2 = child1.branch(newname)\n", " child2.set_params(\"clust_threshold\", clust_threshold)\n", " child2.run(\"3456\")\n", " \n", " ## iterate over min_sample coverage\n", " for min_samples_locus in [4, 12]:\n", " newname = child2.name + \"_m{}\".format(min_samples_locus)\n", " child3 = child2.branch(newname)\n", " child3.set_params(\"min_samples_locus\", min_samples_locus)\n", " child3.run(\"7\")\n", " \n", " ## store the complete assembly in the dictionary by its name\n", " ## so it is easy for us to access and retrieve, since we wrote\n", " ## over the variable name 'child' during the loop. You can do\n", " ## this using dictionaries, lists, etc., or, as you'll see below,\n", " ## we can use the 'load_json()' command to load a finished assembly\n", " ## from its saved file object.\n", " adict[newname] = child3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Working with your data programmatically\n", "A key benefit of using the ipyrad API is that all of the statistics of your analysis are more-or-less accessible through the Assembly and Sample objects. For example, if you want to examine how different minimum depth setting affect your heterozygosity estimates, then you can create two separate branches with different parameter values and access the heterozygosity estimates from the .stats attributes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Advanced branching and merging\n", "If you wanted to apply a set of parameters to only a subset of your Samples during part of the assembly you can do so easily with branching and merging. In the example below I create two new branches from the Assembly before the base-calling steps, where each Assembly selects a different subset of the samples. Then I run steps 4 and 5 with a different set of parameters applied to each, so that one makes haploid base calls and the other makes diploid base calls. Then I merged the Assemblies back together so that all Samples are assembled together in steps 6 and 7. " ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Assembly: base\n", "[force] overwriting fastq files previously created by ipyrad.\n", "This _does not_ affect your original/raw data files.\n", "[####################] 100% sorting reads | 0:00:02 | s1 | \n", "[####################] 100% writing/compressing | 0:00:00 | s1 | \n", "[####################] 100% processing reads | 0:00:03 | s2 | \n", "[####################] 100% dereplicating | 0:00:00 | s3 | | \n", "[####################] 100% clustering | 0:00:01 | s3 | \n", "[####################] 100% building clusters | 0:00:00 | s3 | \n", "[####################] 100% chunking | 0:00:00 | s3 | \n", "[####################] 100% aligning | 0:00:10 | s3 | \n", "[####################] 100% concatenating | 0:00:00 | s3 | \n", "Assembly: data1\n", "[####################] 100% inferring [H, E] | 0:00:01 | s4 | \n", "[####################] 100% calculating depths | 0:00:00 | s5 | \n", "[####################] 100% chunking clusters | 0:00:00 | s5 | \n", "[####################] 100% consens calling | 0:00:04 | s5 | \n", "Assembly: data2\n", "Applying haploid-based test (infer E with H fixed to 0)\n", "[####################] 100% inferring [H, E] | 0:00:01 | s4 | \n", "[####################] 100% calculating depths | 0:00:00 | s5 | \n", "[####################] 100% chunking clusters | 0:00:00 | s5 | \n", "[####################] 100% consens calling | 0:00:04 | s5 | \n", "Assembly: data3\n", "[####################] 100% concat/shuffle input | 0:00:00 | s6 | \n", "[####################] 100% clustering across | 0:00:01 | s6 | \n", "[####################] 100% building clusters | 0:00:00 | s6 | \n", "[####################] 100% aligning clusters | 0:00:02 | s6 | \n", "[####################] 100% database indels | 0:00:00 | s6 | \n", "[####################] 100% indexing clusters | 0:00:00 | s6 | \n", "[####################] 100% building database | 0:00:00 | s6 | \n", "[####################] 100% filtering loci | 0:00:00 | s7 | \n", "[####################] 100% building loci/stats | 0:00:00 | s7 | \n", "[####################] 100% building vcf file | 0:00:01 | s7 | \n", "[####################] 100% writing vcf file | 0:00:00 | s7 | \n", "[####################] 100% building arrays | 0:00:00 | s7 | \n", "[####################] 100% writing outfiles | 0:00:00 | s7 | \n", "Outfiles written to: ~/Documents/ipyrad/tests/branch-test/data3_outfiles\n", "\n" ] } ], "source": [ "## run an assembly up to step 3\n", "data.run(\"123\", force=True)\n", "\n", "## select clade 1 from the sample names\n", "subs = [i for i in data.samples if \"1\" in i]\n", "\n", "## branch selecting only those samples\n", "data1 = data.branch(\"data1\", subs)\n", "\n", "## select clade 2 from the sample names\n", "subs = [i for i in data.samples if \"2\" in i]\n", "\n", "## branch selecting only those samples\n", "data2 = data.branch(\"data2\", subs)\n", "\n", "## make diploid base calls on 'data1' samples\n", "data1.set_params(\"max_alleles_consens\", 2)\n", "\n", "## make haploid base calls on 'data2' samples\n", "data2.set_params(\"max_alleles_consens\", 1)\n", "\n", "## run both assemblies through base-calling steps\n", "data1.run(\"45\", force=True)\n", "data2.run(\"45\", force=True)\n", "\n", "## merge assemblies back together for across-sample steps\n", "data3 = ip.merge(\"data3\", [data1, data2])\n", "data3.run(\"67\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Population assignments\n", "You can easily make population assignments in ipyrad using the Python dictionaries. This is useful for applying min_sample_locus filters to different groups of samples to maximize data that is shared across all of your samples. For example, if we wanted to ensure that every locus had data that was shared across all three clades in our data set then we would set a min_samples_locus value of 1 for each clade. You can see below that I use list-comprehension to select all samples in each clade based on the presence of characters in their names that define them (i.e., the presence of \"1\" for all samples in clade 1). When possible, this makes group assignments much easier than having to write every sample name by hand. " ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'clade1': (1, ['1B_0', '1A_0', '1C_0', '1D_0']),\n", " 'clade2': (1, ['2E_0', '2G_0', '2F_0', '2H_0'])}" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## create a branch for a population-filtered assembly\n", "pops = data3.branch(\"populations\")\n", "\n", "## assign samples to populations\n", "pops.populations = {\n", " \"clade1\": (1, [i for i in pops.samples if \"1\" in i]),\n", " \"clade2\": (1, [i for i in pops.samples if \"2\" in i]),\n", "}\n", "\n", "## print the population dictionary\n", "pops.populations" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Assembly: populations\n", "[####################] 100% filtering loci | 0:00:00 | s7 | \n", "[####################] 100% building loci/stats | 0:00:00 | s7 | \n", "[####################] 100% building vcf file | 0:00:01 | s7 | \n", "[####################] 100% writing vcf file | 0:00:00 | s7 | \n", "[####################] 100% building arrays | 0:00:00 | s7 | \n", "[####################] 100% writing outfiles | 0:00:00 | s7 | \n", "Outfiles written to: ~/Documents/ipyrad/tests/branch-test/populations_outfiles\n", "\n" ] } ], "source": [ "## run assembly\n", "pops.run(\"7\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Saving Assembly objects\n", "Assembly objects (and the Sample objects they contain) are automatically saved each time that you use the `.run()` function. However, you can also save by calling the `.save()` function of an Assembly object. This updates the JSON file. Additionally, Assembly objects have a function called `.write_params()` which can be invoked to create a params file for use by the *ipyrad* command line tool. " ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "loading Assembly: data1\n", "from saved path: ~/Documents/ipyrad/tests/pedicularis/data1.json\n" ] } ], "source": [ "## save assembly object (also auto-saves after every run() command)\n", "data1.save()\n", "\n", "## load assembly object\n", "data1 = ip.load_json(\"pedicularis/data1.json\")\n", "\n", "## write params file for use by the CLI\n", "data1.write_params(force=True)" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.13" } }, "nbformat": 4, "nbformat_minor": 1 }