{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Cookbook for running BUCKy in parallel in a Jupyter notebook" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook uses the *Pedicularis* example data set from the first empirical ipyrad tutorial. Here I show how to run BUCKy on a large set of loci parsed from the output file with the `.alleles.loci` ending. All code in this notebook is Python. You can simply follow along and execute this same code in a Jupyter notebook of your own. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Software requirements for this notebook\n", "All required software can be installed through conda by running the commented out code below in a terminal. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "## conda install -c BioBuilds mrbayes\n", "## conda install -c ipyrad ipyrad\n", "## conda install -c ipyrad bucky" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "## import Python libraries\n", "import ipyrad.analysis as ipa\n", "import ipyparallel as ipp" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cluster setup\n", "To execute code in parallel we will use the `ipyparallel` Python library. A quick guide to starting a parallel cluster locally can be found [here](link), and instructions for setting up a remote cluster on a HPC cluster is available [here](http://ipyrad.readthedocs.io/HPC_Tunnel.html). In either case, this notebook assumes you have started an `ipcluster` instance that this notebook can find, which the cell below will test. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "4 engines found\n" ] } ], "source": [ "## look for running ipcluster instance, and create load-balancer\n", "ipyclient = ipp.Client()\n", "print \"{} engines found\".format(len(ipyclient))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create a bucky analysis object\n", "The two required arguments are the `name` and `data` arguments. The `data` argument should be a .loci file or a .alleles.loci file. The name will be used to name output files, which will be written to `{workdir}/{name}/{number}.nexus`. Bucky doesn't deal well with missing data, so loci will only be included if they contain data for all samples in the analysis. By default, all samples found in the loci file will be used, unless you enter a list of names (the `samples` argument) to subsample taxa, which we do here. It is best to select one individual per species or subspecies. You can set a number of additional parameters in the `.params` dictionary. Here I use the `maxloci` argument to limit the total number of loci so that the example analysis will finish faster. But in practice, BUCKy runs quite fast and I would typically just use all of your loci in a real analysis. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "## make a list of sample names you wish to include in your BUCKy analysis \n", "samples = [\n", " \"29154_superba\", \n", " \"30686_cyathophylla\", \n", " \"41478_cyathophylloides\", \n", " \"33413_thamno\", \n", " \"30556_thamno\",\n", " \"35236_rex\",\n", " \"40578_rex\", \n", " \"38362_rex\", \n", " \"33588_przewalskii\",\n", "]" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "## initiate a bucky object\n", "c = ipa.bucky(\n", " name=\"buckytest\",\n", " data=\"analysis-ipyrad/pedic_outfiles/pedic.alleles.loci\",\n", " workdir=\"analysis-bucky\",\n", " samples=samples,\n", " minsnps=0,\n", " maxloci=100,\n", ")" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "bucky_alpha [0.1, 1.0, 10.0] \n", "bucky_nchains 4 \n", "bucky_niter 1000000 \n", "bucky_nreps 4 \n", "maxloci 100 \n", "mb_mcmc_burnin 100000 \n", "mb_mcmc_ngen 1000000 \n", "mb_mcmc_sample_freq 1000 \n", "minsnps 0 \n", "seed 416119234 " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## print the params dictionary\n", "c.params" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Write data to nexus files\n", "As you will see below, one step of this analysis is to convert the data into nexus files with a mrbayes code block. Let's run that step quickly here just to see what the converted files look like. " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "wrote 100 nexus files to ~/Documents/ipyrad/tests/analysis-bucky/buckytest\n" ] } ], "source": [ "## This will write nexus files to {workdir}/{name}/[number].nex\n", "c.write_nexus_files(force=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### An example nexus file" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "#NEXUS\r\n", "begin data;\r\n", "dimensions ntax=9 nchar=64;\r\n", "format datatype=dna interleave=yes gap=- missing=N;\r\n", "matrix\r\n", "30686_cyathophylla TCCTCGGCAGCCATTAGACCGGTGGAATATGCACCATGTACCGATCCTGGATAATCAAAACTCG\r\n", "33413_thamno TCCTCGGCAGCCATTAAACCGGTGGAGTATGCACCATGTACTGATCCTGGATAATCAAAACTTG\r\n", "33588_przewalskii TCCTCGGCAGCCATTAGACCGGTGGAGTGTGCACCATGCACCGATCCCGGATAATCAAAACTCG\r\n", "29154_superba TCCTCGGCAGCCATTAGACCGGTGGAGTATGCACCATGTACCGATCCTGGATAATCAAAACTCG\r\n", "41478_cyathophylloides TCCTCGGCAGCCATTAGACCAGTGGAGTATGCACCATGTACCGATCCTGGATAATCAAAACTCG\r\n", "40578_rex TCCTCGGCAGCCATTAAACCAGTGGAGTATGCACCATGTACCGATCCTGGATAATCAAAACTTG\r\n", "30556_thamno TCCTCGGCAGCCATTAAACCAGTGGAGTATGCACCATGTACCGATCCTGGATAATCAAAACTTG\r\n", "38362_rex TCCTCGGCAGCCATTAAACCGGTGGAGTATGCACCATGTACCGATCCTGGATAATCAAAACTTG\r\n", "35236_rex TCCTCGGCAGCCATTAAACCAGTGGAGTATGCACCATGTACCGATCCTGGATAATCAAAACTTG\r\n", "\r\n", " ;\r\n", "\r\n", "begin mrbayes;\r\n", "set autoclose=yes nowarn=yes;\r\n", "lset nst=6 rates=gamma;\r\n", "mcmc ngen=1000000 samplefreq=1000 printfreq=1000000;\r\n", "sump burnin=100000;\r\n", "sumt burnin=100000;\r\n", "end;\r\n" ] } ], "source": [ "## print an example nexus file\n", "! cat analysis-bucky/buckytest/1.nex" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Complete a BUCKy analysis\n", "There are four parts to a full BUCKy analysis. The first is converting the data into nexus files. The following are `.run_mrbayes()`, then `.run_mbsum()`, and finally `.run_bucky()`. Each uses the files produced by the previous function in order. You can use the force flag to overwrite existing files. An ipyclient should be provided to distribute the jobs in parallel. The parallelization is especially important for the mrbayes analyses, where more cores will lead to approximately linear speed improvements. An ipyrad.bucky analysis object will run all four steps sequentially by simply calling the `.run()` command. See the end of the notebook for results. " ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "wrote 100 nexus files to ~/Documents/ipyrad/tests/analysis-bucky/buckytest\n", "[####################] 100% [mb] infer gene-tree posteriors | 0:41:56 | \n", "[####################] 100% [mbsum] sum replicate runs | 0:00:02 | \n", "[####################] 100% [bucky] infer CF posteriors | 0:25:06 | \n" ] } ], "source": [ "## run the complete analysis\n", "c.run(force=True, ipyclient=ipyclient)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Alternatively, you can run each step separately" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "wrote 100 nexus files to ~/Documents/ipyrad/tests/analysis-bucky/buckytest\n" ] } ], "source": [ "## (1) This will write nexus files to {workdir}/{name}/[number].nex\n", "c.write_nexus_files(force=True)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[####################] 100% [mb] infer gene-tree posteriors | 0:47:06 | \n" ] } ], "source": [ "## (2) distributes mrbayes jobs across the parallel client\n", "c.run_mrbayes(force=True, ipyclient=ipyclient)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[####################] 100% [mbsum] sum replicate runs | 0:00:07 | \n" ] } ], "source": [ "## (3) this step is fast, simply summing the gene-tree posteriors\n", "c.run_mbsum(force=True, ipyclient=ipyclient)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[####################] 100% [bucky] infer CF posteriors | 1:35:16 | \n" ] } ], "source": [ "## (4) infer concordance factors with BUCKy. This will run in parallel \n", "## for however many alpha values are in b.params.bucky_alpha list\n", "b.run_bucky(force=True, ipyclient=ipyclient)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Convenient access to results\n", "View the results in the file `[workdir]/[name]/CF-{alpha-value}.concordance`. We haven't yet developed any further ipyrad tools for parsing these results, but hope to do so in the future. The main results you are typically interested in are the Primary Concordance Tree and the Splits in the Primary Concordance Tree. " ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "translate\r\n", " 1 30686_cyathophylla,\r\n", " 2 33413_thamno,\r\n", " 3 33588_przewalskii,\r\n", " 4 29154_superba,\r\n", " 5 41478_cyathophylloides,\r\n", " 6 40578_rex,\r\n", " 7 30556_thamno,\r\n", " 8 38362_rex,\r\n", " 9 35236_rex;\r\n", "\r\n", "Population Tree:\r\n", "(((((((1,4),5),3),2),8),6),7,9);\r\n", "\r\n", "Primary Concordance Tree Topology:\r\n", "(((((1,4),5),3),7),((2,8),6),9);\r\n", "\r\n", "Population Tree, With Branch Lengths In Estimated Coalescent Units:\r\n", "(((((((1:10.000,4:10.000):0.250,5:10.000):0.763,3:10.000):1.798,2:10.000):0.116,8:10.000):0.044,6:10.000):0.000,7:10.000,9:10.000);\r\n", "\r\n", "Primary Concordance Tree with Sample Concordance Factors:\r\n", "(((((1:1.000,4:1.000):0.465,5:1.000):0.578,3:1.000):0.754,7:1.000):0.255,((2:1.000,8:1.000):0.231,6:1.000):0.210,9:1.000);\r\n", "\r\n", "Four-way partitions in the Population Tree: sample-wide CF, coalescent units and Ties(if present)\r\n", "{1; 4|2,3,6,7,8,9; 5}\t0.481, 0.250, \r\n", "{1,4; 5|2,6,7,8,9; 3}\t0.689, 0.763, \r\n", "{1,4,5; 3|2; 6,7,8,9}\t0.890, 1.798, \r\n", "{1,2,3,4,5,8; 6|7; 9}\t0.327, 0.000, \r\n", "{1,3,4,5; 2|6,7,9; 8}\t0.406, 0.116, \r\n", "{1,2,3,4,5; 8|6; 7,9}\t0.362, 0.044, \r\n", "\r\n", "Splits in the Primary Concordance Tree: sample-wide and genome-wide mean CF (95% credibility), SD of mean sample-wide CF across runs\r\n", "{1,3,4,5|2,6,7,8,9} 0.754(0.630,0.830) 0.746(0.596,0.862)\t0.041\r\n", "{1,4,5|2,3,6,7,8,9} 0.578(0.470,0.690) 0.572(0.425,0.718)\t0.032\r\n", "{1,4|2,3,5,6,7,8,9} 0.465(0.310,0.600) 0.461(0.276,0.631)\t0.059\r\n", "{1,3,4,5,7|2,6,8,9} 0.255(0.110,0.400) 0.252(0.092,0.423)\t0.056\r\n", "{1,3,4,5,6,7,9|2,8} 0.231(0.040,0.340) 0.229(0.032,0.381)\t0.069\r\n", "{1,3,4,5,7,9|2,6,8} 0.210(0.070,0.350) 0.208(0.059,0.374)\t0.030\r\n", "\r\n", "Splits NOT in the Primary Concordance Tree but with estimated CF > 0.050:\r\n", "{1,5|2,3,4,6,7,8,9} 0.306(0.140,0.470) 0.304(0.122,0.503)\t0.074\r\n", "{1,2,3,4,5|6,7,8,9} 0.250(0.150,0.390) 0.248(0.121,0.410)\t0.030\r\n", "{1,2,3,4,5,8|6,7,9} 0.194(0.050,0.360) 0.192(0.027,0.389)\t0.095\r\n", "{1,2,3,4,5,7,8|6,9} 0.174(0.000,0.390) 0.173(0.000,0.409)\t0.117\r\n", "{1,2,3,4,5,6|7,8,9} 0.160(0.000,0.420) 0.158(0.000,0.441)\t0.138\r\n", "{1,2,5,6,7,8,9|3,4} 0.146(0.020,0.280) 0.145(0.009,0.300)\t0.032\r\n", "{1,2,3,4,5,6,9|7,8} 0.140(0.040,0.340) 0.139(0.029,0.341)\t0.040\r\n", "{1,3,4,5,9|2,6,7,8} 0.139(0.060,0.260) 0.138(0.038,0.283)\t0.051\r\n", "{1,2,3,4,5,9|6,7,8} 0.133(0.070,0.230) 0.132(0.045,0.255)\t0.018\r\n", "{1,2,3,4,5,8,9|6,7} 0.130(0.000,0.450) 0.130(0.000,0.467)\t0.093\r\n" ] } ], "source": [ "## print first 50 lines of a results files\n", "! head -n 50 analysis-bucky/buckytest/CF-a1.0.concordance" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.13" } }, "nbformat": 4, "nbformat_minor": 1 }