{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Eaton & Ree (2013) single-end RAD data set\n", "Here we demonstrate a denovo assembly for an empirical RAD data set using the ipyrad Python API. This example was run on a workstation with 20 cores and takes about 10 minutes to assemble, but you should be able to run it on a 4-core laptop in ~30-60 minutes.\n", "\n", "For our example data we will use the 13 taxa *Pedicularis* data set from [Eaton and Ree (2013) (Open Access)](https://academic.oup.com/sysbio/article/62/5/689/1684460/Inferring-Phylogeny-and-Introgression-using-RADseq). This data set is composed of single-end 75bp reads from a RAD-seq library prepared with the PstI enzyme. The data set also serves as an example for several of our [analysis cookbooks](http://ipyrad.readthedocs.io/analysis.html) that demonstrate methods for analyzing RAD-seq data files. At the end of this notebook there are also several examples of how to use the ipyrad analysis tools to run downstream analyses in parallel. \n", "\n", "The figure below shows the ingroup taxa from this study and their sampling locations. The study includes all species within a small monophyletic clade of *Pedicularis*, including multiple individuals from 5 species and several subspecies, as well as an outgroup species. The sampling essentially spans from population-level variation where species boundaries are unclear, to higher-level divergence where species boundaries are quite distinct. This is a common scale at which RAD-seq data are often very useful. \n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup (software and data files)\n", "If you haven't done so yet, start by installing `ipyrad` using conda (see ipyrad installation instructions) as well as the packages in the cell below. This is easiest to do in a terminal. Then open a jupyter-notebook, like this one, and follow along with the tuturial by copying and executing the code in the cells, and adding your own documentation between them using markdown. Feel free to modify parameters to see their effects on the downstream results. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "## conda install ipyrad -c ipyrad\n", "## conda install toytree -c eaton-lab\n", "## conda install sra-tools -c bioconda\n", "## conda install entrez-direct -c bioconda" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "## imports\n", "import ipyrad as ip\n", "import ipyrad.analysis as ipa\n", "import ipyparallel as ipp" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In contrast to the ipyrad CLI, the ipyrad API gives users much more fine-scale control over the parallelization of their analysis, but this also requires learning a little bit about the library that we use to do this, called `ipyparallel`. This library is designed for use with jupyter-notebooks to allow massive-scale multi-processing while working *interactively*. \n", "\n", "Understanding the nuts and bolts of it might take a little while, but it is fairly easy to get started using it, especially in the way it is integrated with ipyrad. To start a parallel client to you must run the command-line program '`ipcluster`'. This will essentially start a number of independent Python processes (kernels) which we can then send bits of work to do. The cluster can be stopped and restarted independently of this notebook, which is convenient for working on a cluster where connecting to many cores is not always immediately available. \n", "\n", "Open a terminal and type the following command to start an ipcluster instance with N engines. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "## ipcluster start --n=20" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## connect to cluster\n", "ipyclient = ipp.Client()\n", "ipyclient.ids" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Download the data set (Pedicularis)\n", "These data are archived on the NCBI sequence read archive (SRA) under accession id SRP021469. As part of the ipyrad analysis tools we have a wrapper around the SRAtools software that can be used to query NCBI and download sequence data based on accession IDs. Run the code below to download the fastq data files associated with this study. The data will be saved the specified directory which will be created if it does not already exist. The compressed file size of the data is a little over 1GB. If you pass your ipyclient to the `.run()` command below then the download will be parallelized. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[####################] 100% Downloading fastq files | 0:01:19 | \n", "13 fastq files downloaded to /home/deren/Documents/ipyrad/tests/fastqs-Ped\n" ] } ], "source": [ "## download the Pedicularis data set from NCBI\n", "sra = ipa.sratools(accession=\"SRP021469\", workdir=\"fastqs-Ped\")\n", "sra.run(force=True, ipyclient=ipyclient)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create an Assembly object\n", "This object stores the parameters of the assembly and the organization of data files. " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "New Assembly: pedicularis\n" ] } ], "source": [ "## you must provide a name for the Assembly\n", "data = ip.Assembly(\"pedicularis\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Set parameters for the Assembly. This will raise an error if any of the parameters are not allowed because they are the wrong type, or out of the allowed range. " ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 assembly_name pedicularis \n", "1 project_dir ./analysis-ipyrad \n", "2 raw_fastq_path \n", "3 barcodes_path \n", "4 sorted_fastq_path ./example_empirical_rad/*.fastq.gz \n", "5 assembly_method denovo \n", "6 reference_sequence \n", "7 datatype rad \n", "8 restriction_overhang ('TGCAG', '') \n", "9 max_low_qual_bases 5 \n", "10 phred_Qscore_offset 33 \n", "11 mindepth_statistical 6 \n", "12 mindepth_majrule 6 \n", "13 maxdepth 10000 \n", "14 clust_threshold 0.9 \n", "15 max_barcode_mismatch 0 \n", "16 filter_adapters 2 \n", "17 filter_min_trim_len 35 \n", "18 max_alleles_consens 2 \n", "19 max_Ns_consens (5, 5) \n", "20 max_Hs_consens (5, 5) \n", "21 min_samples_locus 4 \n", "22 max_SNPs_locus (20, 20) \n", "23 max_Indels_locus (8, 8) \n", "24 max_shared_Hs_locus 0.5 \n", "25 trim_reads (0, 0, 0, 0) \n", "26 trim_loci (0, 5, 0, 0) \n", "27 output_formats ('p', 's', 'v', 'n', 'k', 'u') \n", "28 pop_assign_file \n" ] } ], "source": [ "## set parameters\n", "data.set_params(\"project_dir\", \"analysis-ipyrad\")\n", "data.set_params(\"sorted_fastq_path\", \"fastqs-Ped/*.fastq.gz\")\n", "data.set_params(\"clust_threshold\", \"0.90\")\n", "data.set_params(\"filter_adapters\", \"2\")\n", "data.set_params(\"max_Hs_consens\", (5, 5))\n", "data.set_params(\"trim_loci\", (0, 5, 0, 0))\n", "data.set_params(\"output_formats\", \"psvnkua\")\n", "\n", "## see/print all parameters\n", "data.get_params()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Assemble the data set\n" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Assembly: pedicularis\n", "[####################] 100% loading reads | 0:00:04 | s1 | \n", "[####################] 100% processing reads | 0:01:14 | s2 | \n" ] } ], "source": [ "## run steps 1 & 2 of the assembly\n", "data.run(\"12\")" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
statereads_rawreads_passed_filter
29154_superba2696994689996
30556_thamno214523161440314
30686_cyathophylla212531091206947
32082_przewalskii2964244955480
33413_thamno2636625626084
33588_przewalskii21002923993873
35236_rex218038581787366
35855_rex214098431397068
38362_rex213911751379626
39618_rex2822263813990
40578_rex217079421695523
41478_cyathophylloides221997402185364
41954_cyathophylloides221996132176210
\n", "
" ], "text/plain": [ " state reads_raw reads_passed_filter\n", "29154_superba 2 696994 689996\n", "30556_thamno 2 1452316 1440314\n", "30686_cyathophylla 2 1253109 1206947\n", "32082_przewalskii 2 964244 955480\n", "33413_thamno 2 636625 626084\n", "33588_przewalskii 2 1002923 993873\n", "35236_rex 2 1803858 1787366\n", "35855_rex 2 1409843 1397068\n", "38362_rex 2 1391175 1379626\n", "39618_rex 2 822263 813990\n", "40578_rex 2 1707942 1695523\n", "41478_cyathophylloides 2 2199740 2185364\n", "41954_cyathophylloides 2 2199613 2176210" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## access the stats of the assembly (so far) from the .stats attribute\n", "data.stats" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Assembly: pedicularis\n", "[####################] 100% dereplicating | 0:00:07 | s3 | \n", "[####################] 100% clustering | 0:05:08 | s3 | \n", "[####################] 100% building clusters | 0:00:26 | s3 | \n", "[####################] 100% chunking | 0:00:05 | s3 | \n", "[####################] 100% aligning | 0:03:06 | s3 | \n", "[####################] 100% concatenating | 0:00:15 | s3 | \n", "[####################] 100% inferring [H, E] | 0:01:10 | s4 | \n", "[####################] 100% calculating depths | 0:00:06 | s5 | \n", "[####################] 100% chunking clusters | 0:00:06 | s5 | \n", "[####################] 100% consens calling | 0:01:56 | s5 | \n", "[####################] 100% concat/shuffle input | 0:00:05 | s6 | \n", "[####################] 100% clustering across | 0:03:47 | s6 | \n", "[####################] 100% building clusters | 0:00:06 | s6 | \n", "[####################] 100% aligning clusters | 0:00:29 | s6 | \n", "[####################] 100% database indels | 0:00:14 | s6 | \n", "[####################] 100% indexing clusters | 0:00:03 | s6 | \n", "[####################] 100% building database | 0:00:29 | s6 | \n" ] } ], "source": [ "## run steps 3-6 of the assembly\n", "data.run(\"3456\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Branch to create several final data sets with different parameter settings\n", "\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Assembly: min4\n", "[####################] 100% filtering loci | 0:00:08 | s7 | \n", "[####################] 100% building loci/stats | 0:00:01 | s7 | \n", "[####################] 100% building vcf file | 0:00:09 | s7 | \n", "[####################] 100% writing vcf file | 0:00:00 | s7 | \n", "[####################] 100% building arrays | 0:00:04 | s7 | \n", "[####################] 100% writing outfiles | 0:00:05 | s7 | \n", "Outfiles written to: ~/Documents/ipyrad/tests/analysis-ipyrad/min4_outfiles\n", "\n", "Assembly: min13\n", "[####################] 100% filtering loci | 0:00:02 | s7 | \n", "[####################] 100% building loci/stats | 0:00:01 | s7 | \n", "[####################] 100% building vcf file | 0:00:03 | s7 | \n", "[####################] 100% writing vcf file | 0:00:00 | s7 | \n", "[####################] 100% building arrays | 0:00:04 | s7 | \n", "[####################] 100% writing outfiles | 0:00:00 | s7 | \n", "Outfiles written to: ~/Documents/ipyrad/tests/analysis-ipyrad/min13_outfiles\n", "\n", "Assembly: min11-pops\n", "[####################] 100% filtering loci | 0:00:02 | s7 | \n", "[####################] 100% building loci/stats | 0:00:01 | s7 | \n", "[####################] 100% building vcf file | 0:00:03 | s7 | \n", "[####################] 100% writing vcf file | 0:00:00 | s7 | \n", "[####################] 100% building arrays | 0:00:04 | s7 | \n", "[####################] 100% writing outfiles | 0:00:00 | s7 | \n", "Outfiles written to: ~/Documents/ipyrad/tests/analysis-ipyrad/min11-pops_outfiles\n", "\n", "Assembly: nouts_min11\n", "[####################] 100% filtering loci | 0:00:03 | s7 | \n", "[####################] 100% building loci/stats | 0:00:02 | s7 | \n", "[####################] 100% building vcf file | 0:00:02 | s7 | \n", "[####################] 100% writing vcf file | 0:00:00 | s7 | \n", "[####################] 100% building arrays | 0:00:04 | s7 | \n", "[####################] 100% writing outfiles | 0:00:00 | s7 | \n", "Outfiles written to: ~/Documents/ipyrad/tests/analysis-ipyrad/nouts_min11_outfiles\n", "\n" ] } ], "source": [ "## create a branch for outputs with min_samples = 4 (lots of missing data)\n", "min4 = data.branch(\"min4\")\n", "min4.set_params(\"min_samples_locus\", 4)\n", "min4.run(\"7\")\n", "\n", "## create a branch for outputs with min_samples = 13 (no missing data)\n", "min13 = data.branch(\"min13\")\n", "min13.set_params(\"min_samples_locus\", 13)\n", "min13.run(\"7\")\n", "\n", "## create a branch with no missing data for ingroups, but allow\n", "## missing data in the outgroups by setting population assignments.\n", "## The population min-sample values overrule the min-samples-locus param\n", "pops = data.branch(\"min11-pops\")\n", "pops.populations = {\n", " \"ingroup\": (11, [i for i in pops.samples if \"prz\" not in i]),\n", " \"outgroup\" : (0, [i for i in pops.samples if \"prz\" in i]),\n", " }\n", "pops.run(\"7\")\n", "\n", "## create a branch with no missing data and with outgroups removed\n", "nouts = data.branch(\"nouts_min11\", subsamples=[i for i in pops.samples if \"prz\" not in i])\n", "nouts.set_params(\"min_samples_locus\", 11)\n", "nouts.run(\"7\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### View final stats\n", "The .stats attribute shows a stats summary for each sample, and a number of stats dataframes can be accessed for each step from the .stats_dfs attribute of the Assembly. " ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
statereads_rawreads_passed_filterclusters_totalclusters_hidepthhetero_esterror_estreads_consens
29154_superba6696994689996134897351430.0100.00234628
30556_thamno614523161440314212960527760.0110.00351701
30686_cyathophylla612531091206947240824542820.0100.00253347
32082_przewalskii6964244955480151762424870.0130.00241841
33413_thamno6636625626084172709310720.0120.00230674
33588_przewalskii61002923993873158933465550.0130.00245892
35236_rex618038581787366417632547090.0100.00154064
35855_rex614098431397068184661565950.0130.00355371
38362_rex613911751379626133541531560.0080.00252496
39618_rex6822263813990148107440160.0100.00243409
40578_rex617079421695523222328566530.0110.00255937
41478_cyathophylloides621997402185364173188552080.0080.00154482
41954_cyathophylloides621996132176210301343765370.0080.00375167
\n", "
" ], "text/plain": [ " state reads_raw reads_passed_filter clusters_total \\\n", "29154_superba 6 696994 689996 134897 \n", "30556_thamno 6 1452316 1440314 212960 \n", "30686_cyathophylla 6 1253109 1206947 240824 \n", "32082_przewalskii 6 964244 955480 151762 \n", "33413_thamno 6 636625 626084 172709 \n", "33588_przewalskii 6 1002923 993873 158933 \n", "35236_rex 6 1803858 1787366 417632 \n", "35855_rex 6 1409843 1397068 184661 \n", "38362_rex 6 1391175 1379626 133541 \n", "39618_rex 6 822263 813990 148107 \n", "40578_rex 6 1707942 1695523 222328 \n", "41478_cyathophylloides 6 2199740 2185364 173188 \n", "41954_cyathophylloides 6 2199613 2176210 301343 \n", "\n", " clusters_hidepth hetero_est error_est reads_consens \n", "29154_superba 35143 0.010 0.002 34628 \n", "30556_thamno 52776 0.011 0.003 51701 \n", "30686_cyathophylla 54282 0.010 0.002 53347 \n", "32082_przewalskii 42487 0.013 0.002 41841 \n", "33413_thamno 31072 0.012 0.002 30674 \n", "33588_przewalskii 46555 0.013 0.002 45892 \n", "35236_rex 54709 0.010 0.001 54064 \n", "35855_rex 56595 0.013 0.003 55371 \n", "38362_rex 53156 0.008 0.002 52496 \n", "39618_rex 44016 0.010 0.002 43409 \n", "40578_rex 56653 0.011 0.002 55937 \n", "41478_cyathophylloides 55208 0.008 0.001 54482 \n", "41954_cyathophylloides 76537 0.008 0.003 75167 " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## we can access the stats summary as a pandas dataframes. \n", "min4.stats" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\r\n", "\r\n", "## The number of loci caught by each filter.\r\n", "## ipyrad API location: [assembly].stats_dfs.s7_filters\r\n", "\r\n", " total_filters applied_order retained_loci\r\n", "total_prefiltered_loci 96562 0 96562\r\n", "filtered_by_rm_duplicates 2939 2939 93623\r\n", "filtered_by_max_indels 189 189 93434\r\n", "filtered_by_max_snps 679 18 93416\r\n", "filtered_by_max_shared_het 865 708 92708\r\n", "filtered_by_min_sample 47958 46646 46062\r\n", "filtered_by_max_alleles 10694 4565 41497\r\n", "total_filtered_loci 41497 0 41497\r\n", "\r\n", "\r\n", "## The number of loci recovered for each Sample.\r\n", "## ipyrad API location: [assembly].stats_dfs.s7_samples\r\n", "\r\n", " sample_coverage\r\n", "29154_superba 21054\r\n", "30556_thamno 31790\r\n", "30686_cyathophylla 26648\r\n", "32082_przewalskii 13201\r\n", "33413_thamno 18747\r\n", "33588_przewalskii 15461\r\n", "35236_rex 33228\r\n", "35855_rex 33403\r\n", "38362_rex 33748\r\n", "39618_rex 28135\r\n", "40578_rex 34107\r\n", "41478_cyathophylloides 30826\r\n", "41954_cyathophylloides 28260\r\n", "\r\n", "\r\n", "## The number of loci for which N taxa have data.\r\n", "## ipyrad API location: [assembly].stats_dfs.s7_loci\r\n", "\r\n", " locus_coverage sum_coverage\r\n", "1 0 0\r\n", "2 0 0\r\n", "3 0 0\r\n", "4 5764 5764\r\n", "5 4002 9766\r\n", "6 3650 13416\r\n", "7 3139 16555\r\n", "8 3220 19775\r\n", "9 4153 23928\r\n", "10 4904 28832\r\n", "11 5321 34153\r\n", "12 4511 38664\r\n", "13 2833 41497\r\n", "\r\n", "\r\n", "## The distribution of SNPs (var and pis) per locus.\r\n", "## var = Number of loci with n variable sites (pis + autapomorphies)\r\n", "## pis = Number of loci with n parsimony informative site (minor allele in >1 sample)\r\n", "## ipyrad API location: [assembly].stats_dfs.s7_snps\r\n", "\r\n", " var sum_var pis sum_pis\r\n", "0 2348 0 11768 0\r\n", "1 4482 4482 10822 10822\r\n", "2 5870 16222 7699 26220\r\n", "3 6314 35164 4914 40962\r\n", "4 5708 57996 2967 52830\r\n", "5 4857 82281 1679 61225\r\n", "6 3885 105591 874 66469\r\n", "7 2932 126115 478 69815\r\n", "8 1955 141755 188 71319\r\n", "9 1255 153050 72 71967\r\n", "10 827 161320 19 72157\r\n", "11 506 166886 10 72267\r\n", "12 262 170030 4 72315\r\n", "13 141 171863 3 72354\r\n", "14 78 172955 0 72354\r\n", "15 39 173540 0 72354\r\n", "16 19 173844 0 72354\r\n", "17 10 174014 0 72354\r\n", "18 6 174122 0 72354\r\n", "19 2 174160 0 72354\r\n", "20 1 174180 0 72354\r\n", "\r\n", "\r\n", "## Final Sample stats summary\r\n", "\r\n", " state reads_raw reads_passed_filter clusters_total clusters_hidepth hetero_est error_est reads_consens loci_in_assembly\r\n", "29154_superba 7 696994 689996 134897 35143 0.010 0.002 34628 21054\r\n", "30556_thamno 7 1452316 1440314 212960 52776 0.011 0.003 51701 31790\r\n", "30686_cyathophylla 7 1253109 1206947 240824 54282 0.010 0.002 53347 26648\r\n", "32082_przewalskii 7 964244 955480 151762 42487 0.013 0.002 41841 13201\r\n", "33413_thamno 7 636625 626084 172709 31072 0.012 0.002 30674 18747\r\n", "33588_przewalskii 7 1002923 993873 158933 46555 0.013 0.002 45892 15461\r\n", "35236_rex 7 1803858 1787366 417632 54709 0.010 0.001 54064 33228\r\n", "35855_rex 7 1409843 1397068 184661 56595 0.013 0.003 55371 33403\r\n", "38362_rex 7 1391175 1379626 133541 53156 0.008 0.002 52496 33748\r\n", "39618_rex 7 822263 813990 148107 44016 0.010 0.002 43409 28135\r\n", "40578_rex 7 1707942 1695523 222328 56653 0.011 0.002 55937 34107\r\n", "41478_cyathophylloides 7 2199740 2185364 173188 55208 0.008 0.001 54482 30826\r\n", "41954_cyathophylloides 7 2199613 2176210 301343 76537 0.008 0.003 75167 28260" ] } ], "source": [ "## or print the full stats file \n", "cat $min4.stats_files.s7" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sample_coverage
29154_superba21054
30556_thamno31790
30686_cyathophylla26648
32082_przewalskii13201
33413_thamno18747
33588_przewalskii15461
35236_rex33228
35855_rex33403
38362_rex33748
39618_rex28135
40578_rex34107
41478_cyathophylloides30826
41954_cyathophylloides28260
\n", "
" ], "text/plain": [ " sample_coverage\n", "29154_superba 21054\n", "30556_thamno 31790\n", "30686_cyathophylla 26648\n", "32082_przewalskii 13201\n", "33413_thamno 18747\n", "33588_przewalskii 15461\n", "35236_rex 33228\n", "35855_rex 33403\n", "38362_rex 33748\n", "39618_rex 28135\n", "40578_rex 34107\n", "41478_cyathophylloides 30826\n", "41954_cyathophylloides 28260" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## and we can access parts of the full stats outputs as dataframes \n", "min4.stats_dfs.s7_samples" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sample_coverage
29154_superba2833
30556_thamno2833
30686_cyathophylla2833
32082_przewalskii2833
33413_thamno2833
33588_przewalskii2833
35236_rex2833
35855_rex2833
38362_rex2833
39618_rex2833
40578_rex2833
41478_cyathophylloides2833
41954_cyathophylloides2833
\n", "
" ], "text/plain": [ " sample_coverage\n", "29154_superba 2833\n", "30556_thamno 2833\n", "30686_cyathophylla 2833\n", "32082_przewalskii 2833\n", "33413_thamno 2833\n", "33588_przewalskii 2833\n", "35236_rex 2833\n", "35855_rex 2833\n", "38362_rex 2833\n", "39618_rex 2833\n", "40578_rex 2833\n", "41478_cyathophylloides 2833\n", "41954_cyathophylloides 2833" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## compare this to the one above, coverage is more equal\n", "min13.stats_dfs.s7_samples" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sample_coverage
29154_superba5796
30556_thamno5796
30686_cyathophylla5796
32082_przewalskii3266
33413_thamno5796
33588_przewalskii3747
35236_rex5796
35855_rex5796
38362_rex5796
39618_rex5796
40578_rex5796
41478_cyathophylloides5796
41954_cyathophylloides5796
\n", "
" ], "text/plain": [ " sample_coverage\n", "29154_superba 5796\n", "30556_thamno 5796\n", "30686_cyathophylla 5796\n", "32082_przewalskii 3266\n", "33413_thamno 5796\n", "33588_przewalskii 3747\n", "35236_rex 5796\n", "35855_rex 5796\n", "38362_rex 5796\n", "39618_rex 5796\n", "40578_rex 5796\n", "41478_cyathophylloides 5796\n", "41954_cyathophylloides 5796" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## similarly, coverage is equal here among ingroups, but allows missing in outgroups\n", "pops.stats_dfs.s7_samples" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Analysis tools\n", "We have a lot more information about analysis tools in the ipyrad documentation. But here I'll show just a quick example of how you can easily access the data files for these assemblies and use them in downstream analysis software. The ipyrad analysis tools include convenient wrappers to make it easier to parallelize analyses of RAD-seq data. Please see the full documentation for the ipyrad.analysis tools in the ipyrad documentation for more details. " ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import ipyrad as ip\n", "import ipyrad.analysis as ipa" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "loading Assembly: min4\n", "from saved path: ~/Documents/ipyrad/tests/analysis-ipyrad/min4.json\n", "loading Assembly: min13\n", "from saved path: ~/Documents/ipyrad/tests/analysis-ipyrad/min13.json\n", "loading Assembly: nouts_min11\n", "from saved path: ~/Documents/ipyrad/tests/analysis-ipyrad/nouts_min11.json\n" ] } ], "source": [ "## you can re-load assemblies at a later time from their JSON file\n", "min4 = ip.load_json(\"analysis-ipyrad/min4.json\")\n", "min13 = ip.load_json(\"analysis-ipyrad/min13.json\")\n", "nouts = ip.load_json(\"analysis-ipyrad/nouts_min11.json\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### *RAxML* -- ML concatenation tree inference" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": true }, "outputs": [], "source": [ "## conda install raxml -c bioconda\n", "## conda install toytree -c eaton-lab" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true }, "outputs": [], "source": [ "## create a raxml analysis object for the min13 data sets\n", "rax = ipa.raxml(\n", " name=min13.name, \n", " data=min13.outfiles.phy,\n", " workdir=\"analysis-raxml\",\n", " T=20,\n", " N=100, \n", " o=[i for i in min13.samples if \"prz\" in i],\n", " )" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "raxmlHPC-PTHREADS-SSE3 -f a -T 20 -m GTRGAMMA -N 100 -x 12345 -p 54321 -n min13 -w /home/deren/Documents/ipyrad/tests/analysis-raxml -s /home/deren/Documents/ipyrad/tests/analysis-ipyrad/min13_outfiles/min13.phy -o 33588_przewalskii,32082_przewalskii\n", "job min13 finished successfully\n" ] } ], "source": [ "## print the raxml command and call it\n", "print rax.command\n", "rax.run(force=True)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "bestTree ~/Documents/ipyrad/tests/analysis-raxml/RAxML_bestTree.min13\n", "bipartitions ~/Documents/ipyrad/tests/analysis-raxml/RAxML_bipartitions.min13\n", "bipartitionsBranchLabels ~/Documents/ipyrad/tests/analysis-raxml/RAxML_bipartitionsBranchLabels.min13\n", "bootstrap ~/Documents/ipyrad/tests/analysis-raxml/RAxML_bootstrap.min13\n", "info ~/Documents/ipyrad/tests/analysis-raxml/RAxML_info.min13" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## access the resulting tree files\n", "rax.trees" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
32082_przewalskii33588_przewalskii29154_superba30686_cyathophylla41478_cyathophylloides41954_cyathophylloides33413_thamno30556_thamno35855_rex40578_rex35236_rex39618_rex38362_rexidx: 1\n", "name: 1\n", "dist: 0.0112591855114\n", "support: 100100idx: 2\n", "name: 2\n", "dist: 0.0112591855114\n", "support: 100100idx: 3\n", "name: 3\n", "dist: 0.00138635903129\n", "support: 100100idx: 4\n", "name: 4\n", "dist: 0.00130362717349\n", "support: 100100idx: 5\n", "name: 5\n", "dist: 0.00515136675257\n", "support: 100100idx: 6\n", "name: 6\n", "dist: 0.0026758642527\n", "support: 100100idx: 7\n", "name: 7\n", "dist: 0.000567990916626\n", "support: 7777idx: 8\n", "name: 8\n", "dist: 0.000342308540455\n", "support: 7373idx: 9\n", "name: 9\n", "dist: 0.00100552991932\n", "support: 100100idx: 10\n", "name: 10\n", "dist: 0.000365735718032\n", "support: 4848idx: 11\n", "name: 11\n", "dist: 0.0030022742397\n", "support: 100100
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "## plot a tree in the notebook with toytree\n", "import toytree\n", "tre = toytree.tree(rax.trees.bipartitions)\n", "tre.draw(\n", " width=350,\n", " height=400,\n", " node_labels=tre.get_node_values(\"support\"),\n", " );" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### *tetrad* -- quartet tree inference " ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "loading seq array [13 taxa x 174180 bp]\n", "max unlinked SNPs per quartet (nloci): 39149\n" ] } ], "source": [ "## create a tetrad analysis object\n", "tet = ipa.tetrad(\n", " name=min4.name, \n", " seqfile=min4.outfiles.snpsphy,\n", " mapfile=min4.outfiles.snpsmap,\n", " nboots=100,\n", " )" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "host compute node: [20 cores] on tinus\n", "inferring 715 induced quartet trees\n", "[####################] 100% initial tree | 0:00:05 | \n", "[####################] 100% boot 100 | 0:02:23 | \n" ] } ], "source": [ "## run tree inference\n", "tet.run(ipyclient)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "boots ~/Documents/ipyrad/tests/analysis-tetrad/min4.boots\n", "cons ~/Documents/ipyrad/tests/analysis-tetrad/min4.cons\n", "nhx ~/Documents/ipyrad/tests/analysis-tetrad/min4.nhx\n", "tree ~/Documents/ipyrad/tests/analysis-tetrad/min4.tree" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## access tree files\n", "tet.trees" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
33588_przewalskii32082_przewalskii30686_cyathophylla29154_superba41954_cyathophylloides41478_cyathophylloides33413_thamno35236_rex30556_thamno35855_rex40578_rex38362_rex39618_rexidx: 1\n", "name: 1\n", "dist: 100\n", "support: 100100idx: 2\n", "name: added-node\n", "dist: 100\n", "support: 100100idx: 3\n", "name: 2\n", "dist: 100\n", "support: 100100idx: 4\n", "name: 3\n", "dist: 100\n", "support: 100100idx: 5\n", "name: 4\n", "dist: 100\n", "support: 100100idx: 6\n", "name: 5\n", "dist: 100\n", "support: 100100idx: 7\n", "name: 6\n", "dist: 99\n", "support: 9999idx: 8\n", "name: 7\n", "dist: 98\n", "support: 9898idx: 9\n", "name: 8\n", "dist: 85\n", "support: 8585idx: 10\n", "name: 9\n", "dist: 100\n", "support: 100100idx: 11\n", "name: 10\n", "dist: 100\n", "support: 100100
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "## plot results (just like above, but unrooted by default)\n", "## the consensus tree here differs from the ML tree above.\n", "import toytree\n", "qtre = toytree.tree(tet.trees.nhx)\n", "qtre.root(wildcard=\"prz\")\n", "qtre.draw(\n", " width=350,\n", " height=400,\n", " node_labels=qtre.get_node_values(\"support\"),\n", " );" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### *STRUCTURE* -- population cluster inference" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "## conda install structure clumpp -c ipyrad" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": true }, "outputs": [], "source": [ "## create a structure analysis object for the no-outgroup data set\n", "struct = ipa.structure(\n", " name=nouts.name, \n", " data=nouts.outfiles.str, \n", " mapfile=nouts.outfiles.snpsmap, \n", ")\n", "\n", "## set params for analysis (should be longer in real analyses)\n", "struct.mainparams.burnin=1000\n", "struct.mainparams.numreps=8000" ] }, { "cell_type": "code", "execution_count": 96, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "submitted 10 structure jobs [nouts_min11-K-2]\n", "submitted 10 structure jobs [nouts_min11-K-3]\n", "submitted 10 structure jobs [nouts_min11-K-4]\n", "submitted 10 structure jobs [nouts_min11-K-5]\n", "submitted 10 structure jobs [nouts_min11-K-6]\n" ] } ], "source": [ "## run structure across 10 random replicates of sampled unlinked SNPs\n", "for kpop in [2, 3, 4, 5, 6]:\n", " struct.run(kpop=kpop, nreps=10, ipyclient=ipyclient)" ] }, { "cell_type": "code", "execution_count": 115, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 115, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## wait for all of these jobs to finish\n", "ipyclient.wait()" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "mean scores across 10 replicates.\n", "mean scores across 10 replicates.\n", "mean scores across 10 replicates.\n", "mean scores across 10 replicates.\n", "mean scores across 10 replicates.\n" ] } ], "source": [ "## collect results\n", "tables = {}\n", "for kpop in [2, 3, 4, 5, 6]:\n", " tables[kpop] = struct.get_clumpp_table(kpop)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": true }, "outputs": [], "source": [ "## custom sorting order\n", "myorder = [\n", " \"41478_cyathophylloides\", \n", " \"41954_cyathophylloides\", \n", " \"29154_superba\",\n", " \"30686_cyathophylla\", \n", " \"33413_thamno\", \n", " \"30556_thamno\", \n", " \"35236_rex\", \n", " \"40578_rex\", \n", " \"35855_rex\",\n", " \"39618_rex\", \n", " \"38362_rex\",\n", "]" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
41478_cyathophylloides41954_cyathophylloides29154_superba30686_cyathophylla33413_thamno30556_thamno35236_rex40578_rex35855_rex39618_rex38362_rex41478_cyathophylloides41954_cyathophylloides29154_superba30686_cyathophylla33413_thamno30556_thamno35236_rex40578_rex35855_rex39618_rex38362_rex0510
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
41478_cyathophylloides41954_cyathophylloides29154_superba30686_cyathophylla33413_thamno30556_thamno35236_rex40578_rex35855_rex39618_rex38362_rex41478_cyathophylloides41954_cyathophylloides29154_superba30686_cyathophylla33413_thamno30556_thamno35236_rex40578_rex35855_rex39618_rex38362_rex41478_cyathophylloides41954_cyathophylloides29154_superba30686_cyathophylla33413_thamno30556_thamno35236_rex40578_rex35855_rex39618_rex38362_rex0510
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
41478_cyathophylloides41954_cyathophylloides29154_superba30686_cyathophylla33413_thamno30556_thamno35236_rex40578_rex35855_rex39618_rex38362_rex41478_cyathophylloides41954_cyathophylloides29154_superba30686_cyathophylla33413_thamno30556_thamno35236_rex40578_rex35855_rex39618_rex38362_rex41478_cyathophylloides41954_cyathophylloides29154_superba30686_cyathophylla33413_thamno30556_thamno35236_rex40578_rex35855_rex39618_rex38362_rex41478_cyathophylloides41954_cyathophylloides29154_superba30686_cyathophylla33413_thamno30556_thamno35236_rex40578_rex35855_rex39618_rex38362_rex0510
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
41478_cyathophylloides41954_cyathophylloides29154_superba30686_cyathophylla33413_thamno30556_thamno35236_rex40578_rex35855_rex39618_rex38362_rex41478_cyathophylloides41954_cyathophylloides29154_superba30686_cyathophylla33413_thamno30556_thamno35236_rex40578_rex35855_rex39618_rex38362_rex41478_cyathophylloides41954_cyathophylloides29154_superba30686_cyathophylla33413_thamno30556_thamno35236_rex40578_rex35855_rex39618_rex38362_rex41478_cyathophylloides41954_cyathophylloides29154_superba30686_cyathophylla33413_thamno30556_thamno35236_rex40578_rex35855_rex39618_rex38362_rex41478_cyathophylloides41954_cyathophylloides29154_superba30686_cyathophylla33413_thamno30556_thamno35236_rex40578_rex35855_rex39618_rex38362_rex0510
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
41478_cyathophylloides41954_cyathophylloides29154_superba30686_cyathophylla33413_thamno30556_thamno35236_rex40578_rex35855_rex39618_rex38362_rex41478_cyathophylloides41954_cyathophylloides29154_superba30686_cyathophylla33413_thamno30556_thamno35236_rex40578_rex35855_rex39618_rex38362_rex41478_cyathophylloides41954_cyathophylloides29154_superba30686_cyathophylla33413_thamno30556_thamno35236_rex40578_rex35855_rex39618_rex38362_rex41478_cyathophylloides41954_cyathophylloides29154_superba30686_cyathophylla33413_thamno30556_thamno35236_rex40578_rex35855_rex39618_rex38362_rex41478_cyathophylloides41954_cyathophylloides29154_superba30686_cyathophylla33413_thamno30556_thamno35236_rex40578_rex35855_rex39618_rex38362_rex41478_cyathophylloides41954_cyathophylloides29154_superba30686_cyathophylla33413_thamno30556_thamno35236_rex40578_rex35855_rex39618_rex38362_rex0510
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "## import toyplot (packaged with toytree) \n", "import toyplot\n", "\n", "## plot bars for each K-value (mean of 10 reps)\n", "for kpop in [2, 3, 4, 5, 6]:\n", " table = tables[kpop]\n", " table = table.ix[myorder]\n", " \n", " ## plot barplot w/ hover\n", " canvas, axes, mark = toyplot.bars(\n", " table, \n", " title=[[i] for i in table.index.tolist()],\n", " width=400, \n", " height=200, \n", " yshow=False, \n", " style={\"stroke\": toyplot.color.near_black},\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### *TREEMIX* -- ML tree & admixture co-inference" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": true }, "outputs": [], "source": [ "## conda install treemix -c ipyrad" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "binary treemix \n", "bootstrap 0 \n", "climb 0 \n", "cormig 0 \n", "g (None, None) \n", "global_ 1 \n", "k 0 \n", "m 0 \n", "noss 0 \n", "root prz \n", "se 0 \n", "seed 737548365 " ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## group taxa into 'populations'\n", "imap = {\n", " \"prz\": [\"32082_przewalskii\", \"33588_przewalskii\"],\n", " \"cys\": [\"41478_cyathophylloides\", \"41954_cyathophylloides\"],\n", " \"cya\": [\"30686_cyathophylla\"],\n", " \"sup\": [\"29154_superba\"],\n", " \"cup\": [\"33413_thamno\"],\n", " \"tha\": [\"30556_thamno\"],\n", " \"rck\": [\"35236_rex\"],\n", " \"rex\": [\"35855_rex\", \"40578_rex\"],\n", " \"lip\": [\"39618_rex\", \"38362_rex\"], \n", "}\n", "\n", "## optional: loci will be filtered if they do not have data for at\n", "## least N samples in each species. Minimums cannot be <1.\n", "minmap = {\n", " \"prz\": 2,\n", " \"cys\": 2,\n", " \"cya\": 1,\n", " \"sup\": 1,\n", " \"cup\": 1,\n", " \"tha\": 1, \n", " \"rck\": 1,\n", " \"rex\": 2,\n", " \"lip\": 2,\n", " }\n", "\n", "## sets a random number seed\n", "import numpy\n", "numpy.random.seed(12349876)\n", "\n", "## create a treemix analysis object\n", "tmix = ipa.treemix(\n", " name=min13.name, \n", " data=min13.outfiles.snpsphy,\n", " mapfile=min13.outfiles.snpsmap,\n", " imap=imap,\n", " minmap=minmap, \n", " )\n", "\n", "## you can set additional parameter args here\n", "tmix.params.root = \"prz\"\n", "tmix.params.global_ = 1\n", "\n", "## print the full params\n", "tmix.params" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": true }, "outputs": [], "source": [ "## a dictionary for storing treemix objects\n", "tdict = {}\n", "\n", "## iterate over values of m\n", "for rep in xrange(4):\n", " for mig in xrange(4):\n", " \n", " ## create new treemix object copy\n", " name = \"mig-{}-rep-{}\".format(mig, rep)\n", " tmp = tmix.copy(name)\n", "\n", " ## set params on new object\n", " tmp.params.m = mig\n", " \n", " ## run treemix analysis\n", " tmp.run()\n", " \n", " ## store the treemix object\n", " tdict[name] = tmp" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
przcyscyasuprckliprexthacupweight: 0.166766-0.3-0.2-0.10.0Drift parameter
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import toyplot\n", "\n", "## select a single result\n", "tmp = tdict[\"mig-1-rep-1\"]\n", "\n", "## draw the tree similar to the Treemix plotting R code\n", "## this code is rather new and will be expanded in the future.\n", "canvas = toyplot.Canvas(width=350, height=350)\n", "axes = canvas.cartesian(padding=25, margin=75)\n", "axes = tmp.draw(axes)" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
przcyscyasupcupthaliprckrex-0.2-0.10.0Drift parameterprzcyscyasupcupthaliprckrex-0.2-0.10.0Drift parameterprzcyscyasupcupthaliprckrex-0.2-0.10.0Drift parameterprzcyscyasupcupthaliprckrex-0.2-0.10.0Drift parameterprzcyssupcyacuptharexliprckweight: 0.0325483-0.2-0.10.0Drift parameterprzcyscyasuprckliprexthacupweight: 0.166766-0.3-0.2-0.10.0Drift parameterprzcyssupcyacupliptharexrckweight: 0.0958665-0.3-0.2-0.10.0Drift parameterprzcyscyasupcupthaliprckrexweight: 0.065454-0.3-0.2-0.10.0Drift parameterprzcyssupcyacuptharexliprckweight: 0.0345748weight: 0.178132-0.3-0.2-0.10.0Drift parameterprzcyscyasupcuptharexrcklipweight: 0.0418195weight: 0.00647291-0.2-0.10.0Drift parameterprzcyssupcyacuprextharcklipweight: 0.0663672weight: 0.3213-0.3-0.2-0.10.0Drift parameterprzcyssupcyacupthaliprexrckweight: 0.232693weight: 0.0401708-0.3-0.2-0.10.0Drift parameterprzcyssupcyacupthaliprckrexweight: 0.125805weight: 0.101155weight: 0.0318235-0.3-0.2-0.10.0Drift parameterprzcyscyasupthacuprexrcklipweight: 0.0739663weight: 0.36854weight: 0.139219-0.3-0.2-0.10.0Drift parameterprzcyssupcyarckrexlipthacupweight: 0.167868weight: 0.368291weight: 0.0163652-0.3-0.2-0.10.0Drift parameterprzcyssupcyacupthaliprexrckweight: 0.49431weight: 0.0674044weight: 0.316649-0.3-0.2-0.10.0Drift parameter
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import toyplot\n", "import numpy as np\n", "\n", "## plot many results\n", "canvas = toyplot.Canvas(width=800, height=1200)\n", "idx = 0\n", "for mig in range(4):\n", " for rep in range(4):\n", " tmp = tdict[\"mig-{}-rep-{}\".format(mig, rep)]\n", " ax = canvas.cartesian(grid=(4, 4, idx), padding=25, margin=(25, 50, 100, 25))\n", " ax = tmp.draw(ax)\n", " idx += 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### ABBA-BABA admixture inference" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": [ "bb = ipa.baba(\n", " data=min4.outfiles.loci,\n", " newick=\"analysis-raxml/RAxML_bestTree.min13\",\n", ")" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "database None \n", "mincov 1 \n", "nboots 1000 \n", "quiet False " ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## check params\n", "bb.params" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "36 tests generated from tree\n" ] } ], "source": [ "## generate all tests from the tree where 32082 is p4\n", "bb.generate_tests_from_tree(\n", " constraint_dict={\n", " \"p4\": [\"32082_przewalskii\"],\n", " \"p3\": [\"30556_thamno\"],\n", " }\n", " )" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[####################] 100% calculating D-stats | 0:00:37 | \n" ] } ], "source": [ "## run the tests in parallel\n", "bb.run(ipyclient=ipyclient)" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dstatbootmeanbootstdZABBABABAnloci
120.1980.1990.0277.291593.156396.84410629
27-0.208-0.2070.0316.648370.500565.00010365
200.2080.2080.0316.625565.000370.50010365
160.1860.1860.0325.830555.250381.25010243
26-0.186-0.1860.0335.708381.250555.25010243
\n", "
" ], "text/plain": [ " dstat bootmean bootstd Z ABBA BABA nloci\n", "12 0.198 0.199 0.027 7.291 593.156 396.844 10629\n", "27 -0.208 -0.207 0.031 6.648 370.500 565.000 10365\n", "20 0.208 0.208 0.031 6.625 565.000 370.500 10365\n", "16 0.186 0.186 0.032 5.830 555.250 381.250 10243\n", "26 -0.186 -0.186 0.033 5.708 381.250 555.250 10243" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bb.results_table.sort_values(by=\"Z\", ascending=False).head()" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'p1': ['35236_rex'],\n", " 'p2': ['35855_rex', '40578_rex'],\n", " 'p3': ['30556_thamno'],\n", " 'p4': ['32082_przewalskii']}" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## most significant result (more ABBA than BABA)\n", "bb.tests[12]" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'p1': ['40578_rex'],\n", " 'p2': ['35236_rex'],\n", " 'p3': ['30556_thamno'],\n", " 'p4': ['32082_przewalskii']}" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## the next most signif (more BABA than ABBA)\n", "bb.tests[27]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### BPP -- species tree inference/delim" ] }, { "cell_type": "code", "execution_count": 68, "metadata": { "collapsed": true }, "outputs": [], "source": [ "## a dictionary mapping sample names to 'species' names\n", "imap = {\n", " \"prz\": [\"32082_przewalskii\", \"33588_przewalskii\"],\n", " \"cys\": [\"41478_cyathophylloides\", \"41954_cyathophylloides\"],\n", " \"cya\": [\"30686_cyathophylla\"],\n", " \"sup\": [\"29154_superba\"],\n", " \"cup\": [\"33413_thamno\"],\n", " \"tha\": [\"30556_thamno\"],\n", " \"rck\": [\"35236_rex\"],\n", " \"rex\": [\"35855_rex\", \"40578_rex\"],\n", " \"lip\": [\"39618_rex\", \"38362_rex\"], \n", " }\n", "\n", "## optional: loci will be filtered if they do not have data for at\n", "## least N samples/individuals in each species.\n", "minmap = {\n", " \"prz\": 2,\n", " \"cys\": 2,\n", " \"cya\": 1,\n", " \"sup\": 1,\n", " \"cup\": 1,\n", " \"tha\": 1, \n", " \"rck\": 1,\n", " \"rex\": 2,\n", " \"lip\": 2,\n", " }\n", "\n", "## a tree hypothesis (guidetree) (here based on tetrad results)\n", "## for the 'species' we've collapsed samples into.\n", "newick = \"((((((rex, lip), rck), tha), cup), (cys, (cya, sup))), prz);\"" ] }, { "cell_type": "code", "execution_count": 69, "metadata": { "collapsed": true }, "outputs": [], "source": [ "## initiata a bpp object \n", "b = ipa.bpp(\n", " name=min4.name,\n", " locifile=min4.outfiles.alleles,\n", " imap=imap,\n", " minmap=minmap,\n", " guidetree=newick,\n", " )" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "burnin 1000 \n", "cleandata 0 \n", "delimit_alg (0, 5) \n", "finetune (0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01)\n", "infer_delimit 0 \n", "infer_sptree 0 \n", "nsample 2000 \n", "sampfreq 20 \n", "seed 12345 \n", "tauprior (2, 2000, 1) \n", "thetaprior (2, 2000) \n", "usedata 1 " ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## set some optional params, leaving others at their defaults\n", "## you should definitely run these longer for real analyses\n", "b.params.burnin = 1000\n", "b.params.nsample = 2000\n", "b.params.sampfreq = 20\n", "\n", "## print params\n", "b.params" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "maxloci 100 \n", "minmap {'cys': 4, 'rex': 4, 'cup': 2, 'rck': 2, 'cya': 2, 'lip': 4, 'sup': 2, 'tha': 2, 'prz': 4}\n", "minsnps 4 " ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## set some optional filters leaving others at their defaults\n", "b.filters.maxloci=100\n", "b.filters.minsnps=4\n", "\n", "## print filters\n", "b.filters" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### run BPP\n", "You can either call '`write_bpp_files()`' to write input files for this data set to be run in BPP, and then call BPP on those files, or you can use the '`.run()`' command to run the data files directly, and in parallel on the cluster. If you specify multiple reps then a different random sample of loci will be selected, and different random seeds applied to each replicate. " ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "input files created for job min4 (100 loci)\n" ] } ], "source": [ "b.write_bpp_files()" ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "submitted 2 bpp jobs [min4] (100 loci)\n" ] } ], "source": [ "b.run()" ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## wait for all ipyclient jobs to finish\n", "ipyclient.wait()" ] }, { "cell_type": "code", "execution_count": 90, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countmeanstdmin25%50%75%max
theta_1cup2000.00.0024.720e-041.110e-031.949e-030.0020.0030.005
theta_2cya2000.00.0014.587e-041.944e-049.056e-040.0010.0020.003
theta_3cys2000.00.0012.569e-047.718e-041.091e-030.0010.0010.002
theta_4lip2000.00.0023.075e-049.518e-041.619e-030.0020.0020.003
theta_5prz2000.00.0077.235e-044.556e-036.150e-030.0070.0070.009
theta_6rck2000.00.0024.692e-041.090e-031.860e-030.0020.0020.004
...........................
tau_13rexliprcktha2000.00.0022.376e-041.470e-031.854e-030.0020.0020.003
tau_14rexliprck2000.00.0022.367e-041.191e-031.781e-030.0020.0020.003
tau_15rexlip2000.00.0022.466e-049.497e-041.717e-030.0020.0020.003
tau_16cyscyasup2000.00.0054.718e-043.503e-034.881e-030.0050.0060.007
tau_17cyasup2000.00.0027.494e-043.230e-041.270e-030.0020.0020.005
lnL2000.0-13806.6672.362e+01-1.389e+04-1.382e+04-13806.152-13790.543-13733.031
\n", "

26 rows × 8 columns

\n", "
" ], "text/plain": [ " count mean std min 25% \\\n", "theta_1cup 2000.0 0.002 4.720e-04 1.110e-03 1.949e-03 \n", "theta_2cya 2000.0 0.001 4.587e-04 1.944e-04 9.056e-04 \n", "theta_3cys 2000.0 0.001 2.569e-04 7.718e-04 1.091e-03 \n", "theta_4lip 2000.0 0.002 3.075e-04 9.518e-04 1.619e-03 \n", "theta_5prz 2000.0 0.007 7.235e-04 4.556e-03 6.150e-03 \n", "theta_6rck 2000.0 0.002 4.692e-04 1.090e-03 1.860e-03 \n", "... ... ... ... ... ... \n", "tau_13rexliprcktha 2000.0 0.002 2.376e-04 1.470e-03 1.854e-03 \n", "tau_14rexliprck 2000.0 0.002 2.367e-04 1.191e-03 1.781e-03 \n", "tau_15rexlip 2000.0 0.002 2.466e-04 9.497e-04 1.717e-03 \n", "tau_16cyscyasup 2000.0 0.005 4.718e-04 3.503e-03 4.881e-03 \n", "tau_17cyasup 2000.0 0.002 7.494e-04 3.230e-04 1.270e-03 \n", "lnL 2000.0 -13806.667 2.362e+01 -1.389e+04 -1.382e+04 \n", "\n", " 50% 75% max \n", "theta_1cup 0.002 0.003 0.005 \n", "theta_2cya 0.001 0.002 0.003 \n", "theta_3cys 0.001 0.001 0.002 \n", "theta_4lip 0.002 0.002 0.003 \n", "theta_5prz 0.007 0.007 0.009 \n", "theta_6rck 0.002 0.002 0.004 \n", "... ... ... ... \n", "tau_13rexliprcktha 0.002 0.002 0.003 \n", "tau_14rexliprck 0.002 0.002 0.003 \n", "tau_15rexlip 0.002 0.002 0.003 \n", "tau_16cyscyasup 0.005 0.006 0.007 \n", "tau_17cyasup 0.002 0.002 0.005 \n", "lnL -13806.152 -13790.543 -13733.031 \n", "\n", "[26 rows x 8 columns]" ] }, "execution_count": 90, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## check results\n", "## parse the mcmc table with pandas library\n", "import pandas as pd\n", "btable = pd.read_csv(b.files.mcmcfiles[0], sep=\"\\t\", index_col=0)\n", "btable.describe().T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Success -- you're finished\n", "Now that you have an idea of the myriad ways you can assembly your data, and some of the downstream analysis tools you are ready to explore RAD-seq data. \n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.13" } }, "nbformat": 4, "nbformat_minor": 2 }