{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Eaton & Ree (2013) single-end RAD data set\n",
"\n",
"Here we demonstrate a denovo assembly for an empirical RAD data set using the ipyrad Python API. This example was run on a workstation with 20 cores and takes about 10 minutes to assemble, but you should be able to run it on a 4-core laptop in ~30-60 minutes.\n",
"\n",
"For our example data we will use the 13 taxa *Pedicularis* data set from [Eaton and Ree (2013) (Open Access)](https://academic.oup.com/sysbio/article/62/5/689/1684460/Inferring-Phylogeny-and-Introgression-using-RADseq). This data set is composed of single-end 75bp reads from a RAD-seq library prepared with the PstI enzyme. The data set also serves as an example for several of our [analysis cookbooks](http://ipyrad.readthedocs.io/analysis.html) that demonstrate methods for analyzing RAD-seq data files. At the end of this notebook there are also several examples of how to use the ipyrad analysis tools to run downstream analyses in parallel. \n",
"\n",
"The figure below shows the ingroup taxa from this study and their sampling locations. The study includes all species within a small monophyletic clade of *Pedicularis*, including multiple individuals from 5 species and several subspecies, as well as an outgroup species. The sampling essentially spans from population-level variation where species boundaries are unclear, to higher-level divergence where species boundaries are quite distinct. This is a common scale at which RAD-seq data are often very useful. \n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup (software and data files)\n",
"If you haven't done so yet, start by installing `ipyrad` using conda (see ipyrad installation instructions) as well as the packages in the cell below. This is easiest to do in a terminal. Then open a jupyter-notebook, like this one, and follow along with the tuturial by copying and executing the code in the cells, and adding your own documentation between them using markdown. Feel free to modify parameters to see their effects on the downstream results. "
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"## conda install ipyrad -c ipyrad\n",
"## conda install toytree -c eaton-lab\n",
"## conda install sra-tools -c bioconda"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"## imports\n",
"import ipyrad as ip\n",
"import ipyrad.analysis as ipa\n",
"import ipyparallel as ipp"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In contrast to the ipyrad CLI, the ipyrad API gives users much more fine-scale control over the parallelization of their analysis, but this also requires learning a little bit about the library that we use to do this, called `ipyparallel`. This library is designed for use with jupyter-notebooks to allow massive-scale multi-processing while working *interactively*. \n",
"\n",
"Understanding the nuts and bolts of it might take a little while, but it is fairly easy to get started using it, especially in the way it is integrated with ipyrad. To start a parallel client to you must run the command-line program '`ipcluster`'. This will essentially start a number of independent Python processes (kernels) which we can then send bits of work to do. The cluster can be stopped and restarted independently of this notebook, which is convenient for working on a cluster where connecting to many cores is not always immediately available. "
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"# Open a terminal and type the following command to start\n",
"# an ipcluster instance with 40 engines:\n",
"# ipcluster start -n 40 --cluster-id=\"ipyrad\" --daemonize\n",
"\n",
"# After the cluster is running you can attach to it with ipyparallel\n",
"ipyclient = ipp.Client(cluster_id=\"ipyrad\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Download the data set (Pedicularis)\n",
"These data are archived on the NCBI sequence read archive (SRA) under accession id SRP021469. As part of the ipyrad analysis tools we have a wrapper around the SRAtools software that can be used to query NCBI and download sequence data based on accession IDs. Run the code below to download the fastq data files associated with this study. The data will be saved the specified directory which will be created if it does not already exist. The compressed file size of the data is a little over 1GB. If you pass your ipyclient to the `.run()` command below then the download will be parallelized. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[####################] 100% Downloading fastq files | 0:01:19 | \n",
"13 fastq files downloaded to /home/deren/Documents/ipyrad/tests/fastqs-Ped\n"
]
}
],
"source": [
"## download the Pedicularis data set from NCBI\n",
"sra = ipa.sratools(accessions=\"SRP021469\", workdir=\"fastqs-Ped\")\n",
"sra.run(force=True, ipyclient=ipyclient)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create an Assembly object\n",
"This object stores the parameters of the assembly and the organization of data files. "
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"New Assembly: pedicularis\n"
]
}
],
"source": [
"## you must provide a name for the Assembly\n",
"data = ip.Assembly(\"pedicularis\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Set parameters for the Assembly. This will raise an error if any of the parameters are not allowed because they are the wrong type, or out of the allowed range. "
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 assembly_name pedicularis \n",
"1 project_dir ./analysis-ipyrad \n",
"2 raw_fastq_path \n",
"3 barcodes_path \n",
"4 sorted_fastq_path ./example_empirical_rad/*.fastq.gz \n",
"5 assembly_method denovo \n",
"6 reference_sequence \n",
"7 datatype rad \n",
"8 restriction_overhang ('TGCAG', '') \n",
"9 max_low_qual_bases 5 \n",
"10 phred_Qscore_offset 33 \n",
"11 mindepth_statistical 6 \n",
"12 mindepth_majrule 6 \n",
"13 maxdepth 10000 \n",
"14 clust_threshold 0.9 \n",
"15 max_barcode_mismatch 0 \n",
"16 filter_adapters 2 \n",
"17 filter_min_trim_len 35 \n",
"18 max_alleles_consens 2 \n",
"19 max_Ns_consens (5, 5) \n",
"20 max_Hs_consens (5, 5) \n",
"21 min_samples_locus 4 \n",
"22 max_SNPs_locus (20, 20) \n",
"23 max_Indels_locus (8, 8) \n",
"24 max_shared_Hs_locus 0.5 \n",
"25 trim_reads (0, 0, 0, 0) \n",
"26 trim_loci (0, 5, 0, 0) \n",
"27 output_formats ('p', 's', 'v', 'n', 'k', 'u') \n",
"28 pop_assign_file \n"
]
}
],
"source": [
"## set parameters\n",
"data.set_params(\"project_dir\", \"analysis-ipyrad\")\n",
"data.set_params(\"sorted_fastq_path\", \"fastqs-Ped/*.fastq.gz\")\n",
"data.set_params(\"clust_threshold\", \"0.90\")\n",
"data.set_params(\"filter_adapters\", \"2\")\n",
"data.set_params(\"max_Hs_consens\", (5, 5))\n",
"data.set_params(\"trim_loci\", (0, 5, 0, 0))\n",
"data.set_params(\"output_formats\", \"psvnkua\")\n",
"\n",
"## see/print all parameters\n",
"data.get_params()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Assemble the data set\n"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Assembly: pedicularis\n",
"[####################] 100% loading reads | 0:00:04 | s1 | \n",
"[####################] 100% processing reads | 0:01:14 | s2 | \n"
]
}
],
"source": [
"## run steps 1 & 2 of the assembly\n",
"data.run(\"12\")"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
state
\n",
"
reads_raw
\n",
"
reads_passed_filter
\n",
"
\n",
" \n",
" \n",
"
\n",
"
29154_superba
\n",
"
2
\n",
"
696994
\n",
"
689996
\n",
"
\n",
"
\n",
"
30556_thamno
\n",
"
2
\n",
"
1452316
\n",
"
1440314
\n",
"
\n",
"
\n",
"
30686_cyathophylla
\n",
"
2
\n",
"
1253109
\n",
"
1206947
\n",
"
\n",
"
\n",
"
32082_przewalskii
\n",
"
2
\n",
"
964244
\n",
"
955480
\n",
"
\n",
"
\n",
"
33413_thamno
\n",
"
2
\n",
"
636625
\n",
"
626084
\n",
"
\n",
"
\n",
"
33588_przewalskii
\n",
"
2
\n",
"
1002923
\n",
"
993873
\n",
"
\n",
"
\n",
"
35236_rex
\n",
"
2
\n",
"
1803858
\n",
"
1787366
\n",
"
\n",
"
\n",
"
35855_rex
\n",
"
2
\n",
"
1409843
\n",
"
1397068
\n",
"
\n",
"
\n",
"
38362_rex
\n",
"
2
\n",
"
1391175
\n",
"
1379626
\n",
"
\n",
"
\n",
"
39618_rex
\n",
"
2
\n",
"
822263
\n",
"
813990
\n",
"
\n",
"
\n",
"
40578_rex
\n",
"
2
\n",
"
1707942
\n",
"
1695523
\n",
"
\n",
"
\n",
"
41478_cyathophylloides
\n",
"
2
\n",
"
2199740
\n",
"
2185364
\n",
"
\n",
"
\n",
"
41954_cyathophylloides
\n",
"
2
\n",
"
2199613
\n",
"
2176210
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" state reads_raw reads_passed_filter\n",
"29154_superba 2 696994 689996\n",
"30556_thamno 2 1452316 1440314\n",
"30686_cyathophylla 2 1253109 1206947\n",
"32082_przewalskii 2 964244 955480\n",
"33413_thamno 2 636625 626084\n",
"33588_przewalskii 2 1002923 993873\n",
"35236_rex 2 1803858 1787366\n",
"35855_rex 2 1409843 1397068\n",
"38362_rex 2 1391175 1379626\n",
"39618_rex 2 822263 813990\n",
"40578_rex 2 1707942 1695523\n",
"41478_cyathophylloides 2 2199740 2185364\n",
"41954_cyathophylloides 2 2199613 2176210"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## access the stats of the assembly (so far) from the .stats attribute\n",
"data.stats"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Assembly: pedicularis\n",
"[####################] 100% dereplicating | 0:00:07 | s3 | \n",
"[####################] 100% clustering | 0:05:08 | s3 | \n",
"[####################] 100% building clusters | 0:00:26 | s3 | \n",
"[####################] 100% chunking | 0:00:05 | s3 | \n",
"[####################] 100% aligning | 0:03:06 | s3 | \n",
"[####################] 100% concatenating | 0:00:15 | s3 | \n",
"[####################] 100% inferring [H, E] | 0:01:10 | s4 | \n",
"[####################] 100% calculating depths | 0:00:06 | s5 | \n",
"[####################] 100% chunking clusters | 0:00:06 | s5 | \n",
"[####################] 100% consens calling | 0:01:56 | s5 | \n",
"[####################] 100% concat/shuffle input | 0:00:05 | s6 | \n",
"[####################] 100% clustering across | 0:03:47 | s6 | \n",
"[####################] 100% building clusters | 0:00:06 | s6 | \n",
"[####################] 100% aligning clusters | 0:00:29 | s6 | \n",
"[####################] 100% database indels | 0:00:14 | s6 | \n",
"[####################] 100% indexing clusters | 0:00:03 | s6 | \n",
"[####################] 100% building database | 0:00:29 | s6 | \n"
]
}
],
"source": [
"## run steps 3-6 of the assembly\n",
"data.run(\"3456\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Branch to create several final data sets with different parameter settings\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Assembly: min4\n",
"[####################] 100% filtering loci | 0:00:08 | s7 | \n",
"[####################] 100% building loci/stats | 0:00:01 | s7 | \n",
"[####################] 100% building vcf file | 0:00:09 | s7 | \n",
"[####################] 100% writing vcf file | 0:00:00 | s7 | \n",
"[####################] 100% building arrays | 0:00:04 | s7 | \n",
"[####################] 100% writing outfiles | 0:00:05 | s7 | \n",
"Outfiles written to: ~/Documents/ipyrad/tests/analysis-ipyrad/min4_outfiles\n",
"\n",
"Assembly: min13\n",
"[####################] 100% filtering loci | 0:00:02 | s7 | \n",
"[####################] 100% building loci/stats | 0:00:01 | s7 | \n",
"[####################] 100% building vcf file | 0:00:03 | s7 | \n",
"[####################] 100% writing vcf file | 0:00:00 | s7 | \n",
"[####################] 100% building arrays | 0:00:04 | s7 | \n",
"[####################] 100% writing outfiles | 0:00:00 | s7 | \n",
"Outfiles written to: ~/Documents/ipyrad/tests/analysis-ipyrad/min13_outfiles\n",
"\n",
"Assembly: min11-pops\n",
"[####################] 100% filtering loci | 0:00:02 | s7 | \n",
"[####################] 100% building loci/stats | 0:00:01 | s7 | \n",
"[####################] 100% building vcf file | 0:00:03 | s7 | \n",
"[####################] 100% writing vcf file | 0:00:00 | s7 | \n",
"[####################] 100% building arrays | 0:00:04 | s7 | \n",
"[####################] 100% writing outfiles | 0:00:00 | s7 | \n",
"Outfiles written to: ~/Documents/ipyrad/tests/analysis-ipyrad/min11-pops_outfiles\n",
"\n",
"Assembly: nouts_min11\n",
"[####################] 100% filtering loci | 0:00:03 | s7 | \n",
"[####################] 100% building loci/stats | 0:00:02 | s7 | \n",
"[####################] 100% building vcf file | 0:00:02 | s7 | \n",
"[####################] 100% writing vcf file | 0:00:00 | s7 | \n",
"[####################] 100% building arrays | 0:00:04 | s7 | \n",
"[####################] 100% writing outfiles | 0:00:00 | s7 | \n",
"Outfiles written to: ~/Documents/ipyrad/tests/analysis-ipyrad/nouts_min11_outfiles\n",
"\n"
]
}
],
"source": [
"## create a branch for outputs with min_samples = 4 (lots of missing data)\n",
"min4 = data.branch(\"min4\")\n",
"min4.set_params(\"min_samples_locus\", 4)\n",
"min4.run(\"7\")\n",
"\n",
"## create a branch for outputs with min_samples = 13 (no missing data)\n",
"min13 = data.branch(\"min13\")\n",
"min13.set_params(\"min_samples_locus\", 13)\n",
"min13.run(\"7\")\n",
"\n",
"## create a branch with no missing data for ingroups, but allow\n",
"## missing data in the outgroups by setting population assignments.\n",
"## The population min-sample values overrule the min-samples-locus param\n",
"pops = data.branch(\"min11-pops\")\n",
"pops.populations = {\n",
" \"ingroup\": (11, [i for i in pops.samples if \"prz\" not in i]),\n",
" \"outgroup\" : (0, [i for i in pops.samples if \"prz\" in i]),\n",
" }\n",
"pops.run(\"7\")\n",
"\n",
"## create a branch with no missing data and with outgroups removed\n",
"nouts = data.branch(\"nouts_min11\", subsamples=[i for i in pops.samples if \"prz\" not in i])\n",
"nouts.set_params(\"min_samples_locus\", 11)\n",
"nouts.run(\"7\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### View final stats\n",
"The .stats attribute shows a stats summary for each sample, and a number of stats dataframes can be accessed for each step from the .stats_dfs attribute of the Assembly. "
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"## import toyplot (packaged with toytree) \n",
"import toyplot\n",
"\n",
"## plot bars for each K-value (mean of 10 reps)\n",
"for kpop in [2, 3, 4, 5, 6]:\n",
" table = tables[kpop]\n",
" table = table.ix[myorder]\n",
" \n",
" ## plot barplot w/ hover\n",
" canvas, axes, mark = toyplot.bars(\n",
" table, \n",
" title=[[i] for i in table.index.tolist()],\n",
" width=400, \n",
" height=200, \n",
" yshow=False, \n",
" style={\"stroke\": toyplot.color.near_black},\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### *TREEMIX* -- ML tree & admixture co-inference"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"## conda install treemix -c ipyrad"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"binary treemix \n",
"bootstrap 0 \n",
"climb 0 \n",
"cormig 0 \n",
"g (None, None) \n",
"global_ 1 \n",
"k 0 \n",
"m 0 \n",
"noss 0 \n",
"root prz \n",
"se 0 \n",
"seed 737548365 "
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## group taxa into 'populations'\n",
"imap = {\n",
" \"prz\": [\"32082_przewalskii\", \"33588_przewalskii\"],\n",
" \"cys\": [\"41478_cyathophylloides\", \"41954_cyathophylloides\"],\n",
" \"cya\": [\"30686_cyathophylla\"],\n",
" \"sup\": [\"29154_superba\"],\n",
" \"cup\": [\"33413_thamno\"],\n",
" \"tha\": [\"30556_thamno\"],\n",
" \"rck\": [\"35236_rex\"],\n",
" \"rex\": [\"35855_rex\", \"40578_rex\"],\n",
" \"lip\": [\"39618_rex\", \"38362_rex\"], \n",
"}\n",
"\n",
"## optional: loci will be filtered if they do not have data for at\n",
"## least N samples in each species. Minimums cannot be <1.\n",
"minmap = {\n",
" \"prz\": 2,\n",
" \"cys\": 2,\n",
" \"cya\": 1,\n",
" \"sup\": 1,\n",
" \"cup\": 1,\n",
" \"tha\": 1, \n",
" \"rck\": 1,\n",
" \"rex\": 2,\n",
" \"lip\": 2,\n",
" }\n",
"\n",
"## sets a random number seed\n",
"import numpy\n",
"numpy.random.seed(12349876)\n",
"\n",
"## create a treemix analysis object\n",
"tmix = ipa.treemix(\n",
" name=min13.name, \n",
" data=min13.outfiles.snpsphy,\n",
" mapfile=min13.outfiles.snpsmap,\n",
" imap=imap,\n",
" minmap=minmap, \n",
" )\n",
"\n",
"## you can set additional parameter args here\n",
"tmix.params.root = \"prz\"\n",
"tmix.params.global_ = 1\n",
"\n",
"## print the full params\n",
"tmix.params"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"## a dictionary for storing treemix objects\n",
"tdict = {}\n",
"\n",
"## iterate over values of m\n",
"for rep in xrange(4):\n",
" for mig in xrange(4):\n",
" \n",
" ## create new treemix object copy\n",
" name = \"mig-{}-rep-{}\".format(mig, rep)\n",
" tmp = tmix.copy(name)\n",
"\n",
" ## set params on new object\n",
" tmp.params.m = mig\n",
" \n",
" ## run treemix analysis\n",
" tmp.run()\n",
" \n",
" ## store the treemix object\n",
" tdict[name] = tmp"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import toyplot\n",
"\n",
"## select a single result\n",
"tmp = tdict[\"mig-1-rep-1\"]\n",
"\n",
"## draw the tree similar to the Treemix plotting R code\n",
"## this code is rather new and will be expanded in the future.\n",
"canvas = toyplot.Canvas(width=350, height=350)\n",
"axes = canvas.cartesian(padding=25, margin=75)\n",
"axes = tmp.draw(axes)"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"