{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# DaCosta & Sorenson (2016) single-end dd-RAD data set\n",
"Here we demonstrate a denovo assembly for an empirical RAD data set using the ipyrad Python API. This example was run on a workstation with 20 cores available and takes about <10 minutes to run completely, but can be run on even a laptop in about less than an hour.\n",
"\n",
"We will use the Lagonosticta and Vidua data set from [DaCosta & Sorenson 2016](https://www.ncbi.nlm.nih.gov/pubmed/26279345). This data set is composed of single-end 101bp reads from a ddRAD-seq library prepared with the SbfI and EcoRI enzymes and is available on NCBI by its study accession SRP059199. At the end of this notebook we also demonstrate the use of ipyrad.analysis tools to run downstream analyses on this data set. \n",
"\n",
"The figure below from this paper shows the general workflow in which two fairly distinct clades were sequenced together but then analyzed separately.\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup (software and data files)\n",
"If you haven't done so yet, start by installing `ipyrad` using conda (see ipyrad installation instructions) as well as the packages in the cell below. This is easiest to do in a terminal. Then open a jupyter-notebook, like this one, and follow along with the tutorial by copying and executing the code in the cells, and adding your own documentation between them using markdown. Feel free to modify parameters to see their effects on the downstream results. "
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"## conda install ipyrad -c ipyrad\n",
"## conda install toytree -c eaton-lab\n",
"## conda install entrez-direct -c bioconda\n",
"## conda install sratools -c bioconda"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"## imports\n",
"import ipyrad as ip\n",
"import ipyrad.analysis as ipa\n",
"import ipyparallel as ipp"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In contrast to the ipyrad CLI, the ipyrad API gives users much more fine-scale control over the parallelization of their analysis, but this also requires learning a little bit about the library that we use to do this, called `ipyparallel`. This library is designed for use with jupyter-notebooks to allow massive-scale multi-processing while working *interactively*. \n",
"\n",
"Understanding the nuts and bolts of it might take a little while, but it is fairly easy to get started using it, especially in the way it is integrated with ipyrad. To start a parallel client to you must run the command-line program '`ipcluster`'. This will essentially start a number of independent Python processes (kernels) which we can then send bits of work to do. The cluster can be stopped and restarted independently of this notebook, which is convenient for working on a cluster where connecting to many cores is not always immediately available. \n",
"\n",
"Open a terminal and type the following command to start an ipcluster instance with N engines. "
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"## ipcluster start --n=20"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## connect to cluster\n",
"ipyclient = ipp.Client()\n",
"ipyclient.ids"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Download the data set (Finches)\n",
"These data are archived on the NCBI sequence read archive (SRA) under accession id SRP059199. For convenience, the data are also hosted at a public Dropbox link which is a bit easier to access. Run the code below to download and decompress the fastq data files, which will save them into a directory called fastqs-Finches/. "
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[####################] 100% Downloading fastq files | 0:00:52 | \n",
"24 fastq files downloaded to /home/deren/Documents/ipyrad/tests/fastqs-Finches\n"
]
}
],
"source": [
"## download the Pedicularis data set from NCBI\n",
"sra = ipa.sratools(accession=\"SRP059199\", workdir=\"fastqs-Finches\")\n",
"sra.run(force=True, ipyclient=ipyclient)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create an Assembly object\n",
"This object stores the parameters of the assembly and the organization of data files. "
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"New Assembly: Finches\n"
]
}
],
"source": [
"## you must provide a name for the Assembly\n",
"data = ip.Assembly(\"Finches\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Set parameters for the Assembly. This will raise an error if any of the parameters are not allowed because they are the wrong type, or out of the allowed range. "
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 assembly_name Finches \n",
"1 project_dir ./analysis-ipyrad/Finches \n",
"2 raw_fastq_path \n",
"3 barcodes_path \n",
"4 sorted_fastq_path ./fastqs-Finches/*.fastq.gz \n",
"5 assembly_method denovo \n",
"6 reference_sequence \n",
"7 datatype ddrad \n",
"8 restriction_overhang ('CCTGCAGG', 'AATTC') \n",
"9 max_low_qual_bases 5 \n",
"10 phred_Qscore_offset 33 \n",
"11 mindepth_statistical 6 \n",
"12 mindepth_majrule 6 \n",
"13 maxdepth 10000 \n",
"14 clust_threshold 0.85 \n",
"15 max_barcode_mismatch 0 \n",
"16 filter_adapters 2 \n",
"17 filter_min_trim_len 35 \n",
"18 max_alleles_consens 2 \n",
"19 max_Ns_consens (5, 5) \n",
"20 max_Hs_consens (5, 5) \n",
"21 min_samples_locus 4 \n",
"22 max_SNPs_locus (20, 20) \n",
"23 max_Indels_locus (8, 8) \n",
"24 max_shared_Hs_locus 0.5 \n",
"25 trim_reads (0, 0, 0, 0) \n",
"26 trim_loci (0, 0, 0, 0) \n",
"27 output_formats ('p', 's', 'v', 'n', 'k', 'u') \n",
"28 pop_assign_file \n"
]
}
],
"source": [
"## set parameters\n",
"data.set_params(\"project_dir\", \"analysis-ipyrad/Finches\")\n",
"data.set_params(\"sorted_fastq_path\", \"fastqs-Finches/*.fastq.gz\")\n",
"data.set_params(\"datatype\", \"ddrad\")\n",
"data.set_params(\"restriction_overhang\", (\"CCTGCAGG\", \"AATTC\"))\n",
"data.set_params(\"clust_threshold\", \"0.85\")\n",
"data.set_params(\"filter_adapters\", \"2\")\n",
"data.set_params(\"max_Hs_consens\", (5, 5))\n",
"data.set_params(\"output_formats\", \"psvnkua\")\n",
"\n",
"## see/print all parameters\n",
"data.get_params()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Assemble the data set\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Assembly: Finches\n",
"[####################] 100% loading reads | 0:00:03 | s1 | \n",
"[####################] 100% processing reads | 0:01:08 | s2 | \n"
]
}
],
"source": [
"## run steps 1 & 2 of the assembly\n",
"data.run(\"12\")"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
state
\n",
"
reads_raw
\n",
"
reads_passed_filter
\n",
"
\n",
" \n",
" \n",
"
\n",
"
Anomalospiza_imberbis
\n",
"
2
\n",
"
935241
\n",
"
889028
\n",
"
\n",
"
\n",
"
Clytospiza_monteiri
\n",
"
2
\n",
"
1220879
\n",
"
1168704
\n",
"
\n",
"
\n",
"
Lagonosticta_larvata
\n",
"
2
\n",
"
1001743
\n",
"
944653
\n",
"
\n",
"
\n",
"
Lagonosticta_rara
\n",
"
2
\n",
"
992534
\n",
"
934386
\n",
"
\n",
"
\n",
"
Lagonosticta_rhodopareia
\n",
"
2
\n",
"
1020850
\n",
"
961277
\n",
"
\n",
"
\n",
"
Lagonosticta_rubricata_congica
\n",
"
2
\n",
"
1064587
\n",
"
997387
\n",
"
\n",
"
\n",
"
Lagonosticta_rubricata_rubricata
\n",
"
2
\n",
"
1079701
\n",
"
1009532
\n",
"
\n",
"
\n",
"
Lagonosticta_rufopicta
\n",
"
2
\n",
"
898117
\n",
"
846834
\n",
"
\n",
"
\n",
"
Lagonosticta_sanguinodorsalis
\n",
"
2
\n",
"
1034739
\n",
"
980499
\n",
"
\n",
"
\n",
"
Lagonosticta_senegala_rendalli
\n",
"
2
\n",
"
834688
\n",
"
786701
\n",
"
\n",
"
\n",
"
Lagonosticta_senegala_rhodopsis
\n",
"
2
\n",
"
792799
\n",
"
749508
\n",
"
\n",
"
\n",
"
Vidua_chalybeata_amauropteryx
\n",
"
2
\n",
"
1031674
\n",
"
1028936
\n",
"
\n",
"
\n",
"
Vidua_chalybeata_neumanni
\n",
"
2
\n",
"
709824
\n",
"
678382
\n",
"
\n",
"
\n",
"
Vidua_fischeri
\n",
"
2
\n",
"
998161
\n",
"
935631
\n",
"
\n",
"
\n",
"
Vidua_hypocherina
\n",
"
2
\n",
"
889818
\n",
"
844395
\n",
"
\n",
"
\n",
"
Vidua_interjecta
\n",
"
2
\n",
"
741675
\n",
"
708862
\n",
"
\n",
"
\n",
"
Vidua_macroura_arenosa
\n",
"
2
\n",
"
939649
\n",
"
891987
\n",
"
\n",
"
\n",
"
Vidua_macroura_macroura
\n",
"
2
\n",
"
729322
\n",
"
690496
\n",
"
\n",
"
\n",
"
Vidua_obtusa
\n",
"
2
\n",
"
809186
\n",
"
763098
\n",
"
\n",
"
\n",
"
Vidua_orientalis
\n",
"
2
\n",
"
862619
\n",
"
824073
\n",
"
\n",
"
\n",
"
Vidua_paradisaea
\n",
"
2
\n",
"
833981
\n",
"
791532
\n",
"
\n",
"
\n",
"
Vidua_purpurascens
\n",
"
2
\n",
"
927116
\n",
"
883478
\n",
"
\n",
"
\n",
"
Vidua_raricola
\n",
"
2
\n",
"
956686
\n",
"
907898
\n",
"
\n",
"
\n",
"
Vidua_regia
\n",
"
2
\n",
"
1012887
\n",
"
965657
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" state reads_raw reads_passed_filter\n",
"Anomalospiza_imberbis 2 935241 889028\n",
"Clytospiza_monteiri 2 1220879 1168704\n",
"Lagonosticta_larvata 2 1001743 944653\n",
"Lagonosticta_rara 2 992534 934386\n",
"Lagonosticta_rhodopareia 2 1020850 961277\n",
"Lagonosticta_rubricata_congica 2 1064587 997387\n",
"Lagonosticta_rubricata_rubricata 2 1079701 1009532\n",
"Lagonosticta_rufopicta 2 898117 846834\n",
"Lagonosticta_sanguinodorsalis 2 1034739 980499\n",
"Lagonosticta_senegala_rendalli 2 834688 786701\n",
"Lagonosticta_senegala_rhodopsis 2 792799 749508\n",
"Vidua_chalybeata_amauropteryx 2 1031674 1028936\n",
"Vidua_chalybeata_neumanni 2 709824 678382\n",
"Vidua_fischeri 2 998161 935631\n",
"Vidua_hypocherina 2 889818 844395\n",
"Vidua_interjecta 2 741675 708862\n",
"Vidua_macroura_arenosa 2 939649 891987\n",
"Vidua_macroura_macroura 2 729322 690496\n",
"Vidua_obtusa 2 809186 763098\n",
"Vidua_orientalis 2 862619 824073\n",
"Vidua_paradisaea 2 833981 791532\n",
"Vidua_purpurascens 2 927116 883478\n",
"Vidua_raricola 2 956686 907898\n",
"Vidua_regia 2 1012887 965657"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## access the stats of the assembly (so far) from the .stats attribute\n",
"data.stats"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Assembly: Finches\n",
"[####################] 100% dereplicating | 0:00:03 | s3 | \n",
"[####################] 100% clustering | 0:00:43 | s3 | \n",
"[####################] 100% building clusters | 0:00:05 | s3 | \n",
"[####################] 100% chunking | 0:00:01 | s3 | \n",
"[####################] 100% aligning | 0:05:02 | s3 | \n",
"[####################] 100% concatenating | 0:00:04 | s3 | \n",
"[####################] 100% inferring [H, E] | 0:00:35 | s4 | \n",
"[####################] 100% calculating depths | 0:00:03 | s5 | \n",
"[####################] 100% chunking clusters | 0:00:02 | s5 | \n",
"[####################] 100% consens calling | 0:01:24 | s5 | \n"
]
}
],
"source": [
"## run steps 3-5 (within-sample steps) of the assembly\n",
"data.run(\"345\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Branch to create separate data sets for Vidua and Lagonistica"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"## create data set with only Vidua samples + outgroup\n",
"subs = [i for i in data.samples if \"Vidua\" in i] +\\\n",
" [i for i in data.samples if \"Anomalo\" in i]\n",
"vidua = data.branch(\"vidua\", subsamples=subs)\n",
"\n",
"## create data set with only Lagonostica sampes + outgroup\n",
"subs = [i for i in data.samples if \"Lagon\" in i] +\\\n",
" [i for i in data.samples if \"Clyto\" in i]\n",
"lagon = data.branch(\"lagon\", subsamples=subs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Assemble each data set through final steps\n",
"Or, if you are pressed for time, you can choose just one of the Assemblies going forward. If you do, you may want to choose Vidua since that is the one we use in the downstream analysis tools at the end of this notebook. "
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Assembly: vidua\n",
"[####################] 100% concat/shuffle input | 0:00:01 | s6 | \n",
"[####################] 100% clustering across | 0:00:08 | s6 | \n",
"[####################] 100% building clusters | 0:00:02 | s6 | \n",
"[####################] 100% aligning clusters | 0:00:07 | s6 | \n",
"[####################] 100% database indels | 0:00:03 | s6 | \n",
"[####################] 100% indexing clusters | 0:00:01 | s6 | \n",
"[####################] 100% building database | 0:00:04 | s6 | \n"
]
}
],
"source": [
"vidua.run(\"6\")"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Assembly: lagon\n",
"[####################] 100% concat/shuffle input | 0:00:01 | s6 | \n",
"[####################] 100% clustering across | 0:00:06 | s6 | \n",
"[####################] 100% building clusters | 0:00:01 | s6 | \n",
"[####################] 100% aligning clusters | 0:00:04 | s6 | \n",
"[####################] 100% database indels | 0:00:03 | s6 | \n",
"[####################] 100% indexing clusters | 0:00:01 | s6 | \n",
"[####################] 100% building database | 0:00:03 | s6 | \n"
]
}
],
"source": [
"lagon.run(\"6\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Branch to create several final data sets with different parameter settings\n",
"\n",
"Here we use nested for-loops to iterate over assemblies and parameter values. \n"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Assembly: vidua_min4\n",
"[####################] 100% filtering loci | 0:00:09 | s7 | \n",
"[####################] 100% building loci/stats | 0:00:01 | s7 | \n",
"[####################] 100% building vcf file | 0:00:04 | s7 | \n",
"[####################] 100% writing vcf file | 0:00:00 | s7 | \n",
"[####################] 100% building arrays | 0:00:07 | s7 | \n",
"[####################] 100% writing outfiles | 0:00:01 | s7 | \n",
"Outfiles written to: ~/Documents/ipyrad/tests/analysis-ipyrad/Finches/vidua_min4_outfiles\n",
"\n",
"Assembly: vidua_min10\n",
"[####################] 100% filtering loci | 0:00:02 | s7 | \n",
"[####################] 100% building loci/stats | 0:00:02 | s7 | \n",
"[####################] 100% building vcf file | 0:00:02 | s7 | \n",
"[####################] 100% writing vcf file | 0:00:00 | s7 | \n",
"[####################] 100% building arrays | 0:00:07 | s7 | \n",
"[####################] 100% writing outfiles | 0:00:01 | s7 | \n",
"Outfiles written to: ~/Documents/ipyrad/tests/analysis-ipyrad/Finches/vidua_min10_outfiles\n",
"\n",
"Assembly: lagon_min4\n",
"[####################] 100% filtering loci | 0:00:02 | s7 | \n",
"[####################] 100% building loci/stats | 0:00:01 | s7 | \n",
"[####################] 100% building vcf file | 0:00:01 | s7 | \n",
"[####################] 100% writing vcf file | 0:00:00 | s7 | \n",
"[####################] 100% building arrays | 0:00:06 | s7 | \n",
"[####################] 100% writing outfiles | 0:00:01 | s7 | \n",
"Outfiles written to: ~/Documents/ipyrad/tests/analysis-ipyrad/Finches/lagon_min4_outfiles\n",
"\n",
"Assembly: lagon_min10\n",
"[####################] 100% filtering loci | 0:00:02 | s7 | \n",
"[####################] 100% building loci/stats | 0:00:01 | s7 | \n",
"[####################] 100% building vcf file | 0:00:02 | s7 | \n",
"[####################] 100% writing vcf file | 0:00:00 | s7 | \n",
"[####################] 100% building arrays | 0:00:06 | s7 | \n",
"[####################] 100% writing outfiles | 0:00:01 | s7 | \n",
"Outfiles written to: ~/Documents/ipyrad/tests/analysis-ipyrad/Finches/lagon_min10_outfiles\n",
"\n"
]
}
],
"source": [
"## iterate over data set and parameters\n",
"for assembly in [vidua, lagon]:\n",
" for min_sample in [4, 10]:\n",
" \n",
" ## create new assembly, apply new name and parameters\n",
" newname = \"{}_min{}\".format(assembly.name, min_sample)\n",
" newdata = assembly.branch(newname)\n",
" newdata.set_params(\"min_samples_locus\", min_sample)\n",
" \n",
" ## run step 7 \n",
" newdata.run(\"7\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### View final stats\n",
"The .stats attribute shows a stats summary for each sample, and a number of stats dataframes can be accessed for each step from the .stats_dfs attribute of the Assembly. "
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"loading Assembly: vidua_min4\n",
"from saved path: ~/Documents/ipyrad/tests/analysis-ipyrad/Finches/vidua_min4.json\n"
]
},
{
"data": {
"text/html": [
"