{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Assembly and analysis of Ficus RAD-seq data\n",
"\n",
"A RAD-seq library of 95 samples was prepared by Floragenex with the PstI restriction enzyme, followed by sonication and size selection. Stats reported by Floragenex include: AverageFragmentSize=386bp, Concentration=2.51ng/uL, Concentation=10nM. The library was sequenced on two lanes of Illumina HiSeq 3000 yielding 378,809,976 reads in lane 1, and 375,813,513 reads in lane 2, for a total of ~755M reads. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### This notebook\n",
"This is a jupyter notebook, a tool used to create an executable document to full reproduce our analyses. This notebook contains all of the code to assemble the *Ficus* RAD-seq data set with *ipyrad*.\n",
"We begin by demultiplexing [The raw data](#The-raw-data). The demultiplexed data (will be) archived and available online. If you downloaded the demultiplexed data you can skip to section [The Demultiplexed Data](#The-demultiplexed-data) and begin by loading in those data. The data were assembled under a range of parameter settings, which you can see in the [Within-sample assembly](#Within-sample-assembly) section. Several Samples were filtered from the data set due to low coverage. The data was then clustered across Samples and final output files were created [Across-sample assembly](#Across-sample-assembly)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Required software\n",
"The following conda commands will locally install of the software required for this notebook. "
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"## conda install ipyrad -c ipyrad\n",
"## conda install toytree -c eaton-lab"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import ipyrad as ip\n",
"import toyplot"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"ipyrad v.0.6.20\n",
"toyplot v.0.14.4\n"
]
}
],
"source": [
"## print software versions\n",
"print 'ipyrad v.{}'.format(ip.__version__)\n",
"print 'toyplot v.{}'.format(toyplot.__version__)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Cluster info\n",
"I started an ipcluster instance on a 40 core workstation with the ipcluster command as shown below. The `cluster_info()` command shows that ipyrad is able to find all 40 cores on the cluster. "
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"##\n",
"## ipcluster start --n=40\n",
"##"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import ipyparallel as ipp\n",
"ipyclient = ipp.Client(profile=\"testing\")\n",
"#print ip.cluster_info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The raw data\n",
"The data came to us as two large 20GB files. The barcodes file was provided by Floragenex and maps sample names to barcodes that are contained inline in the sequences, and are 10bp in length. The barcodes are printed a little further below. I ran the program *fastQC* on the raw data files to do a quality check, the results of which are available here [lane1-fastqc](link) and here [lane2-fastqc](link). Overall, quality scores were very high and there was little (but some) adapter contamination, which we will filter out in the ipyrad analysis. "
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"## The reference genome link\n",
"reference = \"\"\"\\\n",
"ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/\\\n",
"002/002/945/GCA_002002945.1_F.carica_assembly01/\\\n",
"GCA_002002945.1_F.carica_assembly01_genomic.fna.gz\\\n",
"\"\"\"\n",
"\n",
"## Download the reference genome of F. carica\n",
"# ! wget $reference\n",
"\n",
"## decompress it\n",
"# ! gunzip ./GCA_002002945.1_F.carica_assembly01_genomic.fna.gz"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"## Locations of the raw data and barcodes file\n",
"lane1data = \"~/Documents/RADSEQ_DATA/Ficus/Ficus-1_S1_L001_R1_001.fastq.gz\"\n",
"lane2data = \"~/Documents/RADSEQ_DATA/Ficus/Ficus-2_S2_L002_R1_001.fastq.gz\"\n",
"barcodes = \"~/Documents/RADSEQ_DATA/barcodes/Ficus_Jander_2016_95barcodes.txt\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create an ipyrad Assembly object for each lane of data\n",
"We set the location to the data and barcodes info for each object, and set the max barcode mismatch parameter to zero (strict), allowing no mismatches. You can see the full barcode information [at this link](https://github.com/dereneaton/ficus-rad/blob/master/Ficus_Jander_2016_95barcodes.txt)."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" New Assembly: lane1\n",
" New Assembly: lane2\n"
]
}
],
"source": [
"## create an object to demultiplex each lane\n",
"demux1 = ip.Assembly(\"lane1\")\n",
"demux2 = ip.Assembly(\"lane2\")\n",
"\n",
"## set path to data, bcodes, and max_mismatch params\n",
"demux1.set_params(\"project_dir\", \"./ficus_demux_reads\")\n",
"demux1.set_params(\"raw_fastq_path\", lane1data)\n",
"demux1.set_params(\"barcodes_path\", barcodes)\n",
"demux1.set_params(\"max_barcode_mismatch\", 0)\n",
"\n",
"## set path to data, bcodes, and max_mismatch params\n",
"demux2.set_params(\"project_dir\", \"./ficus_demux_reads\")\n",
"demux2.set_params(\"raw_fastq_path\", lane2data)\n",
"demux2.set_params(\"barcodes_path\", barcodes)\n",
"demux2.set_params(\"max_barcode_mismatch\", 0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Demultiplex raw data from both lanes"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
" Assembly: lane1\n",
" [####################] 100% chunking large files | 0:21:55 | s1 | \n",
" [####################] 100% sorting reads | 0:43:43 | s1 | \n",
" [####################] 100% writing/compressing | 0:14:25 | s1 | \n",
"\n",
" Assembly: lane2\n",
" [####################] 100% chunking large files | 0:27:44 | s1 | \n",
" [####################] 100% sorting reads | 0:45:12 | s1 | \n",
" [####################] 100% writing/compressing | 0:15:53 | s1 | \n"
]
}
],
"source": [
"demux1.run(\"1\")\n",
"demux2.run(\"1\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The demultiplexed data\n",
"Now we have two directories with demultiplexed data, each with one gzipped fastq file corresponding to all of the reads matching to a particular Sample's barcode from that lane of sequencing. These are the data that (will be uploaded) to Genbank SRA when we publish. We will load the sorted fastq data at this step, to copy the same procedure that one would take if they were starting from access to the demultiplexed data."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"lib1_fastqs = \"./ficus_demux_reads/lane1_fastqs/*.gz\"\n",
"lib2_fastqs = \"./ficus_demux_reads/lane2_fastqs/*.gz\""
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
" Assembly: lib1\n",
" [####################] 100% loading reads | 0:01:47 | s1 | \n",
"\n",
" Assembly: lib2\n",
" [####################] 100% loading reads | 0:00:34 | s1 | \n"
]
}
],
"source": [
"lib1 = ip.Assembly(\"lib1\", quiet=True)\n",
"lib1.set_params(\"sorted_fastq_path\", lib1_fastqs)\n",
"lib1.run(\"1\")\n",
"\n",
"lib2 = ip.Assembly(\"lib2\", quiet=True)\n",
"lib2.set_params(\"sorted_fastq_path\", lib2_fastqs)\n",
"lib2.run(\"1\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Merge the two lanes of data into one Assembly\n",
"We will join these two demultiplexed libraries into a single analysis that has the set of parameters we will use to assemble the data set. To do this we use the `merge()` command in ipyrad. On this merged Assembly we will then set a number of parameter settings that we will use to assemble the data. \n"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 0 assembly_name merged \n",
" 1 project_dir ./analysis-ipyrad \n",
" 2 raw_fastq_path Merged: lane1, lane2 \n",
" 3 barcodes_path Merged: lane1, lane2 \n",
" 4 sorted_fastq_path Merged: lane1, lane2 \n",
" 5 assembly_method denovo \n",
" 6 reference_sequence \n",
" 7 datatype rad \n",
" 8 restriction_overhang ('TGCAG', '') \n",
" 9 max_low_qual_bases 5 \n",
" 10 phred_Qscore_offset 43 \n",
" 11 mindepth_statistical 6 \n",
" 12 mindepth_majrule 6 \n",
" 13 maxdepth 10000 \n",
" 14 clust_threshold 0.85 \n",
" 15 max_barcode_mismatch 0 \n",
" 16 filter_adapters 3 \n",
" 17 filter_min_trim_len 60 \n",
" 18 max_alleles_consens 2 \n",
" 19 max_Ns_consens (5, 5) \n",
" 20 max_Hs_consens (5, 5) \n",
" 21 min_samples_locus 4 \n",
" 22 max_SNPs_locus (20, 20) \n",
" 23 max_Indels_locus (8, 8) \n",
" 24 max_shared_Hs_locus 4 \n",
" 25 trim_reads (0, 0, 0, 0) \n",
" 26 trim_loci (0, 8, 0, 0) \n",
" 27 output_formats ('l', 'k', 's', 'a', 'p', 'n', 'v') \n",
" 28 pop_assign_file \n"
]
}
],
"source": [
"## named corresponding to some params we are changing\n",
"data = ip.merge(\"merged\", [demux1, demux2])\n",
"\n",
"## set several non-default parameters\n",
"data.set_params(\"project_dir\", \"analysis-ipyrad\")\n",
"data.set_params(\"filter_adapters\", 3)\n",
"data.set_params(\"phred_Qscore_offset\", 43)\n",
"data.set_params(\"max_Hs_consens\", (5, 5))\n",
"data.set_params(\"max_shared_Hs_locus\", 4)\n",
"data.set_params(\"filter_min_trim_len\", 60)\n",
"data.set_params(\"trim_loci\", (0, 8, 0, 0))\n",
"data.set_params(\"output_formats\", list(\"lksapnv\"))\n",
"\n",
"## print parameters for prosperity's sake\n",
"data.get_params()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create branch `ficus` and drop the control Sample\n",
"First we will drop the control sequence included for quality checking by Floragenex (FGXCONTROL). To do this we create a new branch using the argument `subsample` to include all Samples except FGXCONTROL. "
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"## drop the Floragenex control sample if it is in the data\n",
"snames = [i for i in data.samples if i != \"FGXCONTROL\"]\n",
"\n",
"## working branch\n",
"data = data.branch(\"ficus\", subsamples=snames)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### A summary of the number of reads per Sample."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"summary of raw read covereage\n",
"count 95\n",
"mean 5149596\n",
"std 7716026\n",
"min 14703\n",
"25% 289541\n",
"50% 1890328\n",
"75% 7402440\n",
"max 51339646\n",
"Name: reads_raw, dtype: int64\n"
]
}
],
"source": [
"print \"summary of raw read covereage\"\n",
"print data.stats.reads_raw.describe().astype(int)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Filtering options\n",
"From looking closely at the data it appears there are som poor quality reads with adapter contamination, and also that there are some conspicuous long strings of poly repeats, which are probably due to the library being put on the sequencer in the wrong concentration (the facility failed to do a qPCR quantification). Setting the filter parameter in ipyrad to strict (2) uses 'cutadapt' to filter the reads. By default ipyrad would look just for the Illumina universal adapter, but I'm also adding a few additional poly-{A,C,G,T} sequences to be trimmed. These appeared to be somewhat common in the raw data, followed by nonsense. "
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
" Assembly: ficus\n",
" [####################] 100% concatenating inputs | 0:02:43 | s2 | \n",
" [####################] 100% processing reads | 0:51:52 | s2 | \n"
]
}
],
"source": [
"## run step 2\n",
"data.run(\"2\", force=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Within-sample assembly\n",
"Steps 2-5 of ipyrad function to filter and cluster reads, and to call consensus haplotypes within samples. We'll look more closely at the stats for each step after it's finished. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### _reference_ & _de novo_ assemblies with ipyrad"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"collapsed": true,
"scrolled": true
},
"outputs": [],
"source": [
"## create new branches for assembly method\n",
"ficus_d = data.branch(\"ficus_d\")\n",
"ficus_r = data.branch(\"ficus_r\")\n",
"\n",
"## set reference info\n",
"reference = \"GCA_002002945.1_F.carica_assembly01_genomic.fna\"\n",
"ficus_r.set_params(\"reference_sequence\", reference)\n",
"ficus_r.set_params(\"assembly_method\", \"reference\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"## map reads to reference genome\n",
"ficus_r.run(\"3\", force=True)"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
" Assembly: ficus_d\n",
" [####################] 100% dereplicating | 0:04:31 | s3 | \n",
" [####################] 100% clustering | 0:13:54 | s3 | \n",
" [####################] 100% building clusters | 0:00:46 | s3 | \n",
" [####################] 100% chunking | 0:00:11 | s3 | \n",
" [####################] 100% aligning | 0:23:58 | s3 | \n",
" [####################] 100% concatenating | 0:00:14 | s3 | \n"
]
}
],
"source": [
"## cluster reads denovo\n",
"ficus_d.run(\"3\")"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
" Assembly: ficus_d\n",
" [####################] 100% inferring [H, E] | 0:14:01 | s4 | \n"
]
}
],
"source": [
"ficus_d.run(\"4\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Branch to make consensus calls at different mindepth settings\n",
"Now that the reads are filtered and clustered within each Sample we want to try applying several different parameter settings for downstream analyses. One major difference will be in the minimum depth of sequencing we require to make a confident base call. We will leave one Assembly with the default setting of 6, which is somewhat conservative. We will also create a 'lowdepth' Assembly that allows base calls for depths as low as 2. "
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"ficus_dhi = ficus_d.branch(\"ficus_dhi\")\n",
"ficus_dlo = ficus_d.branch(\"ficus_dlo\")\n",
"ficus_dlo.set_params(\"mindepth_majrule\", 1)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
" Assembly: ficus_dlo\n",
" [####################] 100% calculating depths | 0:00:23 | s5 | \n",
" [####################] 100% chunking clusters | 0:00:21 | s5 | \n",
" [####################] 100% consens calling | 0:15:17 | s5 | \n",
"\n",
" Assembly: ficus_dhi\n",
" [####################] 100% calculating depths | 0:00:23 | s5 | \n",
" [####################] 100% chunking clusters | 0:00:21 | s5 | \n",
" [####################] 100% consens calling | 0:14:24 | s5 | \n"
]
}
],
"source": [
"ficus_dlo.run(\"5\")\n",
"ficus_dhi.run(\"5\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Plot consens reads\n",
"Compare hidepth and lodepth assemblies. The difference is not actually that great. Regardless, the samples with very few reads are going to recover very few clusters. \n"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
" \n",
"
\n",
" Save as .csv\n",
"
\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import numpy as np\n",
"import toyplot\n",
"\n",
"## stack columns of consens stats\n",
"zero = np.zeros(ficus_dhi.stats.shape[0])\n",
"upper = ficus_dhi.stats.reads_consens\n",
"lower = -1*ficus_dlo.stats.reads_consens\n",
"boundaries = np.column_stack((lower, zero, upper))\n",
"\n",
"## plot barplots\n",
"canvas = toyplot.Canvas(width=700, height=300)\n",
"axes = canvas.cartesian()\n",
"axes.bars(boundaries, baseline=None)\n",
"axes.y.ticks.show = True\n",
"axes.y.ticks.labels.angle = -90"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Cluster data across Samples\n"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
" Assembly: ficus_dlo\n",
" [####################] 100% concat/shuffle input | 0:00:51 | s6 | \n",
" [####################] 100% clustering across | 3:26:51 | s6 | \n",
" [####################] 100% building clusters | 0:00:40 | s6 | \n",
" [####################] 100% aligning clusters | 0:03:15 | s6 | \n",
" [####################] 100% database indels | 0:01:00 | s6 | \n",
" [####################] 100% indexing clusters | 0:07:29 | s6 | \n",
" [####################] 100% building database | 0:51:26 | s6 | \n",
"\n",
" Assembly: ficus_dhi\n",
" [####################] 100% concat/shuffle input | 0:00:45 | s6 | \n",
" [####################] 100% clustering across | 2:05:50 | s6 | \n",
" [####################] 100% building clusters | 0:00:37 | s6 | \n",
" [####################] 100% aligning clusters | 0:02:56 | s6 | \n",
" [####################] 100% database indels | 0:00:50 | s6 | \n",
" [####################] 100% indexing clusters | 0:06:03 | s6 | \n",
" [####################] 100% building database | 0:41:25 | s6 | \n"
]
}
],
"source": [
"## run step 6 on full and subsampled data sets\n",
"ficus_dlo.run(\"6\")\n",
"ficus_dhi.run(\"6\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create branches that include/exclude Samples with little data\n",
"We have several Samples that recovered very little data, probably as a result of having low quality DNA extractions. Figs are hard. We'll assemble one data set that includes all of these samples, but since they are likely to have little information we'll also assemble most of our data sets without these low data samples. "
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"ficus_dhi = ip.load_json(\"analysis-ipyrad/ficus_dhi.json\", 1)\n",
"ficus_dlo = ip.load_json(\"analysis-ipyrad/ficus_dlo.json\", 1)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
" Assembly: ficus_dhi\n",
" [####################] 100% filtering loci | 0:01:13 | s7 | \n",
" [####################] 100% building loci/stats | 0:00:51 | s7 | \n",
" [####################] 100% building alleles | 0:01:02 | s7 | \n",
" [####################] 100% building vcf file | 0:02:37 | s7 | \n",
" [####################] 100% writing vcf file | 0:00:01 | s7 | \n",
" [####################] 100% building arrays | 0:00:43 | s7 | \n",
" [####################] 100% writing outfiles | 0:02:12 | s7 | \n",
" Outfiles written to: ~/Documents/Ficus/analysis-ipyrad/ficus_dhi_outfiles\n",
"\n",
" Assembly: ficus_dlo\n",
" [####################] 100% filtering loci | 0:01:07 | s7 | \n",
" [####################] 100% building loci/stats | 0:00:59 | s7 | \n",
" [####################] 100% building alleles | 0:01:11 | s7 | \n",
" [####################] 100% building vcf file | 0:03:07 | s7 | \n",
" [####################] 100% writing vcf file | 0:00:01 | s7 | \n",
" [####################] 100% building arrays | 0:00:54 | s7 | \n",
" [####################] 100% writing outfiles | 0:02:20 | s7 | \n",
" Outfiles written to: ~/Documents/Ficus/analysis-ipyrad/ficus_dlo_outfiles\n"
]
}
],
"source": [
"## infer a min4 assembly to see which samples have least data\n",
"ficus_dhi.run(\"7\", force=True)\n",
"ficus_dlo.run(\"7\", force=True)"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"text/html": [
"