{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Assembly and analysis of Ficus RAD-seq data\n", "\n", "A RAD-seq library of 95 samples was prepared by Floragenex with the PstI restriction enzyme, followed by sonication and size selection. Stats reported by Floragenex include: AverageFragmentSize=386bp, Concentration=2.51ng/uL, Concentation=10nM. The library was sequenced on two lanes of Illumina HiSeq 3000 yielding 378,809,976 reads in lane 1, and 375,813,513 reads in lane 2, for a total of ~755M reads. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### This notebook\n", "This is a jupyter notebook, a tool used to create an executable document to full reproduce our analyses. This notebook contains all of the code to assemble the *Ficus* RAD-seq data set with *ipyrad*.\n", "We begin by demultiplexing [The raw data](#The-raw-data). The demultiplexed data (will be) archived and available online. If you downloaded the demultiplexed data you can skip to section [The Demultiplexed Data](#The-demultiplexed-data) and begin by loading in those data. The data were assembled under a range of parameter settings, which you can see in the [Within-sample assembly](#Within-sample-assembly) section. Several Samples were filtered from the data set due to low coverage. The data was then clustered across Samples and final output files were created [Across-sample assembly](#Across-sample-assembly)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Required software\n", "The following conda commands will locally install of the software required for this notebook. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "## conda install ipyrad -c ipyrad\n", "## conda install toytree -c eaton-lab" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import ipyrad as ip\n", "import toyplot" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ipyrad v.0.6.20\n", "toyplot v.0.14.4\n" ] } ], "source": [ "## print software versions\n", "print 'ipyrad v.{}'.format(ip.__version__)\n", "print 'toyplot v.{}'.format(toyplot.__version__)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cluster info\n", "I started an ipcluster instance on a 40 core workstation with the ipcluster command as shown below. The `cluster_info()` command shows that ipyrad is able to find all 40 cores on the cluster. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "##\n", "## ipcluster start --n=40\n", "##" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import ipyparallel as ipp\n", "ipyclient = ipp.Client(profile=\"testing\")\n", "#print ip.cluster_info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The raw data\n", "The data came to us as two large 20GB files. The barcodes file was provided by Floragenex and maps sample names to barcodes that are contained inline in the sequences, and are 10bp in length. The barcodes are printed a little further below. I ran the program *fastQC* on the raw data files to do a quality check, the results of which are available here [lane1-fastqc](link) and here [lane2-fastqc](link). Overall, quality scores were very high and there was little (but some) adapter contamination, which we will filter out in the ipyrad analysis. " ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": true }, "outputs": [], "source": [ "## The reference genome link\n", "reference = \"\"\"\\\n", "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/\\\n", "002/002/945/GCA_002002945.1_F.carica_assembly01/\\\n", "GCA_002002945.1_F.carica_assembly01_genomic.fna.gz\\\n", "\"\"\"\n", "\n", "## Download the reference genome of F. carica\n", "# ! wget $reference\n", "\n", "## decompress it\n", "# ! gunzip ./GCA_002002945.1_F.carica_assembly01_genomic.fna.gz" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "## Locations of the raw data and barcodes file\n", "lane1data = \"~/Documents/RADSEQ_DATA/Ficus/Ficus-1_S1_L001_R1_001.fastq.gz\"\n", "lane2data = \"~/Documents/RADSEQ_DATA/Ficus/Ficus-2_S2_L002_R1_001.fastq.gz\"\n", "barcodes = \"~/Documents/RADSEQ_DATA/barcodes/Ficus_Jander_2016_95barcodes.txt\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create an ipyrad Assembly object for each lane of data\n", "We set the location to the data and barcodes info for each object, and set the max barcode mismatch parameter to zero (strict), allowing no mismatches. You can see the full barcode information [at this link](https://github.com/dereneaton/ficus-rad/blob/master/Ficus_Jander_2016_95barcodes.txt)." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " New Assembly: lane1\n", " New Assembly: lane2\n" ] } ], "source": [ "## create an object to demultiplex each lane\n", "demux1 = ip.Assembly(\"lane1\")\n", "demux2 = ip.Assembly(\"lane2\")\n", "\n", "## set path to data, bcodes, and max_mismatch params\n", "demux1.set_params(\"project_dir\", \"./ficus_demux_reads\")\n", "demux1.set_params(\"raw_fastq_path\", lane1data)\n", "demux1.set_params(\"barcodes_path\", barcodes)\n", "demux1.set_params(\"max_barcode_mismatch\", 0)\n", "\n", "## set path to data, bcodes, and max_mismatch params\n", "demux2.set_params(\"project_dir\", \"./ficus_demux_reads\")\n", "demux2.set_params(\"raw_fastq_path\", lane2data)\n", "demux2.set_params(\"barcodes_path\", barcodes)\n", "demux2.set_params(\"max_barcode_mismatch\", 0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Demultiplex raw data from both lanes" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " Assembly: lane1\n", " [####################] 100% chunking large files | 0:21:55 | s1 | \n", " [####################] 100% sorting reads | 0:43:43 | s1 | \n", " [####################] 100% writing/compressing | 0:14:25 | s1 | \n", "\n", " Assembly: lane2\n", " [####################] 100% chunking large files | 0:27:44 | s1 | \n", " [####################] 100% sorting reads | 0:45:12 | s1 | \n", " [####################] 100% writing/compressing | 0:15:53 | s1 | \n" ] } ], "source": [ "demux1.run(\"1\")\n", "demux2.run(\"1\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The demultiplexed data\n", "Now we have two directories with demultiplexed data, each with one gzipped fastq file corresponding to all of the reads matching to a particular Sample's barcode from that lane of sequencing. These are the data that (will be uploaded) to Genbank SRA when we publish. We will load the sorted fastq data at this step, to copy the same procedure that one would take if they were starting from access to the demultiplexed data." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": true }, "outputs": [], "source": [ "lib1_fastqs = \"./ficus_demux_reads/lane1_fastqs/*.gz\"\n", "lib2_fastqs = \"./ficus_demux_reads/lane2_fastqs/*.gz\"" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " Assembly: lib1\n", " [####################] 100% loading reads | 0:01:47 | s1 | \n", "\n", " Assembly: lib2\n", " [####################] 100% loading reads | 0:00:34 | s1 | \n" ] } ], "source": [ "lib1 = ip.Assembly(\"lib1\", quiet=True)\n", "lib1.set_params(\"sorted_fastq_path\", lib1_fastqs)\n", "lib1.run(\"1\")\n", "\n", "lib2 = ip.Assembly(\"lib2\", quiet=True)\n", "lib2.set_params(\"sorted_fastq_path\", lib2_fastqs)\n", "lib2.run(\"1\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Merge the two lanes of data into one Assembly\n", "We will join these two demultiplexed libraries into a single analysis that has the set of parameters we will use to assemble the data set. To do this we use the `merge()` command in ipyrad. On this merged Assembly we will then set a number of parameter settings that we will use to assemble the data. \n" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0 assembly_name merged \n", " 1 project_dir ./analysis-ipyrad \n", " 2 raw_fastq_path Merged: lane1, lane2 \n", " 3 barcodes_path Merged: lane1, lane2 \n", " 4 sorted_fastq_path Merged: lane1, lane2 \n", " 5 assembly_method denovo \n", " 6 reference_sequence \n", " 7 datatype rad \n", " 8 restriction_overhang ('TGCAG', '') \n", " 9 max_low_qual_bases 5 \n", " 10 phred_Qscore_offset 43 \n", " 11 mindepth_statistical 6 \n", " 12 mindepth_majrule 6 \n", " 13 maxdepth 10000 \n", " 14 clust_threshold 0.85 \n", " 15 max_barcode_mismatch 0 \n", " 16 filter_adapters 3 \n", " 17 filter_min_trim_len 60 \n", " 18 max_alleles_consens 2 \n", " 19 max_Ns_consens (5, 5) \n", " 20 max_Hs_consens (5, 5) \n", " 21 min_samples_locus 4 \n", " 22 max_SNPs_locus (20, 20) \n", " 23 max_Indels_locus (8, 8) \n", " 24 max_shared_Hs_locus 4 \n", " 25 trim_reads (0, 0, 0, 0) \n", " 26 trim_loci (0, 8, 0, 0) \n", " 27 output_formats ('l', 'k', 's', 'a', 'p', 'n', 'v') \n", " 28 pop_assign_file \n" ] } ], "source": [ "## named corresponding to some params we are changing\n", "data = ip.merge(\"merged\", [demux1, demux2])\n", "\n", "## set several non-default parameters\n", "data.set_params(\"project_dir\", \"analysis-ipyrad\")\n", "data.set_params(\"filter_adapters\", 3)\n", "data.set_params(\"phred_Qscore_offset\", 43)\n", "data.set_params(\"max_Hs_consens\", (5, 5))\n", "data.set_params(\"max_shared_Hs_locus\", 4)\n", "data.set_params(\"filter_min_trim_len\", 60)\n", "data.set_params(\"trim_loci\", (0, 8, 0, 0))\n", "data.set_params(\"output_formats\", list(\"lksapnv\"))\n", "\n", "## print parameters for prosperity's sake\n", "data.get_params()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create branch `ficus` and drop the control Sample\n", "First we will drop the control sequence included for quality checking by Floragenex (FGXCONTROL). To do this we create a new branch using the argument `subsample` to include all Samples except FGXCONTROL. " ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": true }, "outputs": [], "source": [ "## drop the Floragenex control sample if it is in the data\n", "snames = [i for i in data.samples if i != \"FGXCONTROL\"]\n", "\n", "## working branch\n", "data = data.branch(\"ficus\", subsamples=snames)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### A summary of the number of reads per Sample." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "summary of raw read covereage\n", "count 95\n", "mean 5149596\n", "std 7716026\n", "min 14703\n", "25% 289541\n", "50% 1890328\n", "75% 7402440\n", "max 51339646\n", "Name: reads_raw, dtype: int64\n" ] } ], "source": [ "print \"summary of raw read covereage\"\n", "print data.stats.reads_raw.describe().astype(int)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Filtering options\n", "From looking closely at the data it appears there are som poor quality reads with adapter contamination, and also that there are some conspicuous long strings of poly repeats, which are probably due to the library being put on the sequencer in the wrong concentration (the facility failed to do a qPCR quantification). Setting the filter parameter in ipyrad to strict (2) uses 'cutadapt' to filter the reads. By default ipyrad would look just for the Illumina universal adapter, but I'm also adding a few additional poly-{A,C,G,T} sequences to be trimmed. These appeared to be somewhat common in the raw data, followed by nonsense. " ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " Assembly: ficus\n", " [####################] 100% concatenating inputs | 0:02:43 | s2 | \n", " [####################] 100% processing reads | 0:51:52 | s2 | \n" ] } ], "source": [ "## run step 2\n", "data.run(\"2\", force=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Within-sample assembly\n", "Steps 2-5 of ipyrad function to filter and cluster reads, and to call consensus haplotypes within samples. We'll look more closely at the stats for each step after it's finished. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### _reference_ & _de novo_ assemblies with ipyrad" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": true, "scrolled": true }, "outputs": [], "source": [ "## create new branches for assembly method\n", "ficus_d = data.branch(\"ficus_d\")\n", "ficus_r = data.branch(\"ficus_r\")\n", "\n", "## set reference info\n", "reference = \"GCA_002002945.1_F.carica_assembly01_genomic.fna\"\n", "ficus_r.set_params(\"reference_sequence\", reference)\n", "ficus_r.set_params(\"assembly_method\", \"reference\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "## map reads to reference genome\n", "ficus_r.run(\"3\", force=True)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " Assembly: ficus_d\n", " [####################] 100% dereplicating | 0:04:31 | s3 | \n", " [####################] 100% clustering | 0:13:54 | s3 | \n", " [####################] 100% building clusters | 0:00:46 | s3 | \n", " [####################] 100% chunking | 0:00:11 | s3 | \n", " [####################] 100% aligning | 0:23:58 | s3 | \n", " [####################] 100% concatenating | 0:00:14 | s3 | \n" ] } ], "source": [ "## cluster reads denovo\n", "ficus_d.run(\"3\")" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " Assembly: ficus_d\n", " [####################] 100% inferring [H, E] | 0:14:01 | s4 | \n" ] } ], "source": [ "ficus_d.run(\"4\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Branch to make consensus calls at different mindepth settings\n", "Now that the reads are filtered and clustered within each Sample we want to try applying several different parameter settings for downstream analyses. One major difference will be in the minimum depth of sequencing we require to make a confident base call. We will leave one Assembly with the default setting of 6, which is somewhat conservative. We will also create a 'lowdepth' Assembly that allows base calls for depths as low as 2. " ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "ficus_dhi = ficus_d.branch(\"ficus_dhi\")\n", "ficus_dlo = ficus_d.branch(\"ficus_dlo\")\n", "ficus_dlo.set_params(\"mindepth_majrule\", 1)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " Assembly: ficus_dlo\n", " [####################] 100% calculating depths | 0:00:23 | s5 | \n", " [####################] 100% chunking clusters | 0:00:21 | s5 | \n", " [####################] 100% consens calling | 0:15:17 | s5 | \n", "\n", " Assembly: ficus_dhi\n", " [####################] 100% calculating depths | 0:00:23 | s5 | \n", " [####################] 100% chunking clusters | 0:00:21 | s5 | \n", " [####################] 100% consens calling | 0:14:24 | s5 | \n" ] } ], "source": [ "ficus_dlo.run(\"5\")\n", "ficus_dhi.run(\"5\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Plot consens reads\n", "Compare hidepth and lodepth assemblies. The difference is not actually that great. Regardless, the samples with very few reads are going to recover very few clusters. \n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
0306090-200000-1000000100000200000
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import numpy as np\n", "import toyplot\n", "\n", "## stack columns of consens stats\n", "zero = np.zeros(ficus_dhi.stats.shape[0])\n", "upper = ficus_dhi.stats.reads_consens\n", "lower = -1*ficus_dlo.stats.reads_consens\n", "boundaries = np.column_stack((lower, zero, upper))\n", "\n", "## plot barplots\n", "canvas = toyplot.Canvas(width=700, height=300)\n", "axes = canvas.cartesian()\n", "axes.bars(boundaries, baseline=None)\n", "axes.y.ticks.show = True\n", "axes.y.ticks.labels.angle = -90" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cluster data across Samples\n" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " Assembly: ficus_dlo\n", " [####################] 100% concat/shuffle input | 0:00:51 | s6 | \n", " [####################] 100% clustering across | 3:26:51 | s6 | \n", " [####################] 100% building clusters | 0:00:40 | s6 | \n", " [####################] 100% aligning clusters | 0:03:15 | s6 | \n", " [####################] 100% database indels | 0:01:00 | s6 | \n", " [####################] 100% indexing clusters | 0:07:29 | s6 | \n", " [####################] 100% building database | 0:51:26 | s6 | \n", "\n", " Assembly: ficus_dhi\n", " [####################] 100% concat/shuffle input | 0:00:45 | s6 | \n", " [####################] 100% clustering across | 2:05:50 | s6 | \n", " [####################] 100% building clusters | 0:00:37 | s6 | \n", " [####################] 100% aligning clusters | 0:02:56 | s6 | \n", " [####################] 100% database indels | 0:00:50 | s6 | \n", " [####################] 100% indexing clusters | 0:06:03 | s6 | \n", " [####################] 100% building database | 0:41:25 | s6 | \n" ] } ], "source": [ "## run step 6 on full and subsampled data sets\n", "ficus_dlo.run(\"6\")\n", "ficus_dhi.run(\"6\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create branches that include/exclude Samples with little data\n", "We have several Samples that recovered very little data, probably as a result of having low quality DNA extractions. Figs are hard. We'll assemble one data set that includes all of these samples, but since they are likely to have little information we'll also assemble most of our data sets without these low data samples. " ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "ficus_dhi = ip.load_json(\"analysis-ipyrad/ficus_dhi.json\", 1)\n", "ficus_dlo = ip.load_json(\"analysis-ipyrad/ficus_dlo.json\", 1)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " Assembly: ficus_dhi\n", " [####################] 100% filtering loci | 0:01:13 | s7 | \n", " [####################] 100% building loci/stats | 0:00:51 | s7 | \n", " [####################] 100% building alleles | 0:01:02 | s7 | \n", " [####################] 100% building vcf file | 0:02:37 | s7 | \n", " [####################] 100% writing vcf file | 0:00:01 | s7 | \n", " [####################] 100% building arrays | 0:00:43 | s7 | \n", " [####################] 100% writing outfiles | 0:02:12 | s7 | \n", " Outfiles written to: ~/Documents/Ficus/analysis-ipyrad/ficus_dhi_outfiles\n", "\n", " Assembly: ficus_dlo\n", " [####################] 100% filtering loci | 0:01:07 | s7 | \n", " [####################] 100% building loci/stats | 0:00:59 | s7 | \n", " [####################] 100% building alleles | 0:01:11 | s7 | \n", " [####################] 100% building vcf file | 0:03:07 | s7 | \n", " [####################] 100% writing vcf file | 0:00:01 | s7 | \n", " [####################] 100% building arrays | 0:00:54 | s7 | \n", " [####################] 100% writing outfiles | 0:02:20 | s7 | \n", " Outfiles written to: ~/Documents/Ficus/analysis-ipyrad/ficus_dlo_outfiles\n" ] } ], "source": [ "## infer a min4 assembly to see which samples have least data\n", "ficus_dhi.run(\"7\", force=True)\n", "ficus_dlo.run(\"7\", force=True)" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sample_coverage
A01_paraensis6133
A02_paraensis11534
A04_paraensis35023
A05_paraensis14597
A06_obtusifolia605
A07_obtusifolia455
A16_citrifolia3664
A18_citrifolia3804
A19_citrifolia9194
A26_popenoei542
\n", "
" ], "text/plain": [ " sample_coverage\n", "A01_paraensis 6133\n", "A02_paraensis 11534\n", "A04_paraensis 35023\n", "A05_paraensis 14597\n", "A06_obtusifolia 605\n", "A07_obtusifolia 455\n", "A16_citrifolia 3664\n", "A18_citrifolia 3804\n", "A19_citrifolia 9194\n", "A26_popenoei 542" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ficus_dlo.stats_dfs.s7_samples.head(10)" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "excluded samples:\t\tconsens reads\n", " A07_obtusifolia \t3128\n", " A62_turbinata \t3400\n", " C02_citrifolia \t44\n", " A26_popenoei \t3374\n", " A63_turbinata \t1562\n", " A34_nymphaeifolia \t12204\n", " A75_colubrinae \t3982\n", " A29_popenoei \t1473\n", " A28_popenoei \t3168\n", " C33_triangle \t1373\n", " A38_nymphaeifolia \t6826\n", " C09_costaricana \t189\n", " A06_obtusifolia \t5760\n", " C01_bullenei \t3826\n", " C52_citrifolia \t269\n", " C54_citrifolia \t1409\n", " C32_triangleXtrigonata\t2389\n", " A27_popenoei \t2467\n", " C34_triangle \t374\n" ] } ], "source": [ "## get list of samples with > 1000 reads in the min4\n", "mask = ficus_dlo.stats_dfs.s7_samples.sample_coverage > 1000\n", "skeep = ficus_dlo.stats.index[mask].tolist()\n", "\n", "## print who was excluded\n", "print \"excluded samples:\\t\\tconsens reads\"\n", "for name, dat in ficus_dlo.samples.items():\n", " if name not in skeep:\n", " print \" {:<22}\\t{}\".format(name, int(dat.stats.reads_consens))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Assemble final data sets\n", "Also with different thresholds for missing data. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " Assembly: ficus_dlo_s4\n", " [####################] 100% filtering loci | 0:00:43 | s7 | \n", " [####################] 100% building loci/stats | 0:00:34 | s7 | \n", " [####################] 100% building alleles | 0:00:44 | s7 | \n", " [####################] 100% building vcf file | 0:01:26 | s7 | \n", " [####################] 100% writing vcf file | 0:00:01 | s7 | \n", " [####################] 100% building arrays | 0:00:40 | s7 | \n", " [####################] 100% writing outfiles | 0:01:21 | s7 | \n", " Outfiles written to: ~/Documents/Ficus/analysis-ipyrad/ficus_dlo_s4_outfiles\n", "\n", " Assembly: ficus_dlo_s10\n", " [####################] 100% filtering loci | 0:00:44 | s7 | \n", " [####################] 100% writing outfiles | 0:00:47 | s7 | \n", " Outfiles written to: ~/Documents/Ficus/analysis-ipyrad/ficus_dlo_s10_outfiles\n", "\n", " Assembly: ficus_dlo_s20\n", " [####################] 100% filtering loci | 0:00:44 | s7 | \n", " [####################] 100% building loci/stats | 0:00:33 | s7 | \n", " [####################] 100% building alleles | 0:00:41 | s7 | \n", " [####################] 100% building vcf file | 0:01:02 | s7 | \n", " [####################] 100% writing vcf file | 0:00:00 | s7 | \n", " [####################] 100% building arrays | 0:00:38 | s7 | \n", " [####################] 100% writing outfiles | 0:00:32 | s7 | \n", " Outfiles written to: ~/Documents/Ficus/analysis-ipyrad/ficus_dlo_s20_outfiles\n", "\n", " Assembly: ficus_dhi_s4\n", " [####################] 100% filtering loci | 0:00:36 | s7 | \n", " [####################] 100% building loci/stats | 0:00:27 | s7 | \n", " [####################] 100% building alleles | 0:00:37 | s7 | \n", " [####################] 100% building vcf file | 0:01:14 | s7 | \n", " [####################] 100% writing vcf file | 0:00:01 | s7 | \n", " [####################] 100% building arrays | 0:00:32 | s7 | \n", " [####################] 100% writing outfiles | 0:01:14 | s7 | \n", " Outfiles written to: ~/Documents/Ficus/analysis-ipyrad/ficus_dhi_s4_outfiles\n", "\n", " Assembly: ficus_dhi_s10\n", " [####################] 100% filtering loci | 0:00:37 | s7 | \n", " [####################] 100% building loci/stats | 0:00:27 | s7 | \n", " [####################] 100% building alleles | 0:00:35 | s7 | \n", " [####################] 100% building vcf file | 0:01:00 | s7 | \n", " [####################] 100% writing vcf file | 0:00:00 | s7 | \n", " [####################] 100% building arrays | 0:00:31 | s7 | \n", " [####################] 100% writing outfiles | 0:00:43 | s7 | \n", " Outfiles written to: ~/Documents/Ficus/analysis-ipyrad/ficus_dhi_s10_outfiles\n", "\n", " Assembly: ficus_dhi_s20\n", " [####################] 100% filtering loci | 0:00:36 | s7 | \n", " [####################] 100% building loci/stats | 0:00:27 | s7 | \n", " [####################] 100% building alleles | 0:00:35 | s7 | \n", " [####################] 100% building vcf file | 0:00:53 | s7 | \n", " [####################] 100% writing vcf file | 0:00:00 | s7 | \n", " [####################] 100% building arrays | 0:00:30 | s7 | \n", " [####################] 100% writing outfiles | 0:00:29 | s7 | \n", " Outfiles written to: ~/Documents/Ficus/analysis-ipyrad/ficus_dhi_s20_outfiles\n" ] } ], "source": [ "## final min4s, min10s, and min20\n", "for assembly in [ficus_dlo, ficus_dhi]:\n", " for minsamp in [4, 10, 20]:\n", " ## new branch names\n", " subname = assembly.name + \"_s{}\".format(minsamp)\n", " \n", " ## make new branches with subsampling\n", " subdata = assembly.branch(subname, subsamples=skeep)\n", " \n", " ## set minsamp value\n", " subdata.set_params(\"min_samples_locus\", minsamp)\n", " \n", " ## run step7\n", " subdata.run(\"7\", force=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Check some final stats" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
total_filtersapplied_orderretained_loci
total_prefiltered_loci2502340250234
filtered_by_rm_duplicates64586458243776
filtered_by_max_indels48894889238887
filtered_by_max_snps99804476234411
filtered_by_max_shared_het109859067225344
filtered_by_min_sample19777518681938525
filtered_by_max_alleles17038460933916
total_filtered_loci33916033916
\n", "
" ], "text/plain": [ " total_filters applied_order retained_loci\n", "total_prefiltered_loci 250234 0 250234\n", "filtered_by_rm_duplicates 6458 6458 243776\n", "filtered_by_max_indels 4889 4889 238887\n", "filtered_by_max_snps 9980 4476 234411\n", "filtered_by_max_shared_het 10985 9067 225344\n", "filtered_by_min_sample 197775 186819 38525\n", "filtered_by_max_alleles 17038 4609 33916\n", "total_filtered_loci 33916 0 33916" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#subdata = ip.load_json(\"analysis-ipyrad/ficus_dhi_s20\")\n", "subdata.stats_dfs.s7_filters" ] }, { "cell_type": "code", "execution_count": 70, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sample_coverage
A01_paraensis3178
A02_paraensis6990
A04_paraensis26258
A05_paraensis10147
A16_citrifolia1870
A18_citrifolia2028
A19_citrifolia5755
A33_nymphaeifolia2344
A41_nymphaeifolia19448
A42_nymphaeifolia25988
A48_trigonata5099
A49_trigonata5963
A55_triangle5659
A59_dugandii22571
A60_dugandii1555
A61_turbinata4224
A65_pertusa26857
A67_bullenei12912
A69_bullenei1530
A70_bullenei9034
A71_bullenei1141
A72_bullenei2881
A77_colubrinae22439
A82_perforata17661
A83_perforata17932
A84_perforata26821
A85_perforata16701
A87_costaricana2200
A94_maxima18643
A95_insipida10999
A96_glabrata20708
A97_glabrata16130
B102_obtusifolia26421
B103_obtusifolia728
B118_maxima1542
B119_maxima17984
B120_maxima20001
B123_maxima15354
B126_insipida22977
B127_insipida18709
B128_insipida22847
B130_glabrataXmaxima22352
B131_glabrataXmaxima23084
B133_glabrata21720
B134_glabrata22720
C04_colubrinae24782
C11_costaricana26317
C12_dugandii7438
C14_dugandii26742
C15_insipida22907
C17_maxima18066
C18_maxima23139
C19_nymphaeifolia26665
C21_obtusifolia2627
C22_obtusifolia27114
C24_obtusifolia25269
C25_popenoei26463
C26_popenoei25572
C27_popenoei15584
C28_pertusa25187
C30_triangle7256
C31_triangle12661
C36_trigonata27118
C37_trigonata26778
C39_trigonata26270
C41_trigonata10822
C43_trigonata27126
C45_yoponensis18584
C46_yoponensis22783
C47_yoponensis23199
C48_tonduzii20918
C49_dugandii24327
C50_insipida23203
C51_perforata22246
C53_citrifolia3392
C5_colubrinae26818
\n", "
" ], "text/plain": [ " sample_coverage\n", "A01_paraensis 3178\n", "A02_paraensis 6990\n", "A04_paraensis 26258\n", "A05_paraensis 10147\n", "A16_citrifolia 1870\n", "A18_citrifolia 2028\n", "A19_citrifolia 5755\n", "A33_nymphaeifolia 2344\n", "A41_nymphaeifolia 19448\n", "A42_nymphaeifolia 25988\n", "A48_trigonata 5099\n", "A49_trigonata 5963\n", "A55_triangle 5659\n", "A59_dugandii 22571\n", "A60_dugandii 1555\n", "A61_turbinata 4224\n", "A65_pertusa 26857\n", "A67_bullenei 12912\n", "A69_bullenei 1530\n", "A70_bullenei 9034\n", "A71_bullenei 1141\n", "A72_bullenei 2881\n", "A77_colubrinae 22439\n", "A82_perforata 17661\n", "A83_perforata 17932\n", "A84_perforata 26821\n", "A85_perforata 16701\n", "A87_costaricana 2200\n", "A94_maxima 18643\n", "A95_insipida 10999\n", "A96_glabrata 20708\n", "A97_glabrata 16130\n", "B102_obtusifolia 26421\n", "B103_obtusifolia 728\n", "B118_maxima 1542\n", "B119_maxima 17984\n", "B120_maxima 20001\n", "B123_maxima 15354\n", "B126_insipida 22977\n", "B127_insipida 18709\n", "B128_insipida 22847\n", "B130_glabrataXmaxima 22352\n", "B131_glabrataXmaxima 23084\n", "B133_glabrata 21720\n", "B134_glabrata 22720\n", "C04_colubrinae 24782\n", "C11_costaricana 26317\n", "C12_dugandii 7438\n", "C14_dugandii 26742\n", "C15_insipida 22907\n", "C17_maxima 18066\n", "C18_maxima 23139\n", "C19_nymphaeifolia 26665\n", "C21_obtusifolia 2627\n", "C22_obtusifolia 27114\n", "C24_obtusifolia 25269\n", "C25_popenoei 26463\n", "C26_popenoei 25572\n", "C27_popenoei 15584\n", "C28_pertusa 25187\n", "C30_triangle 7256\n", "C31_triangle 12661\n", "C36_trigonata 27118\n", "C37_trigonata 26778\n", "C39_trigonata 26270\n", "C41_trigonata 10822\n", "C43_trigonata 27126\n", "C45_yoponensis 18584\n", "C46_yoponensis 22783\n", "C47_yoponensis 23199\n", "C48_tonduzii 20918\n", "C49_dugandii 24327\n", "C50_insipida 23203\n", "C51_perforata 22246\n", "C53_citrifolia 3392\n", "C5_colubrinae 26818" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "subdata.stats_dfs.s7_samples" ] }, { "cell_type": "code", "execution_count": 71, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
locus_coveragesum_coverage
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2024322432
2118874319
2212485567
239106477
248447321
258488169
268419010
278829892
2891110803
2984911652
3082912481
3179513276
3270513981
3368014661
3466415325
3563815963
3654416507
3753817045
3858917634
3957618210
4064418854
4168619540
4275420294
4377021064
4494122005
45101923024
46109424118
47107325191
48115026341
49112227463
50108328546
51112929675
5290830583
5384831431
5469132122
5557032692
5641633108
5731733425
5822033645
5912233767
607733844
613633880
621833898
631533913
64233915
65033915
66133916
67033916
68033916
69033916
70033916
71033916
72033916
73033916
74033916
75033916
76033916
\n", "
" ], "text/plain": [ " locus_coverage sum_coverage\n", "1 0 0\n", "2 0 0\n", "3 0 0\n", "4 0 0\n", "5 0 0\n", "6 0 0\n", "7 0 0\n", "8 0 0\n", "9 0 0\n", "10 0 0\n", "11 0 0\n", "12 0 0\n", "13 0 0\n", "14 0 0\n", "15 0 0\n", "16 0 0\n", "17 0 0\n", "18 0 0\n", "19 0 0\n", "20 2432 2432\n", "21 1887 4319\n", "22 1248 5567\n", "23 910 6477\n", "24 844 7321\n", "25 848 8169\n", "26 841 9010\n", "27 882 9892\n", "28 911 10803\n", "29 849 11652\n", "30 829 12481\n", "31 795 13276\n", "32 705 13981\n", "33 680 14661\n", "34 664 15325\n", "35 638 15963\n", "36 544 16507\n", "37 538 17045\n", "38 589 17634\n", "39 576 18210\n", "40 644 18854\n", "41 686 19540\n", "42 754 20294\n", "43 770 21064\n", "44 941 22005\n", "45 1019 23024\n", "46 1094 24118\n", "47 1073 25191\n", "48 1150 26341\n", "49 1122 27463\n", "50 1083 28546\n", "51 1129 29675\n", "52 908 30583\n", "53 848 31431\n", "54 691 32122\n", "55 570 32692\n", "56 416 33108\n", "57 317 33425\n", "58 220 33645\n", "59 122 33767\n", "60 77 33844\n", "61 36 33880\n", "62 18 33898\n", "63 15 33913\n", "64 2 33915\n", "65 0 33915\n", "66 1 33916\n", "67 0 33916\n", "68 0 33916\n", "69 0 33916\n", "70 0 33916\n", "71 0 33916\n", "72 0 33916\n", "73 0 33916\n", "74 0 33916\n", "75 0 33916\n", "76 0 33916" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "subdata.stats_dfs.s7_loci" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "### Link to all of the final stats files on github" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "+ [ficus_dlo](...)\n", "+ [ficus_dhi](...)\n", "+ [ficus_dlo_s4](...)\n", "+ [ficus_dhi_s4](...)\n", "+ [ficus_dlo_s10](...)\n", "+ [ficus_dhi_s10](...)\n", "+ [ficus_dlo_s20](...)\n", "+ [ficus_dhi_s20](...)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.13" } }, "nbformat": 4, "nbformat_minor": 1 }