{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Canarium GBS Assembly\n", "### *Federman et al.*\n", "\n", "This notebook provides all code necessary to reproduce the assembled GBS data sets used in Federman et al. (xxxx). Starting from demultiplexed fastq data files we assemble the data into four complete data sets that were used in downstream analyses. All code in this notebook is written in Python and uses the *ipyrad* package for assembly. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Required software" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "## conda install ipyrad -c ipyrad\n", "## conda install sra-tools -c bioconda\n", "## conda install entrez-direct -c bioconda" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Imports" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "import ipyrad as ip\n", "import ipyrad.analysis as ipa" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ipyrad v.0.7.20\n" ] } ], "source": [ "print \"ipyrad v.{}\".format(ip.__version__)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Connect to cluster" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "host compute node: [40 cores] on sacra\n" ] } ], "source": [ "import ipyparallel as ipp\n", "ipyclient = ipp.Client()\n", "ip.cluster_info(ipyclient)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Links to Sections\n", "\n", "+ [Download Sequence Data](#Download-sequence-data-from-SRA)\n", "+ [ipyrad Assembly](#ipyrad-Assembly)\n", "+ [Assembly stats](#Assembly-stats)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Download sequence data from SRA" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\r", "Fetching project data..." ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Runspotsspots_with_matesScientificNameSampleName
0SRR5534921128186930Canarium lamianumSF327
1SRR553492256757730Canarium longistipulatumD12950
2SRR5534923344047460Canarium ovatumD14269
3SRR553492433826490Canarium pilicarpum5573
4SRR5534925166324420Canarium obtusifoliumSF228
5SRR5534926107691590Canarium odontophyllumSFC1988
6SRR5534927476250Canarium ntidifolium4304
7SRR5534928170348810Canarium obtusifoliumSF224
8SRR55349292119320Canarium ferrugineumSF343
9SRR55349304024930Canarium galokenseD13101
10SRR553493157969760Canarium galokenseSF155
11SRR553493232190830Canarium globosumSF200
12SRR553493319028600Canarium globosumSF209
13SRR553493497095590Canarium indicumD13374
14SRR553493521921590Canarium lamianumD13063
15SRR55349364506260Canarium lamianumSF160
16SRR5534937128740610Canarium pulchrebracteatumSF286
17SRR553493824999020Canarium compressumD13090
18SRR553493972608510Canarium ferrugineumSF172
19SRR5534940157369040Canarium pulchrebracteatumSF276
20SRR5534941125398780Canarium pilicarpumD13052
21SRR5534942233550830Canarium compressumD13097
22SRR553494320975370Canarium egregiumD13103
23SRR553494410337630Canarium elegansD12963
24SRR553494527256980Canarium bengalenseD13852
25SRR5534946121320850Canarium betamponaeSF175
26SRR5534947184462480Canarium betamponaeSF328
27SRR55349484002660Canarium boiviniiD12962
28SRR5534949116479480Canarium velutinifoliumD14505
29SRR5534950128529420Canarium velutinifoliumD14504
30SRR5534951161008070Canarium scholasticumSF197
31SRR55349521866820Canarium scholasticumSF301
32SRR553495311523810Canarium planifoliumSF153
33SRR553495438032370Canarium multiflorumD14501
34SRR553495547342470Canarium multiflorumD14485
35SRR5534956107447450Canarium multinerveD14482
36SRR553495727570990Canarium multiflorumD14513
37SRR553495866455490Canarium multiflorumD14477
38SRR55349592256680Canarium madagascarienseD13091
39SRR553496079629740Canarium multiflorumD14480
40SRR5534961190152380Canarium multiflorumD14478
41SRR55349626686790Canarium scholasticumD13075
42SRR55349632944190Canarium multinerveD14492
43SRR5534964114660070Canarium multinerveD14483
44SRR553496598291530Canarium planifoliumSF164
45SRR553496617887480Canarium pulchrebracteatumD14528
46SRR553496756878240Canarium velutinifoliumD14506
47SRR553496816945550Canarium ferrugineumD13053
\n", "
" ], "text/plain": [ " Run spots spots_with_mates ScientificName \\\n", "0 SRR5534921 12818693 0 Canarium lamianum \n", "1 SRR5534922 5675773 0 Canarium longistipulatum \n", "2 SRR5534923 34404746 0 Canarium ovatum \n", "3 SRR5534924 3382649 0 Canarium pilicarpum \n", "4 SRR5534925 16632442 0 Canarium obtusifolium \n", "5 SRR5534926 10769159 0 Canarium odontophyllum \n", "6 SRR5534927 47625 0 Canarium ntidifolium \n", "7 SRR5534928 17034881 0 Canarium obtusifolium \n", "8 SRR5534929 211932 0 Canarium ferrugineum \n", "9 SRR5534930 402493 0 Canarium galokense \n", "10 SRR5534931 5796976 0 Canarium galokense \n", "11 SRR5534932 3219083 0 Canarium globosum \n", "12 SRR5534933 1902860 0 Canarium globosum \n", "13 SRR5534934 9709559 0 Canarium indicum \n", "14 SRR5534935 2192159 0 Canarium lamianum \n", "15 SRR5534936 450626 0 Canarium lamianum \n", "16 SRR5534937 12874061 0 Canarium pulchrebracteatum \n", "17 SRR5534938 2499902 0 Canarium compressum \n", "18 SRR5534939 7260851 0 Canarium ferrugineum \n", "19 SRR5534940 15736904 0 Canarium pulchrebracteatum \n", "20 SRR5534941 12539878 0 Canarium pilicarpum \n", "21 SRR5534942 23355083 0 Canarium compressum \n", "22 SRR5534943 2097537 0 Canarium egregium \n", "23 SRR5534944 1033763 0 Canarium elegans \n", "24 SRR5534945 2725698 0 Canarium bengalense \n", "25 SRR5534946 12132085 0 Canarium betamponae \n", "26 SRR5534947 18446248 0 Canarium betamponae \n", "27 SRR5534948 400266 0 Canarium boivinii \n", "28 SRR5534949 11647948 0 Canarium velutinifolium \n", "29 SRR5534950 12852942 0 Canarium velutinifolium \n", "30 SRR5534951 16100807 0 Canarium scholasticum \n", "31 SRR5534952 186682 0 Canarium scholasticum \n", "32 SRR5534953 1152381 0 Canarium planifolium \n", "33 SRR5534954 3803237 0 Canarium multiflorum \n", "34 SRR5534955 4734247 0 Canarium multiflorum \n", "35 SRR5534956 10744745 0 Canarium multinerve \n", "36 SRR5534957 2757099 0 Canarium multiflorum \n", "37 SRR5534958 6645549 0 Canarium multiflorum \n", "38 SRR5534959 225668 0 Canarium madagascariense \n", "39 SRR5534960 7962974 0 Canarium multiflorum \n", "40 SRR5534961 19015238 0 Canarium multiflorum \n", "41 SRR5534962 668679 0 Canarium scholasticum \n", "42 SRR5534963 294419 0 Canarium multinerve \n", "43 SRR5534964 11466007 0 Canarium multinerve \n", "44 SRR5534965 9829153 0 Canarium planifolium \n", "45 SRR5534966 1788748 0 Canarium pulchrebracteatum \n", "46 SRR5534967 5687824 0 Canarium velutinifolium \n", "47 SRR5534968 1694555 0 Canarium ferrugineum \n", "\n", " SampleName \n", "0 SF327 \n", "1 D12950 \n", "2 D14269 \n", "3 5573 \n", "4 SF228 \n", "5 SFC1988 \n", "6 4304 \n", "7 SF224 \n", "8 SF343 \n", "9 D13101 \n", "10 SF155 \n", "11 SF200 \n", "12 SF209 \n", "13 D13374 \n", "14 D13063 \n", "15 SF160 \n", "16 SF286 \n", "17 D13090 \n", "18 SF172 \n", "19 SF276 \n", "20 D13052 \n", "21 D13097 \n", "22 D13103 \n", "23 D12963 \n", "24 D13852 \n", "25 SF175 \n", "26 SF328 \n", "27 D12962 \n", "28 D14505 \n", "29 D14504 \n", "30 SF197 \n", "31 SF301 \n", "32 SF153 \n", "33 D14501 \n", "34 D14485 \n", "35 D14482 \n", "36 D14513 \n", "37 D14477 \n", "38 D13091 \n", "39 D14480 \n", "40 D14478 \n", "41 D13075 \n", "42 D14492 \n", "43 D14483 \n", "44 SF164 \n", "45 D14528 \n", "46 D14506 \n", "47 D13053 " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## get accession data from sra\n", "sra = ipa.sratools(accession=\"SRP106882\", workdir=\"./fastq-files\")\n", "\n", "## print run info for posterity\n", "run_info = sra.fetch_runinfo((1,4,6,29,30))\n", "run_info" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[####################] 100% Downloading fastq files | 0:24:21 | \n", "48 fastq files downloaded to /home/deren/Documents/Canarium/fastq-files\n" ] } ], "source": [ "## run parallel download\n", "sra.run(ipyclient=ipyclient)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## ipyrad Assembly\n", "\n", "Enter parameter values for the ipyrad assembly . " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "New Assembly: Canarium\n", "0 assembly_name Canarium \n", "1 project_dir ./analysis-ipyrad \n", "2 raw_fastq_path \n", "3 barcodes_path \n", "4 sorted_fastq_path ./fastq-files/*.gz \n", "5 assembly_method denovo \n", "6 reference_sequence \n", "7 datatype gbs \n", "8 restriction_overhang ('CWGC', 'CWGC') \n", "9 max_low_qual_bases 5 \n", "10 phred_Qscore_offset 33 \n", "11 mindepth_statistical 6 \n", "12 mindepth_majrule 6 \n", "13 maxdepth 10000 \n", "14 clust_threshold 0.9 \n", "15 max_barcode_mismatch 0 \n", "16 filter_adapters 2 \n", "17 filter_min_trim_len 35 \n", "18 max_alleles_consens 2 \n", "19 max_Ns_consens (5, 5) \n", "20 max_Hs_consens (8, 8) \n", "21 min_samples_locus 4 \n", "22 max_SNPs_locus (10, 10) \n", "23 max_Indels_locus (8, 8) \n", "24 max_shared_Hs_locus 4 \n", "25 trim_reads (0, 0, 0, 0) \n", "26 trim_loci (0, 5) \n", "27 output_formats ('l', 'p', 's', 'v', 'k', 'a') \n", "28 pop_assign_file \n" ] } ], "source": [ "## create an Assembly\n", "data = ip.Assembly(\"Canarium\")\n", "\n", "## set params\n", "data.set_params(\"project_dir\", \"analysis-ipyrad\")\n", "data.set_params(\"sorted_fastq_path\", \"./fastq-files/*.gz\")\n", "data.set_params(\"restriction_overhang\", (\"CWGC\", \"CWGC\"))\n", "data.set_params(\"datatype\", \"gbs\")\n", "data.set_params(\"clust_threshold\", 0.90)\n", "data.set_params(\"filter_adapters\", 2)\n", "data.set_params(\"max_SNPs_locus\", (10, 10))\n", "data.set_params(\"max_shared_Hs_locus\", 4)\n", "data.set_params(\"trim_reads\", (0, 0))\n", "data.set_params(\"trim_loci\", (0, 5))\n", "data.set_params(\"output_formats\", list(\"lpsvka\"))\n", "\n", "## print params for posterity\n", "data.get_params()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Assemble reads within each Sample" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "data.run(\"12\")" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Assembly: Canarium\n", "[####################] 100% dereplicating | 0:01:55 | s3 | | \n", "[####################] 100% clustering | 15:30:31 | s3 | \n", "[####################] 100% building clusters | 0:04:02 | s3 | \n", "[####################] 100% chunking | 0:00:42 | s3 | \n", "[####################] 100% aligning | 1:00:07 | s3 | \n", "[####################] 100% concatenating | 0:03:13 | s3 | \n" ] } ], "source": [ "data.run(\"3\", force=True)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Assembly: Canarium\n", "[####################] 100% inferring [H, E] | 0:12:06 | s4 | \n", "[####################] 100% calculating depths | 0:00:54 | s5 | \n", "[####################] 100% chunking clusters | 0:01:22 | s5 | \n", "[####################] 100% consens calling | 0:21:34 | s5 | \n" ] } ], "source": [ "data.run(\"45\")" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Assembly: Canarium\n", "[####################] 100% concat/shuffle input | 0:01:06 | s6 | \n", "[####################] 100% clustering across | 5:43:06 | s6 | \n", "[####################] 100% building clusters | 0:00:53 | s6 | \n", "[####################] 100% aligning clusters | 0:04:13 | s6 | \n", "[####################] 100% database indels | 0:02:13 | s6 | \n", "[####################] 100% indexing clusters | 0:02:19 | s6 | \n", "[####################] 100% building database | 0:24:48 | s6 | \n" ] } ], "source": [ "data.run(\"6\")" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Assembly: Canarium\n", "[####################] 100% filtering loci | 0:00:39 | s7 | \n", "[####################] 100% building loci/stats | 0:00:29 | s7 | \n", "[####################] 100% building alleles | 0:00:36 | s7 | \n", "[####################] 100% building vcf file | 0:01:06 | s7 | \n", "[####################] 100% writing vcf file | 0:00:00 | s7 | \n", "[####################] 100% building arrays | 0:00:31 | s7 | \n", "[####################] 100% writing outfiles | 0:01:00 | s7 | \n", "Outfiles written to: ~/Documents/Canarium/analysis-ipyrad/Canarium_outfiles\n", "\n" ] } ], "source": [ "data.run(\"7\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Full assembly stats" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "loading Assembly: Canarium\n", "from saved path: ~/Documents/Canarium/analysis-ipyrad/Canarium.json\n" ] } ], "source": [ "## re-load assembly in case coming back to this notebook later\n", "data = ip.load_json(\"analysis-ipyrad/Canarium.json\")" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "total N reads: 365012834\n", "mean N reads/sample: 7604434.04167\n", "S.D. N reads/sample: 7407655.17926\n" ] } ], "source": [ "## print some stats\n", "print \"total N reads:\", data.stats.reads_raw.sum()\n", "print \"mean N reads/sample:\", data.stats.reads_raw.mean()\n", "print \"S.D. N reads/sample:\", data.stats.reads_raw.std()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Excluding low-data samples" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sample_coverage
43047
557337672
D1295044861
D129621981
D1296310141
D1305275316
D1305321812
D1306328227
D130754096
D1309027541
D130911059
D1309781399
D131012860
D1310326359
D1337456701
D1385213687
D1426979114
D1447761515
D1447897836
D1448070588
D1448269750
D1448371352
D1448551841
D144921272
D1450149981
D1450477524
D1450577445
D1450660470
D1451339505
D1452826587
SF1539269
SF15553298
SF1602856
SF16469764
SF17264881
SF17569099
SF19782914
SF20042029
SF20928325
SF22481039
SF22882338
SF27682961
SF28678681
SF301785
SF32772815
SF32880219
SF343894
SFC198851364
\n", "
" ], "text/plain": [ " sample_coverage\n", "4304 7\n", "5573 37672\n", "D12950 44861\n", "D12962 1981\n", "D12963 10141\n", "D13052 75316\n", "D13053 21812\n", "D13063 28227\n", "D13075 4096\n", "D13090 27541\n", "D13091 1059\n", "D13097 81399\n", "D13101 2860\n", "D13103 26359\n", "D13374 56701\n", "D13852 13687\n", "D14269 79114\n", "D14477 61515\n", "D14478 97836\n", "D14480 70588\n", "D14482 69750\n", "D14483 71352\n", "D14485 51841\n", "D14492 1272\n", "D14501 49981\n", "D14504 77524\n", "D14505 77445\n", "D14506 60470\n", "D14513 39505\n", "D14528 26587\n", "SF153 9269\n", "SF155 53298\n", "SF160 2856\n", "SF164 69764\n", "SF172 64881\n", "SF175 69099\n", "SF197 82914\n", "SF200 42029\n", "SF209 28325\n", "SF224 81039\n", "SF228 82338\n", "SF276 82961\n", "SF286 78681\n", "SF301 785\n", "SF327 72815\n", "SF328 80219\n", "SF343 894\n", "SFC1988 51364" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## number of consens reads per sample. \n", "data.stats_dfs.s7_samples" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "collapsed": true }, "outputs": [], "source": [ "## make subset lists to exclude taxa with little data\n", "subs = [i.name for i in data.samples.values() if i.stats.reads_consens > 12000]\n", "subsnout = list(set(subs) - set([\"D14269\", \"D13374\", \"SFC1988\", \"D13852\"]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Final assemblies" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "collapsed": true }, "outputs": [], "source": [ "## make new branches\n", "Can_min4 = data.branch(\"Canarium-min4\", subsamples=subs)\n", "Can_min10 = data.branch(\"Canarium-min10\", subsamples=subs)\n", "Can_min20 = data.branch(\"Canarium-min20\", subsamples=subs)\n", "Can_min30nout = data.branch(\"Canarium-min30-nout\", subsamples=subsnout)\n", "\n", "## set params on new assemblies\n", "Can_min10.set_params(\"min_samples_locus\", 10)\n", "Can_min20.set_params(\"min_samples_locus\", 20)\n", "Can_min30nout.set_params(\"min_samples_locus\", 30)" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Assembly: Canarium-min4\n", "[####################] 100% filtering loci | 0:00:50 | s7 | \n", "[####################] 100% building loci/stats | 0:00:29 | s7 | \n", "[####################] 100% building alleles | 0:00:36 | s7 | \n", "[####################] 100% building vcf file | 0:01:00 | s7 | \n", "[####################] 100% writing vcf file | 0:00:00 | s7 | \n", "[####################] 100% building arrays | 0:00:40 | s7 | \n", "[####################] 100% writing outfiles | 0:00:50 | s7 | \n", "Outfiles written to: ~/Documents/Canarium/analysis-ipyrad/Canarium-min4_outfiles\n", "\n", "Assembly: Canarium-min10\n", "[####################] 100% filtering loci | 0:00:41 | s7 | \n", "[####################] 100% building loci/stats | 0:00:29 | s7 | \n", "[####################] 100% building alleles | 0:00:35 | s7 | \n", "[####################] 100% building vcf file | 0:00:49 | s7 | \n", "[####################] 100% writing vcf file | 0:00:00 | s7 | \n", "[####################] 100% building arrays | 0:00:39 | s7 | \n", "[####################] 100% writing outfiles | 0:00:27 | s7 | \n", "Outfiles written to: ~/Documents/Canarium/analysis-ipyrad/Canarium-min10_outfiles\n", "\n", "Assembly: Canarium-min20\n", "[####################] 100% filtering loci | 0:00:41 | s7 | \n", "[####################] 100% building loci/stats | 0:00:29 | s7 | \n", "[####################] 100% building alleles | 0:00:35 | s7 | \n", "[####################] 100% building vcf file | 0:00:42 | s7 | \n", "[####################] 100% writing vcf file | 0:00:00 | s7 | \n", "[####################] 100% building arrays | 0:00:38 | s7 | \n", "[####################] 100% writing outfiles | 0:00:15 | s7 | \n", "Outfiles written to: ~/Documents/Canarium/analysis-ipyrad/Canarium-min20_outfiles\n", "\n", "Assembly: Canarium-min30-nout\n", "[####################] 100% filtering loci | 0:00:37 | s7 | \n", "[####################] 100% building loci/stats | 0:00:28 | s7 | \n", "[####################] 100% building alleles | 0:00:33 | s7 | \n", "[####################] 100% building vcf file | 0:00:35 | s7 | \n", "[####################] 100% writing vcf file | 0:00:00 | s7 | \n", "[####################] 100% building arrays | 0:00:35 | s7 | \n", "[####################] 100% writing outfiles | 0:00:06 | s7 | \n", "Outfiles written to: ~/Documents/Canarium/analysis-ipyrad/Canarium-min30-nout_outfiles\n", "\n" ] } ], "source": [ "## final assemblies\n", "Can_min4.run(\"7\", force=True)\n", "Can_min10.run(\"7\", force=True)\n", "Can_min20.run(\"7\", force=True)\n", "Can_min30nout.run(\"7\", force=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Assembly stats\n", "See the github page for stats of each assembly. " ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "loading Assembly: Canarium-min4\n", "from saved path: ~/Documents/Canarium/analysis-ipyrad/Canarium-min4.json\n", "loading Assembly: Canarium-min10\n", "from saved path: ~/Documents/Canarium/analysis-ipyrad/Canarium-min10.json\n", "loading Assembly: Canarium-min20\n", "from saved path: ~/Documents/Canarium/analysis-ipyrad/Canarium-min20.json\n", "loading Assembly: Canarium-min30-nout\n", "from saved path: ~/Documents/Canarium/analysis-ipyrad/Canarium-min30-nout.json\n" ] } ], "source": [ "## reoload assemblies from their JSON files\n", "Can_min4 = ip.load_json(\"analysis-ipyrad/Canarium-min4.json\")\n", "Can_min10 = ip.load_json(\"analysis-ipyrad/Canarium-min10.json\")\n", "Can_min20 = ip.load_json(\"analysis-ipyrad/Canarium-min20.json\")\n", "Can_min30n = ip.load_json(\"analysis-ipyrad/Canarium-min30-nout.json\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.13" } }, "nbformat": 4, "nbformat_minor": 1 }