{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Canarium GBS Assembly\n",
"### *Federman et al.*\n",
"\n",
"This notebook provides all code necessary to reproduce the assembled GBS data sets used in Federman et al. (xxxx). Starting from demultiplexed fastq data files we assemble the data into four complete data sets that were used in downstream analyses. All code in this notebook is written in Python and uses the *ipyrad* package for assembly. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Required software"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"## conda install ipyrad -c ipyrad\n",
"## conda install sra-tools -c bioconda\n",
"## conda install entrez-direct -c bioconda"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Imports"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"import ipyrad as ip\n",
"import ipyrad.analysis as ipa"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"ipyrad v.0.7.20\n"
]
}
],
"source": [
"print \"ipyrad v.{}\".format(ip.__version__)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Connect to cluster"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"host compute node: [40 cores] on sacra\n"
]
}
],
"source": [
"import ipyparallel as ipp\n",
"ipyclient = ipp.Client()\n",
"ip.cluster_info(ipyclient)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Links to Sections\n",
"\n",
"+ [Download Sequence Data](#Download-sequence-data-from-SRA)\n",
"+ [ipyrad Assembly](#ipyrad-Assembly)\n",
"+ [Assembly stats](#Assembly-stats)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Download sequence data from SRA"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\r",
"Fetching project data..."
]
},
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Run | \n",
" spots | \n",
" spots_with_mates | \n",
" ScientificName | \n",
" SampleName | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" SRR5534921 | \n",
" 12818693 | \n",
" 0 | \n",
" Canarium lamianum | \n",
" SF327 | \n",
"
\n",
" \n",
" 1 | \n",
" SRR5534922 | \n",
" 5675773 | \n",
" 0 | \n",
" Canarium longistipulatum | \n",
" D12950 | \n",
"
\n",
" \n",
" 2 | \n",
" SRR5534923 | \n",
" 34404746 | \n",
" 0 | \n",
" Canarium ovatum | \n",
" D14269 | \n",
"
\n",
" \n",
" 3 | \n",
" SRR5534924 | \n",
" 3382649 | \n",
" 0 | \n",
" Canarium pilicarpum | \n",
" 5573 | \n",
"
\n",
" \n",
" 4 | \n",
" SRR5534925 | \n",
" 16632442 | \n",
" 0 | \n",
" Canarium obtusifolium | \n",
" SF228 | \n",
"
\n",
" \n",
" 5 | \n",
" SRR5534926 | \n",
" 10769159 | \n",
" 0 | \n",
" Canarium odontophyllum | \n",
" SFC1988 | \n",
"
\n",
" \n",
" 6 | \n",
" SRR5534927 | \n",
" 47625 | \n",
" 0 | \n",
" Canarium ntidifolium | \n",
" 4304 | \n",
"
\n",
" \n",
" 7 | \n",
" SRR5534928 | \n",
" 17034881 | \n",
" 0 | \n",
" Canarium obtusifolium | \n",
" SF224 | \n",
"
\n",
" \n",
" 8 | \n",
" SRR5534929 | \n",
" 211932 | \n",
" 0 | \n",
" Canarium ferrugineum | \n",
" SF343 | \n",
"
\n",
" \n",
" 9 | \n",
" SRR5534930 | \n",
" 402493 | \n",
" 0 | \n",
" Canarium galokense | \n",
" D13101 | \n",
"
\n",
" \n",
" 10 | \n",
" SRR5534931 | \n",
" 5796976 | \n",
" 0 | \n",
" Canarium galokense | \n",
" SF155 | \n",
"
\n",
" \n",
" 11 | \n",
" SRR5534932 | \n",
" 3219083 | \n",
" 0 | \n",
" Canarium globosum | \n",
" SF200 | \n",
"
\n",
" \n",
" 12 | \n",
" SRR5534933 | \n",
" 1902860 | \n",
" 0 | \n",
" Canarium globosum | \n",
" SF209 | \n",
"
\n",
" \n",
" 13 | \n",
" SRR5534934 | \n",
" 9709559 | \n",
" 0 | \n",
" Canarium indicum | \n",
" D13374 | \n",
"
\n",
" \n",
" 14 | \n",
" SRR5534935 | \n",
" 2192159 | \n",
" 0 | \n",
" Canarium lamianum | \n",
" D13063 | \n",
"
\n",
" \n",
" 15 | \n",
" SRR5534936 | \n",
" 450626 | \n",
" 0 | \n",
" Canarium lamianum | \n",
" SF160 | \n",
"
\n",
" \n",
" 16 | \n",
" SRR5534937 | \n",
" 12874061 | \n",
" 0 | \n",
" Canarium pulchrebracteatum | \n",
" SF286 | \n",
"
\n",
" \n",
" 17 | \n",
" SRR5534938 | \n",
" 2499902 | \n",
" 0 | \n",
" Canarium compressum | \n",
" D13090 | \n",
"
\n",
" \n",
" 18 | \n",
" SRR5534939 | \n",
" 7260851 | \n",
" 0 | \n",
" Canarium ferrugineum | \n",
" SF172 | \n",
"
\n",
" \n",
" 19 | \n",
" SRR5534940 | \n",
" 15736904 | \n",
" 0 | \n",
" Canarium pulchrebracteatum | \n",
" SF276 | \n",
"
\n",
" \n",
" 20 | \n",
" SRR5534941 | \n",
" 12539878 | \n",
" 0 | \n",
" Canarium pilicarpum | \n",
" D13052 | \n",
"
\n",
" \n",
" 21 | \n",
" SRR5534942 | \n",
" 23355083 | \n",
" 0 | \n",
" Canarium compressum | \n",
" D13097 | \n",
"
\n",
" \n",
" 22 | \n",
" SRR5534943 | \n",
" 2097537 | \n",
" 0 | \n",
" Canarium egregium | \n",
" D13103 | \n",
"
\n",
" \n",
" 23 | \n",
" SRR5534944 | \n",
" 1033763 | \n",
" 0 | \n",
" Canarium elegans | \n",
" D12963 | \n",
"
\n",
" \n",
" 24 | \n",
" SRR5534945 | \n",
" 2725698 | \n",
" 0 | \n",
" Canarium bengalense | \n",
" D13852 | \n",
"
\n",
" \n",
" 25 | \n",
" SRR5534946 | \n",
" 12132085 | \n",
" 0 | \n",
" Canarium betamponae | \n",
" SF175 | \n",
"
\n",
" \n",
" 26 | \n",
" SRR5534947 | \n",
" 18446248 | \n",
" 0 | \n",
" Canarium betamponae | \n",
" SF328 | \n",
"
\n",
" \n",
" 27 | \n",
" SRR5534948 | \n",
" 400266 | \n",
" 0 | \n",
" Canarium boivinii | \n",
" D12962 | \n",
"
\n",
" \n",
" 28 | \n",
" SRR5534949 | \n",
" 11647948 | \n",
" 0 | \n",
" Canarium velutinifolium | \n",
" D14505 | \n",
"
\n",
" \n",
" 29 | \n",
" SRR5534950 | \n",
" 12852942 | \n",
" 0 | \n",
" Canarium velutinifolium | \n",
" D14504 | \n",
"
\n",
" \n",
" 30 | \n",
" SRR5534951 | \n",
" 16100807 | \n",
" 0 | \n",
" Canarium scholasticum | \n",
" SF197 | \n",
"
\n",
" \n",
" 31 | \n",
" SRR5534952 | \n",
" 186682 | \n",
" 0 | \n",
" Canarium scholasticum | \n",
" SF301 | \n",
"
\n",
" \n",
" 32 | \n",
" SRR5534953 | \n",
" 1152381 | \n",
" 0 | \n",
" Canarium planifolium | \n",
" SF153 | \n",
"
\n",
" \n",
" 33 | \n",
" SRR5534954 | \n",
" 3803237 | \n",
" 0 | \n",
" Canarium multiflorum | \n",
" D14501 | \n",
"
\n",
" \n",
" 34 | \n",
" SRR5534955 | \n",
" 4734247 | \n",
" 0 | \n",
" Canarium multiflorum | \n",
" D14485 | \n",
"
\n",
" \n",
" 35 | \n",
" SRR5534956 | \n",
" 10744745 | \n",
" 0 | \n",
" Canarium multinerve | \n",
" D14482 | \n",
"
\n",
" \n",
" 36 | \n",
" SRR5534957 | \n",
" 2757099 | \n",
" 0 | \n",
" Canarium multiflorum | \n",
" D14513 | \n",
"
\n",
" \n",
" 37 | \n",
" SRR5534958 | \n",
" 6645549 | \n",
" 0 | \n",
" Canarium multiflorum | \n",
" D14477 | \n",
"
\n",
" \n",
" 38 | \n",
" SRR5534959 | \n",
" 225668 | \n",
" 0 | \n",
" Canarium madagascariense | \n",
" D13091 | \n",
"
\n",
" \n",
" 39 | \n",
" SRR5534960 | \n",
" 7962974 | \n",
" 0 | \n",
" Canarium multiflorum | \n",
" D14480 | \n",
"
\n",
" \n",
" 40 | \n",
" SRR5534961 | \n",
" 19015238 | \n",
" 0 | \n",
" Canarium multiflorum | \n",
" D14478 | \n",
"
\n",
" \n",
" 41 | \n",
" SRR5534962 | \n",
" 668679 | \n",
" 0 | \n",
" Canarium scholasticum | \n",
" D13075 | \n",
"
\n",
" \n",
" 42 | \n",
" SRR5534963 | \n",
" 294419 | \n",
" 0 | \n",
" Canarium multinerve | \n",
" D14492 | \n",
"
\n",
" \n",
" 43 | \n",
" SRR5534964 | \n",
" 11466007 | \n",
" 0 | \n",
" Canarium multinerve | \n",
" D14483 | \n",
"
\n",
" \n",
" 44 | \n",
" SRR5534965 | \n",
" 9829153 | \n",
" 0 | \n",
" Canarium planifolium | \n",
" SF164 | \n",
"
\n",
" \n",
" 45 | \n",
" SRR5534966 | \n",
" 1788748 | \n",
" 0 | \n",
" Canarium pulchrebracteatum | \n",
" D14528 | \n",
"
\n",
" \n",
" 46 | \n",
" SRR5534967 | \n",
" 5687824 | \n",
" 0 | \n",
" Canarium velutinifolium | \n",
" D14506 | \n",
"
\n",
" \n",
" 47 | \n",
" SRR5534968 | \n",
" 1694555 | \n",
" 0 | \n",
" Canarium ferrugineum | \n",
" D13053 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Run spots spots_with_mates ScientificName \\\n",
"0 SRR5534921 12818693 0 Canarium lamianum \n",
"1 SRR5534922 5675773 0 Canarium longistipulatum \n",
"2 SRR5534923 34404746 0 Canarium ovatum \n",
"3 SRR5534924 3382649 0 Canarium pilicarpum \n",
"4 SRR5534925 16632442 0 Canarium obtusifolium \n",
"5 SRR5534926 10769159 0 Canarium odontophyllum \n",
"6 SRR5534927 47625 0 Canarium ntidifolium \n",
"7 SRR5534928 17034881 0 Canarium obtusifolium \n",
"8 SRR5534929 211932 0 Canarium ferrugineum \n",
"9 SRR5534930 402493 0 Canarium galokense \n",
"10 SRR5534931 5796976 0 Canarium galokense \n",
"11 SRR5534932 3219083 0 Canarium globosum \n",
"12 SRR5534933 1902860 0 Canarium globosum \n",
"13 SRR5534934 9709559 0 Canarium indicum \n",
"14 SRR5534935 2192159 0 Canarium lamianum \n",
"15 SRR5534936 450626 0 Canarium lamianum \n",
"16 SRR5534937 12874061 0 Canarium pulchrebracteatum \n",
"17 SRR5534938 2499902 0 Canarium compressum \n",
"18 SRR5534939 7260851 0 Canarium ferrugineum \n",
"19 SRR5534940 15736904 0 Canarium pulchrebracteatum \n",
"20 SRR5534941 12539878 0 Canarium pilicarpum \n",
"21 SRR5534942 23355083 0 Canarium compressum \n",
"22 SRR5534943 2097537 0 Canarium egregium \n",
"23 SRR5534944 1033763 0 Canarium elegans \n",
"24 SRR5534945 2725698 0 Canarium bengalense \n",
"25 SRR5534946 12132085 0 Canarium betamponae \n",
"26 SRR5534947 18446248 0 Canarium betamponae \n",
"27 SRR5534948 400266 0 Canarium boivinii \n",
"28 SRR5534949 11647948 0 Canarium velutinifolium \n",
"29 SRR5534950 12852942 0 Canarium velutinifolium \n",
"30 SRR5534951 16100807 0 Canarium scholasticum \n",
"31 SRR5534952 186682 0 Canarium scholasticum \n",
"32 SRR5534953 1152381 0 Canarium planifolium \n",
"33 SRR5534954 3803237 0 Canarium multiflorum \n",
"34 SRR5534955 4734247 0 Canarium multiflorum \n",
"35 SRR5534956 10744745 0 Canarium multinerve \n",
"36 SRR5534957 2757099 0 Canarium multiflorum \n",
"37 SRR5534958 6645549 0 Canarium multiflorum \n",
"38 SRR5534959 225668 0 Canarium madagascariense \n",
"39 SRR5534960 7962974 0 Canarium multiflorum \n",
"40 SRR5534961 19015238 0 Canarium multiflorum \n",
"41 SRR5534962 668679 0 Canarium scholasticum \n",
"42 SRR5534963 294419 0 Canarium multinerve \n",
"43 SRR5534964 11466007 0 Canarium multinerve \n",
"44 SRR5534965 9829153 0 Canarium planifolium \n",
"45 SRR5534966 1788748 0 Canarium pulchrebracteatum \n",
"46 SRR5534967 5687824 0 Canarium velutinifolium \n",
"47 SRR5534968 1694555 0 Canarium ferrugineum \n",
"\n",
" SampleName \n",
"0 SF327 \n",
"1 D12950 \n",
"2 D14269 \n",
"3 5573 \n",
"4 SF228 \n",
"5 SFC1988 \n",
"6 4304 \n",
"7 SF224 \n",
"8 SF343 \n",
"9 D13101 \n",
"10 SF155 \n",
"11 SF200 \n",
"12 SF209 \n",
"13 D13374 \n",
"14 D13063 \n",
"15 SF160 \n",
"16 SF286 \n",
"17 D13090 \n",
"18 SF172 \n",
"19 SF276 \n",
"20 D13052 \n",
"21 D13097 \n",
"22 D13103 \n",
"23 D12963 \n",
"24 D13852 \n",
"25 SF175 \n",
"26 SF328 \n",
"27 D12962 \n",
"28 D14505 \n",
"29 D14504 \n",
"30 SF197 \n",
"31 SF301 \n",
"32 SF153 \n",
"33 D14501 \n",
"34 D14485 \n",
"35 D14482 \n",
"36 D14513 \n",
"37 D14477 \n",
"38 D13091 \n",
"39 D14480 \n",
"40 D14478 \n",
"41 D13075 \n",
"42 D14492 \n",
"43 D14483 \n",
"44 SF164 \n",
"45 D14528 \n",
"46 D14506 \n",
"47 D13053 "
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## get accession data from sra\n",
"sra = ipa.sratools(accession=\"SRP106882\", workdir=\"./fastq-files\")\n",
"\n",
"## print run info for posterity\n",
"run_info = sra.fetch_runinfo((1,4,6,29,30))\n",
"run_info"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[####################] 100% Downloading fastq files | 0:24:21 | \n",
"48 fastq files downloaded to /home/deren/Documents/Canarium/fastq-files\n"
]
}
],
"source": [
"## run parallel download\n",
"sra.run(ipyclient=ipyclient)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## ipyrad Assembly\n",
"\n",
"Enter parameter values for the ipyrad assembly . "
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"New Assembly: Canarium\n",
"0 assembly_name Canarium \n",
"1 project_dir ./analysis-ipyrad \n",
"2 raw_fastq_path \n",
"3 barcodes_path \n",
"4 sorted_fastq_path ./fastq-files/*.gz \n",
"5 assembly_method denovo \n",
"6 reference_sequence \n",
"7 datatype gbs \n",
"8 restriction_overhang ('CWGC', 'CWGC') \n",
"9 max_low_qual_bases 5 \n",
"10 phred_Qscore_offset 33 \n",
"11 mindepth_statistical 6 \n",
"12 mindepth_majrule 6 \n",
"13 maxdepth 10000 \n",
"14 clust_threshold 0.9 \n",
"15 max_barcode_mismatch 0 \n",
"16 filter_adapters 2 \n",
"17 filter_min_trim_len 35 \n",
"18 max_alleles_consens 2 \n",
"19 max_Ns_consens (5, 5) \n",
"20 max_Hs_consens (8, 8) \n",
"21 min_samples_locus 4 \n",
"22 max_SNPs_locus (10, 10) \n",
"23 max_Indels_locus (8, 8) \n",
"24 max_shared_Hs_locus 4 \n",
"25 trim_reads (0, 0, 0, 0) \n",
"26 trim_loci (0, 5) \n",
"27 output_formats ('l', 'p', 's', 'v', 'k', 'a') \n",
"28 pop_assign_file \n"
]
}
],
"source": [
"## create an Assembly\n",
"data = ip.Assembly(\"Canarium\")\n",
"\n",
"## set params\n",
"data.set_params(\"project_dir\", \"analysis-ipyrad\")\n",
"data.set_params(\"sorted_fastq_path\", \"./fastq-files/*.gz\")\n",
"data.set_params(\"restriction_overhang\", (\"CWGC\", \"CWGC\"))\n",
"data.set_params(\"datatype\", \"gbs\")\n",
"data.set_params(\"clust_threshold\", 0.90)\n",
"data.set_params(\"filter_adapters\", 2)\n",
"data.set_params(\"max_SNPs_locus\", (10, 10))\n",
"data.set_params(\"max_shared_Hs_locus\", 4)\n",
"data.set_params(\"trim_reads\", (0, 0))\n",
"data.set_params(\"trim_loci\", (0, 5))\n",
"data.set_params(\"output_formats\", list(\"lpsvka\"))\n",
"\n",
"## print params for posterity\n",
"data.get_params()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Assemble reads within each Sample"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"data.run(\"12\")"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Assembly: Canarium\n",
"[####################] 100% dereplicating | 0:01:55 | s3 | | \n",
"[####################] 100% clustering | 15:30:31 | s3 | \n",
"[####################] 100% building clusters | 0:04:02 | s3 | \n",
"[####################] 100% chunking | 0:00:42 | s3 | \n",
"[####################] 100% aligning | 1:00:07 | s3 | \n",
"[####################] 100% concatenating | 0:03:13 | s3 | \n"
]
}
],
"source": [
"data.run(\"3\", force=True)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Assembly: Canarium\n",
"[####################] 100% inferring [H, E] | 0:12:06 | s4 | \n",
"[####################] 100% calculating depths | 0:00:54 | s5 | \n",
"[####################] 100% chunking clusters | 0:01:22 | s5 | \n",
"[####################] 100% consens calling | 0:21:34 | s5 | \n"
]
}
],
"source": [
"data.run(\"45\")"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Assembly: Canarium\n",
"[####################] 100% concat/shuffle input | 0:01:06 | s6 | \n",
"[####################] 100% clustering across | 5:43:06 | s6 | \n",
"[####################] 100% building clusters | 0:00:53 | s6 | \n",
"[####################] 100% aligning clusters | 0:04:13 | s6 | \n",
"[####################] 100% database indels | 0:02:13 | s6 | \n",
"[####################] 100% indexing clusters | 0:02:19 | s6 | \n",
"[####################] 100% building database | 0:24:48 | s6 | \n"
]
}
],
"source": [
"data.run(\"6\")"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Assembly: Canarium\n",
"[####################] 100% filtering loci | 0:00:39 | s7 | \n",
"[####################] 100% building loci/stats | 0:00:29 | s7 | \n",
"[####################] 100% building alleles | 0:00:36 | s7 | \n",
"[####################] 100% building vcf file | 0:01:06 | s7 | \n",
"[####################] 100% writing vcf file | 0:00:00 | s7 | \n",
"[####################] 100% building arrays | 0:00:31 | s7 | \n",
"[####################] 100% writing outfiles | 0:01:00 | s7 | \n",
"Outfiles written to: ~/Documents/Canarium/analysis-ipyrad/Canarium_outfiles\n",
"\n"
]
}
],
"source": [
"data.run(\"7\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Full assembly stats"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"loading Assembly: Canarium\n",
"from saved path: ~/Documents/Canarium/analysis-ipyrad/Canarium.json\n"
]
}
],
"source": [
"## re-load assembly in case coming back to this notebook later\n",
"data = ip.load_json(\"analysis-ipyrad/Canarium.json\")"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"total N reads: 365012834\n",
"mean N reads/sample: 7604434.04167\n",
"S.D. N reads/sample: 7407655.17926\n"
]
}
],
"source": [
"## print some stats\n",
"print \"total N reads:\", data.stats.reads_raw.sum()\n",
"print \"mean N reads/sample:\", data.stats.reads_raw.mean()\n",
"print \"S.D. N reads/sample:\", data.stats.reads_raw.std()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Excluding low-data samples"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" sample_coverage | \n",
"
\n",
" \n",
" \n",
" \n",
" 4304 | \n",
" 7 | \n",
"
\n",
" \n",
" 5573 | \n",
" 37672 | \n",
"
\n",
" \n",
" D12950 | \n",
" 44861 | \n",
"
\n",
" \n",
" D12962 | \n",
" 1981 | \n",
"
\n",
" \n",
" D12963 | \n",
" 10141 | \n",
"
\n",
" \n",
" D13052 | \n",
" 75316 | \n",
"
\n",
" \n",
" D13053 | \n",
" 21812 | \n",
"
\n",
" \n",
" D13063 | \n",
" 28227 | \n",
"
\n",
" \n",
" D13075 | \n",
" 4096 | \n",
"
\n",
" \n",
" D13090 | \n",
" 27541 | \n",
"
\n",
" \n",
" D13091 | \n",
" 1059 | \n",
"
\n",
" \n",
" D13097 | \n",
" 81399 | \n",
"
\n",
" \n",
" D13101 | \n",
" 2860 | \n",
"
\n",
" \n",
" D13103 | \n",
" 26359 | \n",
"
\n",
" \n",
" D13374 | \n",
" 56701 | \n",
"
\n",
" \n",
" D13852 | \n",
" 13687 | \n",
"
\n",
" \n",
" D14269 | \n",
" 79114 | \n",
"
\n",
" \n",
" D14477 | \n",
" 61515 | \n",
"
\n",
" \n",
" D14478 | \n",
" 97836 | \n",
"
\n",
" \n",
" D14480 | \n",
" 70588 | \n",
"
\n",
" \n",
" D14482 | \n",
" 69750 | \n",
"
\n",
" \n",
" D14483 | \n",
" 71352 | \n",
"
\n",
" \n",
" D14485 | \n",
" 51841 | \n",
"
\n",
" \n",
" D14492 | \n",
" 1272 | \n",
"
\n",
" \n",
" D14501 | \n",
" 49981 | \n",
"
\n",
" \n",
" D14504 | \n",
" 77524 | \n",
"
\n",
" \n",
" D14505 | \n",
" 77445 | \n",
"
\n",
" \n",
" D14506 | \n",
" 60470 | \n",
"
\n",
" \n",
" D14513 | \n",
" 39505 | \n",
"
\n",
" \n",
" D14528 | \n",
" 26587 | \n",
"
\n",
" \n",
" SF153 | \n",
" 9269 | \n",
"
\n",
" \n",
" SF155 | \n",
" 53298 | \n",
"
\n",
" \n",
" SF160 | \n",
" 2856 | \n",
"
\n",
" \n",
" SF164 | \n",
" 69764 | \n",
"
\n",
" \n",
" SF172 | \n",
" 64881 | \n",
"
\n",
" \n",
" SF175 | \n",
" 69099 | \n",
"
\n",
" \n",
" SF197 | \n",
" 82914 | \n",
"
\n",
" \n",
" SF200 | \n",
" 42029 | \n",
"
\n",
" \n",
" SF209 | \n",
" 28325 | \n",
"
\n",
" \n",
" SF224 | \n",
" 81039 | \n",
"
\n",
" \n",
" SF228 | \n",
" 82338 | \n",
"
\n",
" \n",
" SF276 | \n",
" 82961 | \n",
"
\n",
" \n",
" SF286 | \n",
" 78681 | \n",
"
\n",
" \n",
" SF301 | \n",
" 785 | \n",
"
\n",
" \n",
" SF327 | \n",
" 72815 | \n",
"
\n",
" \n",
" SF328 | \n",
" 80219 | \n",
"
\n",
" \n",
" SF343 | \n",
" 894 | \n",
"
\n",
" \n",
" SFC1988 | \n",
" 51364 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" sample_coverage\n",
"4304 7\n",
"5573 37672\n",
"D12950 44861\n",
"D12962 1981\n",
"D12963 10141\n",
"D13052 75316\n",
"D13053 21812\n",
"D13063 28227\n",
"D13075 4096\n",
"D13090 27541\n",
"D13091 1059\n",
"D13097 81399\n",
"D13101 2860\n",
"D13103 26359\n",
"D13374 56701\n",
"D13852 13687\n",
"D14269 79114\n",
"D14477 61515\n",
"D14478 97836\n",
"D14480 70588\n",
"D14482 69750\n",
"D14483 71352\n",
"D14485 51841\n",
"D14492 1272\n",
"D14501 49981\n",
"D14504 77524\n",
"D14505 77445\n",
"D14506 60470\n",
"D14513 39505\n",
"D14528 26587\n",
"SF153 9269\n",
"SF155 53298\n",
"SF160 2856\n",
"SF164 69764\n",
"SF172 64881\n",
"SF175 69099\n",
"SF197 82914\n",
"SF200 42029\n",
"SF209 28325\n",
"SF224 81039\n",
"SF228 82338\n",
"SF276 82961\n",
"SF286 78681\n",
"SF301 785\n",
"SF327 72815\n",
"SF328 80219\n",
"SF343 894\n",
"SFC1988 51364"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## number of consens reads per sample. \n",
"data.stats_dfs.s7_samples"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"## make subset lists to exclude taxa with little data\n",
"subs = [i.name for i in data.samples.values() if i.stats.reads_consens > 12000]\n",
"subsnout = list(set(subs) - set([\"D14269\", \"D13374\", \"SFC1988\", \"D13852\"]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Final assemblies"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"## make new branches\n",
"Can_min4 = data.branch(\"Canarium-min4\", subsamples=subs)\n",
"Can_min10 = data.branch(\"Canarium-min10\", subsamples=subs)\n",
"Can_min20 = data.branch(\"Canarium-min20\", subsamples=subs)\n",
"Can_min30nout = data.branch(\"Canarium-min30-nout\", subsamples=subsnout)\n",
"\n",
"## set params on new assemblies\n",
"Can_min10.set_params(\"min_samples_locus\", 10)\n",
"Can_min20.set_params(\"min_samples_locus\", 20)\n",
"Can_min30nout.set_params(\"min_samples_locus\", 30)"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Assembly: Canarium-min4\n",
"[####################] 100% filtering loci | 0:00:50 | s7 | \n",
"[####################] 100% building loci/stats | 0:00:29 | s7 | \n",
"[####################] 100% building alleles | 0:00:36 | s7 | \n",
"[####################] 100% building vcf file | 0:01:00 | s7 | \n",
"[####################] 100% writing vcf file | 0:00:00 | s7 | \n",
"[####################] 100% building arrays | 0:00:40 | s7 | \n",
"[####################] 100% writing outfiles | 0:00:50 | s7 | \n",
"Outfiles written to: ~/Documents/Canarium/analysis-ipyrad/Canarium-min4_outfiles\n",
"\n",
"Assembly: Canarium-min10\n",
"[####################] 100% filtering loci | 0:00:41 | s7 | \n",
"[####################] 100% building loci/stats | 0:00:29 | s7 | \n",
"[####################] 100% building alleles | 0:00:35 | s7 | \n",
"[####################] 100% building vcf file | 0:00:49 | s7 | \n",
"[####################] 100% writing vcf file | 0:00:00 | s7 | \n",
"[####################] 100% building arrays | 0:00:39 | s7 | \n",
"[####################] 100% writing outfiles | 0:00:27 | s7 | \n",
"Outfiles written to: ~/Documents/Canarium/analysis-ipyrad/Canarium-min10_outfiles\n",
"\n",
"Assembly: Canarium-min20\n",
"[####################] 100% filtering loci | 0:00:41 | s7 | \n",
"[####################] 100% building loci/stats | 0:00:29 | s7 | \n",
"[####################] 100% building alleles | 0:00:35 | s7 | \n",
"[####################] 100% building vcf file | 0:00:42 | s7 | \n",
"[####################] 100% writing vcf file | 0:00:00 | s7 | \n",
"[####################] 100% building arrays | 0:00:38 | s7 | \n",
"[####################] 100% writing outfiles | 0:00:15 | s7 | \n",
"Outfiles written to: ~/Documents/Canarium/analysis-ipyrad/Canarium-min20_outfiles\n",
"\n",
"Assembly: Canarium-min30-nout\n",
"[####################] 100% filtering loci | 0:00:37 | s7 | \n",
"[####################] 100% building loci/stats | 0:00:28 | s7 | \n",
"[####################] 100% building alleles | 0:00:33 | s7 | \n",
"[####################] 100% building vcf file | 0:00:35 | s7 | \n",
"[####################] 100% writing vcf file | 0:00:00 | s7 | \n",
"[####################] 100% building arrays | 0:00:35 | s7 | \n",
"[####################] 100% writing outfiles | 0:00:06 | s7 | \n",
"Outfiles written to: ~/Documents/Canarium/analysis-ipyrad/Canarium-min30-nout_outfiles\n",
"\n"
]
}
],
"source": [
"## final assemblies\n",
"Can_min4.run(\"7\", force=True)\n",
"Can_min10.run(\"7\", force=True)\n",
"Can_min20.run(\"7\", force=True)\n",
"Can_min30nout.run(\"7\", force=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Assembly stats\n",
"See the github page for stats of each assembly. "
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"loading Assembly: Canarium-min4\n",
"from saved path: ~/Documents/Canarium/analysis-ipyrad/Canarium-min4.json\n",
"loading Assembly: Canarium-min10\n",
"from saved path: ~/Documents/Canarium/analysis-ipyrad/Canarium-min10.json\n",
"loading Assembly: Canarium-min20\n",
"from saved path: ~/Documents/Canarium/analysis-ipyrad/Canarium-min20.json\n",
"loading Assembly: Canarium-min30-nout\n",
"from saved path: ~/Documents/Canarium/analysis-ipyrad/Canarium-min30-nout.json\n"
]
}
],
"source": [
"## reoload assemblies from their JSON files\n",
"Can_min4 = ip.load_json(\"analysis-ipyrad/Canarium-min4.json\")\n",
"Can_min10 = ip.load_json(\"analysis-ipyrad/Canarium-min10.json\")\n",
"Can_min20 = ip.load_json(\"analysis-ipyrad/Canarium-min20.json\")\n",
"Can_min30n = ip.load_json(\"analysis-ipyrad/Canarium-min30-nout.json\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.13"
}
},
"nbformat": 4,
"nbformat_minor": 1
}