{
"metadata": {
"name": "",
"signature": "sha256:e68f8510cc8a35217d86e9e8d72c074e2e737ec220f51784e6dff6cc9461dc53"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In an additional test of PyNAST alignment of V2 sequences with the core set versus the PyNAST 85% OTU alignment added in [biocore/qiime-default-reference#15](https://github.com/biocore/qiime-default-reference/pull/15), I find that only slightly more sequences fail to align with the PyNAST 85% OTU alignment than with the old core set. Details of my test follow. Since we don't have an inherent way to say which is better (i.e., it's possible that we want those additional sequences to fail), and we have complete understanding of how the PyNAST 85% OTU alignment was generated (which is not true for the core set), we'll move forward with the plan to use the PyNAST 85% OTU alignment as the default reference template alignment. \n",
"\n",
"# V2 sequence collection\n",
"\n",
"My first test was performed with the V2 sequences deposited under study id ``449`` in the QIIME database (the [Costello Whole Body](http://www.sciencemag.org/content/326/5960/1694.abstract)) dataset. I ran the following command to align representative sequences against the PyNAST 85% OTU alignment:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"## Uncomment to remove output from previous run.\n",
"!rm -r output\n",
"import pandas as pd\n",
"from os.path import join, abspath\n",
"from qiime_default_reference import get_reference_sequences\n",
"from bfillings.formatdb import build_blast_db_from_fasta_path \n",
"\n",
"gg_97_reference = get_reference_sequences()\n",
"db_name, db_files = build_blast_db_from_fasta_path(gg_97_reference, output_dir='./')\n",
"\n",
"!wget http://greengenes.lbl.gov/Download/Sequence_Data/Fasta_data_files/core_set_aligned.fasta.imputed\n",
"core_set_template_fp = abspath(\"core_set_aligned.fasta.imputed\")\n",
"rep_set_449_fp = abspath(\"data/rep_set.449.fna\")\n",
"rep_set_449_out_dir = abspath(\"output/449_out\")\n",
"rep_set_550_fp = abspath(\"data/rep_set.550.1.fna\")\n",
"rep_set_550_out_dir = abspath(\"output/550_out\")"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"rm: output: No such file or directory\r\n"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"--2015-04-16 09:15:50-- http://greengenes.lbl.gov/Download/Sequence_Data/Fasta_data_files/core_set_aligned.fasta.imputed\r\n",
"Resolving greengenes.lbl.gov... "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"128.32.248.7\r\n",
"Connecting to greengenes.lbl.gov|128.32.248.7|:80... "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"connected.\r\n",
"HTTP request sent, awaiting response... "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"200 OK\r\n",
"Length: 37975992 (36M) [text/plain]\r\n",
"Saving to: 'core_set_aligned.fasta.imputed'\r\n",
"\r\n",
"\r",
" 0% [ ] 0 --.-K/s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
" 0% [ ] 39,393 184KB/s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
" 0% [ ] 303,104 651KB/s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
" 2% [ ] 924,192 1.32MB/s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
" 4% [> ] 1,769,616 1.92MB/s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
" 7% [=> ] 2,798,352 2.45MB/s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"10% [===> ] 4,151,304 3.04MB/s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"15% [====> ] 5,754,600 3.63MB/s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"20% [======> ] 7,690,320 4.25MB/s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"21% [=======> ] 8,132,184 4.01MB/s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"24% [========> ] 9,289,512 4.12MB/s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"27% [=========> ] 10,255,320 4.14MB/s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"29% [==========> ] 11,225,216 4.16MB/s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"32% [===========> ] 12,184,200 4.17MB/s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"34% [============> ] 13,154,096 4.18MB/s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"37% [=============> ] 14,111,712 4.20MB/s eta 5s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"39% [==============> ] 15,078,888 4.20MB/s eta 5s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"42% [===============> ] 16,041,960 4.41MB/s eta 5s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"44% [================> ] 16,921,568 4.56MB/s eta 5s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"46% [=================> ] 17,712,288 4.72MB/s eta 5s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"48% [=================> ] 18,345,672 4.66MB/s eta 5s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"50% [==================> ] 18,988,632 4.57MB/s eta 5s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"51% [===================> ] 19,630,224 4.42MB/s eta 5s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"53% [===================> ] 20,278,640 4.08MB/s eta 5s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"55% [====================> ] 20,917,512 3.79MB/s eta 5s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"56% [=====================> ] 21,560,472 3.75MB/s eta 4s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"58% [=====================> ] 22,206,168 3.72MB/s eta 4s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"60% [======================> ] 22,853,216 3.54MB/s eta 4s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"61% [=======================> ] 23,492,088 3.46MB/s eta 4s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"63% [=======================> ] 24,141,888 3.38MB/s eta 4s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"65% [========================> ] 24,778,008 3.31MB/s eta 3s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"66% [=========================> ] 25,427,792 3.18MB/s eta 3s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"68% [=========================> ] 26,058,456 3.10MB/s eta 3s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"70% [==========================> ] 26,715,096 3.03MB/s eta 3s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"72% [===========================> ] 27,353,952 2.96MB/s eta 3s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"73% [===========================> ] 27,988,704 2.88MB/s eta 3s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"75% [============================> ] 28,518,104 2.85MB/s eta 3s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"75% [============================> ] 28,832,760 2.76MB/s eta 3s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"76% [============================> ] 29,163,800 2.68MB/s eta 3s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"77% [=============================> ] 29,479,824 2.58MB/s eta 3s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"78% [=============================> ] 29,799,936 2.50MB/s eta 2s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"79% [=============================> ] 30,122,784 2.42MB/s eta 2s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"80% [==============================> ] 30,444,264 2.33MB/s eta 2s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"81% [==============================> ] 30,768,480 2.22MB/s eta 2s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"81% [==============================> ] 31,087,224 2.15MB/s eta 2s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"82% [===============================> ] 31,414,160 2.07MB/s eta 2s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"83% [===============================> ] 31,735,640 1.99MB/s eta 2s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"84% [===============================> ] 32,057,120 1.87MB/s eta 2s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"85% [================================> ] 32,371,776 1.79MB/s eta 2s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"86% [================================> ] 32,700,080 1.72MB/s eta 2s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"86% [================================> ] 33,014,736 1.64MB/s eta 2s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"87% [=================================> ] 33,343,040 1.51MB/s eta 2s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"88% [=================================> ] 33,659,064 1.46MB/s eta 2s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"89% [=================================> ] 33,979,176 1.45MB/s eta 2s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"90% [==================================> ] 34,300,656 1.45MB/s eta 2s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"91% [==================================> ] 34,622,136 1.45MB/s eta 1s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"92% [==================================> ] 34,943,616 1.45MB/s eta 1s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"92% [===================================> ] 35,265,096 1.45MB/s eta 1s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"93% [===================================> ] 35,587,944 1.45MB/s eta 1s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"94% [===================================> ] 35,908,056 1.45MB/s eta 1s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"95% [====================================> ] 36,230,904 1.45MB/s eta 1s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"96% [====================================> ] 36,551,016 1.45MB/s eta 1s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"97% [====================================> ] 36,873,864 1.45MB/s eta 1s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"97% [=====================================> ] 37,200,800 1.45MB/s eta 1s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"98% [=====================================> ] 37,516,824 1.45MB/s eta 1s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"99% [=====================================> ] 37,838,304 1.45MB/s eta 0s "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r",
"100%[======================================>] 37,975,992 1.47MB/s in 14s \r\n",
"\r\n",
"2015-04-16 09:16:07 (2.62 MB/s) - 'core_set_aligned.fasta.imputed' saved [37975992/37975992]\r\n",
"\r\n"
]
}
],
"prompt_number": 1
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"pyn_out_qdr = join(rep_set_449_out_dir, \"qdr012\")\n",
"!parallel_align_seqs_pynast.py -i $rep_set_449_fp -o $pyn_out_qdr --jobs_to_start 8\n",
"fasta_fps = join(pyn_out_qdr, \"*fasta\")\n",
"!count_seqs.py -i \"$fasta_fps\""
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r\n",
"204 : /home/ubuntu/qdr-issue14/output/449_out/qdr012/rep_set.449_failures.fasta (Sequence lengths (mean +/- std): 181.9706 +/- 56.2609)\r\n",
"10200 : /home/ubuntu/qdr-issue14/output/449_out/qdr012/rep_set.449_aligned.fasta (Sequence lengths (mean +/- std): 7682.0000 +/- 0.0000)\r\n",
"10404 : Total\r\n"
]
}
],
"prompt_number": 3
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This tells us that 204 sequences failed to align with PyNAST out of 10404. \n",
"\n",
"Next, we perform the same alignment against the core set. "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"pyn_out_core = join(rep_set_449_out_dir, \"core-set\")\n",
"!parallel_align_seqs_pynast.py -i $rep_set_449_fp -o $pyn_out_core --jobs_to_start 8 -t $core_set_template_fp\n",
"fasta_fps = join(pyn_out_core, \"*fasta\")\n",
"!count_seqs.py -i \"$fasta_fps\""
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r\n",
"171 : /home/ubuntu/qdr-issue14/output/449_out/core-set/rep_set.449_failures.fasta (Sequence lengths (mean +/- std): 166.0292 +/- 46.7550)\r\n",
"10233 : /home/ubuntu/qdr-issue14/output/449_out/core-set/rep_set.449_aligned.fasta (Sequence lengths (mean +/- std): 7682.0000 +/- 0.0000)\r\n",
"10404 : Total\r\n"
]
}
],
"prompt_number": 4
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This tells us that 171 sequences failed to align with PyNAST out of 10404, so slightly less when aligning against the core set.\n",
"\n",
"However, it's important to note that different sequences failed to align with the two data sets (i.e., the sequences that fail to align against the core set are not a subset of the sequences that fail to align against the 85% PyNAST-aligned OTUs). We therefore don't immediately know which is better."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# find the sequences that didn't align against the 85% PyNAST-aligned OTUs\n",
"# but did align against the core set\n",
"!filter_fasta.py -f $pyn_out_qdr/rep_set.449_failures.fasta -o $pyn_out_qdr/failures-that-aligned-v-core-set.fasta -a $pyn_out_core/rep_set.449_aligned.fasta\n",
"!count_seqs.py -i $pyn_out_qdr/failures-that-aligned-v-core-set.fasta"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r\n",
"47 : /home/ubuntu/qdr-issue14/output/449_out/qdr012/failures-that-aligned-v-core-set.fasta (Sequence lengths (mean +/- std): 256.3404 +/- 16.3086)\r\n",
"47 : Total\r\n"
]
}
],
"prompt_number": 5
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# find the sequences that didn't align against the core set\n",
"# but did align against the 85% PyNAST-aligned OTUs\n",
"!filter_fasta.py -f $pyn_out_core/rep_set.449_failures.fasta -o $pyn_out_core/failures-that-aligned-v-pyn-85-otus.fasta -a $pyn_out_qdr/rep_set.449_aligned.fasta\n",
"!count_seqs.py -i $pyn_out_core/failures-that-aligned-v-pyn-85-otus.fasta\n"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r\n",
"14 : /home/ubuntu/qdr-issue14/output/449_out/core-set/failures-that-aligned-v-pyn-85-otus.fasta (Sequence lengths (mean +/- std): 236.9286 +/- 11.0677)\r\n",
"14 : Total\r\n"
]
}
],
"prompt_number": 6
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So what do the BLAST alignments look like if we query these sequences against the Greengenes 13_8 97% OTUs?"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"blast_out_qdr = join(pyn_out_qdr, \"failures-that-aligned-v-core-set-blast-results\")\n",
"\n",
"!parallel_blast.py -i $pyn_out_qdr/failures-that-aligned-v-core-set.fasta -b $db_name -o $blast_out_qdr -O 8"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 7
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"blast_headers = !head $blast_out_qdr/failures-that-aligned-v-core-set_blast_out.txt\n",
"blast_headers = blast_headers[3].strip('# Fields: ').split(', ')\n",
"blast_out = !grep -v '^#' $blast_out_qdr/failures-that-aligned-v-core-set_blast_out.txt\n",
"blast_out = [l.split('\\t') for l in blast_out]\n",
"blast_results_qdr = pd.DataFrame(blast_out, columns=blast_headers).convert_objects(convert_numeric=True)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 8
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"blast_results_qdr"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"
\n",
"
\n",
" \n",
" \n",
" | \n",
" Query id | \n",
" Subject id | \n",
" % identity | \n",
" alignment length | \n",
" mismatches | \n",
" gap openings | \n",
" q. start | \n",
" q. end | \n",
" s. start | \n",
" s. end | \n",
" e-value | \n",
" bit scor | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" New.CleanUp.ReferenceOTU10449 | \n",
" 853867 | \n",
" 100.00 | \n",
" 149 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 149 | \n",
" 317 | \n",
" 169 | \n",
" 0 | \n",
" 295 | \n",
"
\n",
" \n",
" 1 | \n",
" New.CleanUp.ReferenceOTU10796 | \n",
" 960131 | \n",
" 100.00 | \n",
" 131 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 131 | \n",
" 317 | \n",
" 187 | \n",
" 0 | \n",
" 260 | \n",
"
\n",
" \n",
" 2 | \n",
" New.CleanUp.ReferenceOTU11004 | \n",
" 960131 | \n",
" 99.32 | \n",
" 148 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" 148 | \n",
" 317 | \n",
" 170 | \n",
" 0 | \n",
" 285 | \n",
"
\n",
" \n",
" 3 | \n",
" New.CleanUp.ReferenceOTU11548 | \n",
" 509773 | \n",
" 100.00 | \n",
" 132 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 132 | \n",
" 336 | \n",
" 205 | \n",
" 0 | \n",
" 262 | \n",
"
\n",
" \n",
" 4 | \n",
" New.CleanUp.ReferenceOTU12392 | \n",
" 853867 | \n",
" 100.00 | \n",
" 156 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 156 | \n",
" 317 | \n",
" 162 | \n",
" 0 | \n",
" 309 | \n",
"
\n",
" \n",
" 5 | \n",
" New.CleanUp.ReferenceOTU12511 | \n",
" 853867 | \n",
" 98.72 | \n",
" 156 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 155 | \n",
" 317 | \n",
" 162 | \n",
" 0 | \n",
" 285 | \n",
"
\n",
" \n",
" 6 | \n",
" New.CleanUp.ReferenceOTU12532 | \n",
" 1000986 | \n",
" 98.84 | \n",
" 173 | \n",
" 2 | \n",
" 0 | \n",
" 1 | \n",
" 173 | \n",
" 304 | \n",
" 132 | \n",
" 0 | \n",
" 327 | \n",
"
\n",
" \n",
" 7 | \n",
" New.CleanUp.ReferenceOTU13882 | \n",
" 129798 | \n",
" 97.04 | \n",
" 169 | \n",
" 1 | \n",
" 3 | \n",
" 1 | \n",
" 169 | \n",
" 368 | \n",
" 204 | \n",
" 0 | \n",
" 274 | \n",
"
\n",
" \n",
" 8 | \n",
" New.CleanUp.ReferenceOTU14202 | \n",
" 960131 | \n",
" 98.18 | \n",
" 165 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 165 | \n",
" 317 | \n",
" 155 | \n",
" 0 | \n",
" 297 | \n",
"
\n",
" \n",
" 9 | \n",
" New.CleanUp.ReferenceOTU15614 | \n",
" 4366889 | \n",
" 96.49 | \n",
" 171 | \n",
" 1 | \n",
" 3 | \n",
" 1 | \n",
" 171 | \n",
" 303 | \n",
" 138 | \n",
" 0 | \n",
" 272 | \n",
"
\n",
" \n",
" 10 | \n",
" New.CleanUp.ReferenceOTU16729 | \n",
" 4467774 | \n",
" 99.45 | \n",
" 181 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" 181 | \n",
" 308 | \n",
" 128 | \n",
" 0 | \n",
" 351 | \n",
"
\n",
" \n",
" 11 | \n",
" New.CleanUp.ReferenceOTU17453 | \n",
" 870751 | \n",
" 100.00 | \n",
" 137 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 137 | \n",
" 317 | \n",
" 181 | \n",
" 0 | \n",
" 272 | \n",
"
\n",
" \n",
" 12 | \n",
" New.CleanUp.ReferenceOTU18041 | \n",
" 4422519 | \n",
" 99.04 | \n",
" 209 | \n",
" 0 | \n",
" 2 | \n",
" 1 | \n",
" 209 | \n",
" 349 | \n",
" 143 | \n",
" 0 | \n",
" 383 | \n",
"
\n",
" \n",
" 13 | \n",
" New.CleanUp.ReferenceOTU18339 | \n",
" 4467774 | \n",
" 99.43 | \n",
" 176 | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 176 | \n",
" 308 | \n",
" 134 | \n",
" 0 | \n",
" 333 | \n",
"
\n",
" \n",
" 14 | \n",
" New.CleanUp.ReferenceOTU18724 | \n",
" 965048 | \n",
" 98.89 | \n",
" 180 | \n",
" 0 | \n",
" 2 | \n",
" 1 | \n",
" 178 | \n",
" 313 | \n",
" 134 | \n",
" 0 | \n",
" 325 | \n",
"
\n",
" \n",
" 15 | \n",
" New.CleanUp.ReferenceOTU1891 | \n",
" 4308647 | \n",
" 94.62 | \n",
" 186 | \n",
" 5 | \n",
" 3 | \n",
" 1 | \n",
" 186 | \n",
" 346 | \n",
" 166 | \n",
" 0 | \n",
" 270 | \n",
"
\n",
" \n",
" 16 | \n",
" New.CleanUp.ReferenceOTU2098 | \n",
" 979261 | \n",
" 100.00 | \n",
" 157 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 157 | \n",
" 317 | \n",
" 161 | \n",
" 0 | \n",
" 311 | \n",
"
\n",
" \n",
" 17 | \n",
" New.CleanUp.ReferenceOTU21175 | \n",
" 4347865 | \n",
" 100.00 | \n",
" 141 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 141 | \n",
" 334 | \n",
" 194 | \n",
" 0 | \n",
" 280 | \n",
"
\n",
" \n",
" 18 | \n",
" New.CleanUp.ReferenceOTU21766 | \n",
" 4467774 | \n",
" 100.00 | \n",
" 178 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 178 | \n",
" 308 | \n",
" 131 | \n",
" 0 | \n",
" 353 | \n",
"
\n",
" \n",
" 19 | \n",
" New.CleanUp.ReferenceOTU21849 | \n",
" 853867 | \n",
" 99.36 | \n",
" 156 | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 155 | \n",
" 317 | \n",
" 162 | \n",
" 0 | \n",
" 293 | \n",
"
\n",
" \n",
" 20 | \n",
" New.CleanUp.ReferenceOTU22455 | \n",
" 4417848 | \n",
" 97.92 | \n",
" 192 | \n",
" 1 | \n",
" 2 | \n",
" 2 | \n",
" 193 | \n",
" 306 | \n",
" 118 | \n",
" 0 | \n",
" 335 | \n",
"
\n",
" \n",
" 21 | \n",
" New.CleanUp.ReferenceOTU22842 | \n",
" 379925 | \n",
" 98.63 | \n",
" 146 | \n",
" 0 | \n",
" 2 | \n",
" 1 | \n",
" 146 | \n",
" 298 | \n",
" 155 | \n",
" 0 | \n",
" 258 | \n",
"
\n",
" \n",
" 22 | \n",
" New.CleanUp.ReferenceOTU22953 | \n",
" 4368216 | \n",
" 100.00 | \n",
" 180 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 180 | \n",
" 315 | \n",
" 136 | \n",
" 0 | \n",
" 357 | \n",
"
\n",
" \n",
" 23 | \n",
" New.CleanUp.ReferenceOTU2306 | \n",
" 1064036 | \n",
" 97.59 | \n",
" 166 | \n",
" 2 | \n",
" 2 | \n",
" 1 | \n",
" 166 | \n",
" 326 | \n",
" 163 | \n",
" 0 | \n",
" 281 | \n",
"
\n",
" \n",
" 24 | \n",
" New.CleanUp.ReferenceOTU23446 | \n",
" 979261 | \n",
" 100.00 | \n",
" 144 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 144 | \n",
" 317 | \n",
" 174 | \n",
" 0 | \n",
" 285 | \n",
"
\n",
" \n",
" 25 | \n",
" New.CleanUp.ReferenceOTU24551 | \n",
" 867184 | \n",
" 93.95 | \n",
" 215 | \n",
" 4 | \n",
" 7 | \n",
" 1 | \n",
" 215 | \n",
" 329 | \n",
" 124 | \n",
" 0 | \n",
" 272 | \n",
"
\n",
" \n",
" 26 | \n",
" New.CleanUp.ReferenceOTU24622 | \n",
" 1061772 | \n",
" 100.00 | \n",
" 149 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 149 | \n",
" 317 | \n",
" 169 | \n",
" 0 | \n",
" 295 | \n",
"
\n",
" \n",
" 27 | \n",
" New.CleanUp.ReferenceOTU24747 | \n",
" 960131 | \n",
" 99.32 | \n",
" 148 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" 148 | \n",
" 317 | \n",
" 170 | \n",
" 0 | \n",
" 285 | \n",
"
\n",
" \n",
" 28 | \n",
" New.CleanUp.ReferenceOTU24753 | \n",
" 129798 | \n",
" 96.36 | \n",
" 165 | \n",
" 0 | \n",
" 5 | \n",
" 1 | \n",
" 165 | \n",
" 368 | \n",
" 210 | \n",
" 0 | \n",
" 242 | \n",
"
\n",
" \n",
" 29 | \n",
" New.CleanUp.ReferenceOTU303 | \n",
" 1136387 | \n",
" 92.97 | \n",
" 185 | \n",
" 13 | \n",
" 0 | \n",
" 1 | \n",
" 185 | \n",
" 310 | \n",
" 126 | \n",
" 0 | \n",
" 264 | \n",
"
\n",
" \n",
" 30 | \n",
" New.CleanUp.ReferenceOTU3067 | \n",
" 495451 | \n",
" 96.94 | \n",
" 196 | \n",
" 3 | \n",
" 3 | \n",
" 1 | \n",
" 196 | \n",
" 311 | \n",
" 119 | \n",
" 0 | \n",
" 317 | \n",
"
\n",
" \n",
" 31 | \n",
" New.CleanUp.ReferenceOTU3375 | \n",
" 4477696 | \n",
" 100.00 | \n",
" 148 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 148 | \n",
" 329 | \n",
" 182 | \n",
" 0 | \n",
" 293 | \n",
"
\n",
" \n",
" 32 | \n",
" New.CleanUp.ReferenceOTU3384 | \n",
" 945061 | \n",
" 94.59 | \n",
" 222 | \n",
" 12 | \n",
" 0 | \n",
" 1 | \n",
" 222 | \n",
" 315 | \n",
" 94 | \n",
" 0 | \n",
" 345 | \n",
"
\n",
" \n",
" 33 | \n",
" New.CleanUp.ReferenceOTU3739 | \n",
" 4467774 | \n",
" 100.00 | \n",
" 188 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 188 | \n",
" 308 | \n",
" 121 | \n",
" 0 | \n",
" 373 | \n",
"
\n",
" \n",
" 34 | \n",
" New.CleanUp.ReferenceOTU4059 | \n",
" 635216 | \n",
" 98.58 | \n",
" 141 | \n",
" 0 | \n",
" 2 | \n",
" 1 | \n",
" 139 | \n",
" 300 | \n",
" 160 | \n",
" 0 | \n",
" 248 | \n",
"
\n",
" \n",
" 35 | \n",
" New.CleanUp.ReferenceOTU4528 | \n",
" 979261 | \n",
" 98.61 | \n",
" 144 | \n",
" 2 | \n",
" 0 | \n",
" 1 | \n",
" 144 | \n",
" 317 | \n",
" 174 | \n",
" 0 | \n",
" 270 | \n",
"
\n",
" \n",
" 36 | \n",
" New.CleanUp.ReferenceOTU5484 | \n",
" 934488 | \n",
" 100.00 | \n",
" 144 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 144 | \n",
" 317 | \n",
" 174 | \n",
" 0 | \n",
" 285 | \n",
"
\n",
" \n",
" 37 | \n",
" New.CleanUp.ReferenceOTU6173 | \n",
" 1110261 | \n",
" 96.22 | \n",
" 185 | \n",
" 5 | \n",
" 2 | \n",
" 1 | \n",
" 184 | \n",
" 318 | \n",
" 135 | \n",
" 0 | \n",
" 295 | \n",
"
\n",
" \n",
" 38 | \n",
" New.CleanUp.ReferenceOTU6680 | \n",
" 979261 | \n",
" 99.31 | \n",
" 144 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" 144 | \n",
" 317 | \n",
" 174 | \n",
" 0 | \n",
" 278 | \n",
"
\n",
" \n",
" 39 | \n",
" New.CleanUp.ReferenceOTU720 | \n",
" 932077 | \n",
" 100.00 | \n",
" 144 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 144 | \n",
" 313 | \n",
" 170 | \n",
" 0 | \n",
" 285 | \n",
"
\n",
" \n",
" 40 | \n",
" New.CleanUp.ReferenceOTU7237 | \n",
" 979261 | \n",
" 100.00 | \n",
" 144 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 144 | \n",
" 317 | \n",
" 174 | \n",
" 0 | \n",
" 285 | \n",
"
\n",
" \n",
" 41 | \n",
" New.CleanUp.ReferenceOTU7455 | \n",
" 129798 | \n",
" 100.00 | \n",
" 139 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 139 | \n",
" 368 | \n",
" 230 | \n",
" 0 | \n",
" 276 | \n",
"
\n",
" \n",
" 42 | \n",
" New.CleanUp.ReferenceOTU7553 | \n",
" 335887 | \n",
" 99.40 | \n",
" 168 | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 168 | \n",
" 337 | \n",
" 171 | \n",
" 0 | \n",
" 317 | \n",
"
\n",
" \n",
" 43 | \n",
" New.CleanUp.ReferenceOTU7914 | \n",
" 853867 | \n",
" 100.00 | \n",
" 156 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 156 | \n",
" 317 | \n",
" 162 | \n",
" 0 | \n",
" 309 | \n",
"
\n",
" \n",
" 44 | \n",
" New.CleanUp.ReferenceOTU7942 | \n",
" 4442170 | \n",
" 100.00 | \n",
" 154 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 154 | \n",
" 310 | \n",
" 157 | \n",
" 0 | \n",
" 305 | \n",
"
\n",
" \n",
" 45 | \n",
" New.CleanUp.ReferenceOTU8857 | \n",
" 669486 | \n",
" 98.26 | \n",
" 172 | \n",
" 3 | \n",
" 0 | \n",
" 1 | \n",
" 172 | \n",
" 330 | \n",
" 159 | \n",
" 0 | \n",
" 317 | \n",
"
\n",
" \n",
" 46 | \n",
" New.CleanUp.ReferenceOTU9075 | \n",
" 960131 | \n",
" 99.35 | \n",
" 153 | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 152 | \n",
" 317 | \n",
" 165 | \n",
" 0 | \n",
" 287 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 9,
"text": [
" Query id Subject id % identity alignment length \\\n",
"0 New.CleanUp.ReferenceOTU10449 853867 100.00 149 \n",
"1 New.CleanUp.ReferenceOTU10796 960131 100.00 131 \n",
"2 New.CleanUp.ReferenceOTU11004 960131 99.32 148 \n",
"3 New.CleanUp.ReferenceOTU11548 509773 100.00 132 \n",
"4 New.CleanUp.ReferenceOTU12392 853867 100.00 156 \n",
"5 New.CleanUp.ReferenceOTU12511 853867 98.72 156 \n",
"6 New.CleanUp.ReferenceOTU12532 1000986 98.84 173 \n",
"7 New.CleanUp.ReferenceOTU13882 129798 97.04 169 \n",
"8 New.CleanUp.ReferenceOTU14202 960131 98.18 165 \n",
"9 New.CleanUp.ReferenceOTU15614 4366889 96.49 171 \n",
"10 New.CleanUp.ReferenceOTU16729 4467774 99.45 181 \n",
"11 New.CleanUp.ReferenceOTU17453 870751 100.00 137 \n",
"12 New.CleanUp.ReferenceOTU18041 4422519 99.04 209 \n",
"13 New.CleanUp.ReferenceOTU18339 4467774 99.43 176 \n",
"14 New.CleanUp.ReferenceOTU18724 965048 98.89 180 \n",
"15 New.CleanUp.ReferenceOTU1891 4308647 94.62 186 \n",
"16 New.CleanUp.ReferenceOTU2098 979261 100.00 157 \n",
"17 New.CleanUp.ReferenceOTU21175 4347865 100.00 141 \n",
"18 New.CleanUp.ReferenceOTU21766 4467774 100.00 178 \n",
"19 New.CleanUp.ReferenceOTU21849 853867 99.36 156 \n",
"20 New.CleanUp.ReferenceOTU22455 4417848 97.92 192 \n",
"21 New.CleanUp.ReferenceOTU22842 379925 98.63 146 \n",
"22 New.CleanUp.ReferenceOTU22953 4368216 100.00 180 \n",
"23 New.CleanUp.ReferenceOTU2306 1064036 97.59 166 \n",
"24 New.CleanUp.ReferenceOTU23446 979261 100.00 144 \n",
"25 New.CleanUp.ReferenceOTU24551 867184 93.95 215 \n",
"26 New.CleanUp.ReferenceOTU24622 1061772 100.00 149 \n",
"27 New.CleanUp.ReferenceOTU24747 960131 99.32 148 \n",
"28 New.CleanUp.ReferenceOTU24753 129798 96.36 165 \n",
"29 New.CleanUp.ReferenceOTU303 1136387 92.97 185 \n",
"30 New.CleanUp.ReferenceOTU3067 495451 96.94 196 \n",
"31 New.CleanUp.ReferenceOTU3375 4477696 100.00 148 \n",
"32 New.CleanUp.ReferenceOTU3384 945061 94.59 222 \n",
"33 New.CleanUp.ReferenceOTU3739 4467774 100.00 188 \n",
"34 New.CleanUp.ReferenceOTU4059 635216 98.58 141 \n",
"35 New.CleanUp.ReferenceOTU4528 979261 98.61 144 \n",
"36 New.CleanUp.ReferenceOTU5484 934488 100.00 144 \n",
"37 New.CleanUp.ReferenceOTU6173 1110261 96.22 185 \n",
"38 New.CleanUp.ReferenceOTU6680 979261 99.31 144 \n",
"39 New.CleanUp.ReferenceOTU720 932077 100.00 144 \n",
"40 New.CleanUp.ReferenceOTU7237 979261 100.00 144 \n",
"41 New.CleanUp.ReferenceOTU7455 129798 100.00 139 \n",
"42 New.CleanUp.ReferenceOTU7553 335887 99.40 168 \n",
"43 New.CleanUp.ReferenceOTU7914 853867 100.00 156 \n",
"44 New.CleanUp.ReferenceOTU7942 4442170 100.00 154 \n",
"45 New.CleanUp.ReferenceOTU8857 669486 98.26 172 \n",
"46 New.CleanUp.ReferenceOTU9075 960131 99.35 153 \n",
"\n",
" mismatches gap openings q. start q. end s. start s. end e-value \\\n",
"0 0 0 1 149 317 169 0 \n",
"1 0 0 1 131 317 187 0 \n",
"2 1 0 1 148 317 170 0 \n",
"3 0 0 1 132 336 205 0 \n",
"4 0 0 1 156 317 162 0 \n",
"5 1 1 1 155 317 162 0 \n",
"6 2 0 1 173 304 132 0 \n",
"7 1 3 1 169 368 204 0 \n",
"8 1 1 1 165 317 155 0 \n",
"9 1 3 1 171 303 138 0 \n",
"10 1 0 1 181 308 128 0 \n",
"11 0 0 1 137 317 181 0 \n",
"12 0 2 1 209 349 143 0 \n",
"13 0 1 1 176 308 134 0 \n",
"14 0 2 1 178 313 134 0 \n",
"15 5 3 1 186 346 166 0 \n",
"16 0 0 1 157 317 161 0 \n",
"17 0 0 1 141 334 194 0 \n",
"18 0 0 1 178 308 131 0 \n",
"19 0 1 1 155 317 162 0 \n",
"20 1 2 2 193 306 118 0 \n",
"21 0 2 1 146 298 155 0 \n",
"22 0 0 1 180 315 136 0 \n",
"23 2 2 1 166 326 163 0 \n",
"24 0 0 1 144 317 174 0 \n",
"25 4 7 1 215 329 124 0 \n",
"26 0 0 1 149 317 169 0 \n",
"27 1 0 1 148 317 170 0 \n",
"28 0 5 1 165 368 210 0 \n",
"29 13 0 1 185 310 126 0 \n",
"30 3 3 1 196 311 119 0 \n",
"31 0 0 1 148 329 182 0 \n",
"32 12 0 1 222 315 94 0 \n",
"33 0 0 1 188 308 121 0 \n",
"34 0 2 1 139 300 160 0 \n",
"35 2 0 1 144 317 174 0 \n",
"36 0 0 1 144 317 174 0 \n",
"37 5 2 1 184 318 135 0 \n",
"38 1 0 1 144 317 174 0 \n",
"39 0 0 1 144 313 170 0 \n",
"40 0 0 1 144 317 174 0 \n",
"41 0 0 1 139 368 230 0 \n",
"42 0 1 1 168 337 171 0 \n",
"43 0 0 1 156 317 162 0 \n",
"44 0 0 1 154 310 157 0 \n",
"45 3 0 1 172 330 159 0 \n",
"46 0 1 1 152 317 165 0 \n",
"\n",
" bit scor \n",
"0 295 \n",
"1 260 \n",
"2 285 \n",
"3 262 \n",
"4 309 \n",
"5 285 \n",
"6 327 \n",
"7 274 \n",
"8 297 \n",
"9 272 \n",
"10 351 \n",
"11 272 \n",
"12 383 \n",
"13 333 \n",
"14 325 \n",
"15 270 \n",
"16 311 \n",
"17 280 \n",
"18 353 \n",
"19 293 \n",
"20 335 \n",
"21 258 \n",
"22 357 \n",
"23 281 \n",
"24 285 \n",
"25 272 \n",
"26 295 \n",
"27 285 \n",
"28 242 \n",
"29 264 \n",
"30 317 \n",
"31 293 \n",
"32 345 \n",
"33 373 \n",
"34 248 \n",
"35 270 \n",
"36 285 \n",
"37 295 \n",
"38 278 \n",
"39 285 \n",
"40 285 \n",
"41 276 \n",
"42 317 \n",
"43 309 \n",
"44 305 \n",
"45 317 \n",
"46 287 "
]
}
],
"prompt_number": 9
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"blast_results_qdr.mean()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 10,
"text": [
"Subject id 1729626.042553\n",
"% identity 98.667660\n",
"alignment length 163.468085\n",
"mismatches 1.276596\n",
"gap openings 0.936170\n",
"q. start 1.021277\n",
"q. end 163.319149\n",
"s. start 320.723404\n",
"s. end 159.234043\n",
"e-value 0.000000\n",
"bit scor 297.787234\n",
"dtype: float64"
]
}
],
"prompt_number": 10
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we see that the mean alignment length is $163$, but the mean/standard deviation of the sequences provided as input above was $256 +/- 16$ ($163/256 \\approx 63\\%$ of sequence length is aligned). It therefore looks like we're getting partial alignments, which may be indicative of chimeras or sequence quality drop-off. This suggests that these sequences *may* be aligning to sequences that were in the core set, but which were dropped from in Greengenes 13_8 (or some earlier version of the database)."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"blast_out_core = join(pyn_out_core, \"failures-that-aligned-v-pyn-85-otus-blast-results\")\n",
"\n",
"!parallel_blast.py -i $pyn_out_core/failures-that-aligned-v-pyn-85-otus.fasta -b $db_name -o $blast_out_core -O 8"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 11
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"blast_headers = !head $blast_out_core/failures-that-aligned-v-pyn-85-otus_blast_out.txt\n",
"blast_headers = blast_headers[3].strip('# Fields: ').split(', ')\n",
"blast_out = !grep -v '^#' $blast_out_core/failures-that-aligned-v-pyn-85-otus_blast_out.txt\n",
"blast_out = [l.split('\\t') for l in blast_out]\n",
"blast_results_core = pd.DataFrame(blast_out, columns=blast_headers).convert_objects(convert_numeric=True)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 12
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"blast_results_core"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Query id | \n",
" Subject id | \n",
" % identity | \n",
" alignment length | \n",
" mismatches | \n",
" gap openings | \n",
" q. start | \n",
" q. end | \n",
" s. start | \n",
" s. end | \n",
" e-value | \n",
" bit scor | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" New.CleanUp.ReferenceOTU11778 | \n",
" 2426773 | \n",
" 99.17 | \n",
" 120 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" 120 | \n",
" 486 | \n",
" 367 | \n",
" 0 | \n",
" 230 | \n",
"
\n",
" \n",
" 1 | \n",
" New.CleanUp.ReferenceOTU14235 | \n",
" 1100972 | \n",
" 98.84 | \n",
" 172 | \n",
" 2 | \n",
" 0 | \n",
" 1 | \n",
" 172 | \n",
" 316 | \n",
" 145 | \n",
" 0 | \n",
" 325 | \n",
"
\n",
" \n",
" 2 | \n",
" New.CleanUp.ReferenceOTU14439 | \n",
" 3339747 | \n",
" 99.55 | \n",
" 223 | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 223 | \n",
" 305 | \n",
" 84 | \n",
" 0 | \n",
" 426 | \n",
"
\n",
" \n",
" 3 | \n",
" New.CleanUp.ReferenceOTU15749 | \n",
" 937813 | \n",
" 95.58 | \n",
" 181 | \n",
" 7 | \n",
" 1 | \n",
" 1 | \n",
" 181 | \n",
" 323 | \n",
" 144 | \n",
" 0 | \n",
" 287 | \n",
"
\n",
" \n",
" 4 | \n",
" New.CleanUp.ReferenceOTU16523 | \n",
" 4484060 | \n",
" 96.49 | \n",
" 228 | \n",
" 2 | \n",
" 4 | \n",
" 1 | \n",
" 228 | \n",
" 351 | \n",
" 130 | \n",
" 0 | \n",
" 361 | \n",
"
\n",
" \n",
" 5 | \n",
" New.CleanUp.ReferenceOTU20112 | \n",
" 930873 | \n",
" 99.37 | \n",
" 159 | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 159 | \n",
" 323 | \n",
" 166 | \n",
" 0 | \n",
" 299 | \n",
"
\n",
" \n",
" 6 | \n",
" New.CleanUp.ReferenceOTU20493 | \n",
" 930873 | \n",
" 98.78 | \n",
" 164 | \n",
" 2 | \n",
" 0 | \n",
" 1 | \n",
" 164 | \n",
" 323 | \n",
" 160 | \n",
" 0 | \n",
" 309 | \n",
"
\n",
" \n",
" 7 | \n",
" New.CleanUp.ReferenceOTU2210 | \n",
" 4466460 | \n",
" 97.56 | \n",
" 164 | \n",
" 4 | \n",
" 0 | \n",
" 1 | \n",
" 164 | \n",
" 307 | \n",
" 144 | \n",
" 0 | \n",
" 293 | \n",
"
\n",
" \n",
" 8 | \n",
" New.CleanUp.ReferenceOTU2265 | \n",
" 827969 | \n",
" 98.36 | \n",
" 183 | \n",
" 3 | \n",
" 0 | \n",
" 1 | \n",
" 183 | \n",
" 319 | \n",
" 137 | \n",
" 0 | \n",
" 339 | \n",
"
\n",
" \n",
" 9 | \n",
" New.CleanUp.ReferenceOTU23577 | \n",
" 1035532 | \n",
" 100.00 | \n",
" 179 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 179 | \n",
" 311 | \n",
" 133 | \n",
" 0 | \n",
" 355 | \n",
"
\n",
" \n",
" 10 | \n",
" New.CleanUp.ReferenceOTU24778 | \n",
" 495095 | \n",
" 98.35 | \n",
" 182 | \n",
" 3 | \n",
" 0 | \n",
" 1 | \n",
" 182 | \n",
" 323 | \n",
" 142 | \n",
" 0 | \n",
" 337 | \n",
"
\n",
" \n",
" 11 | \n",
" New.CleanUp.ReferenceOTU2799 | \n",
" 4302012 | \n",
" 91.42 | \n",
" 233 | \n",
" 16 | \n",
" 3 | \n",
" 1 | \n",
" 230 | \n",
" 304 | \n",
" 73 | \n",
" 0 | \n",
" 281 | \n",
"
\n",
" \n",
" 12 | \n",
" New.CleanUp.ReferenceOTU5311 | \n",
" 1035532 | \n",
" 100.00 | \n",
" 179 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 179 | \n",
" 311 | \n",
" 133 | \n",
" 0 | \n",
" 355 | \n",
"
\n",
" \n",
" 13 | \n",
" New.CleanUp.ReferenceOTU8921 | \n",
" 1035532 | \n",
" 99.44 | \n",
" 179 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" 179 | \n",
" 311 | \n",
" 133 | \n",
" 0 | \n",
" 347 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 13,
"text": [
" Query id Subject id % identity alignment length \\\n",
"0 New.CleanUp.ReferenceOTU11778 2426773 99.17 120 \n",
"1 New.CleanUp.ReferenceOTU14235 1100972 98.84 172 \n",
"2 New.CleanUp.ReferenceOTU14439 3339747 99.55 223 \n",
"3 New.CleanUp.ReferenceOTU15749 937813 95.58 181 \n",
"4 New.CleanUp.ReferenceOTU16523 4484060 96.49 228 \n",
"5 New.CleanUp.ReferenceOTU20112 930873 99.37 159 \n",
"6 New.CleanUp.ReferenceOTU20493 930873 98.78 164 \n",
"7 New.CleanUp.ReferenceOTU2210 4466460 97.56 164 \n",
"8 New.CleanUp.ReferenceOTU2265 827969 98.36 183 \n",
"9 New.CleanUp.ReferenceOTU23577 1035532 100.00 179 \n",
"10 New.CleanUp.ReferenceOTU24778 495095 98.35 182 \n",
"11 New.CleanUp.ReferenceOTU2799 4302012 91.42 233 \n",
"12 New.CleanUp.ReferenceOTU5311 1035532 100.00 179 \n",
"13 New.CleanUp.ReferenceOTU8921 1035532 99.44 179 \n",
"\n",
" mismatches gap openings q. start q. end s. start s. end e-value \\\n",
"0 1 0 1 120 486 367 0 \n",
"1 2 0 1 172 316 145 0 \n",
"2 0 1 1 223 305 84 0 \n",
"3 7 1 1 181 323 144 0 \n",
"4 2 4 1 228 351 130 0 \n",
"5 0 1 1 159 323 166 0 \n",
"6 2 0 1 164 323 160 0 \n",
"7 4 0 1 164 307 144 0 \n",
"8 3 0 1 183 319 137 0 \n",
"9 0 0 1 179 311 133 0 \n",
"10 3 0 1 182 323 142 0 \n",
"11 16 3 1 230 304 73 0 \n",
"12 0 0 1 179 311 133 0 \n",
"13 1 0 1 179 311 133 0 \n",
"\n",
" bit scor \n",
"0 230 \n",
"1 325 \n",
"2 426 \n",
"3 287 \n",
"4 361 \n",
"5 299 \n",
"6 309 \n",
"7 293 \n",
"8 339 \n",
"9 355 \n",
"10 337 \n",
"11 281 \n",
"12 355 \n",
"13 347 "
]
}
],
"prompt_number": 13
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"blast_results_core.mean()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 14,
"text": [
"Subject id 1953517.357143\n",
"% identity 98.065000\n",
"alignment length 181.857143\n",
"mismatches 2.928571\n",
"gap openings 0.714286\n",
"q. start 1.000000\n",
"q. end 181.642857\n",
"s. start 329.500000\n",
"s. end 149.357143\n",
"e-value 0.000000\n",
"bit scor 324.571429\n",
"dtype: float64"
]
}
],
"prompt_number": 14
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here it also looks like these are partial alignments, but average a little longer ($181/236 \\approx 77\\% $ of sequence length aligned). These are within the threshold for PyNAST to align them (default is 75% of the median input sequence length), which is why we see them align to the 85% OTU alignment. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# V4 sequence collection\n",
"\n",
"My next test was performed with the V4 sequences deposited under study id ``550`` in the QIIME database (the [Moving Pictures](http://genomebiology.com/2011/12/5/r50)) dataset. These tests are less extensive than the ones above, just because I've run out of time for now."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"pyn_out_qdr = join(rep_set_550_out_dir, \"qdr012\")\n",
"!parallel_align_seqs_pynast.py -i $rep_set_550_fp -o $pyn_out_qdr --jobs_to_start 8\n",
"fasta_fps = join(pyn_out_qdr, \"*fasta\")\n",
"!count_seqs.py -i \"$fasta_fps\""
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r\n",
"1475 : /home/ubuntu/qdr-issue14/output/550_out/qdr012/rep_set.550.1_failures.fasta (Sequence lengths (mean +/- std): 92.4312 +/- 18.7288)\r\n",
"13594 : /home/ubuntu/qdr-issue14/output/550_out/qdr012/rep_set.550.1_aligned.fasta (Sequence lengths (mean +/- std): 7682.0000 +/- 0.0000)\r\n",
"15069 : Total\r\n"
]
}
],
"prompt_number": 15
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is 9.8% failing to align against the 85% PyNAST-aligned OTUs. \n"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"pyn_out_core = join(rep_set_550_out_dir, \"core-set\")\n",
"!parallel_align_seqs_pynast.py -i $rep_set_550_fp -o $pyn_out_core --jobs_to_start 8 -t $core_set_template_fp\n",
"fasta_fps = join(pyn_out_core, \"*fasta\")\n",
"!count_seqs.py -i \"$fasta_fps\""
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r\n",
"1596 : /home/ubuntu/qdr-issue14/output/550_out/core-set/rep_set.550.1_failures.fasta (Sequence lengths (mean +/- std): 95.2481 +/- 20.8748)\r\n",
"13473 : /home/ubuntu/qdr-issue14/output/550_out/core-set/rep_set.550.1_aligned.fasta (Sequence lengths (mean +/- std): 7682.0000 +/- 0.0000)\r\n",
"15069 : Total\r\n"
]
}
],
"prompt_number": 16
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is 10.6% failing to align against the core set. \n",
"\n",
"So, unlike what we saw with V2 above, we're now having fewer sequences failing to align to the Greengenes 85% OTUs than failed to align to the core set."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The analyses presented here were originally described in [biocore/qiime-default-reference#14](https://github.com/biocore/qiime-default-reference/issues/14). These were ported to a separate repository for reproducibility."
]
}
],
"metadata": {}
}
]
}