{ "metadata": { "name": "", "signature": "sha256:e68f8510cc8a35217d86e9e8d72c074e2e737ec220f51784e6dff6cc9461dc53" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "In an additional test of PyNAST alignment of V2 sequences with the core set versus the PyNAST 85% OTU alignment added in [biocore/qiime-default-reference#15](https://github.com/biocore/qiime-default-reference/pull/15), I find that only slightly more sequences fail to align with the PyNAST 85% OTU alignment than with the old core set. Details of my test follow. Since we don't have an inherent way to say which is better (i.e., it's possible that we want those additional sequences to fail), and we have complete understanding of how the PyNAST 85% OTU alignment was generated (which is not true for the core set), we'll move forward with the plan to use the PyNAST 85% OTU alignment as the default reference template alignment. \n", "\n", "# V2 sequence collection\n", "\n", "My first test was performed with the V2 sequences deposited under study id ``449`` in the QIIME database (the [Costello Whole Body](http://www.sciencemag.org/content/326/5960/1694.abstract)) dataset. I ran the following command to align representative sequences against the PyNAST 85% OTU alignment:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "## Uncomment to remove output from previous run.\n", "!rm -r output\n", "import pandas as pd\n", "from os.path import join, abspath\n", "from qiime_default_reference import get_reference_sequences\n", "from bfillings.formatdb import build_blast_db_from_fasta_path \n", "\n", "gg_97_reference = get_reference_sequences()\n", "db_name, db_files = build_blast_db_from_fasta_path(gg_97_reference, output_dir='./')\n", "\n", "!wget http://greengenes.lbl.gov/Download/Sequence_Data/Fasta_data_files/core_set_aligned.fasta.imputed\n", "core_set_template_fp = abspath(\"core_set_aligned.fasta.imputed\")\n", "rep_set_449_fp = abspath(\"data/rep_set.449.fna\")\n", "rep_set_449_out_dir = abspath(\"output/449_out\")\n", "rep_set_550_fp = abspath(\"data/rep_set.550.1.fna\")\n", "rep_set_550_out_dir = abspath(\"output/550_out\")" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "rm: output: No such file or directory\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "--2015-04-16 09:15:50-- http://greengenes.lbl.gov/Download/Sequence_Data/Fasta_data_files/core_set_aligned.fasta.imputed\r\n", "Resolving greengenes.lbl.gov... " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "128.32.248.7\r\n", "Connecting to greengenes.lbl.gov|128.32.248.7|:80... " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "connected.\r\n", "HTTP request sent, awaiting response... " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "200 OK\r\n", "Length: 37975992 (36M) [text/plain]\r\n", "Saving to: 'core_set_aligned.fasta.imputed'\r\n", "\r\n", "\r", " 0% [ ] 0 --.-K/s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", " 0% [ ] 39,393 184KB/s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", " 0% [ ] 303,104 651KB/s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", " 2% [ ] 924,192 1.32MB/s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", " 4% [> ] 1,769,616 1.92MB/s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", " 7% [=> ] 2,798,352 2.45MB/s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "10% [===> ] 4,151,304 3.04MB/s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "15% [====> ] 5,754,600 3.63MB/s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "20% [======> ] 7,690,320 4.25MB/s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "21% [=======> ] 8,132,184 4.01MB/s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "24% [========> ] 9,289,512 4.12MB/s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "27% [=========> ] 10,255,320 4.14MB/s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "29% [==========> ] 11,225,216 4.16MB/s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "32% [===========> ] 12,184,200 4.17MB/s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "34% [============> ] 13,154,096 4.18MB/s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "37% [=============> ] 14,111,712 4.20MB/s eta 5s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "39% [==============> ] 15,078,888 4.20MB/s eta 5s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "42% [===============> ] 16,041,960 4.41MB/s eta 5s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "44% [================> ] 16,921,568 4.56MB/s eta 5s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "46% [=================> ] 17,712,288 4.72MB/s eta 5s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "48% [=================> ] 18,345,672 4.66MB/s eta 5s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "50% [==================> ] 18,988,632 4.57MB/s eta 5s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "51% [===================> ] 19,630,224 4.42MB/s eta 5s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "53% [===================> ] 20,278,640 4.08MB/s eta 5s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "55% [====================> ] 20,917,512 3.79MB/s eta 5s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "56% [=====================> ] 21,560,472 3.75MB/s eta 4s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "58% [=====================> ] 22,206,168 3.72MB/s eta 4s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "60% [======================> ] 22,853,216 3.54MB/s eta 4s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "61% [=======================> ] 23,492,088 3.46MB/s eta 4s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "63% [=======================> ] 24,141,888 3.38MB/s eta 4s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "65% [========================> ] 24,778,008 3.31MB/s eta 3s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "66% [=========================> ] 25,427,792 3.18MB/s eta 3s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "68% [=========================> ] 26,058,456 3.10MB/s eta 3s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "70% [==========================> ] 26,715,096 3.03MB/s eta 3s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "72% [===========================> ] 27,353,952 2.96MB/s eta 3s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "73% [===========================> ] 27,988,704 2.88MB/s eta 3s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "75% [============================> ] 28,518,104 2.85MB/s eta 3s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "75% [============================> ] 28,832,760 2.76MB/s eta 3s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "76% [============================> ] 29,163,800 2.68MB/s eta 3s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "77% [=============================> ] 29,479,824 2.58MB/s eta 3s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "78% [=============================> ] 29,799,936 2.50MB/s eta 2s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "79% [=============================> ] 30,122,784 2.42MB/s eta 2s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "80% [==============================> ] 30,444,264 2.33MB/s eta 2s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "81% [==============================> ] 30,768,480 2.22MB/s eta 2s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "81% [==============================> ] 31,087,224 2.15MB/s eta 2s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "82% [===============================> ] 31,414,160 2.07MB/s eta 2s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "83% [===============================> ] 31,735,640 1.99MB/s eta 2s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "84% [===============================> ] 32,057,120 1.87MB/s eta 2s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "85% [================================> ] 32,371,776 1.79MB/s eta 2s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "86% [================================> ] 32,700,080 1.72MB/s eta 2s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "86% [================================> ] 33,014,736 1.64MB/s eta 2s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "87% [=================================> ] 33,343,040 1.51MB/s eta 2s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "88% [=================================> ] 33,659,064 1.46MB/s eta 2s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "89% [=================================> ] 33,979,176 1.45MB/s eta 2s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "90% [==================================> ] 34,300,656 1.45MB/s eta 2s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "91% [==================================> ] 34,622,136 1.45MB/s eta 1s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "92% [==================================> ] 34,943,616 1.45MB/s eta 1s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "92% [===================================> ] 35,265,096 1.45MB/s eta 1s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "93% [===================================> ] 35,587,944 1.45MB/s eta 1s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "94% [===================================> ] 35,908,056 1.45MB/s eta 1s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "95% [====================================> ] 36,230,904 1.45MB/s eta 1s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "96% [====================================> ] 36,551,016 1.45MB/s eta 1s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "97% [====================================> ] 36,873,864 1.45MB/s eta 1s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "97% [=====================================> ] 37,200,800 1.45MB/s eta 1s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "98% [=====================================> ] 37,516,824 1.45MB/s eta 1s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "99% [=====================================> ] 37,838,304 1.45MB/s eta 0s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "100%[======================================>] 37,975,992 1.47MB/s in 14s \r\n", "\r\n", "2015-04-16 09:16:07 (2.62 MB/s) - 'core_set_aligned.fasta.imputed' saved [37975992/37975992]\r\n", "\r\n" ] } ], "prompt_number": 1 }, { "cell_type": "code", "collapsed": false, "input": [ "pyn_out_qdr = join(rep_set_449_out_dir, \"qdr012\")\n", "!parallel_align_seqs_pynast.py -i $rep_set_449_fp -o $pyn_out_qdr --jobs_to_start 8\n", "fasta_fps = join(pyn_out_qdr, \"*fasta\")\n", "!count_seqs.py -i \"$fasta_fps\"" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\r\n", "204 : /home/ubuntu/qdr-issue14/output/449_out/qdr012/rep_set.449_failures.fasta (Sequence lengths (mean +/- std): 181.9706 +/- 56.2609)\r\n", "10200 : /home/ubuntu/qdr-issue14/output/449_out/qdr012/rep_set.449_aligned.fasta (Sequence lengths (mean +/- std): 7682.0000 +/- 0.0000)\r\n", "10404 : Total\r\n" ] } ], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "This tells us that 204 sequences failed to align with PyNAST out of 10404. \n", "\n", "Next, we perform the same alignment against the core set. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "pyn_out_core = join(rep_set_449_out_dir, \"core-set\")\n", "!parallel_align_seqs_pynast.py -i $rep_set_449_fp -o $pyn_out_core --jobs_to_start 8 -t $core_set_template_fp\n", "fasta_fps = join(pyn_out_core, \"*fasta\")\n", "!count_seqs.py -i \"$fasta_fps\"" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\r\n", "171 : /home/ubuntu/qdr-issue14/output/449_out/core-set/rep_set.449_failures.fasta (Sequence lengths (mean +/- std): 166.0292 +/- 46.7550)\r\n", "10233 : /home/ubuntu/qdr-issue14/output/449_out/core-set/rep_set.449_aligned.fasta (Sequence lengths (mean +/- std): 7682.0000 +/- 0.0000)\r\n", "10404 : Total\r\n" ] } ], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "This tells us that 171 sequences failed to align with PyNAST out of 10404, so slightly less when aligning against the core set.\n", "\n", "However, it's important to note that different sequences failed to align with the two data sets (i.e., the sequences that fail to align against the core set are not a subset of the sequences that fail to align against the 85% PyNAST-aligned OTUs). We therefore don't immediately know which is better." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# find the sequences that didn't align against the 85% PyNAST-aligned OTUs\n", "# but did align against the core set\n", "!filter_fasta.py -f $pyn_out_qdr/rep_set.449_failures.fasta -o $pyn_out_qdr/failures-that-aligned-v-core-set.fasta -a $pyn_out_core/rep_set.449_aligned.fasta\n", "!count_seqs.py -i $pyn_out_qdr/failures-that-aligned-v-core-set.fasta" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\r\n", "47 : /home/ubuntu/qdr-issue14/output/449_out/qdr012/failures-that-aligned-v-core-set.fasta (Sequence lengths (mean +/- std): 256.3404 +/- 16.3086)\r\n", "47 : Total\r\n" ] } ], "prompt_number": 5 }, { "cell_type": "code", "collapsed": false, "input": [ "# find the sequences that didn't align against the core set\n", "# but did align against the 85% PyNAST-aligned OTUs\n", "!filter_fasta.py -f $pyn_out_core/rep_set.449_failures.fasta -o $pyn_out_core/failures-that-aligned-v-pyn-85-otus.fasta -a $pyn_out_qdr/rep_set.449_aligned.fasta\n", "!count_seqs.py -i $pyn_out_core/failures-that-aligned-v-pyn-85-otus.fasta\n" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\r\n", "14 : /home/ubuntu/qdr-issue14/output/449_out/core-set/failures-that-aligned-v-pyn-85-otus.fasta (Sequence lengths (mean +/- std): 236.9286 +/- 11.0677)\r\n", "14 : Total\r\n" ] } ], "prompt_number": 6 }, { "cell_type": "markdown", "metadata": {}, "source": [ "So what do the BLAST alignments look like if we query these sequences against the Greengenes 13_8 97% OTUs?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "blast_out_qdr = join(pyn_out_qdr, \"failures-that-aligned-v-core-set-blast-results\")\n", "\n", "!parallel_blast.py -i $pyn_out_qdr/failures-that-aligned-v-core-set.fasta -b $db_name -o $blast_out_qdr -O 8" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 7 }, { "cell_type": "code", "collapsed": false, "input": [ "blast_headers = !head $blast_out_qdr/failures-that-aligned-v-core-set_blast_out.txt\n", "blast_headers = blast_headers[3].strip('# Fields: ').split(', ')\n", "blast_out = !grep -v '^#' $blast_out_qdr/failures-that-aligned-v-core-set_blast_out.txt\n", "blast_out = [l.split('\\t') for l in blast_out]\n", "blast_results_qdr = pd.DataFrame(blast_out, columns=blast_headers).convert_objects(convert_numeric=True)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 8 }, { "cell_type": "code", "collapsed": false, "input": [ "blast_results_qdr" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Query idSubject id% identityalignment lengthmismatchesgap openingsq. startq. ends. starts. ende-valuebit scor
0 New.CleanUp.ReferenceOTU10449 853867 100.00 149 0 0 1 149 317 169 0 295
1 New.CleanUp.ReferenceOTU10796 960131 100.00 131 0 0 1 131 317 187 0 260
2 New.CleanUp.ReferenceOTU11004 960131 99.32 148 1 0 1 148 317 170 0 285
3 New.CleanUp.ReferenceOTU11548 509773 100.00 132 0 0 1 132 336 205 0 262
4 New.CleanUp.ReferenceOTU12392 853867 100.00 156 0 0 1 156 317 162 0 309
5 New.CleanUp.ReferenceOTU12511 853867 98.72 156 1 1 1 155 317 162 0 285
6 New.CleanUp.ReferenceOTU12532 1000986 98.84 173 2 0 1 173 304 132 0 327
7 New.CleanUp.ReferenceOTU13882 129798 97.04 169 1 3 1 169 368 204 0 274
8 New.CleanUp.ReferenceOTU14202 960131 98.18 165 1 1 1 165 317 155 0 297
9 New.CleanUp.ReferenceOTU15614 4366889 96.49 171 1 3 1 171 303 138 0 272
10 New.CleanUp.ReferenceOTU16729 4467774 99.45 181 1 0 1 181 308 128 0 351
11 New.CleanUp.ReferenceOTU17453 870751 100.00 137 0 0 1 137 317 181 0 272
12 New.CleanUp.ReferenceOTU18041 4422519 99.04 209 0 2 1 209 349 143 0 383
13 New.CleanUp.ReferenceOTU18339 4467774 99.43 176 0 1 1 176 308 134 0 333
14 New.CleanUp.ReferenceOTU18724 965048 98.89 180 0 2 1 178 313 134 0 325
15 New.CleanUp.ReferenceOTU1891 4308647 94.62 186 5 3 1 186 346 166 0 270
16 New.CleanUp.ReferenceOTU2098 979261 100.00 157 0 0 1 157 317 161 0 311
17 New.CleanUp.ReferenceOTU21175 4347865 100.00 141 0 0 1 141 334 194 0 280
18 New.CleanUp.ReferenceOTU21766 4467774 100.00 178 0 0 1 178 308 131 0 353
19 New.CleanUp.ReferenceOTU21849 853867 99.36 156 0 1 1 155 317 162 0 293
20 New.CleanUp.ReferenceOTU22455 4417848 97.92 192 1 2 2 193 306 118 0 335
21 New.CleanUp.ReferenceOTU22842 379925 98.63 146 0 2 1 146 298 155 0 258
22 New.CleanUp.ReferenceOTU22953 4368216 100.00 180 0 0 1 180 315 136 0 357
23 New.CleanUp.ReferenceOTU2306 1064036 97.59 166 2 2 1 166 326 163 0 281
24 New.CleanUp.ReferenceOTU23446 979261 100.00 144 0 0 1 144 317 174 0 285
25 New.CleanUp.ReferenceOTU24551 867184 93.95 215 4 7 1 215 329 124 0 272
26 New.CleanUp.ReferenceOTU24622 1061772 100.00 149 0 0 1 149 317 169 0 295
27 New.CleanUp.ReferenceOTU24747 960131 99.32 148 1 0 1 148 317 170 0 285
28 New.CleanUp.ReferenceOTU24753 129798 96.36 165 0 5 1 165 368 210 0 242
29 New.CleanUp.ReferenceOTU303 1136387 92.97 185 13 0 1 185 310 126 0 264
30 New.CleanUp.ReferenceOTU3067 495451 96.94 196 3 3 1 196 311 119 0 317
31 New.CleanUp.ReferenceOTU3375 4477696 100.00 148 0 0 1 148 329 182 0 293
32 New.CleanUp.ReferenceOTU3384 945061 94.59 222 12 0 1 222 315 94 0 345
33 New.CleanUp.ReferenceOTU3739 4467774 100.00 188 0 0 1 188 308 121 0 373
34 New.CleanUp.ReferenceOTU4059 635216 98.58 141 0 2 1 139 300 160 0 248
35 New.CleanUp.ReferenceOTU4528 979261 98.61 144 2 0 1 144 317 174 0 270
36 New.CleanUp.ReferenceOTU5484 934488 100.00 144 0 0 1 144 317 174 0 285
37 New.CleanUp.ReferenceOTU6173 1110261 96.22 185 5 2 1 184 318 135 0 295
38 New.CleanUp.ReferenceOTU6680 979261 99.31 144 1 0 1 144 317 174 0 278
39 New.CleanUp.ReferenceOTU720 932077 100.00 144 0 0 1 144 313 170 0 285
40 New.CleanUp.ReferenceOTU7237 979261 100.00 144 0 0 1 144 317 174 0 285
41 New.CleanUp.ReferenceOTU7455 129798 100.00 139 0 0 1 139 368 230 0 276
42 New.CleanUp.ReferenceOTU7553 335887 99.40 168 0 1 1 168 337 171 0 317
43 New.CleanUp.ReferenceOTU7914 853867 100.00 156 0 0 1 156 317 162 0 309
44 New.CleanUp.ReferenceOTU7942 4442170 100.00 154 0 0 1 154 310 157 0 305
45 New.CleanUp.ReferenceOTU8857 669486 98.26 172 3 0 1 172 330 159 0 317
46 New.CleanUp.ReferenceOTU9075 960131 99.35 153 0 1 1 152 317 165 0 287
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 9, "text": [ " Query id Subject id % identity alignment length \\\n", "0 New.CleanUp.ReferenceOTU10449 853867 100.00 149 \n", "1 New.CleanUp.ReferenceOTU10796 960131 100.00 131 \n", "2 New.CleanUp.ReferenceOTU11004 960131 99.32 148 \n", "3 New.CleanUp.ReferenceOTU11548 509773 100.00 132 \n", "4 New.CleanUp.ReferenceOTU12392 853867 100.00 156 \n", "5 New.CleanUp.ReferenceOTU12511 853867 98.72 156 \n", "6 New.CleanUp.ReferenceOTU12532 1000986 98.84 173 \n", "7 New.CleanUp.ReferenceOTU13882 129798 97.04 169 \n", "8 New.CleanUp.ReferenceOTU14202 960131 98.18 165 \n", "9 New.CleanUp.ReferenceOTU15614 4366889 96.49 171 \n", "10 New.CleanUp.ReferenceOTU16729 4467774 99.45 181 \n", "11 New.CleanUp.ReferenceOTU17453 870751 100.00 137 \n", "12 New.CleanUp.ReferenceOTU18041 4422519 99.04 209 \n", "13 New.CleanUp.ReferenceOTU18339 4467774 99.43 176 \n", "14 New.CleanUp.ReferenceOTU18724 965048 98.89 180 \n", "15 New.CleanUp.ReferenceOTU1891 4308647 94.62 186 \n", "16 New.CleanUp.ReferenceOTU2098 979261 100.00 157 \n", "17 New.CleanUp.ReferenceOTU21175 4347865 100.00 141 \n", "18 New.CleanUp.ReferenceOTU21766 4467774 100.00 178 \n", "19 New.CleanUp.ReferenceOTU21849 853867 99.36 156 \n", "20 New.CleanUp.ReferenceOTU22455 4417848 97.92 192 \n", "21 New.CleanUp.ReferenceOTU22842 379925 98.63 146 \n", "22 New.CleanUp.ReferenceOTU22953 4368216 100.00 180 \n", "23 New.CleanUp.ReferenceOTU2306 1064036 97.59 166 \n", "24 New.CleanUp.ReferenceOTU23446 979261 100.00 144 \n", "25 New.CleanUp.ReferenceOTU24551 867184 93.95 215 \n", "26 New.CleanUp.ReferenceOTU24622 1061772 100.00 149 \n", "27 New.CleanUp.ReferenceOTU24747 960131 99.32 148 \n", "28 New.CleanUp.ReferenceOTU24753 129798 96.36 165 \n", "29 New.CleanUp.ReferenceOTU303 1136387 92.97 185 \n", "30 New.CleanUp.ReferenceOTU3067 495451 96.94 196 \n", "31 New.CleanUp.ReferenceOTU3375 4477696 100.00 148 \n", "32 New.CleanUp.ReferenceOTU3384 945061 94.59 222 \n", "33 New.CleanUp.ReferenceOTU3739 4467774 100.00 188 \n", "34 New.CleanUp.ReferenceOTU4059 635216 98.58 141 \n", "35 New.CleanUp.ReferenceOTU4528 979261 98.61 144 \n", "36 New.CleanUp.ReferenceOTU5484 934488 100.00 144 \n", "37 New.CleanUp.ReferenceOTU6173 1110261 96.22 185 \n", "38 New.CleanUp.ReferenceOTU6680 979261 99.31 144 \n", "39 New.CleanUp.ReferenceOTU720 932077 100.00 144 \n", "40 New.CleanUp.ReferenceOTU7237 979261 100.00 144 \n", "41 New.CleanUp.ReferenceOTU7455 129798 100.00 139 \n", "42 New.CleanUp.ReferenceOTU7553 335887 99.40 168 \n", "43 New.CleanUp.ReferenceOTU7914 853867 100.00 156 \n", "44 New.CleanUp.ReferenceOTU7942 4442170 100.00 154 \n", "45 New.CleanUp.ReferenceOTU8857 669486 98.26 172 \n", "46 New.CleanUp.ReferenceOTU9075 960131 99.35 153 \n", "\n", " mismatches gap openings q. start q. end s. start s. end e-value \\\n", "0 0 0 1 149 317 169 0 \n", "1 0 0 1 131 317 187 0 \n", "2 1 0 1 148 317 170 0 \n", "3 0 0 1 132 336 205 0 \n", "4 0 0 1 156 317 162 0 \n", "5 1 1 1 155 317 162 0 \n", "6 2 0 1 173 304 132 0 \n", "7 1 3 1 169 368 204 0 \n", "8 1 1 1 165 317 155 0 \n", "9 1 3 1 171 303 138 0 \n", "10 1 0 1 181 308 128 0 \n", "11 0 0 1 137 317 181 0 \n", "12 0 2 1 209 349 143 0 \n", "13 0 1 1 176 308 134 0 \n", "14 0 2 1 178 313 134 0 \n", "15 5 3 1 186 346 166 0 \n", "16 0 0 1 157 317 161 0 \n", "17 0 0 1 141 334 194 0 \n", "18 0 0 1 178 308 131 0 \n", "19 0 1 1 155 317 162 0 \n", "20 1 2 2 193 306 118 0 \n", "21 0 2 1 146 298 155 0 \n", "22 0 0 1 180 315 136 0 \n", "23 2 2 1 166 326 163 0 \n", "24 0 0 1 144 317 174 0 \n", "25 4 7 1 215 329 124 0 \n", "26 0 0 1 149 317 169 0 \n", "27 1 0 1 148 317 170 0 \n", "28 0 5 1 165 368 210 0 \n", "29 13 0 1 185 310 126 0 \n", "30 3 3 1 196 311 119 0 \n", "31 0 0 1 148 329 182 0 \n", "32 12 0 1 222 315 94 0 \n", "33 0 0 1 188 308 121 0 \n", "34 0 2 1 139 300 160 0 \n", "35 2 0 1 144 317 174 0 \n", "36 0 0 1 144 317 174 0 \n", "37 5 2 1 184 318 135 0 \n", "38 1 0 1 144 317 174 0 \n", "39 0 0 1 144 313 170 0 \n", "40 0 0 1 144 317 174 0 \n", "41 0 0 1 139 368 230 0 \n", "42 0 1 1 168 337 171 0 \n", "43 0 0 1 156 317 162 0 \n", "44 0 0 1 154 310 157 0 \n", "45 3 0 1 172 330 159 0 \n", "46 0 1 1 152 317 165 0 \n", "\n", " bit scor \n", "0 295 \n", "1 260 \n", "2 285 \n", "3 262 \n", "4 309 \n", "5 285 \n", "6 327 \n", "7 274 \n", "8 297 \n", "9 272 \n", "10 351 \n", "11 272 \n", "12 383 \n", "13 333 \n", "14 325 \n", "15 270 \n", "16 311 \n", "17 280 \n", "18 353 \n", "19 293 \n", "20 335 \n", "21 258 \n", "22 357 \n", "23 281 \n", "24 285 \n", "25 272 \n", "26 295 \n", "27 285 \n", "28 242 \n", "29 264 \n", "30 317 \n", "31 293 \n", "32 345 \n", "33 373 \n", "34 248 \n", "35 270 \n", "36 285 \n", "37 295 \n", "38 278 \n", "39 285 \n", "40 285 \n", "41 276 \n", "42 317 \n", "43 309 \n", "44 305 \n", "45 317 \n", "46 287 " ] } ], "prompt_number": 9 }, { "cell_type": "code", "collapsed": false, "input": [ "blast_results_qdr.mean()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 10, "text": [ "Subject id 1729626.042553\n", "% identity 98.667660\n", "alignment length 163.468085\n", "mismatches 1.276596\n", "gap openings 0.936170\n", "q. start 1.021277\n", "q. end 163.319149\n", "s. start 320.723404\n", "s. end 159.234043\n", "e-value 0.000000\n", "bit scor 297.787234\n", "dtype: float64" ] } ], "prompt_number": 10 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we see that the mean alignment length is $163$, but the mean/standard deviation of the sequences provided as input above was $256 +/- 16$ ($163/256 \\approx 63\\%$ of sequence length is aligned). It therefore looks like we're getting partial alignments, which may be indicative of chimeras or sequence quality drop-off. This suggests that these sequences *may* be aligning to sequences that were in the core set, but which were dropped from in Greengenes 13_8 (or some earlier version of the database)." ] }, { "cell_type": "code", "collapsed": false, "input": [ "blast_out_core = join(pyn_out_core, \"failures-that-aligned-v-pyn-85-otus-blast-results\")\n", "\n", "!parallel_blast.py -i $pyn_out_core/failures-that-aligned-v-pyn-85-otus.fasta -b $db_name -o $blast_out_core -O 8" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 11 }, { "cell_type": "code", "collapsed": false, "input": [ "blast_headers = !head $blast_out_core/failures-that-aligned-v-pyn-85-otus_blast_out.txt\n", "blast_headers = blast_headers[3].strip('# Fields: ').split(', ')\n", "blast_out = !grep -v '^#' $blast_out_core/failures-that-aligned-v-pyn-85-otus_blast_out.txt\n", "blast_out = [l.split('\\t') for l in blast_out]\n", "blast_results_core = pd.DataFrame(blast_out, columns=blast_headers).convert_objects(convert_numeric=True)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 12 }, { "cell_type": "code", "collapsed": false, "input": [ "blast_results_core" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Query idSubject id% identityalignment lengthmismatchesgap openingsq. startq. ends. starts. ende-valuebit scor
0 New.CleanUp.ReferenceOTU11778 2426773 99.17 120 1 0 1 120 486 367 0 230
1 New.CleanUp.ReferenceOTU14235 1100972 98.84 172 2 0 1 172 316 145 0 325
2 New.CleanUp.ReferenceOTU14439 3339747 99.55 223 0 1 1 223 305 84 0 426
3 New.CleanUp.ReferenceOTU15749 937813 95.58 181 7 1 1 181 323 144 0 287
4 New.CleanUp.ReferenceOTU16523 4484060 96.49 228 2 4 1 228 351 130 0 361
5 New.CleanUp.ReferenceOTU20112 930873 99.37 159 0 1 1 159 323 166 0 299
6 New.CleanUp.ReferenceOTU20493 930873 98.78 164 2 0 1 164 323 160 0 309
7 New.CleanUp.ReferenceOTU2210 4466460 97.56 164 4 0 1 164 307 144 0 293
8 New.CleanUp.ReferenceOTU2265 827969 98.36 183 3 0 1 183 319 137 0 339
9 New.CleanUp.ReferenceOTU23577 1035532 100.00 179 0 0 1 179 311 133 0 355
10 New.CleanUp.ReferenceOTU24778 495095 98.35 182 3 0 1 182 323 142 0 337
11 New.CleanUp.ReferenceOTU2799 4302012 91.42 233 16 3 1 230 304 73 0 281
12 New.CleanUp.ReferenceOTU5311 1035532 100.00 179 0 0 1 179 311 133 0 355
13 New.CleanUp.ReferenceOTU8921 1035532 99.44 179 1 0 1 179 311 133 0 347
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 13, "text": [ " Query id Subject id % identity alignment length \\\n", "0 New.CleanUp.ReferenceOTU11778 2426773 99.17 120 \n", "1 New.CleanUp.ReferenceOTU14235 1100972 98.84 172 \n", "2 New.CleanUp.ReferenceOTU14439 3339747 99.55 223 \n", "3 New.CleanUp.ReferenceOTU15749 937813 95.58 181 \n", "4 New.CleanUp.ReferenceOTU16523 4484060 96.49 228 \n", "5 New.CleanUp.ReferenceOTU20112 930873 99.37 159 \n", "6 New.CleanUp.ReferenceOTU20493 930873 98.78 164 \n", "7 New.CleanUp.ReferenceOTU2210 4466460 97.56 164 \n", "8 New.CleanUp.ReferenceOTU2265 827969 98.36 183 \n", "9 New.CleanUp.ReferenceOTU23577 1035532 100.00 179 \n", "10 New.CleanUp.ReferenceOTU24778 495095 98.35 182 \n", "11 New.CleanUp.ReferenceOTU2799 4302012 91.42 233 \n", "12 New.CleanUp.ReferenceOTU5311 1035532 100.00 179 \n", "13 New.CleanUp.ReferenceOTU8921 1035532 99.44 179 \n", "\n", " mismatches gap openings q. start q. end s. start s. end e-value \\\n", "0 1 0 1 120 486 367 0 \n", "1 2 0 1 172 316 145 0 \n", "2 0 1 1 223 305 84 0 \n", "3 7 1 1 181 323 144 0 \n", "4 2 4 1 228 351 130 0 \n", "5 0 1 1 159 323 166 0 \n", "6 2 0 1 164 323 160 0 \n", "7 4 0 1 164 307 144 0 \n", "8 3 0 1 183 319 137 0 \n", "9 0 0 1 179 311 133 0 \n", "10 3 0 1 182 323 142 0 \n", "11 16 3 1 230 304 73 0 \n", "12 0 0 1 179 311 133 0 \n", "13 1 0 1 179 311 133 0 \n", "\n", " bit scor \n", "0 230 \n", "1 325 \n", "2 426 \n", "3 287 \n", "4 361 \n", "5 299 \n", "6 309 \n", "7 293 \n", "8 339 \n", "9 355 \n", "10 337 \n", "11 281 \n", "12 355 \n", "13 347 " ] } ], "prompt_number": 13 }, { "cell_type": "code", "collapsed": false, "input": [ "blast_results_core.mean()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 14, "text": [ "Subject id 1953517.357143\n", "% identity 98.065000\n", "alignment length 181.857143\n", "mismatches 2.928571\n", "gap openings 0.714286\n", "q. start 1.000000\n", "q. end 181.642857\n", "s. start 329.500000\n", "s. end 149.357143\n", "e-value 0.000000\n", "bit scor 324.571429\n", "dtype: float64" ] } ], "prompt_number": 14 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here it also looks like these are partial alignments, but average a little longer ($181/236 \\approx 77\\% $ of sequence length aligned). These are within the threshold for PyNAST to align them (default is 75% of the median input sequence length), which is why we see them align to the 85% OTU alignment. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# V4 sequence collection\n", "\n", "My next test was performed with the V4 sequences deposited under study id ``550`` in the QIIME database (the [Moving Pictures](http://genomebiology.com/2011/12/5/r50)) dataset. These tests are less extensive than the ones above, just because I've run out of time for now." ] }, { "cell_type": "code", "collapsed": false, "input": [ "pyn_out_qdr = join(rep_set_550_out_dir, \"qdr012\")\n", "!parallel_align_seqs_pynast.py -i $rep_set_550_fp -o $pyn_out_qdr --jobs_to_start 8\n", "fasta_fps = join(pyn_out_qdr, \"*fasta\")\n", "!count_seqs.py -i \"$fasta_fps\"" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\r\n", "1475 : /home/ubuntu/qdr-issue14/output/550_out/qdr012/rep_set.550.1_failures.fasta (Sequence lengths (mean +/- std): 92.4312 +/- 18.7288)\r\n", "13594 : /home/ubuntu/qdr-issue14/output/550_out/qdr012/rep_set.550.1_aligned.fasta (Sequence lengths (mean +/- std): 7682.0000 +/- 0.0000)\r\n", "15069 : Total\r\n" ] } ], "prompt_number": 15 }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is 9.8% failing to align against the 85% PyNAST-aligned OTUs. \n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "pyn_out_core = join(rep_set_550_out_dir, \"core-set\")\n", "!parallel_align_seqs_pynast.py -i $rep_set_550_fp -o $pyn_out_core --jobs_to_start 8 -t $core_set_template_fp\n", "fasta_fps = join(pyn_out_core, \"*fasta\")\n", "!count_seqs.py -i \"$fasta_fps\"" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\r\n", "1596 : /home/ubuntu/qdr-issue14/output/550_out/core-set/rep_set.550.1_failures.fasta (Sequence lengths (mean +/- std): 95.2481 +/- 20.8748)\r\n", "13473 : /home/ubuntu/qdr-issue14/output/550_out/core-set/rep_set.550.1_aligned.fasta (Sequence lengths (mean +/- std): 7682.0000 +/- 0.0000)\r\n", "15069 : Total\r\n" ] } ], "prompt_number": 16 }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is 10.6% failing to align against the core set. \n", "\n", "So, unlike what we saw with V2 above, we're now having fewer sequences failing to align to the Greengenes 85% OTUs than failed to align to the core set." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The analyses presented here were originally described in [biocore/qiime-default-reference#14](https://github.com/biocore/qiime-default-reference/issues/14). These were ported to a separate repository for reproducibility." ] } ], "metadata": {} } ] }