{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "###BLASTN C.gigas against NCBI nt DB " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Code explanation (by line)\n", "1. Calls blastn program (!blastn) and specifies to use blastn (-task blastn).\n", "2. Specifies which query file to blast.\n", "3. Specifies which database file to blast against (located on Hummingbird /Volumes/Data/blast_dbs).\n", "4. Specifies output format. In this case, output format 6 with subject scientific names.\n", "5. Specifies maximum number of DB matches to save.\n", "6. Specifies number of CPUs to use.\n", "7. Output file name.\n", "8. Directs stderr output to file instead of printing to screen." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "^C\r\n" ] } ], "source": [ "!blastn -task blastn \\\n", "-query Owl/halfshell/EmmaBS400.fa \\\n", "-db nt \\\n", "-outfmt \"6 sscinames\" \\\n", "-max_target_seqs 1 \\\n", "-num_threads 16 \\\n", "-out 20150501_nt_blastn.tab \\\n", "2> 20150501_nt_blastn.err" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Interrupted kernel because I glanced at the output file and saw this (after running for ~6hrs!):" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Methanohalophilus mahii DSM 5219\r\n", "uncultured organism\r\n", "uncultured organism\r\n", "Staphylococcus saprophyticus subsp. saprophyticus ATCC 15305\r\n", "Mus musculus\r\n", "Mus musculus\r\n", "Mus musculus\r\n", "Mus musculus\r\n", "Mus musculus\r\n", "Mus musculus\r\n", "Brassica rapa subsp. pekinensis\r\n", "Vanderwaltozyma polyspora DSM 70294\r\n", "Leishmania braziliensis MHOM/BR/75/M2904\r\n", "Dictyostelium discoideum AX4\r\n", "Dictyostelium discoideum AX4\r\n", "Homo sapiens\r\n", "Maribacter sp. HTCC2170\r\n", "Dictyostelium discoideum AX4\r\n", "Mus musculus\r\n", "Dictyostelium discoideum AX4\r\n", "Dictyostelium fasciculatum\r\n", "Chrysemys picta bellii\r\n", "Rattus norvegicus\r\n", "Rattus norvegicus\r\n", "Rattus norvegicus\r\n", "Rattus norvegicus\r\n", "Rattus norvegicus\r\n", "Beta vulgaris\r\n", "Beta vulgaris\r\n", "Beta vulgaris\r\n" ] } ], "source": [ "!head -30 20150501_nt_blastn.tab" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The issue is that the output only contains the species info and not the rest of the formatting (e-vals, bit scores, etc) that are part of BLASTn format 6.\n", "\n", "It turns out that specifying format 6 just specifies tab-delimited output. If an \"additional\" feature is specified from the default output of format 6, then the defaults no longer apply; you have to specify them all." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Re-run BLAST with correct output formatting" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "!blastn -task blastn \\\n", "-query Owl/halfshell/EmmaBS400.fa \\\n", "-db nt \\\n", "-outfmt \"6 qseqid sseqid pident length evalue stitle sscinames\" \\\n", "-max_target_seqs 1 \\\n", "-num_threads 16 \\\n", "-out 20150501_nt_blastn.tab \\\n", "2> 20150501_nt_blastn.err" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "####Verify file output" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "20150414_trimmed_2212_lane2_400ppm_GCCAAT_contig_1\tgi|292665689|gb|CP001994.1|\t86.00\t50\t1e-04\tMethanohalophilus mahii DSM 5219, complete genome\tMethanohalophilus mahii DSM 5219\r\n", "20150414_trimmed_2212_lane2_400ppm_GCCAAT_contig_2\tgi|534503879|gb|KF524813.1|\t99.70\t4658\t0.0\tUncultured organism clone 89_8 Microvirus J protein, Capsid proteins, Major spike protein (G protein), Microvirus H protein, Bacteriophage replication gene A protein, and Phage protein C genes, complete cds; and Bacteriophage scaffolding protein D gene, partial cds\tuncultured organism\r\n", "20150414_trimmed_2212_lane2_400ppm_GCCAAT_contig_2\tgi|534503879|gb|KF524813.1|\t100.00\t802\t0.0\tUncultured organism clone 89_8 Microvirus J protein, Capsid proteins, Major spike protein (G protein), Microvirus H protein, Bacteriophage replication gene A protein, and Phage protein C genes, complete cds; and Bacteriophage scaffolding protein D gene, partial cds\tuncultured organism\r\n", "20150414_trimmed_2212_lane2_400ppm_GCCAAT_contig_3\tgi|72493824|dbj|AP008934.1|\t81.13\t53\t0.009\tStaphylococcus saprophyticus subsp. saprophyticus ATCC 15305 DNA, complete genome\tStaphylococcus saprophyticus subsp. saprophyticus ATCC 15305\r\n", "20150414_trimmed_2212_lane2_400ppm_GCCAAT_contig_4\tgi|28475607|gb|AC117201.2|\t70.56\t180\t2e-04\tMus musculus BAC clone RP23-216E6 from 16, complete sequence\tMus musculus\r\n", "20150414_trimmed_2212_lane2_400ppm_GCCAAT_contig_4\tgi|28475607|gb|AC117201.2|\t70.81\t161\t0.033\tMus musculus BAC clone RP23-216E6 from 16, complete sequence\tMus musculus\r\n", "20150414_trimmed_2212_lane2_400ppm_GCCAAT_contig_4\tgi|28475607|gb|AC117201.2|\t70.25\t158\t0.11\tMus musculus BAC clone RP23-216E6 from 16, complete sequence\tMus musculus\r\n", "20150414_trimmed_2212_lane2_400ppm_GCCAAT_contig_4\tgi|28475607|gb|AC117201.2|\t69.44\t180\t0.11\tMus musculus BAC clone RP23-216E6 from 16, complete sequence\tMus musculus\r\n", "20150414_trimmed_2212_lane2_400ppm_GCCAAT_contig_4\tgi|28475607|gb|AC117201.2|\t68.78\t189\t0.40\tMus musculus BAC clone RP23-216E6 from 16, complete sequence\tMus musculus\r\n", "20150414_trimmed_2212_lane2_400ppm_GCCAAT_contig_4\tgi|28475607|gb|AC117201.2|\t69.70\t165\t1.4\tMus musculus BAC clone RP23-216E6 from 16, complete sequence\tMus musculus\r\n" ] } ], "source": [ "!head -10 20150501_nt_blastn.tab" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "####Count number of matched sequences" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 67326 20150501_nt_blastn.tab\r\n" ] } ], "source": [ "!wc -l 20150501_nt_blastn.tab" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "####Check best matches based on e-value\n", "The code below cuts columns (fields; -f) 5 and 7 (e-value and species) of the input file, sorts and displays first 100 entries (head -100)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.0\tAlteromonas macleodii str. 'Ionian Sea U4'\r\n", "0.0\tAlteromonas macleodii str. 'Ionian Sea U4'\r\n", "0.0\tCanis lupus familiaris\r\n", "0.0\tComamonas sp. 7D-2\r\n", "0.0\tComamonas sp. 7D-2\r\n", "0.0\tComamonas sp. 7D-2\r\n", "0.0\tComamonas sp. 7D-2\r\n", "0.0\tCrassostrea angulata\r\n", "0.0\tCrassostrea gigas\r\n", "0.0\tCrassostrea gigas\r\n", "0.0\tCrassostrea gigas\r\n", "0.0\tCrassostrea gigas\r\n", "0.0\tCrassostrea gigas\r\n", "0.0\tCrassostrea gigas\r\n", "0.0\tCrassostrea gigas\r\n", "0.0\tCrassostrea gigas\r\n", "0.0\tCrassostrea gigas\r\n", "0.0\tCrassostrea gigas\r\n", "0.0\tCrassostrea gigas\r\n", "0.0\tCrassostrea gigas\r\n", "0.0\tCrassostrea gigas\r\n", "0.0\tCrassostrea gigas\r\n", "0.0\tCrassostrea gigas\r\n", "0.0\tCrassostrea gigas\r\n", "0.0\tCrassostrea gigas\r\n", "0.0\tCrassostrea gigas\r\n", "0.0\tCrassostrea gigas\r\n", "0.0\tCrassostrea gigas\r\n", "0.0\tCrassostrea gigas\r\n", "0.0\tCrassostrea gigas\r\n", "0.0\tCrassostrea gigas\r\n", "0.0\tCrassostrea gigas\r\n", "0.0\tCycloclasticus zancles 7-ME\r\n", "0.0\tCycloclasticus zancles 7-ME\r\n", "0.0\tCycloclasticus zancles 7-ME\r\n", "0.0\tDinoroseobacter shibae DFL 12\r\n", "0.0\tDinoroseobacter shibae DFL 12\r\n", "0.0\tDinoroseobacter shibae DFL 12\r\n", "0.0\tDinoroseobacter shibae DFL 12\r\n", "0.0\tDinoroseobacter shibae DFL 12\r\n", "0.0\tDinoroseobacter shibae DFL 12\r\n", "0.0\tDinoroseobacter shibae DFL 12\r\n", "0.0\tDinoroseobacter shibae DFL 12\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "0.0\tHomo sapiens\r\n", "sort: write failed: standard output: Broken pipe\r\n", "sort: write error\r\n" ] } ], "source": [ "!cut -f5,7 20150501_nt_blastn.tab | sort | head -100" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Extract all ouput with e-values less than or equal to 0.001\n", "The code uses awk to obtain all lines in the file that have e-values less than or equal to 0.001 (```$5<=0.001```). The e-values are found in column (i.e. field) 5 of the input file ($5)." ] }, { "cell_type": "code", "execution_count": 79, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%%bash\n", "awk ' $5<=0.001' 20150501_nt_blastn.tab > 20150501_Cgigas_larvae_OA_blastn_evals_0.001.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Count number of lines produced from previous command" ] }, { "cell_type": "code", "execution_count": 81, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 40700 20150501_Cgigas_larvae_OA_blastn_evals_0.001.txt\r\n" ] } ], "source": [ "!wc -l 20150501_Cgigas_larvae_OA_blastn_evals_0.001.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Store number of lines as Python variable\n", "The code stores the number of lines in the Python variable \"total\" from the output of the bash command (designated by the \"!\")." ] }, { "cell_type": "code", "execution_count": 92, "metadata": { "collapsed": true }, "outputs": [], "source": [ "total = !wc -l < 20150501_Cgigas_larvae_OA_blastn_evals_0.001.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Verify variable has correct number\n", "Notice that the number of lines is stored as a string list. This is denoted by the brackets and single quotes in the output." ] }, { "cell_type": "code", "execution_count": 93, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[' 40700']\n" ] } ], "source": [ "print total" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Convert string list variable to integer\n", "This enables the number to be more easily used for subsequent calculations." ] }, { "cell_type": "code", "execution_count": 96, "metadata": { "collapsed": false }, "outputs": [], "source": [ "total = int(total[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Verify variable is now and integer and no longer a string list\n", "This is confirmed by the lack of brackets around the output." ] }, { "cell_type": "code", "execution_count": 97, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "40700\n" ] } ], "source": [ "print total" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Count the number of each species in the BLAST matches\n", "The code uses awk again to extract all lines with e-values <=0.001. Those lines are then cut on column (i.e. field; -f) 7, sorted (which is necessary for the subsequent \"uniq\" command), and the uniq entries are counted (```uniq -c```)." ] }, { "cell_type": "code", "execution_count": 70, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%%bash\n", "awk ' $5<=0.001' 20150501_nt_blastn.tab | cut -f7 | sort | uniq -c > 20150501_Cgigas_larvae_OA_unique_blastn_evals.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Check output file" ] }, { "cell_type": "code", "execution_count": 71, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 2 'Nostoc azollae' 0708\r\n", " 1 Acanthamoeba castellanii\r\n", " 13 Acanthamoeba castellanii mamavirus\r\n", " 89 Acanthamoeba polyphaga mimivirus\r\n", " 1 Acanthocheilonema viteae\r\n", " 2 Acanthocystis turfacea Chlorella virus TN603.4.2\r\n", " 1 Acanthopagrus schlegelii\r\n", " 4 Acaryochloris marina MBIC11017\r\n", " 1 Acaryochloris sp. HICR111A\r\n", " 2 Acetobacter pasteurianus 386B\r\n" ] } ], "source": [ "!head 20150501_Cgigas_larvae_OA_unique_blastn_evals.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Remove leading whitespace from file\n", "Although difficult to notice, the previous command where the number of unique entities are counted results in an output with a bunch of spaces at the beginning of each line that I don't want.\n", "\n", "The command below utilizes sed. Sed creates a backup of the input file (```-i.bu```; \".bu\" is a custom suffix for the backup file - you can enter anything you'd like) and then removes (substitutes, thus the ```s/```) all spaces (``` */```) found only at the beginning (```^```) of each line." ] }, { "cell_type": "code", "execution_count": 72, "metadata": { "collapsed": false }, "outputs": [], "source": [ "!sed -i.bu 's/^ *//g' 20150501_Cgigas_larvae_OA_unique_blastn_evals.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Verify removal of leading whitespaces\n", "Note how the output is shifted to the left compared to how it was above." ] }, { "cell_type": "code", "execution_count": 73, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2 'Nostoc azollae' 0708\r\n", "1 Acanthamoeba castellanii\r\n", "13 Acanthamoeba castellanii mamavirus\r\n", "89 Acanthamoeba polyphaga mimivirus\r\n", "1 Acanthocheilonema viteae\r\n", "2 Acanthocystis turfacea Chlorella virus TN603.4.2\r\n", "1 Acanthopagrus schlegelii\r\n", "4 Acaryochloris marina MBIC11017\r\n", "1 Acaryochloris sp. HICR111A\r\n", "2 Acetobacter pasteurianus 386B\r\n", "3 Acetobacterium woodii DSM 1030\r\n", "8 Acetohalobium arabaticum DSM 5501\r\n", "2 Achromobacter denitrificans\r\n", "1 Acidianus hospitalis W1\r\n", "1 Acidiphilium cryptum JF-5\r\n", "1 Acidovorax sp. KKS102\r\n", "1 Aciduliprofundum boonei T469\r\n", "2 Acinetobacter baumannii\r\n", "2 Acinetobacter baumannii AYE\r\n", "1 Acinetobacter baumannii BJAB07104\r\n", "4 Acinetobacter baumannii BJAB0715\r\n", "2 Acinetobacter baumannii BJAB0868\r\n", "1 Acinetobacter baumannii SDF\r\n", "1 Acinetobacter calcoaceticus\r\n", "11 Acinetobacter calcoaceticus PHEA-2\r\n", "4 Acinetobacter oleivorans DR1\r\n", "8 Acinetobacter sp. ADP1\r\n", "2 Acinetobacter sp. C42\r\n", "1 Acinetobacter sp. M-1\r\n", "1 Acontias meleagris\r\n" ] } ], "source": [ "!head -30 20150501_Cgigas_larvae_OA_unique_blastn_evals.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Count number of unique species" ] }, { "cell_type": "code", "execution_count": 75, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 1608 20150501_Cgigas_larvae_OA_unique_blastn_evals.txt\r\n" ] } ], "source": [ "!wc -l 20150501_Cgigas_larvae_OA_unique_blastn_evals.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Copy file to Eagle" ] }, { "cell_type": "code", "execution_count": 77, "metadata": { "collapsed": true }, "outputs": [], "source": [ "!cp 20150501_Cgigas_larvae_OA_unique_blastn_evals.txt /Volumes/Eagle/Arabidopsis/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###View top 15 species of BLAST matches\n", "The code sorts the file in reverse (```-r```) numerical (```-n```) order on the first column (```-k1```) and displays the first 15 lines (```head -15```)." ] }, { "cell_type": "code", "execution_count": 78, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "9737 Dictyostelium discoideum AX4\r\n", "8248 Danio rerio\r\n", "6047 Homo sapiens\r\n", "2739 Mus musculus\r\n", "956 Rattus norvegicus\r\n", "411 Solanum lycopersicum\r\n", "395 Volvox carteri f. nagariensis\r\n", "377 Dictyostelium fasciculatum\r\n", "369 Hucho taimen\r\n", "368 Dictyostelium purpureum\r\n", "365 Schistosoma mansoni\r\n", "360 Botryotinia fuckeliana\r\n", "312 Lodderomyces elongisporus NRRL YB-4239\r\n", "265 Crassostrea gigas\r\n", "248 Octadecabacter antarcticus 307\r\n" ] } ], "source": [ "!sort -rn -k1 20150501_Cgigas_larvae_OA_unique_blastn_evals.txt | head -15" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Store Crassostrea gigas counts in variable\n", "The code uses bash grep to find any lines with \"Crassostrea gigas\" on them and then uses awk to print the first column (i.e. field; $1) which contains the count." ] }, { "cell_type": "code", "execution_count": 108, "metadata": { "collapsed": false }, "outputs": [], "source": [ "gigas_count = !grep 'Crassostrea gigas' 20150501_Cgigas_larvae_OA_unique_blastn_evals.txt | awk '{ print $1 }'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Confirm count is stored in variable" ] }, { "cell_type": "code", "execution_count": 109, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['265']\n" ] } ], "source": [ "print gigas_count" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Convert Python variable value from string list to floating number\n", "Conversion to float is necessary based on the desired output we want later. The output I want later will be a value less than 1. Integers can only display whole numbers." ] }, { "cell_type": "code", "execution_count": 110, "metadata": { "collapsed": false }, "outputs": [], "source": [ "gigas_count = float(gigas_count[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Verify conversion from string list to float" ] }, { "cell_type": "code", "execution_count": 111, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "265.0\n" ] } ], "source": [ "print gigas_count" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Calculate percentage of Crassostrea gigas BLAST hits" ] }, { "cell_type": "code", "execution_count": 113, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.651105651106\n" ] } ], "source": [ "print (gigas_count/total) * 100" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.9" } }, "nbformat": 4, "nbformat_minor": 0 }