{ "metadata": { "name": "", "signature": "sha256:19ae06d0c1f8a796f398370b1b272489d164402642b988108d99e7db316128dc" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Bioinformatics for Environmental Sciences " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Shortcuts: (Links to primary content on this page) \n", "[Blast Databases](http://nbviewer.ipython.org/github/sr320/ipython_nb/blob/master/fish546/w1_Intro.ipynb#Databases) \n", "[Fasta files of interest](http://nbviewer.ipython.org/github/sr320/ipython_nb/blob/master/fish546/w1_Intro.ipynb#List-of-Fastas-for-BLAST) \n", "[Steven does Module 1](#Steven-does-Module-1)\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Syllabus\n", "-- Student Inquiry Based (problem solving) \n", "-- Class Implemented \n", "-- Discovery Driven \n", "\n", "_4 tenets_ \n", "- Terminal Projects \n", "- Instructional Unit \n", "- Modules (Blast, GO ontology, RNA-seq, Enrichment, Metagenomics, Proteomics, Array) \n", "- Engagement (Sharing, Answering Questions, Asking Questions, Lead Discussion, Organize, Host Hangout etc)\n", "\n", "**Everyone needs a Public Lab Notebook**\n", "\n", "##Central Theme: Open and Reproducible Science##\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "####Competencies you should become familiar with\n", "- command-line (basic navigation and file manipulation)\n", "- markdown\n", "- IPython\n", "- R\n", "- SQLShare (including python client)\n", "- iPlant\n", "- Galaxy\n", "- GitHub\n", "- Hyak" ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Mainly just file manipulation" ] }, { "cell_type": "code", "collapsed": false, "input": [ "!pwd" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "/Applications/BLAST/ncbi-blast-2.2.26+\r\n" ] } ], "prompt_number": 37 }, { "cell_type": "code", "collapsed": false, "input": [ "!head w1_Intro.ipynb" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "{\r\n", " \"metadata\": {\r\n", " \"name\": \"\"\r\n", " },\r\n", " \"nbformat\": 3,\r\n", " \"nbformat_minor\": 0,\r\n", " \"worksheets\": [\r\n", " {\r\n", " \"cells\": [\r\n", " {\r\n" ] } ], "prompt_number": 2 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Tentative Schedule" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. Blast\n", "2. RNA-seq\n", "3. Assembly\n", "4. " ] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Module 1 Blast" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Assignment: Blast a large fasta file and create a tab-delimited output file." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Screenshot of Blast page at NCBI. \n", "\"blast_187C98FA_png\"/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\"BLAST__Release_Notes_-_BLAST\u00ae_Help_-_NCBI_Bookshelf_187C99D4_png\"/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Download Stand-alone BLAST" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\"National_Center_for_Biotechnology_Information_187C9A28_png\"/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/`" ] }, { "cell_type": "code", "collapsed": false, "input": [ "pwd" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 6, "text": [ "u'/Users/sr320/Dropbox/Steven/ipython_nb/fish546'" ] } ], "prompt_number": 6 }, { "cell_type": "code", "collapsed": false, "input": [ "cd /Applications/BLAST/ncbi-blast-2.2.26+/" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "/Applications/BLAST/ncbi-blast-2.2.26+\n" ] } ], "prompt_number": 16 }, { "cell_type": "code", "collapsed": false, "input": [ "ls" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "ChangeLog README \u001b[34mdoc\u001b[m\u001b[m/\r\n", "LICENSE \u001b[34mbin\u001b[m\u001b[m/ ncbi_package_info\r\n" ] } ], "prompt_number": 17 }, { "cell_type": "code", "collapsed": false, "input": [ "cd bin" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "/Applications/BLAST/ncbi-blast-2.2.26+/bin\n" ] } ], "prompt_number": 9 }, { "cell_type": "code", "collapsed": false, "input": [ "ls" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\u001b[31mblast_formatter\u001b[m\u001b[m* \u001b[31mblastx\u001b[m\u001b[m* \u001b[31mmakembindex\u001b[m\u001b[m* \u001b[31mtblastn\u001b[m\u001b[m*\r\n", "\u001b[31mblastdb_aliastool\u001b[m\u001b[m* \u001b[31mconvert2blastmask\u001b[m\u001b[m* \u001b[31mmakeprofiledb\u001b[m\u001b[m* \u001b[31mtblastx\u001b[m\u001b[m*\r\n", "\u001b[31mblastdbcheck\u001b[m\u001b[m* \u001b[31mdeltablast\u001b[m\u001b[m* \u001b[31mpsiblast\u001b[m\u001b[m* \u001b[31mupdate_blastdb.pl\u001b[m\u001b[m*\r\n", "\u001b[31mblastdbcmd\u001b[m\u001b[m* \u001b[31mdustmasker\u001b[m\u001b[m* \u001b[31mrpsblast\u001b[m\u001b[m* \u001b[31mwindowmasker\u001b[m\u001b[m*\r\n", "\u001b[31mblastn\u001b[m\u001b[m* \u001b[31mlegacy_blast.pl\u001b[m\u001b[m* \u001b[31mrpstblastn\u001b[m\u001b[m*\r\n", "\u001b[31mblastp\u001b[m\u001b[m* \u001b[31mmakeblastdb\u001b[m\u001b[m* \u001b[31msegmasker\u001b[m\u001b[m*\r\n" ] } ], "prompt_number": 10 }, { "cell_type": "code", "collapsed": false, "input": [ "!blastn -help" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "USAGE\r\n", " blastn [-h] [-help] [-import_search_strategy filename]\r\n", " [-export_search_strategy filename] [-task task_name] [-db database_name]\r\n", " [-dbsize num_letters] [-gilist filename] [-seqidlist filename]\r\n", " [-negative_gilist filename] [-entrez_query entrez_query]\r\n", " [-db_soft_mask filtering_algorithm] [-db_hard_mask filtering_algorithm]\r\n", " [-subject subject_input_file] [-subject_loc range] [-query input_file]\r\n", " [-out output_file] [-evalue evalue] [-word_size int_value]\r\n", " [-gapopen open_penalty] [-gapextend extend_penalty]\r\n", " [-perc_identity float_value] [-xdrop_ungap float_value]\r\n", " [-xdrop_gap float_value] [-xdrop_gap_final float_value]\r\n", " [-searchsp int_value] [-max_hsps_per_subject int_value] [-penalty penalty]\r\n", " [-reward reward] [-no_greedy] [-min_raw_gapped_score int_value]\r\n", " [-template_type type] [-template_length int_value] [-dust DUST_options]\r\n", " [-filtering_db filtering_database]\r\n", " [-window_masker_taxid window_masker_taxid]\r\n", " [-window_masker_db window_masker_db] [-soft_masking soft_masking]\r\n", " [-ungapped] [-culling_limit int_value] [-best_hit_overhang float_value]\r\n", " [-best_hit_score_edge float_value] [-window_size int_value]\r\n", " [-off_diagonal_range int_value] [-use_index boolean] [-index_name string]\r\n", " [-lcase_masking] [-query_loc range] [-strand strand] [-parse_deflines]\r\n", " [-outfmt format] [-show_gis] [-num_descriptions int_value]\r\n", " [-num_alignments int_value] [-html] [-max_target_seqs num_sequences]\r\n", " [-num_threads int_value] [-remote] [-version]\r\n", "\r\n", "DESCRIPTION\r\n", " Nucleotide-Nucleotide BLAST 2.2.26+\r\n", "\r\n", "OPTIONAL ARGUMENTS\r\n", " -h\r\n", " Print USAGE and DESCRIPTION; ignore other arguments\r\n", " -help\r\n", " Print USAGE, DESCRIPTION and ARGUMENTS description; ignore other arguments\r\n", " -version\r\n", " Print version number; ignore other arguments\r\n", "\r\n", " *** Input query options\r\n", " -query \r\n", " Input file name\r\n", " Default = `-'\r\n", " -query_loc \r\n", " Location on the query sequence in 1-based offsets (Format: start-stop)\r\n", " -strand \r\n", " Query strand(s) to search against database/subject\r\n", " Default = `both'\r\n", "\r\n", " *** General search options\r\n", " -task \r\n", " Task to execute\r\n", " Default = `megablast'\r\n", " -db \r\n", " BLAST database name\r\n", " * Incompatible with: subject, subject_loc\r\n", " -out \r\n", " Output file name\r\n", " Default = `-'\r\n", " -evalue \r\n", " Expectation value (E) threshold for saving hits \r\n", " Default = `10'\r\n", " -word_size =4>\r\n", " Word size for wordfinder algorithm (length of best perfect match)\r\n", " -gapopen \r\n", " Cost to open a gap\r\n", " -gapextend \r\n", " Cost to extend a gap\r\n", " -penalty \r\n", " Penalty for a nucleotide mismatch\r\n", " -reward =0>\r\n", " Reward for a nucleotide match\r\n", " -use_index \r\n", " Use MegaBLAST database index\r\n", " -index_name \r\n", " MegaBLAST database index name\r\n", "\r\n", " *** BLAST-2-Sequences options\r\n", " -subject \r\n", " Subject sequence(s) to search\r", "\r\n", " * Incompatible with: db, gilist, seqidlist, negative_gilist,\r\n", " db_soft_mask, db_hard_mask\r\n", " -subject_loc \r\n", " Location on the subject sequence in 1-based offsets (Format: start-stop)\r\n", " * Incompatible with: db, gilist, seqidlist, negative_gilist,\r\n", " db_soft_mask, db_hard_mask, remote\r\n", "\r\n", " *** Formatting options\r\n", " -outfmt \r\n", " alignment view options:\r\n", " 0 = pairwise,\r\n", " 1 = query-anchored showing identities,\r\n", " 2 = query-anchored no identities,\r\n", " 3 = flat query-anchored, show identities,\r\n", " 4 = flat query-anchored, no identities,\r\n", " 5 = XML Blast output,\r\n", " 6 = tabular,\r\n", " 7 = tabular with comment lines,\r\n", " 8 = Text ASN.1,\r\n", " 9 = Binary ASN.1,\r\n", " 10 = Comma-separated values,\r\n", " 11 = BLAST archive format (ASN.1) \r\n", " \r\n", " Options 6, 7, and 10 can be additionally configured to produce\r\n", " a custom format specified by space delimited format specifiers.\r\n", " The supported format specifiers are:\r\n", " \t qseqid means Query Seq-id\r\n", " \t qgi means Query GI\r\n", " \t qacc means Query accesion\r\n", " \t qaccver means Query accesion.version\r\n", " \t qlen means Query sequence length\r\n", " \t sseqid means Subject Seq-id\r\n", " \t sallseqid means All subject Seq-id(s), separated by a ';'\r\n", " \t sgi means Subject GI\r\n", " \t sallgi means All subject GIs\r\n", " \t sacc means Subject accession\r\n", " \t saccver means Subject accession.version\r\n", " \t sallacc means All subject accessions\r\n", " \t slen means Subject sequence length\r\n", " \t qstart means Start of alignment in query\r\n", " \t qend means End of alignment in query\r\n", " \t sstart means Start of alignment in subject\r\n", " \t send means End of alignment in subject\r\n", " \t qseq means Aligned part of query sequence\r\n", " \t sseq means Aligned part of subject sequence\r\n", " \t evalue means Expect value\r\n", " \t bitscore means Bit score\r\n", " \t score means Raw score\r\n", " \t length means Alignment length\r\n", " \t pident means Percentage of identical matches\r\n", " \t nident means Number of identical matches\r\n", " \t mismatch means Number of mismatches\r\n", " \t positive means Number of positive-scoring matches\r\n", " \t gapopen means Number of gap openings\r\n", " \t gaps means Total number of gaps\r\n", " \t ppos means Percentage of positive-scoring matches\r\n", " \t frames means Query and subject frames separated by a '/'\r\n", " \t qframe means Query frame\r\n", " \t sframe means Subject frame\r\n", " \t btop means Blast traceback operations (BTOP)\r\n", " When not provided, the default value is:\r\n", " 'qseqid sseqid pident length mismatch gapopen qstart qend sstart send\r\n", " evalue bitscore', which is equivalent to the keyword 'std'\r\n", " Default = `0'\r\n", " -show_gis\r\n", " Show NCBI GIs in deflines?\r\n", " -num_descriptions =0>\r\n", " Number of database sequences to show one-line descriptions for\r\n", " Default = `500'\r\n", " * Incompatible with: max_target_seqs\r\n", " -num_alignments =0>\r\n", " Number of database sequences to show alignments for\r\n", " Default = `250'\r\n", " * Incompatible with: max_target_seqs\r\n", " -html\r\n", " Produce HTML output?\r\n", "\r\n", " *** Query filtering options\r\n", " -dust \r\n", " Filter query sequence with DUST (Format: 'yes', 'level window linker', or\r\n", " 'no' to disable)\r\n", " Default = `20 64 1'\r\n", " -filtering_db \r\n", " BLAST database containing filtering elements (i.e.: repeats)\r\n", " -window_masker_taxid \r\n", " Enable WindowMasker filtering using a Taxonomic ID\r\n", " -window_masker_db \r\n", " Enable WindowMasker filtering using this repeats database.\r\n", " -soft_masking \r\n", " Apply filtering locations as soft masks\r\n", " Default = `true'\r\n", " -lcase_masking\r\n", " Use lower case filtering in query and subject sequence(s)?\r\n", "\r\n", " *** Restrict search or results\r\n", " -gilist \r\n", " Restrict search of database to list of GI's\r\n", " * Incompatible with: negative_gilist, seqidlist, remote, subject,\r\n", " subject_loc\r\n", " -seqidlist \r\n", " Restrict search of database to list of SeqId's\r\n", " * Incompatible with: gilist, negative_gilist, remote, subject,\r\n", " subject_loc\r\n", " -negative_gilist \r\n", " Restrict search of database to everything except the listed GIs\r\n", " * Incompatible with: gilist, seqidlist, remote, subject, subject_loc\r\n", " -entrez_query \r\n", " Restrict search with the given Entrez query\r\n", " * Requires: remote\r\n", " -db_soft_mask \r\n", " Filtering algorithm ID to apply to the BLAST database as soft masking\r\n", " * Incompatible with: db_hard_mask, subject, subject_loc\r\n", " -db_hard_mask \r\n", " Filtering algorithm ID to apply to the BLAST database as hard masking\r\n", " * Incompatible with: db_soft_mask, subject, subject_loc\r\n", " -perc_identity \r\n", " Percent identity\r\n", " -culling_limit =0>\r\n", " If the query range of a hit is enveloped by that of at least this many\r\n", " higher-scoring hits, delete the hit\r\n", " * Incompatible with: best_hit_overhang, best_hit_score_edge\r\n", " -best_hit_overhang =0 and =<0.5)>\r\n", " Best Hit algorithm overhang value (recommended value: 0.1)\r\n", " * Incompatible with: culling_limit\r\n", " -best_hit_score_edge =0 and =<0.5)>\r\n", " Best Hit algorithm score edge value (recommended value: 0.1)\r\n", " * Incompatible with: culling_limit\r\n", " -max_target_seqs =1>\r\n", " Maximum number of aligned sequences to keep\r\n", " * Incompatible with: num_descriptions, num_alignments\r\n", "\r\n", " *** Discontiguous MegaBLAST options\r\n", " -template_type \r\n", " Discontiguous MegaBLAST template type\r\n", " * Requires: template_length\r\n", " -template_length \r\n", " Discontiguous MegaBLAST template length\r\n", " * Requires: template_type\r\n", "\r\n", " *** Statistical options\r\n", " -dbsize \r\n", " Effective length of the database \r\n", " -searchsp =0>\r\n", " Effective length of the search space\r\n", " -max_hsps_per_subject =0>\r\n", " Override maximum number of HSPs per subject to save for ungapped searches\r\n", " (0 means do not override)\r\n", " Default = `0'\r\n", "\r\n", " *** Search strategy options\r\n", " -import_search_strategy \r\n", " Search strategy to use\r\n", " * Incompatible with: export_search_strategy\r\n", " -export_search_strategy \r\n", " File name to record the search strategy used\r\n", " * Incompatible with: import_search_strategy\r\n", "\r\n", " *** Extension options\r\n", " -xdrop_ungap \r\n", " X-dropoff value (in bits) for ungapped extensions\r\n", " -xdrop_gap \r\n", " X-dropoff value (in bits) for preliminary gapped extensions\r\n", " -xdrop_gap_final \r\n", " X-dropoff value (in bits) for final gapped alignment\r\n", " -no_greedy\r\n", " Use non-greedy dynamic programming extension\r\n", " -min_raw_gapped_score \r\n", " Minimum raw gapped score to keep an alignment in the preliminary gapped and\r\n", " traceback stages\r\n", " -ungapped\r\n", " Perform ungapped alignment only?\r\n", " -window_size =0>\r\n", " Multiple hits window size, use 0 to specify 1-hit algorithm\r\n", " -off_diagonal_range =0>\r\n", " Number of off-diagonals to search for the 2nd hit, use 0 to turn off\r\n", " Default = `0'\r\n", "\r\n", " *** Miscellaneous options\r\n", " -parse_deflines\r\n", " Should the query and subject defline(s) be parsed?\r\n", " -num_threads =1>\r\n", " Number of threads (CPUs) to use in the BLAST search\r\n", " Default = `1'\r\n", " * Incompatible with: remote\r\n", " -remote\r\n", " Execute search remotely?\r\n", " * Incompatible with: gilist, seqidlist, negative_gilist, subject_loc,\r\n", " num_threads\r\n", "\r\n" ] } ], "prompt_number": 20 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Things to consider \n", "\n", "1. Type of Blast \n", "2. Databases " ] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Databases" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can make your own database from any fasta format file.\n", "Some of the commonly used databases are found at NCBI and Uniprot. \n", "A list of all NCBI databases for download is available at\n", ". Preformatted and fasta files are available.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Fasta Files can be downloaded from Uniprot at . \n", "\n", "Do note that all of these fasta files / databases are routinely updated so it is important to know where to get the most recent version." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I have downloaded a number of these locally and can be viewed at \n", ". In order to use the files \"locally\" you will need to mount the computer `hummmingbird.fish.washington.edu`. A description of these databases is shown below." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from IPython.display import HTML\n", "HTML('')" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "" ], "metadata": {}, "output_type": "pyout", "prompt_number": 1, "text": [ "" ] } ], "prompt_number": 1 }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "List of Fastas for BLAST" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "| ID | Description | url | comment | \n", "|-----|-------------|-----|---------|\n", "| Cgigas_v9tran | Pacific oyster transcriptome; [.gz] 28027 gene (CDS only) sequences (via gigadb.org) | | note |\n", "| Cgigas_v9prot | Pacific oyster proteome; [.gz] 28027 protein sequences (via gigadb.org) | | note |\n", "| V_tubiashii | Assembly of Vibrio tubishii (RE22) partial genome | | note |\n", "| Olurida_v3tran | Olympia oyster transcriptome version 3 | | note |\n", "| mystery | mystery | | note |" ] }, { "cell_type": "code", "collapsed": false, "input": [ "!%s/ .*// Olurida_transcriptome_v3.fasta > test" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "/bin/sh: line 0: fg: no job control\r\n" ] } ], "prompt_number": 59 }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "A few random uses of BLAST." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- check short read sequences\n", "- compare species\n", "- look for contamination\n", "- " ] }, { "cell_type": "code", "collapsed": false, "input": [ "#can I tunnel into genefish to run blast" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 38 }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Steven does Module 1" ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Step 1. Installing Blast" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Download appropriate software from " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I downloaded the file to this location on my computer `/Volumes/Bay3/Software/`. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "cd /Volumes/Bay3/Software" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "/Volumes/Bay3/Software\n" ] } ], "prompt_number": 2 }, { "cell_type": "code", "collapsed": false, "input": [ "#commmand that list file (-1 = one file per line) and only those that start with \"ncbi\"\n", "!ls -1 ncbi*" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "ncbi-blast-2.2.27+-universal-macosx.tar.gz\r\n", "ncbi-blast-2.2.28+-universal-macosx.tar.gz\r\n", "ncbi-blast-2.2.29+-universal-macosx.tar.gz\r\n", "\r\n", "ncbi-blast-2.2.26+:\r\n", "ChangeLog\r\n", "LICENSE\r\n", "README\r\n", "\u001b[34mbin\u001b[m\u001b[m\r\n", "blastdb.ncbirc\r\n", "\u001b[34mdb\u001b[m\u001b[m\r\n", "\u001b[34mdoc\u001b[m\u001b[m\r\n", "ncbi_package_info\r\n", "\u001b[34mout\u001b[m\u001b[m\r\n", "\u001b[34mquery\u001b[m\u001b[m\r\n", "\r\n", "ncbi-blast-2.2.27+:\r\n", "\u001b[31mChangeLog\u001b[m\u001b[m\r\n", "\u001b[31mLICENSE\u001b[m\u001b[m\r\n", "\u001b[31mREADME\u001b[m\u001b[m\r\n", "\u001b[34mbin\u001b[m\u001b[m\r\n", "\u001b[34mdb\u001b[m\u001b[m\r\n", "\u001b[34mdoc\u001b[m\u001b[m\r\n", "\u001b[31mncbi_package_info\u001b[m\u001b[m\r\n", "\r\n", "ncbi-blast-2.2.28+:\r\n", "ChangeLog\r\n", "LICENSE\r\n", "README\r\n", "\u001b[34mbin\u001b[m\u001b[m\r\n", "\u001b[34mdb\u001b[m\u001b[m\r\n", "\u001b[34mdoc\u001b[m\u001b[m\r\n", "ncbi_package_info\r\n" ] } ], "prompt_number": 20 }, { "cell_type": "code", "collapsed": false, "input": [ "#unzipping [-]x --extract --get; -v, --verbose; -z, --gzip; -f, --file F\n", "!tar -xzvf ncbi-blast-2.2.29+-universal-macosx.tar.gz" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "x ncbi-blast-2.2.29+/\r\n", "x ncbi-blast-2.2.29+/bin/\r\n", "x ncbi-blast-2.2.29+/bin/makembindex" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r\n", "x ncbi-blast-2.2.29+/bin/tblastn" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r\n", "x ncbi-blast-2.2.29+/bin/psiblast" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r\n", "x ncbi-blast-2.2.29+/bin/rpsblast" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r\n", "x ncbi-blast-2.2.29+/bin/legacy_blast.pl\r\n", "x ncbi-blast-2.2.29+/bin/blastdbcmd" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r\n", "x ncbi-blast-2.2.29+/bin/makeblastdb" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r\n", "x ncbi-blast-2.2.29+/bin/tblastx" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r\n", "x ncbi-blast-2.2.29+/bin/blastn" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r\n", "x ncbi-blast-2.2.29+/bin/blastp" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r\n", "x ncbi-blast-2.2.29+/bin/segmasker" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r\n", "x ncbi-blast-2.2.29+/bin/dustmasker" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r\n", "x ncbi-blast-2.2.29+/bin/blastx" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r\n", "x ncbi-blast-2.2.29+/bin/blast_formatter" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r\n", "x ncbi-blast-2.2.29+/bin/windowmasker" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r\n", "x ncbi-blast-2.2.29+/bin/blastdb_aliastool" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r\n", "x ncbi-blast-2.2.29+/bin/convert2blastmask" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r\n", "x ncbi-blast-2.2.29+/bin/update_blastdb.pl\r\n", "x ncbi-blast-2.2.29+/bin/deltablast" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r\n", "x ncbi-blast-2.2.29+/bin/blastdbcheck" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r\n", "x ncbi-blast-2.2.29+/bin/rpstblastn" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r\n", "x ncbi-blast-2.2.29+/bin/makeprofiledb" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r\n", "x ncbi-blast-2.2.29+/doc/\r\n", "x ncbi-blast-2.2.29+/doc/README.txt\r\n", "x ncbi-blast-2.2.29+/README\r\n", "x ncbi-blast-2.2.29+/ncbi_package_info\r\n", "x ncbi-blast-2.2.29+/LICENSE\r\n", "x ncbi-blast-2.2.29+/ChangeLog\r\n" ] } ], "prompt_number": 28 }, { "cell_type": "code", "collapsed": false, "input": [ "#commmand that list file (-1 = one file per line) and only those that start with \"ncbi\"\n", "!ls -1 ncbi*" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "ncbi-blast-2.2.27+-universal-macosx.tar.gz\r\n", "ncbi-blast-2.2.28+-universal-macosx.tar.gz\r\n", "ncbi-blast-2.2.29+-universal-macosx.tar.gz\r\n", "\r\n", "ncbi-blast-2.2.26+:\r\n", "ChangeLog\r\n", "LICENSE\r\n", "README\r\n", "\u001b[34mbin\u001b[m\u001b[m\r\n", "blastdb.ncbirc\r\n", "\u001b[34mdb\u001b[m\u001b[m\r\n", "\u001b[34mdoc\u001b[m\u001b[m\r\n", "ncbi_package_info\r\n", "\u001b[34mout\u001b[m\u001b[m\r\n", "\u001b[34mquery\u001b[m\u001b[m\r\n", "\r\n", "ncbi-blast-2.2.27+:\r\n", "\u001b[31mChangeLog\u001b[m\u001b[m\r\n", "\u001b[31mLICENSE\u001b[m\u001b[m\r\n", "\u001b[31mREADME\u001b[m\u001b[m\r\n", "\u001b[34mbin\u001b[m\u001b[m\r\n", "\u001b[34mdb\u001b[m\u001b[m\r\n", "\u001b[34mdoc\u001b[m\u001b[m\r\n", "\u001b[31mncbi_package_info\u001b[m\u001b[m\r\n", "\r\n", "ncbi-blast-2.2.28+:\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "ChangeLog\r\n", "LICENSE\r\n", "README\r\n", "\u001b[34mbin\u001b[m\u001b[m\r\n", "\u001b[34mdb\u001b[m\u001b[m\r\n", "\u001b[34mdoc\u001b[m\u001b[m\r\n", "ncbi_package_info\r\n", "\r\n", "ncbi-blast-2.2.29+:\r\n", "ChangeLog\r\n", "LICENSE\r\n", "README\r\n", "\u001b[34mbin\u001b[m\u001b[m\r\n", "\u001b[34mdoc\u001b[m\u001b[m\r\n", "ncbi_package_info\r\n" ] } ], "prompt_number": 29 }, { "cell_type": "code", "collapsed": false, "input": [ "cd ncbi-blast-2.2.29+/" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "/Volumes/Bay3/Software/ncbi-blast-2.2.29+\n" ] } ], "prompt_number": 30 }, { "cell_type": "code", "collapsed": false, "input": [ "cd bin" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "/Volumes/Bay3/Software/ncbi-blast-2.2.29+/bin\n" ] } ], "prompt_number": 31 }, { "cell_type": "code", "collapsed": false, "input": [ "ls -1" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\u001b[31mblast_formatter\u001b[m\u001b[m*\r\n", "\u001b[31mblastdb_aliastool\u001b[m\u001b[m*\r\n", "\u001b[31mblastdbcheck\u001b[m\u001b[m*\r\n", "\u001b[31mblastdbcmd\u001b[m\u001b[m*\r\n", "\u001b[31mblastn\u001b[m\u001b[m*\r\n", "\u001b[31mblastp\u001b[m\u001b[m*\r\n", "\u001b[31mblastx\u001b[m\u001b[m*\r\n", "\u001b[31mconvert2blastmask\u001b[m\u001b[m*\r\n", "\u001b[31mdeltablast\u001b[m\u001b[m*\r\n", "\u001b[31mdustmasker\u001b[m\u001b[m*\r\n", "\u001b[31mlegacy_blast.pl\u001b[m\u001b[m*\r\n", "\u001b[31mmakeblastdb\u001b[m\u001b[m*\r\n", "\u001b[31mmakembindex\u001b[m\u001b[m*\r\n", "\u001b[31mmakeprofiledb\u001b[m\u001b[m*\r\n", "\u001b[31mpsiblast\u001b[m\u001b[m*\r\n", "\u001b[31mrpsblast\u001b[m\u001b[m*\r\n", "\u001b[31mrpstblastn\u001b[m\u001b[m*\r\n", "\u001b[31msegmasker\u001b[m\u001b[m*\r\n", "\u001b[31mtblastn\u001b[m\u001b[m*\r\n", "\u001b[31mtblastx\u001b[m\u001b[m*\r\n", "\u001b[31mupdate_blastdb.pl\u001b[m\u001b[m*\r\n", "\u001b[31mwindowmasker\u001b[m\u001b[m*\r\n" ] } ], "prompt_number": 33 }, { "cell_type": "code", "collapsed": false, "input": [ "#check to see if \"works\"\n", "!blastx -h" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "USAGE\r\n", " blastx [-h] [-help] [-import_search_strategy filename]\r\n", " [-export_search_strategy filename] [-db database_name]\r\n", " [-dbsize num_letters] [-gilist filename] [-seqidlist filename]\r\n", " [-negative_gilist filename] [-entrez_query entrez_query]\r\n", " [-db_soft_mask filtering_algorithm] [-db_hard_mask filtering_algorithm]\r\n", " [-subject subject_input_file] [-subject_loc range] [-query input_file]\r\n", " [-out output_file] [-evalue evalue] [-word_size int_value]\r\n", " [-gapopen open_penalty] [-gapextend extend_penalty]\r\n", " [-xdrop_ungap float_value] [-xdrop_gap float_value]\r\n", " [-xdrop_gap_final float_value] [-searchsp int_value]\r\n", " [-max_hsps_per_subject int_value] [-max_intron_length length]\r\n", " [-seg SEG_options] [-soft_masking soft_masking] [-matrix matrix_name]\r\n", " [-threshold float_value] [-culling_limit int_value]\r\n", " [-best_hit_overhang float_value] [-best_hit_score_edge float_value]\r\n", " [-window_size int_value] [-ungapped] [-lcase_masking] [-query_loc range]\r\n", " [-strand strand] [-parse_deflines] [-query_gencode int_value]\r\n", " [-outfmt format] [-show_gis] [-num_descriptions int_value]\r\n", " [-num_alignments int_value] [-html] [-max_target_seqs num_sequences]\r\n", " [-num_threads int_value] [-remote] [-comp_based_stats compo]\r\n", " [-use_sw_tback] [-version]\r\n", "\r\n", "DESCRIPTION\r\n", " Translated Query-Protein Subject BLAST 2.2.28+\r\n", "\r\n", "Use '-help' to print detailed descriptions of command line arguments\r\n" ] } ], "prompt_number": 35 }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Step 2: Make a database to blast" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I would like to make a database of UniProt/Swiss-prot." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Screenshot:\n", "\n", "\"Download_187DDB0F.png\"" ] }, { "cell_type": "code", "collapsed": false, "input": [ "cd /Volumes/Bay3/Software/" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "/Volumes/Bay3/Software\n" ] } ], "prompt_number": 44 }, { "cell_type": "code", "collapsed": false, "input": [ "cd ncbi-blast-2.2.29+/" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[Errno 2] No such file or directory: 'ncbi-blast-2.2.29+/'\n", "/Volumes/Bay3/Software/ncbi-blast-2.2.29+/db\n" ] } ], "prompt_number": 53 }, { "cell_type": "code", "collapsed": false, "input": [ "cd db" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[Errno 2] No such file or directory: 'db'\n", "/Volumes/Bay3/Software/ncbi-blast-2.2.29+/db\n" ] } ], "prompt_number": 54 }, { "cell_type": "code", "collapsed": false, "input": [ "ls" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "uniprot_sprot.fasta.gz\r\n" ] } ], "prompt_number": 56 }, { "cell_type": "code", "collapsed": false, "input": [ "!gzip -d uniprot_sprot.fasta.gz" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 62 }, { "cell_type": "code", "collapsed": false, "input": [ "ls" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "uniprot_sprot.fasta\r\n" ] } ], "prompt_number": 63 }, { "cell_type": "code", "collapsed": false, "input": [ "pwd" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 65, "text": [ "u'/Volumes/Bay3/Software/ncbi-blast-2.2.29+/db'" ] } ], "prompt_number": 65 }, { "cell_type": "code", "collapsed": false, "input": [ "#note I am working in dir db, thus can just use file names. Most times you might use the complete path.\n", "!makeblastdb -in uniprot_sprot.fasta -dbtype prot -out uniprot_sprot_r2013_12 " ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\r\n", "\r\n", "Building a new DB, current time: 01/08/2014 11:34:36\r\n", "New DB name: uniprot_sprot_r2013_12\r\n", "New DB title: uniprot_sprot.fasta\r\n", "Sequence type: Protein\r\n", "Keep Linkouts: T\r\n", "Keep MBits: T\r\n", "Maximum file size: 1000000000B\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Adding sequences from FASTA; added 541954 sequences in 53.9535 seconds.\r\n" ] } ], "prompt_number": 66 }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Step 3: Get a query sequence file" ] }, { "cell_type": "code", "collapsed": false, "input": [ "#creating new directory; \n", "!pwd" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "/Volumes/Bay3/Software/ncbi-blast-2.2.29+/db\r\n" ] } ], "prompt_number": 79 }, { "cell_type": "code", "collapsed": false, "input": [ "cd .." ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "/Volumes/Bay3/Software/ncbi-blast-2.2.29+\n" ] } ], "prompt_number": 80 }, { "cell_type": "code", "collapsed": false, "input": [ "!mkdir query" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 83 }, { "cell_type": "code", "collapsed": false, "input": [ "ls" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "ChangeLog README \u001b[34mdb\u001b[m\u001b[m/ ncbi_package_info\r\n", "LICENSE \u001b[34mbin\u001b[m\u001b[m/ \u001b[34mdoc\u001b[m\u001b[m/ \u001b[34mquery\u001b[m\u001b[m/\r\n" ] } ], "prompt_number": 84 }, { "cell_type": "code", "collapsed": false, "input": [ "cd query/" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "/Volumes/Bay3/Software/ncbi-blast-2.2.29+/query\n" ] } ], "prompt_number": 85 }, { "cell_type": "code", "collapsed": false, "input": [ "#getting file from url to local location\n", "!wget http://eagle.fish.washington.edu/cnidarian/Ab_4denovo_CLC6_a.fa" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "--2014-01-08 11:40:14-- http://eagle.fish.washington.edu/cnidarian/Ab_4denovo_CLC6_a.fa\r\n", "Resolving eagle.fish.washington.edu... " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "128.95.149.81\r\n", "Connecting to eagle.fish.washington.edu|128.95.149.81|:80... connected.\r\n", "HTTP request sent, awaiting response... 200 OK\r\n", "Length: 2030182 (1.9M) [text/plain]\r\n", "Saving to: `Ab_4denovo_CLC6_a.fa'\r\n", "\r\n", "\r", " 0% [ ] 0 --.-K/s \r", "100%[======================================>] 2,030,182 --.-K/s in 0.03s \r\n", "\r\n", "2014-01-08 11:40:14 (68.2 MB/s) - `Ab_4denovo_CLC6_a.fa' saved [2030182/2030182]\r\n", "\r\n" ] } ], "prompt_number": 86 }, { "cell_type": "code", "collapsed": false, "input": [ "#lets get a preview\n", "!head Ab_4denovo_CLC6_a.fa" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ ">solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_1\r\n", "ACACCCCACCCCAACGCACCCTCACCCCCACCCCAACAATCCATGATTGAATACTTCATC\r\n", "TATCCAAGACAAACTCCTCCTACAATCCATGATAGAATTCCTCCAAAAATAATTTCACAC\r\n", "TGAAACTCCGGTATCCGAGTTATTTTGTTCCCAGTAAAATGGCATCAACAAAAGTAGGTC\r\n", "TGGATTAACGAACCAATGTTGCTGCGTAATATCCCATTGACATATCTTGTCGATTCCTAC\r\n", "CAGGATCCGGACTGACGAGATTTCACTGTACGTTTATGCAAGTCATTTCCATATATAAAA\r\n", "TTGGATCTTATTTGCACAGTTAAATGTCTCTATGCTTATTTATAAATCAATGCCCGTAAG\r\n", "CTCCTAATATTTCTCTTTTCGTCCGACGAGCAAACAGTGAGTTTACTGTGGCCTTCAGCA\r\n", "AAAGTATTGATGTTGTAAATCTCAGTTGTGATTGAACAATTTGCCTCACTAGAAGTAGCC\r\n", "TTC\r\n" ] } ], "prompt_number": 87 }, { "cell_type": "code", "collapsed": false, "input": [ "#word count\n", "!wc Ab_4denovo_CLC6_a.fa" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ " 35092 35092 2030182 Ab_4denovo_CLC6_a.fa\r\n" ] } ], "prompt_number": 88 }, { "cell_type": "code", "collapsed": false, "input": [ "#how many sequences? lets count \">\" as we know each contig has 1\n", "!grep -c \">\" Ab_4denovo_CLC6_a.fa" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "5490\r\n" ] } ], "prompt_number": 90 }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Step 4: Run Blast" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "#will use full paths..\n", "!blastx -query /Volumes/Bay3/Software/ncbi-blast-2.2.29\\+/query/Ab_4denovo_CLC6_a.fa -db /Volumes/Bay3/Software/ncbi-blast-2.2.29\\+/db/uniprot_sprot_r2013_12 -out /Volumes/Bay3/Software/ncbi-blast-2.2.29\\+/out/Ab_4denovo_CLC6_a_uniprot_blastx.tab -evalue 1E-20 -max_target_seqs 1 -outfmt 6" ] }, { "cell_type": "code", "collapsed": false, "input": [ "!head /Volumes/Bay3/Software/ncbi-blast-2.2.29\\+/out/Ab_4denovo_CLC6_a_uniprot_blastx.tab" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_3\tsp|O42248|GBLP_DANRE\t82.46\t171\t30\t0\t1\t513\t35\t205\t1e-101\t 301\r\n", "solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_5\tsp|Q08013|SSRG_RAT\t75.38\t65\t16\t0\t3\t197\t121\t185\t1e-27\t 104\r\n", "solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_6\tsp|P12234|MPCP_BOVIN\t76.62\t77\t18\t0\t2\t232\t286\t362\t2e-23\t98.6\r\n", "solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_9\tsp|Q41629|ADT1_WHEAT\t82.26\t62\t11\t0\t3\t188\t170\t231\t3e-27\t 104\r\n", "solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_13\tsp|Q32NG4|PDDC1_XENLA\t54.44\t90\t40\t1\t1\t270\t140\t228\t1e-27\t 106\r\n", "solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_23\tsp|Q9GNE2|RL23_AEDAE\t97.22\t72\t2\t0\t67\t282\t14\t85\t1e-42\t 142\r\n", "solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_31\tsp|Q3V1H3|HPHL1_MOUSE\t53.38\t133\t59\t1\t2\t391\t23\t155\t5e-42\t 153\r\n", "solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_32\tsp|Q641Y2|NDUS2_RAT\t88.03\t117\t14\t0\t2\t352\t334\t450\t1e-70\t 224\r\n", "solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_37\tsp|Q9D3D9|ATPD_MOUSE\t56.10\t123\t54\t0\t2\t370\t46\t168\t7e-42\t 144\r\n", "solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_39\tsp|Q39613|CYPH_CATRO\t75.00\t120\t23\t1\t55\t393\t1\t120\t7e-49\t 160\r\n" ] } ], "prompt_number": 94 }, { "cell_type": "code", "collapsed": false, "input": [ "!wc /Volumes/Bay3/Software/ncbi-blast-2.2.29\\+/out/Ab_4denovo_CLC6_a_uniprot_blastx.tab" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ " 664 7968 84910 /Volumes/Bay3/Software/ncbi-blast-2.2.29+/out/Ab_4denovo_CLC6_a_uniprot_blastx.tab\r\n" ] } ], "prompt_number": 95 }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }