{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# BLAST on the Command Line and Integrating with Python\n", "\n", "This will get example sequence data from an external source, prepare it, and then run BLAST with it. Then, subsequent steps will parse the resulting data into something useful in Python. And, even cover how to convert to Excel. \n", "\n", "To make things easier and to allow integrating with the nice features provided by the Jupyter environment, we will access the command line from inside a Jupyter notebook using the `!` symbol at the start of the command to direct it to the shell. (Alternatively, cells to direct to the shell can be started with `%%bash` on the first line. Or, if you feel more confortable working on the actual command line, you can open a terminal from within this active Jupyter environment by clicking the Jupyter logo in the upper left corner of this notebook and then selecting `New` > `Terminal` form the dropdown on the right of the Jupyter environment dashboard and run the initial parts without the `!` or `%%bash` tags.)\n", "\n", "
\n", "

If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!

\n", "\n", "

\n", " Some tips:\n", "

\n", "

\n", "
\n", "\n", "-----\n", "\n", "**CREDITS/RESOURCES**:\n", "\n", "The portions dealing with running BLAST on command line are based on:\n", "\n", "- [Titus Brown's Running command-line BLAST](https://angus.readthedocs.io/en/2017/running-command-line-blast.html) - from 2017 (early 2019 version with R for downstream [here](https://hackmd.io/FVzYhD6lRoOPFFZvW2GAGw?view))\n", "- [Shawn T. O’Neil's A Primer for Computational Biology: Chapter 7 Command Line BLAST](http://library.open.oregonstate.edu/computationalbiology/chapter/command-line-blast/)\n", "\n", "\n", "See [here](https://medium.com/@auguste.dutcher/turn-blast-results-into-a-presence-absence-matrix-cc44429c814) for illustration of the approach used for feeding the results into Python, and [here](https://blastedbio.blogspot.com/2014/11/column-headers-in-blast-tabular-and-csv.html) for a listing of explanations for the text codes involved in the flags. \n", "([Using NCBI BLAST+ Service with Biopython](https://github.com/widdowquinn/Teaching-SfAM-ECS/blob/61ee5e98e7c1327cfc2bfc2ca6e44c2bdefcaeed/workshop/02b-blast.ipynb) might also be of interest.)\n", "\n", "The Tables [here](https://www.ncbi.nlm.nih.gov/books/NBK279684/) are particularly useful for enhancing your queries with custom options.\n", "\n", "My script [`blast_to_df.py`](https://github.com/fomightez/sequencework/blob/master/blast-utilities/blast_to_df.py) is used as well and can be found in my [blast-utilities repository](https://github.com/fomightez/sequencework/tree/master/blast-utilities).\n", "\n", "----" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preparing to run BLAST on the command line\n", "\n", "Typically need for command line BLAST:\n", "\n", "- command line-based BLAST+ applications on a computer\n", "- Have a database to query\n", "- Sequences to match against the database\n", "\n", "Normally the first is the major hurdle, but here that is already handled. **BLAST WILL WORK IN THE ACTIVE JUPYTER ENVIRONMENT AT LAUNCH**. \n", "(For those curious, see [here](https://www.ncbi.nlm.nih.gov/books/NBK279671/) for the official documentation; however, [conda makes it easier](https://anaconda.org/bioconda/blast).)\n", "\n", "If you got this far, you probably already have sequences you need to search.\n", "\n", "We are going to make a custom database here.\n", "\n", "We'll use the FASTA-formatted sequence of *S. cerevisiae* mitochondrial genome to make the database in this simple example. Retrieve that with the command in the next cell. \n", "Click in the next cell and hit `shift-enter` or press the `Run` button on the toolbar above." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!curl -O http://sgd-archive.yeastgenome.org/sequence/S288C_reference/chromosomes/fasta/chrmt.fsa" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Go ahead and run the code in the next cell to make the database using the downloaded sequence. (The command to make the BLAST database below based on [here](http://library.open.oregonstate.edu/computationalbiology/chapter/command-line-blast/) and reviewing pertinent topics [here](https://angus.readthedocs.io/en/2017/ ).)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!makeblastdb -in chrmt.fsa -dbtype nucl" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note the use of the exclamation point to direct the command to the command line shell.\n", "\n", "Now that you have indexed database for mitochdondrial chromosome, you need some sequences to BLAST against it.\n", "\n", "The next cell will get a few FASTA-formatted sequences of other mitochondria assembled primarily from long read data, see [Yue et al., 2017 PMID: 28416820](https://www.ncbi.nlm.nih.gov/pubmed/28416820). It will also fix the first line of each of those so they have unique identifiers based on the file name. Otherwise they would all be `chrMT`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "curl -LO http://yjx1217.github.io/Yeast_PacBio_2016/data/Mitochondrial_Genome/S288C.mt.genome.fa.gz\n", "gunzip -f S288C.mt.genome.fa.gz\n", "# Up until sometime around September / October 2018, the above referenced file was named `S288c.mt.genome.fa.gz` on that server and\n", "# I built things around that. The name has since changed to the more-correct `S288C.mt.genome.fa.gz`, but I am going \n", "# to convert name to earlier form to make subsequent commands work, be more distinguished from SGD S288C reference sequence,\n", "# and make things continue match what was seen before.\n", "mv S288C.mt.genome.fa S288c.mt.genome.fa\n", "curl -LO http://yjx1217.github.io/Yeast_PacBio_2016/data/Mitochondrial_Genome/SK1.mt.genome.fa.gz\n", "gunzip -f SK1.mt.genome.fa.gz\n", "curl -LO http://yjx1217.github.io/Yeast_PacBio_2016/data/Mitochondrial_Genome/Y12.mt.genome.fa.gz\n", "gunzip -f Y12.mt.genome.fa.gz\n", "var=$(echo \"S288c.mt.genome.fa\" | cut -d'.' --complement -f2-).mito\n", "sed -i \"1s/.*/>$var/\" S288c.mt.genome.fa\n", "var=$(echo \"SK1.mt.genome.fa\" | cut -d'.' --complement -f2-).mito\n", "sed -i \"1s/.*/>$var/\" SK1.mt.genome.fa\n", "var=$(echo \"Y12.mt.genome.fa\" | cut -d'.' --complement -f2-).mito\n", "sed -i \"1s/.*/>$var/\" Y12.mt.genome.fa" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note we used the `%%bash` cell magics in the notebook cell to signal that this large code block was all to be run in the command line shell.\n", "\n", "Now you are prepared to run BLAST on the command line." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "## Running BLAST on the command line (basics)\n", "\n", "To start simple, you can run one of the sequences against the database with the basic command next. It will output a lot of data. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "!blastn -query S288c.mt.genome.fa -db chrmt.fsa" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Already, you'll note that large amount of data is not easy to navigate here.\n", "\n", "We can address that some before we move on to using Python to make it much easier to deal with the results.\n", "\n", "For dealing with all this data, without Python you can direct the output to a file and then access portions of it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!blastn -query S288c.mt.genome.fa -db chrmt.fsa -out S288C.mt.x.chrmt.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that there is a file, you can view the first 45 lines by running the next command." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!head -45 S288C.mt.x.chrmt.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That shows that there is a 57 kb fragment that matches very well the sequence we used to build the database.\n", "\n", "Likewise `tail` could be used and combinations to look at parts of the file, like in the next cell." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!tail -21979 S288C.mt.x.chrmt.txt | head -45" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But that is clunky when there are much better ways using Python to read BLAST results that makes them easily queried by programmatic means.\n", "\n", "To do this we need to move from outputing in a human readable form to something more parseable by a computer. We will do that in the next section, but first I'll end this section by briefly pointing out another advantange running BLAST on the the command line affords by showing one example of using flags to control settings in depth.\n", "\n", "There are a lot of settings that can be adjusted/controlled via the command line and the command line exposes many more (I think) than are available in the web browser based interface (I think). Additionally, trying and adjusting these on the command line can be much more efficient if you start combining it with programmatic means that the this notebook and others in this series will try to introduce.\n", "\n", "When we ran the command `!blastn -query S288c.mt.genome.fa -db chrmt.fsa -out S288C.mt.x.chrmt.txt` above there were a lot of options that we let be determined by the software or the defaults programmed into the software. However, you don't need to accept those defaults and can adjust things as needed. Here is an example command were we use what are called 'flags' to provide parameters for various settings." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!blastn -query S288c.mt.genome.fa -db chrmt.fsa -task blastn -gapopen 5 -gapextend 2 -reward 2 -penalty -3 -word_size 11 -out flag_example.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Tables [here](https://www.ncbi.nlm.nih.gov/books/NBK279684/) are particularly helpful for understanding what those flags mean. Briefly, we controlled a lot of the settings BLAST uses, and you'll note this query is computationally much more demanding as it takes longer to run this command than `!blastn -query S288c.mt.genome.fa -db chrmt.fsa -out S288C.mt.x.chrmt.txt` took run above. I just wanted to touch on this ability here; you can explore this further in this binder with your own sequences. However, because experimenting with these settings can generate a lot of data, you may wish to return to trying things after seeing the next section because you can more easily deal with the generated data using a few abilitues Python adds.\n", "\n", "**Note: If you are not seeing hits, you may be using too stringent of setings as megablast is the default currently run with the command `blastn`** as discussed in [this post entitled 'When Blast is not Blast'](http://www.sixthresearcher.com/when-blast-is-not-blast/). You **probably want `-task blastn` included**.\n", "\n", "\n", "### Running BLAST on the command line and sending output to HTML" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "html_text = !blastn -query S288c.mt.genome.fa -db chrmt.fsa -task blastn -gapopen 5 -gapextend 2 -reward 2 -penalty -3 -word_size 11 -html" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "from IPython.core.display import HTML\n", "HTML(html_text.n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Why `html_text.n` is used will become understandable in the course of working through [the next notebook, entitled 'BLAST on the Command Line with Advanced Python' here](BLAST%20on%20the%20Command%20Line%20with%20Advanced%20Python.ipynb#BLAST-on-the-Command-Line-with-Advanced-Python). For now, it will suffice to say that I'm using a shortcut to get the HTML text output here to show you can get the more formatted version by [using the HTML flag](https://www.biostars.org/p/315104/#315109) and then display it in Jupyter, if you need. However, the goal of this is to use BLAST in Jupyter to automate and so the focus is meant to be on computationally parseable output and not the human-readable variations.\n", "\n", "The next sections and the next notebook build towards that more progammatic direction using Blast on the command line in conjunction with Python/Jupyter." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Running BLAST on the command line and sending the output to parseable files\n", "\n", "As I mentioned above, to make things more efficient and easier to parse, we'd like to move towards letting the computer handle our BLAST data. To do this we'll build on the approach outline at the start of the document [here](https://medium.com/@auguste.dutcher/turn-blast-results-into-a-presence-absence-matrix-cc44429c814) that involves using 'flags' to control the settings as just discussed above. \n", "\n", "See [here](https://blastedbio.blogspot.com/2014/11/column-headers-in-blast-tabular-and-csv.html) for a listing of explanations for the text codes involved in the flags. \n", "\n", "Run the next cell to make a new file `S288C.mt.x.chrmt.c.txt`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!blastn -query S288c.mt.genome.fa -db chrmt.fsa -outfmt \"6 qseqid sseqid stitle pident qcovs length mismatch gapopen qstart qend sstart send qframe sframe frames evalue bitscore qseq sseq\" -task blastn -out S288C.mt.x.chrmt.c.txt\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's peek at what that looks like by looking at the first line." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!head -1 S288C.mt.x.chrmt.c.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You'll notice that definitely look different, and we've moved away from 'human readable' for certain.\n", "\n", "Next, we'll use Python, which is great for working with text based data to parse this into a much more convenient form. Plus, we'll make it human readable at the same time. Then with that in hand will return to using BLAST by moving into adding in options to the BLAST run commands to control the settings in order customize the queries beyond the defaults.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Integrating BLAST results with Python\n", "\n", "We are going to use dataframes as form to handle the results. These are a really conevenient datatype and Python has an entire library, called Pandas, for dealing with them. This Jupyter environment already has the Pandas Python module installed. We'll use a script that will access it when it runs to take the BLAST results and make them into a dataframe. \n", "\n", "To get that script, you can run the next cell. (It is not included in the repository where this launches from to insure you always get the most current version, which is assumed to be the best available at the time.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!curl -O https://raw.githubusercontent.com/fomightez/sequencework/master/blast-utilities/blast_to_df.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have the script now. And we already have a file for it to process. To process the results, run the next command where we use Python to run the script and direct it at the results file, `S288C.mt.x.chrmt.c.txt`, we made just a few cells ago." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%run blast_to_df.py S288C.mt.x.chrmt.c.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As of writing this, the script we are using outputs a file that is a binary, compact form of the dataframe. (That means it is tiny and not human readable. It is called 'pickled'. Saving in that form may seem odd, but as illustrated [here](#Output-to-more-universal,-table-like-formats) below this is is a very malleable form. And even more pertinent for dealing with BLAST results in Jupyter notebooks, there is actually an easier way to interact with this script when in Jupyter notebook that skips saving this intermediate file. Please see the next notebook on using BLAST results with Python for more about that if you find that awkward. To lessen the Python wizardry for now, I am going to take the file intermediate path, and perhaps I should have saved it in a tab-separated form, but that would take up more space and some of these files can get rather large. It is easy to convert back and forth using the pickled form assuming you can match the Pandas/Python versions. I have found Python 3 to Python 3 you need to convert to the text based table and use that as an intermediate. In other words, the current Python3/Pandas pickled form is not directly readable by current Python 2.7/Pandas.)\n", "\n", "We can take that file where the dataframe is pickled, and bring it into active memory in this notebook with another command form the Pandas library. First we have to import the Pandas library because where we used earlier was in termporary environment where `blast_to_df.py` ran in the last cell, which actually is different than the computational envrionment associated with this notebook.\n", "\n", "Run the next command to bring the dataframe into active memory." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "df = pd.read_pickle(\"BLAST_pickled_df.pkl\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When that last cell ran, you won't notice any output, but something happened. We can look at that dataframe by calling it in a cell." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You'll notice that it is so large, that the Jupyter environment represents just the head and tail to make it more reasonable. There are ways you can have Jupyter display it all which we won't go into here. \n", "\n", "Instead we'll start to show some methods of dataframes that make them convenient. For example, you can use the `head` method to see the start like we used on the command line above." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So Pandas dataframes are a excellent way to store data in a way that clear but at the same time provides a lot of power and convenience for dealing with rather large sets of data. We'll only touch upon this a little here. For example, there are more convenience methods and functions that Pandas, such as `df.describe()` and `df.tail`, that you can explore.\n", "\n", "But let's run BLAST again with different data, and bring the results into a dataframe in a more direct way. And we'll then demonstrate the utility of the dataframe form with an applied example demonstrating the ease with which you can extract relevant data using Pandas. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Demonstrating the Utility of Having the BLAST Results in Python\n", "\n", "Above the steps were laid out step-by-step to make clear what is happening. However, this can be done in a more compact way all in a single Jupyter cell. (In fact, if you were really doing an analysis of this collection of PacBio generated sequences, you'd take another step [as dicussed below](#Practical-BLAST:-Making-things-easier-by-having-related-sequences-to-be-queried-in-same-file).) The next cell will run another BLAST and make the data in a dataframe.\n", "\n", "(This will depend on running the ['Preparation' section above](#Preparing-to-run-BLAST-on-the-command-line) and additionally the following two commands from above:\n", "\n", "- `!curl -O https://raw.githubusercontent.com/fomightez/sequencework/master/blast-utilities/blast_to_df.py`\n", "- `import pandas as pd` \n", "\n", "And so if you are picking this back up later, run [the first section](#Preparing-to-run-BLAST-on-the-command-line) and then add some cells and run those extra two commands.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!blastn -query Y12.mt.genome.fa -db chrmt.fsa -outfmt \"6 qseqid sseqid stitle pident qcovs length mismatch gapopen qstart qend sstart send qframe sframe frames evalue bitscore qseq sseq\" -task blastn -out Y12.mt.x.chrmt.c.txt\n", "%run blast_to_df.py Y12.mt.x.chrmt.c.txt --df_output next_BLAST_pickled_df.pkl\n", "df = pd.read_pickle(\"next_BLAST_pickled_df.pkl\")\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From my work with these genomes, I know that what the curating researchers defined as \"start\" position of the circular chromosome differs from the arbitrary \"start\" that the Saccharomyces Genome Database (SGD) uses for S288C mitochdonrial start position. For comparing the position of genes, etc., it is important for me to be able to match these other sequences to the SGD reference sequence.\n", "\n", "Again, from my previous work, I know that in most cases for these sequence, the \"start\" used by the curating researchers, is easily located from the BLAST data be finding the segment that matches the start of the SGD sequence. That translates to wanting to find where the value in the `sstart` (subject start) column matches 1. For the PacBio S288C that we looked at in the first step-by-step example above, the second row in the result viewable `df.head()` revealed that. But it didn't here. We can find it easily though via the power of Pandas by coding a conditional selection to say we want any rows there `sstart` equals `1`. The next cell will run this subsetting operation." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "start_loc_df = df[df['sstart']==1]\n", "start_loc_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Running that reveals the answer, the `qstart`(query start) position of 54123 matches the SGD reference sequence position 1.\n", "In fact, we can extract that exact value using Pandas to references it specifically. We'll assign that to a variable in the next cell because we are going to use it next. Additionally we are going to make sure we sort the `start_loc_df` on `bitscore`. This sorting is serving an pedagogical and practical purpose here:\n", " #1 it exposes another Pandas operation that easy to perform.\n", " #2 it would be important if there were two matches satisfying `df['sstart']==1` that we take the better scoring one. This situation is not happening with this example, but does with other examples from this set of a dozen PacBio sequences (only three of which did we actually retrieve in this example notebook), and so I am adding it here to generalize the processs further." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "start_loc_df = start_loc_df.sort_values(['sstart','bitscore'],ascending=[1,0])\n", "start_loc = start_loc_df.iloc[0][\"qstart\"]\n", "start_loc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So now you have `start_loc` defined as the position in the PacBio Y12 mitochondrial sequence as the one that matches the SGD sequence. To further demonstrate the utility of combining BLAST results with Python, in the next cell you can adjust the sequence of the PacBio Y12 mitochondrial sequence to match the 'start' position SGD uses." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from Bio import SeqIO\n", "with open(\"Y12.mt.genome.fa\") as handle:\n", " mito_seq = SeqIO.read(handle, \"fasta\")\n", "\n", "# fix, based on where it says \"i.e. shift the starting point on this plasmid,\" @ \n", "#http://biopython.org/DIST/docs/api/Bio.SeqRecord.SeqRecord-class.html\n", "left = mito_seq[:start_loc-1] # use one less because of zero indexing in Python\n", "right = mito_seq[start_loc-1:]\n", "new_mito_seq = right + left\n", "\n", "# write result after fix\n", "SeqIO.write(\n", " new_mito_seq, \"Y12.mt.genome.START_ADJUSTED.fa\", \"fasta\");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The saved sequence `Y12.mt.genome.START_ADJUSTED.fa` should begin at that same point, and now it would be ready for use with the SGD reference sequence (available [here](https://sgd-archive.yeastgenome.org/sequence/S288C_reference/chromosomes/fasta/chrmt.fsa) or already downloaded into this active Jupyter environment as `chrmt.fsa` in the course of the preparation steps) in an alignment by [Kalign](https://www.ebi.ac.uk/Tools/msa/kalign/). If you were to do that, you'd see the both ends of that long alignment actually align.\n", "\n", "(In case you are new to the Jupyter environment, to get to the dashboard where the file directory is visible, click on the Jupyter logo in the upper left of this notebook page.)\n", "\n", "If we had tried running the alignment without moving the parts to match the SGD sequence, the sequences would align in the 'middle', but the two 'ends' would be left overhanging. In other words, without the corrective shift, the large 'start' segment of the SGD sequence would appear as an unaligned overhang with about 33 thousand base pairs into the alignment where upon the Y12 'start' sequence would finally match SGD sequence farily well, and that would be used as an anchor for this 'middle' matching. After that matching segment a large fragment of Y12 would then remain overhanging. \n", "\n", "You can demonstrate those two situations I just described easily for yourself with the sequences available in this Jupyter environment by submitting them to [Kalign](https://www.ebi.ac.uk/Tools/msa/kalign/)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Aside: Other Alternatives**\n", " \n", "Of course, you can work with tabular BLAST output other ways. For example, AWK used on the command line can easily grab the best hit as described here by Mick Watson:\n", "\n", ">\"Also if you have multiline tabular BLAST results, just get top hit by\n", "awk '!a[$1]++'\" \n", "-Source: https://twitter.com/BioMickWatson/status/1017486331511963648\n", " \n", "Demonstrating that:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!cat S288C.mt.x.chrmt.c.txt|awk '!a[$1]++'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Practical BLAST: Making things easier by having related sequences to be queried in same file\n", "\n", "This section might be better titled, **\"How you'd really do what was shown above\"**.\n", "\n", "Above we dealt with PacBio-generated sequences S228C and then Y12 in turn comparing them the SGD reference sequence for S288C. However, it is much more convenient to combine the asociated sequences you need to compare into one file and submit in one BLAST job. As you'll see, the dataframe structure makes it easy to keep these combined, but easily accessible individually. This enhances reproducibility because you then have one master file to deal with and you know you had the same settings applied to all the involved queries. This is espeically important once you start using 'flags' to customize settings.\n", "\n", "Let's combine all the PacBio data we have been dealing with above and one we haven't touched yet in one. (There are actually a total of a dozen in the real set.) First we'll concatenate all the individual FASTA files we were dealing with into one file that file that consist of multiple squences in FASTA format that is sometimes referred to as Multi-FASTA format. Then the next steps are just the same as above.\n", "\n", "(As with the last section, the next cell working will depend on running the ['Preparation' section above](#Preparing-to-run-BLAST-on-the-command-line) and additionally the following two commands from above:\n", "\n", "- `!curl -O https://raw.githubusercontent.com/fomightez/sequencework/master/blast-utilities/blast_to_df.py`\n", "- `import pandas as pd` \n", "\n", "And so if you are picking this back up later, run [the first section](#Preparing-to-run-BLAST-on-the-command-line) and then add some cells and run those extra two commands.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!cat S288c.mt.genome.fa Y12.mt.genome.fa SK1.mt.genome.fa > pacbio.mt.fa\n", "!blastn -query pacbio.mt.fa -db chrmt.fsa -outfmt \"6 qseqid sseqid stitle pident qcovs length mismatch gapopen qstart qend sstart send qframe sframe frames evalue bitscore qseq sseq\" -task blastn -out pacbio.mt.x.chrmt.c.txt\n", "%run blast_to_df.py pacbio.mt.x.chrmt.c.txt --df_output pb.x.chrmt_pickled_df.pkl\n", "df = pd.read_pickle(\"pb.x.chrmt_pickled_df.pkl\")\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Promising. But the `qseqid`(query sequence identifier) only shows data for `S288c.mito` visible right now. Let's run the `df.tail()` command next." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Okay, there is `SK1.mito` that we haven't touched prior to this and a high index approaching 5000 on the far left. So all the same data we have seen before, plus the `SK1` data is in there.\n", "\n", "We can easily access indivudal parts by adding conditions based on the `qseqid` to selections for the dataframe filtering/subsetting.\n", "\n", "For example, above we got the 'start' value for Y12 that corresponded to position #1 of the SGD sequence. You can repeat that for SK1 by running the next cell." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "SK1_df = df[df['qseqid'] == \"SK1.mito\"]\n", "start_loc_df = SK1_df.sort_values(['sstart','bitscore'],ascending=[1,0])\n", "start_loc = start_loc_df.iloc[0][\"qstart\"]\n", "start_loc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Yay, a number. You can see that more in context by showing the start of the pertinent dataframe." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "start_loc_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So you could again adjust the sequence based on that number so the 'starts' match. There is no need to actually do that because the point was to illustrate `pb.x.chrmt_pickled_df.pkl` is easier to generate and much better to deal with than the individual approaches I had done earlier as to make this process less overwhelming." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Output to more universal, table-like formats\n", "\n", "I've tried to sell you on the power of the Python/Pandas dataframe, but it isn't for all uses or everyone. However, most everyone is accustomed to dealing with text based tables or even Excel. In fact, a text-based based table perhaps tab or comma-delimited would be the better way to archive the data we are generating here. Python/Pandas makes it easy to go from the dataframe form to these tabular forms. You can even go back later from the table to the dataframe, which may be inportant if you are going to different versions of Python/Pandas as I briefly mentioned parenthetically above.\n", "\n", "**First, generating a text-based table.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Save / write a TSV-formatted (tab-separated values/ tab-delimited) file\n", "df.to_csv('blast_data.tsv', sep='\\t',index = False) #add `,header=False` to leave off header, too" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because `df.to_csv()` defaults to dealing with csv, you can simply use `df.to_csv('example.csv',index = False)` for comma-delimited (comma-separated) files.\n", "\n", "You can see that worked by looking at the first line with the next command. (Feel free to make the number higher or delete the number all together. I restricted it just to first line to make output smaller.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!head -1 blast_data.tsv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you had need to go back from a tab-separated table to a dataframe, you can run something like in the following cell." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "reverted_df = pd.read_csv('blast_data.tsv', sep='\\t')\n", "reverted_df.to_pickle('reverted_df.pkl') # OPTIONAL: pickle that data too" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For a comma-delimited (CSV) file you'd use `df = pd.read_csv('example.csv')` because `pd.read_csv()` method defaults to comma as the separator (`sep` parameter).\n", "\n", "You can verify that read from the text-based table by viewing it with the next line." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "reverted_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Generating an Excel spreadsheet from a dataframe.**\n", "\n", "Because this is an specialized need, there is a special module needed that I didn't bother installing by default and so it needs to be installed before generating the Excel file. Running the next cell will do both." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%pip install openpyxl\n", "# save to excel (KEEPS multiINDEX, and makes sparse to look good in Excel straight out of Python)\n", "df.to_excel('blast_data.xlsx') # after openpyxl installed" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You'll need to download the file first to your computer and then view it locally as there is no viewer in the Jupyter environment.\n", "\n", "Adiitionally, it is possible to add styles to dataframes and the styles such as shading of cells and coloring of text will be translated to the Excel document made as well.\n", "\n", "Excel files can be read in to dataframes directly without needing to go to a text based intermediate first using the `xlrd` module." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%pip install xlrd\n", "# read Excel, after xlrd installed\n", "df_from_excel = pd.read_excel('blast_data.xlsx') # after xlrd installed " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That can be viewed to convince yourself it worked by running the next command." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_from_excel.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "-----\n", "\n", "## Where to next?\n", "\n", "Hopefully, you now have an environment that you can run BLAST and manipulate the output readily. Your next step would be to add your data and repeat. You can get to the Jupyter dashboard by clicking the the Jupyter logo in the upper left corner of this notebook. There you can upload files from your computer using a typical GUI file handling interface. Or, you can use the command line to retrieve files.\n", "\n", "Note that you can also use that same Jupyter dashboard, to launch new notebooks to start working with your own data in a fresh notebook in the same session as you ran the examples above.\n", "\n", "Remember if you make any useful files to get them ASAP as these Binder-associated instances are temporary. Additionally, grab the pickled dataframe files as well as they can be read in again to continue an analysis of any results without need for re-rerunning any BLAST.\n", "\n", "Additionallly, now that you can quickly reformat the data into useful forms, you may want return to the brief introduction above of 'flags' to adjust options and explore those taking advantage of how running BLAST on the command line offers the ability to quickly test various options and compare and contrast the output. Related to that, you may want to learn more about Pandas. Here are some resources for that:\n", "\n", "- [Chris' Albon Data Wrangling has a lot of concise Pandas recipes](https://chrisalbon.com/python/data_wrangling/) , search 'pandas' in that list\n", "- [10 Minutes to pandas](http://pandas.pydata.org/pandas-docs/stable/10min.html)\n", "- [Pandas Intro](http://www.stephaniehicks.com/learnPython/pages/pandas.html) \n", "- [Shane Lynn's Using iloc, loc, & ix to select rows and columns in Pandas DataFrames](https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/)\n", "- [My gist of useful Pandas snippets](https://gist.github.com/fomightez/ef57387b5d23106fabd4e02dab6819b4)\n", "- [Things in Pandas I Wish I'd Known Earlier](http://nbviewer.jupyter.org/github/rasbt/python_reference/blob/master/tutorials/things_in_pandas.ipynb)\n", "\n", "\n", "Beyond that...\n", "\n", "There is [another notebook, entitled 'BLAST on the Command Line with Advanced Python' here](BLAST%20on%20the%20Command%20Line%20with%20Advanced%20Python.ipynb#BLAST-on-the-Command-Line-with-Advanced-Python) in this repo that shows different ways these steps can be streamlined when running in the Jupyter environment. For example, there isn't a need to rely on file intermediates. If you are using Jupyter notebooks, to make a workflow or pipeline, you definitely want to take a look at this approach. Please see [here](BLAST%20on%20the%20Command%20Line%20with%20Advanced%20Python.ipynb#BLAST-on-the-Command-Line-with-Advanced-Python) for more advanced use of BLAST and Python.\n", "\n", "\n", "**The Binder-luanchable version that you are using here too limiting for your needs?**\n", "\n", "NCBI has made available an Amazon Machine Image (AMI) that will generate a remote cloud compute instance already set to run standalone BLAST, see [here](https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=CloudBlast). Additional documentation is [here](http://ncbi.github.io/blast-cloud/). \n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "-----" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.12" } }, "nbformat": 4, "nbformat_minor": 2 }