{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Searching for homologs among deduced proteins from a genome using BLAST and Python\n", "\n", "This notebook will take three specified budding yeast genes and identify & collect the protein sequences of homologs among a collection of deduced proteins available for a genome from Ensembl.\n", "\n", "This notebook builds on the previous ones. See those for introduction and credits. \n", "\n", "This is meant to demonstrate using BLAST+, Python, and a few of my Python scripts to accomplish a task in a series of steps. \n", "\n", "How to modify the input to use your own sequence or sequences is described.\n", "\n", "The file used here contains the deduced sequences predicted proteins encoded by the genome of *Encephalitozoon cuniculi*. (The details of these sequences are just below in an expandable section; however, the option doesn't render statically via Github; use [nbviewer](https://nbviewer.jupyter.org/github/fomightez/blast-binder/blob/master/notebooks/Searching%20for%20homologs%20among%20deduced%20proteins%20from%20a%20genome%20using%20BLAST%20and%20Python.ipynb) or [launch an active session via MyBinder](https://mybinder.org/v2/gh/fomightez/blast-binder/master?filepath=notebooks/Searching%20for%20homologs%20among%20deduced%20proteins%20from%20a%20genome%20using%20BLAST%20and%20Python.ipynb).)\n", "\n", "
\n", " Click the arrow below to expand to fuller explanation of the sequence source.\n", " \n", "From [the Ensembl genome page for Encephalitozoon cuniculi](https://fungi.ensembl.org/Encephalitozoon_cuniculi_ecuniii_l_gca_001078035/Info/Index), under 'Gene Annoation', you'd click 'FASTA' next to the text 'Download genes, cDNAs, ncRNA, proteins -'. That leads to the [Downloads page](ftp://ftp.ensemblgenomes.org/pub/fungi/release-48/fasta/fungi_microsporidia1_collection/encephalitozoon_cuniculi_ecuniii_l_gca_001078035)\n", "which lists:\n", "\n", "```bash\n", "cdna/\t\t8/17/20, 1:24:00 PM\n", "cds/\t\t8/17/20, 1:24:00 PM\n", "dna/\t\t8/17/20, 1:24:00 PM\n", "dna_index/\t\t8/17/20, 1:24:00 PM\n", "ncrna/\t\t8/17/20, 1:24:00 PM\n", "pep/\t\t8/17/20, 1:24:00 PM\n", "```\n", "\n", "Specifically the deduced protein sequences is the `pep` entry at the bottom of that list and can be directly retrieved from:\n", "ftp://ftp.ensemblgenomes.org/pub/fungi/release-48/fasta/fungi_microsporidia1_collection/encephalitozoon_cuniculi_ecuniii_l_gca_001078035/pep/\n", "\n", "Source information for the Encephalitozoon cuniculi sequence data: \n", "https://www.ebi.ac.uk/ena/browser/view/GCA_001078035.1\n", " \n", "Sadly, the sequence file we use here cannot be directly retrieved via `curl` or `wget` from the source because the port for ftp is blocked on MyBinder.org to prevent abuse. However, you could easily download the file and then upload it to the active session running via MyBinder.org. To save on that step it is already included in this repository and so is present when the session launches. \n", "
\n", "\n", "-----" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preparation\n", "\n", "In this demonstration we'll get protein sequences for three genes of *S. cerevisiae* and make a multi-FASTA file from them. If you already have your own protein sequences in a file, you can upload that to this running session and skip to the step below where the input file is specified and edit the `seq_file_to_use` variable to be the name of your file. The sequence file in FASTA format you provide will need to be in the `notebooks` directory alongside this notebook. This will be revisited in greater detail below. \n", "\n", "To get the demonstration protein sequences, we'll use a script I developed that fetches protein sequences for *S. cerevisiae* genes when provided wit a gene's systematic name, standard name, or alias as defined at gene page at yeastgenome.org. See more about that script [here](https://github.com/fomightez/yeastmine#descriptions-of-the-scripts)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", "100 11399 100 11399 0 0 50823 0 --:--:-- --:--:-- --:--:-- 50888\n" ] } ], "source": [ "!curl -O https://raw.githubusercontent.com/fomightez/yeastmine/master/get_protein_seq_as_FASTA.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The intermine webservices package that script relies on has not been updated on package distribution sites in a while and so we are going to get the current version directly from Github using the following command. We'll also add a package called `pyfaidx` that makes handling FASTA-formated files easier." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Collecting https://github.com/intermine/intermine-ws-python/archive/dev.zip\n", " Downloading https://github.com/intermine/intermine-ws-python/archive/dev.zip\n", "\u001b[2K \u001b[32m-\u001b[0m \u001b[32m150.5 kB\u001b[0m \u001b[31m4.9 MB/s\u001b[0m \u001b[33m0:00:00\u001b[0m\n", "\u001b[?25h Preparing metadata (setup.py) ... \u001b[?25ldone\n", "\u001b[?25hBuilding wheels for collected packages: intermine\n", " Building wheel for intermine (setup.py) ... \u001b[?25ldone\n", "\u001b[?25h Created wheel for intermine: filename=intermine-1.13.0-py3-none-any.whl size=75660 sha256=5f2f7c4d4b2ecae2cac9ff52c43a956dd2dc03ad8c7c2f24ffc543e1d6d299c5\n", " Stored in directory: /tmp/pip-ephem-wheel-cache-_wq3fh2r/wheels/74/4c/cc/b2bc6341109f116ae83cc8424243b1a820cd4fcdf99a4f6932\n", "Successfully built intermine\n", "Installing collected packages: intermine\n", "Successfully installed intermine-1.13.0\n", "Note: you may need to restart the kernel to use updated packages.\n", "Collecting pyfaidx\n", " Downloading pyfaidx-0.8.1.1-py3-none-any.whl.metadata (25 kB)\n", "Requirement already satisfied: setuptools in /srv/conda/envs/notebook/lib/python3.10/site-packages (from pyfaidx) (68.2.2)\n", "Requirement already satisfied: importlib-metadata in /srv/conda/envs/notebook/lib/python3.10/site-packages (from pyfaidx) (6.8.0)\n", "Requirement already satisfied: zipp>=0.5 in /srv/conda/envs/notebook/lib/python3.10/site-packages (from importlib-metadata->pyfaidx) (3.17.0)\n", "Downloading pyfaidx-0.8.1.1-py3-none-any.whl (28 kB)\n", "Installing collected packages: pyfaidx\n", "Successfully installed pyfaidx-0.8.1.1\n", "Note: you may need to restart the kernel to use updated packages.\n" ] } ], "source": [ "%pip install https://github.com/intermine/intermine-ws-python/archive/dev.zip\n", "%pip install pyfaidx" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we'll use it to get the protein sequences corresponding to the three genes form a list. \n", "(If you see, errors like `ModuleNotFoundError: No module named 'urlparse'`...`ImportError: cannot import name 'MutableMapping' from 'collections' (/srv/conda/envs/notebook/lib/python3.10/collections/__init__.py)` as you run this next cell, then read below this next cell for possible solution.)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "looking up the gene associated with NOP1...getting protein sequence...\n", "\n", "File of protein sequence saved as 'S288C_YDL014W_NOP1_protein.fsa'.\n", "Finished.\n", "looking up the gene associated with VPH1...getting protein sequence...\n", "\n", "File of protein sequence saved as 'S288C_YOR270C_VPH1_protein.fsa'.\n", "Finished.\n", "looking up the gene associated with RPB1...getting protein sequence...\n", "\n", "File of protein sequence saved as 'S288C_YDL140C_RPO21_protein.fsa'.\n", "Finished.\n" ] } ], "source": [ "seqs_to_get = [\"NOP1\",\"VPH1\",\"RPB1\"]\n", "for seq in seqs_to_get:\n", " %run get_protein_seq_as_FASTA.py {seq}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(If you see `basil2` among the output, disregard it. I believe it simply an indicator that it is using the development version of the code.) \n", "(If you saw errors, like `ModuleNotFoundError: No module named 'urlparse'`...`ImportError: cannot import name 'MutableMapping' from 'collections' (/srv/conda/envs/notebook/lib/python3.10/collections/__init__.py)` as you run the cell, it is possibly [because intermine needs updating](https://github.com/intermine/intermine/issues/2424). Foruntately it is easily fixed by making a new cell and running two commands: Rename the code to fix it by running in the new cell `!mv possible_webservice_fix_for_intermine_code.py webservice.py`. Then to fix run `!cp ./webservice.py /srv/conda/envs/notebook/lib/python3.10/site-packages/intermine/webservice.py`. Now delete that extra cell and try running the cell above again, and then if it works continue on with the rest.)\n", "\n", "With those protein sequences we'll combine them into one multi-FASTA sequence file using the cell below. (Most of the code in it is concerned with getting a list of the individual FASTA files without needing to hardcode them into this notebook. The putting the files together is really done in the `!cat` command.)\n", "\n", "The collected individual sequences are funneled into one sequence file by running the next cells. This may seem counterproductive because later we split the sequences back out to individual files again. However, it is done in this manner because BLAST needs a file as input and this way it is easier for others to plug-in their own sequences in FASTA form right into this notebook by uploading it here and editing the value of file name for the `seq_file_to_use` variable below. This way they don't need to relist their gene names to adapt this notebook. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "import os\n", "import sys\n", "import fnmatch\n", "fasta_files = []\n", "name_part_to_match = \".fsa\"\n", "for file in os.listdir('.'):\n", " if fnmatch.fnmatch(file, '*'+name_part_to_match):\n", " #print (file)\n", " #first_part_filen = file.rsplit(extension_to_handle,1)[0]\n", " fasta_files.append(file)\n", "!cat {\" \".join(fasta_files)} >protein_sequences.fasta" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That should now produce a file named `protein_sequences.fasta` in the current working directory. We can check that is the case. If you run the code in the next cell it will first display the current working directory, and then the list of files in that directory. We use the `%%bash%` magic tag at the top of the cell to specify to run those commands in the shell. (For those that are Jupyter power users, you may realize this is unnecessary for these two commands; however, it is used here as what unix commands work without anything special is one of those things that novices find hard to follow when learning on Jupyter.)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/home/jovyan/notebooks\n", "-*--above is working directory; below is the list of files in that directory-*--\n", "BLAST on Command Line and Integrating with Python.ipynb\n", "BLAST on the Command Line with Advanced Python.ipynb\n", "Encephalitozoon_cuniculi_ecuniii_l_gca_001078035.ECIIIL.dna.toplevel.fa.gz\n", "Encephalitozoon_cuniculi_ecuniii_l_gca_001078035.ECIIIL.pep.all.fa.gz\n", "get_protein_seq_as_FASTA.py\n", "GSD\n", "protein_sequences.fasta\n", "S288C_YDL014W_NOP1_protein.fsa\n", "S288C_YDL140C_RPO21_protein.fsa\n", "S288C_YOR270C_VPH1_protein.fsa\n", "Searching for coding sequences in genomes using BLAST and Python.ipynb\n", "Searching for homologs among deduced proteins from a genome using BLAST and Python.ipynb\n", "Untitled.ipynb\n", "webservice.py\n" ] } ], "source": [ "%%bash\n", "pwd\n", "echo \"-*--above is working directory; below is the list of files in that directory-*--\"\n", "ls" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You should see `protein_sequences.fasta` listed there.\n", "\n", "The file `protein_sequences.fasta` contains the sequences we'll use in the remainder of this notebook to look for homologs among the deduced protein sequences from *Encephalitozoon cuniculi*. If you already have a file of your own protein sequence(s) in FASTA format, you have some choices. One is to simply edit the content of `protein_sequences.fasta` to replace it with your sequence or sequences. The more direct way, and one that allows you to skip any of the above cells once you know what you are doing is that you upload your sequence file to this session and change the file name below to match your sequence file. To do that you edit the text between the quotes. **Note that the sequence file in FASTA format you provide will need to be located in the `notebooks` directory alongside this notebook.** To help in troubleshooting the upload and placing the file in the correct directory, you should see your file listed now if you re-run the cell above." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "seq_file_to_use = \"protein_sequences.fasta\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sadly, the file containing the deduced sequences predicted proteins encoded by the genome of *Encephalitozoon cuniculi* we use here cannot be directly retrieved via `curl` or `wget` from the source because the port for ftp is blocked on MyBinder.org to prevent abuse. However, you could easily download the file and then upload it to the active session running via MyBinder.org. To save on that step it is already included in this repository and so is present when the session launches. \n", "\n", "We need to first extract the sequence file from a compressed form using the following command:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "gunzip Encephalitozoon_cuniculi_ecuniii_l_gca_001078035.ECIIIL.pep.all.fa.gz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's rename the sequence file something more manageable, `ec.pep.fas`." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "mv Encephalitozoon_cuniculi_ecuniii_l_gca_001078035.ECIIIL.pep.all.fa ec.pep.fas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get the `blast_to_df` script that will be used to convert the raw output from BLAST to something useful in Python by running the following cell. We'll also go ahead and import the main function from the script that allows sending BLAST output to Python dataframes so that we can use it here. Also we prepare to make files from each sequence so that we can feed as input into BLAST." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", "100 13771 100 13771 0 0 57280 0 --:--:-- --:--:-- --:--:-- 57141\n" ] } ], "source": [ "import os\n", "file_needed = \"blast_to_df.py\"\n", "if not os.path.isfile(file_needed):\n", " !curl -O https://raw.githubusercontent.com/fomightez/sequencework/master/blast-utilities/blast_to_df.py\n", "import pandas as pd\n", "\n", "from blast_to_df import blast_to_df\n", "from pyfaidx import Fasta" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll also define a function we can use later to save the individual sequences as single files for input into BLAST. This apparent counterproductivity approach where we start out with single files of the sequences in FASTA format and later save them again is to make it easier for researchers with sequences already in a file to plug into this workflow as mentioned above." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "def one_seq_to_file(prot_seq, description, name_to_use):\n", " '''\n", " function to save a sequence as a file in FASTA-format\n", " \n", " NOTE THAT NOT JUST USING `!faidx --split-files ` because\n", " want to know what files will be named for each. In fact can use same name\n", " because will delete.\n", " '''\n", " # save the protein sequence as FASTA\n", " chunk_size = 70 #<---amino acids per line to have in FASTA (note this chunking to\n", " # multiple lines is the opposite of what PatMatch's `unjustify_fasta.pl` does)\n", " prot_seq_chunks = [prot_seq[i:i+chunk_size] for i in range(\n", " 0, len(prot_seq),chunk_size)]\n", " prot_seq_fa = \">\" + description + \"\\n\"+ \"\\n\".join(prot_seq_chunks)\n", " p_output_file_name = name_to_use\n", " with open(p_output_file_name, 'w') as output:\n", " output.write(prot_seq_fa)\n", " sys.stderr.write(\"\\n\\nProtein sequence saved as \"\n", " \"`{}`.\".format(p_output_file_name)) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now you are prepared to run BLAST to search the deduced protein sequences from the *Encephalitozoon cuniculi* genome for the homologs of the three yeast genes. If you provided sequences as part of this, it will look for homologs of those among the deduced protein sequences from the *Encephalitozoon cuniculi* genome." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Use BLAST to search the deduced proteins sequences for matches to the specified genes\n", "\n", "This going to use protein sequences from the *Encephalitozoon cuniculi* genome and make a database so it is searchable and then search for matches to each the genes. The information on the best match will be collected. One use for that information will be collecting the corresponding sequences later." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "Building a new DB, current time: 03/06/2024 22:04:27\n", "New DB name: /home/jovyan/notebooks/ec.pep.fas\n", "New DB title: ec.pep.fas\n", "Sequence type: Protein\n", "Keep MBits: T\n", "Maximum file size: 3000000000B\n", "Adding sequences from FASTA; added 1834 sequences in 0.137166 seconds.\n", "\n", "\n" ] } ], "source": [ "!makeblastdb -in ec.pep.fas -dbtype prot" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\n", "\n", "Protein sequence saved as `blast_input_seq.fa`.Provided results read and converted to a dataframe...\n", "\n", "A dataframe of the data has been saved as a file\n", "in a manner where other Python programs can access it (pickled form).\n", "RESULTING DATAFRAME is stored as ==> 'BLAST_pickled_df.pkl'\n", "\n", "Returning a dataframe with the information as well.\n", "\n", "Protein sequence saved as `blast_input_seq.fa`.Provided results read and converted to a dataframe...\n", "\n", "A dataframe of the data has been saved as a file\n", "in a manner where other Python programs can access it (pickled form).\n", "RESULTING DATAFRAME is stored as ==> 'BLAST_pickled_df.pkl'\n", "\n", "Returning a dataframe with the information as well.\n", "\n", "Protein sequence saved as `blast_input_seq.fa`.Provided results read and converted to a dataframe...\n", "\n", "A dataframe of the data has been saved as a file\n", "in a manner where other Python programs can access it (pickled form).\n", "RESULTING DATAFRAME is stored as ==> 'BLAST_pickled_df.pkl'\n", "\n", "Returning a dataframe with the information as well." ] } ], "source": [ "records = Fasta(seq_file_to_use)\n", "dfs = []\n", "for seq in records:\n", " one_seq_to_file(str(seq), seq.long_name, \"blast_input_seq.fa\") # if you are seeing the caret ,a.k.a., less-than-symbol,\n", " # get included in the name, switch `seq.long_name` to `seq.name`. I've seen where if the FASTA sequence file isn't \n", " # standard and instead includes a blank line above the description lines (after the first), the `seq.long_name` will\n", " # include `>` at the start. I chose to use `seq.long_name` to pass so that one has the richest set of\n", " # options for adpating the code for naming of the files.\n", " result = !blastp -query blast_input_seq.fa -db ec.pep.fas -outfmt \"6 qseqid sseqid stitle pident qcovs length mismatch gapopen qstart qend sstart send qframe sframe frames evalue bitscore qseq sseq\" -task blastp\n", " dfs.append(blast_to_df(result.n))\n", " !rm blast_input_seq.fa" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`dfs` results in a list of dataframes, one for each search. \n", "\n", "For example, the first one in the list can be displayed by running the following code:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
qseqidsseqidstitlepidentqcovslengthmismatchgapopenqstartqendsstartsendqframesframeframesevaluebitscoreqseqsseq
0NOP1KMV65234KMV65234 pep supercontig:ECIIIL:CH10_ECIII-L:9...58.1238227710045732410279111/12.610000e-112324.0GRGGSRGGA--RGGSRGGRGGAAGGARGGA------KVVIEPH-RH...GRKFQKGGKPFRSGKGEGRGRMNNKKKGSVNAGLDRKVLVEPHPRF...
1NOP1KMV65287KMV65287 pep supercontig:ECIIIL:CH10_ECIII-L:1...28.8892590494160240141224111/11.100000e-0233.1DELFIAPGKKVLYLGAASGTSVSHVSDVVGPEGVVYAVEFSHRPGR...EDFYITP----YYAGHVLGAAMFHV--VVGDQSVVYTGDYSTTPDK...
2NOP1KMV66016KMV66016 pep supercontig:ECIIIL:CH06_ECIII-L:1...22.05921685112172841075111/19.800000e-0228.9IIPIIEDARHPQKYRMLIGMVDCVFADVAQPDQARIIALNSHMFLK...VTPALKERRH--TYAIIVGIIDVTFVLLQRKDGEREICSVANLHLE...
3NOP1KMV66763KMV66763 pep supercontig:ECIIIL:CH01_ECIII-L:9...28.289441529271315632174111/14.900000e+0025.0GGSRGGFGGRGGSRGGARGG-SRGGFGGR-GGSRGGARGGSRGGFG...SGGRRGAGQKSQRIRGYRGGTSSGSHLGRSGGKRSDRSGCERGSFP...
4NOP1KMV65333KMV65333 pep supercontig:ECIIIL:CH10_ECIII-L:2...31.9151447320222268463509111/19.400000e+0023.9EDARHPQKYRMLIGMVDCVFADVAQPDQARIIALNSHMFLKDQGGVVEDSRPVFRFKPAIAPVQCAIGYLIHFDEFNEHILNIKKFLTDNGLVV
\n", "
" ], "text/plain": [ " qseqid sseqid stitle pident \\\n", "0 NOP1 KMV65234 KMV65234 pep supercontig:ECIIIL:CH10_ECIII-L:9... 58.123 \n", "1 NOP1 KMV65287 KMV65287 pep supercontig:ECIIIL:CH10_ECIII-L:1... 28.889 \n", "2 NOP1 KMV66016 KMV66016 pep supercontig:ECIIIL:CH06_ECIII-L:1... 22.059 \n", "3 NOP1 KMV66763 KMV66763 pep supercontig:ECIIIL:CH01_ECIII-L:9... 28.289 \n", "4 NOP1 KMV65333 KMV65333 pep supercontig:ECIIIL:CH10_ECIII-L:2... 31.915 \n", "\n", " qcovs length mismatch gapopen qstart qend sstart send qframe \\\n", "0 82 277 100 4 57 324 10 279 1 \n", "1 25 90 49 4 160 240 141 224 1 \n", "2 21 68 51 1 217 284 10 75 1 \n", "3 44 152 92 7 13 156 32 174 1 \n", "4 14 47 32 0 222 268 463 509 1 \n", "\n", " sframe frames evalue bitscore \\\n", "0 1 1/1 2.610000e-112 324.0 \n", "1 1 1/1 1.100000e-02 33.1 \n", "2 1 1/1 9.800000e-02 28.9 \n", "3 1 1/1 4.900000e+00 25.0 \n", "4 1 1/1 9.400000e+00 23.9 \n", "\n", " qseq \\\n", "0 GRGGSRGGA--RGGSRGGRGGAAGGARGGA------KVVIEPH-RH... \n", "1 DELFIAPGKKVLYLGAASGTSVSHVSDVVGPEGVVYAVEFSHRPGR... \n", "2 IIPIIEDARHPQKYRMLIGMVDCVFADVAQPDQARIIALNSHMFLK... \n", "3 GGSRGGFGGRGGSRGGARGG-SRGGFGGR-GGSRGGARGGSRGGFG... \n", "4 EDARHPQKYRMLIGMVDCVFADVAQPDQARIIALNSHMFLKDQGGVV \n", "\n", " sseq \n", "0 GRKFQKGGKPFRSGKGEGRGRMNNKKKGSVNAGLDRKVLVEPHPRF... \n", "1 EDFYITP----YYAGHVLGAAMFHV--VVGDQSVVYTGDYSTTPDK... \n", "2 VTPALKERRH--TYAIIVGIIDVTFVLLQRKDGEREICSVANLHLE... \n", "3 SGGRRGAGQKSQRIRGYRGGTSSGSHLGRSGGKRSDRSGCERGSFP... \n", "4 EDSRPVFRFKPAIAPVQCAIGYLIHFDEFNEHILNIKKFLTDNGLVV " ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dfs[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Displaying the results for the first one in the list, at index zero, shows that the first hit for Nop1p, the highly conserved protein fibrillarin, has high percent identity over 277 residues and a good bitscore. The others are not as good.\n", "\n", "(Note if you'd prefer to see the BLAST results as text this is covered after a few more cells below under 'Text result'.)\n", "\n", "Because the description in the `stitle` column gets cut off, this following trick can be used to see more of that and that column. The sequences aren't as useful but they are shown," ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
qseqidsseqidstitlepidentqcovslengthmismatchgapopenqstartqendsstartsendqframesframeframesevaluebitscoreqseqsseq
0NOP1KMV65234KMV65234 pep supercontig:ECIIIL:CH10_ECIII-L:96494:97369:1 gene:M970_100740 transcript:KMV65234 gene_biotype:protein_coding transcript_biotype:protein_coding description:fibrillarin58.1238227710045732410279111/12.610000e-112324.0GRGGSRGGA--RGGSRGGRGGAAGGARGGA------KVVIEPH-RHAGVYIARGKEDLLVTKNMAPGESVYGEKRISVEEPSKEDGVPPTKVEYRVWNPFRSKLAAGIMGGLDELFIAPGKKVLYLGAASGTSVSHVSDVVGPEGVVYAVEFSHRPGRELISMAKKRPNIIPIIEDARHPQKYRMLIGMVDCVFADVAQPDQARIIALNSHMFLKDQGGVVISIKANCIDSTVDAETVFAREVQKLREERIKPLEQLTLEPYERDHCIVVGRYMRSGGRKFQKGGKPFRSGKGEGRGRMNNKKKGSVNAGLDRKVLVEPHPRFPGVYISRGKEDLLLTRNLVPGVSVYGEKRVAVD-------LEGMKVEYRVWNAYRSKLAAGIVCGAENIHMEPGSKVLYLGASSGTTVSHVSDIVGKDGVVYAVEFSERSGRDLINMSMKRPNIVPIIEDARYPSRYRMLVPIVDCIFSDVSQPDQTRIVALNAQYFLKEGGGVDVSIKANCVNSAVPAETVFADEVNILRKNSIRPKEQVTLEPFEKDHAMIIGRFKLSA
1NOP1KMV65287KMV65287 pep supercontig:ECIIIL:CH10_ECIII-L:168660:170150:1 gene:M970_101280 transcript:KMV65287 gene_biotype:protein_coding transcript_biotype:protein_coding description:hypothetical protein28.8892590494160240141224111/11.100000e-0233.1DELFIAPGKKVLYLGAASGTSVSHVSDVVGPEGVVYAVEFSHRPGREL--ISMAKKRPNII-------PIIEDARHPQKYRMLIGMVDCVEDFYITP----YYAGHVLGAAMFHV--VVGDQSVVYTGDYSTTPDKHLGPASIKCIRPDLLITESTYGSITRDCRKVKEREFLKAVSDCV
2NOP1KMV66016KMV66016 pep supercontig:ECIIIL:CH06_ECIII-L:150158:150508:-1 gene:M970_061160 transcript:KMV66016 gene_biotype:protein_coding transcript_biotype:protein_coding description:hypothetical protein22.05921685112172841075111/19.800000e-0228.9IIPIIEDARHPQKYRMLIGMVDCVFADVAQPDQARIIALNSHMFLKDQGGVVISIKANCIDSTVDAETVTPALKERRH--TYAIIVGIIDVTFVLLQRKDGEREICSVANLHLEDEAFDIKGLSAEEIGKLIPEDT
3NOP1KMV66763KMV66763 pep supercontig:ECIIIL:CH01_ECIII-L:92624:94390:-1 gene:M970_010930 transcript:KMV66763 gene_biotype:protein_coding transcript_biotype:protein_coding description:hypothetical protein28.289441529271315632174111/14.900000e+0025.0GGSRGGFGGRGGSRGGARGG-SRGGFGGR-GGSRGGARGGSRGGFGGRGGSRGGARGGSR-----GGRGGAAGGARGGAKVVIEPHRHAGVYI-ARGKEDLLVTKNMAPGESVYGEKRISVEEPSKEDGVPPTKVEYRVWNPFRSKLAAGIMSGGRRGAGQKSQRIRGYRGGTSSGSHLGRSGGKRSDRSGCERGSFPMKGEAVTGGRYELRMSEDIGVQGPTGEGGQQGERCQEE------VYVDAKERDSSNRRCSVRPANEVFGREMCRLHSSPLEDG--KDRIE-KIESMLRNDLADGIV
4NOP1KMV65333KMV65333 pep supercontig:ECIIIL:CH10_ECIII-L:231151:232890:1 gene:M970_101740 transcript:KMV65333 gene_biotype:protein_coding transcript_biotype:protein_coding description:glycyl-tRNA synthetase31.9151447320222268463509111/19.400000e+0023.9EDARHPQKYRMLIGMVDCVFADVAQPDQARIIALNSHMFLKDQGGVVEDSRPVFRFKPAIAPVQCAIGYLIHFDEFNEHILNIKKFLTDNGLVV
\n", "
" ], "text/plain": [ " qseqid sseqid \\\n", "0 NOP1 KMV65234 \n", "1 NOP1 KMV65287 \n", "2 NOP1 KMV66016 \n", "3 NOP1 KMV66763 \n", "4 NOP1 KMV65333 \n", "\n", " stitle \\\n", "0 KMV65234 pep supercontig:ECIIIL:CH10_ECIII-L:96494:97369:1 gene:M970_100740 transcript:KMV65234 gene_biotype:protein_coding transcript_biotype:protein_coding description:fibrillarin \n", "1 KMV65287 pep supercontig:ECIIIL:CH10_ECIII-L:168660:170150:1 gene:M970_101280 transcript:KMV65287 gene_biotype:protein_coding transcript_biotype:protein_coding description:hypothetical protein \n", "2 KMV66016 pep supercontig:ECIIIL:CH06_ECIII-L:150158:150508:-1 gene:M970_061160 transcript:KMV66016 gene_biotype:protein_coding transcript_biotype:protein_coding description:hypothetical protein \n", "3 KMV66763 pep supercontig:ECIIIL:CH01_ECIII-L:92624:94390:-1 gene:M970_010930 transcript:KMV66763 gene_biotype:protein_coding transcript_biotype:protein_coding description:hypothetical protein \n", "4 KMV65333 pep supercontig:ECIIIL:CH10_ECIII-L:231151:232890:1 gene:M970_101740 transcript:KMV65333 gene_biotype:protein_coding transcript_biotype:protein_coding description:glycyl-tRNA synthetase \n", "\n", " pident qcovs length mismatch gapopen qstart qend sstart send \\\n", "0 58.123 82 277 100 4 57 324 10 279 \n", "1 28.889 25 90 49 4 160 240 141 224 \n", "2 22.059 21 68 51 1 217 284 10 75 \n", "3 28.289 44 152 92 7 13 156 32 174 \n", "4 31.915 14 47 32 0 222 268 463 509 \n", "\n", " qframe sframe frames evalue bitscore \\\n", "0 1 1 1/1 2.610000e-112 324.0 \n", "1 1 1 1/1 1.100000e-02 33.1 \n", "2 1 1 1/1 9.800000e-02 28.9 \n", "3 1 1 1/1 4.900000e+00 25.0 \n", "4 1 1 1/1 9.400000e+00 23.9 \n", "\n", " qseq \\\n", "0 GRGGSRGGA--RGGSRGGRGGAAGGARGGA------KVVIEPH-RHAGVYIARGKEDLLVTKNMAPGESVYGEKRISVEEPSKEDGVPPTKVEYRVWNPFRSKLAAGIMGGLDELFIAPGKKVLYLGAASGTSVSHVSDVVGPEGVVYAVEFSHRPGRELISMAKKRPNIIPIIEDARHPQKYRMLIGMVDCVFADVAQPDQARIIALNSHMFLKDQGGVVISIKANCIDSTVDAETVFAREVQKLREERIKPLEQLTLEPYERDHCIVVGRYMRSG \n", "1 DELFIAPGKKVLYLGAASGTSVSHVSDVVGPEGVVYAVEFSHRPGREL--ISMAKKRPNII-------PIIEDARHPQKYRMLIGMVDCV \n", "2 IIPIIEDARHPQKYRMLIGMVDCVFADVAQPDQARIIALNSHMFLKDQGGVVISIKANCIDSTVDAET \n", "3 GGSRGGFGGRGGSRGGARGG-SRGGFGGR-GGSRGGARGGSRGGFGGRGGSRGGARGGSR-----GGRGGAAGGARGGAKVVIEPHRHAGVYI-ARGKEDLLVTKNMAPGESVYGEKRISVEEPSKEDGVPPTKVEYRVWNPFRSKLAAGIM \n", "4 EDARHPQKYRMLIGMVDCVFADVAQPDQARIIALNSHMFLKDQGGVV \n", "\n", " sseq \n", "0 GRKFQKGGKPFRSGKGEGRGRMNNKKKGSVNAGLDRKVLVEPHPRFPGVYISRGKEDLLLTRNLVPGVSVYGEKRVAVD-------LEGMKVEYRVWNAYRSKLAAGIVCGAENIHMEPGSKVLYLGASSGTTVSHVSDIVGKDGVVYAVEFSERSGRDLINMSMKRPNIVPIIEDARYPSRYRMLVPIVDCIFSDVSQPDQTRIVALNAQYFLKEGGGVDVSIKANCVNSAVPAETVFADEVNILRKNSIRPKEQVTLEPFEKDHAMIIGRFKLSA \n", "1 EDFYITP----YYAGHVLGAAMFHV--VVGDQSVVYTGDYSTTPDKHLGPASIKCIRPDLLITESTYGSITRDCRKVKEREFLKAVSDCV \n", "2 VTPALKERRH--TYAIIVGIIDVTFVLLQRKDGEREICSVANLHLEDEAFDIKGLSAEEIGKLIPEDT \n", "3 SGGRRGAGQKSQRIRGYRGGTSSGSHLGRSGGKRSDRSGCERGSFPMKGEAVTGGRYELRMSEDIGVQGPTGEGGQQGERCQEE------VYVDAKERDSSNRRCSVRPANEVFGREMCRLHSSPLEDG--KDRIE-KIESMLRNDLADGIV \n", "4 EDSRPVFRFKPAIAPVQCAIGYLIHFDEFNEHILNIKKFLTDNGLVV " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "#trick from based on https://stackoverflow.com/a/30691921 and https://stackoverflow.com/a/25352191/8508004\n", "with pd.option_context('display.max_rows', None, 'display.max_columns', None,'display.max_colwidth', None):\n", " display(dfs[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The individual dataframes of results can be used in some of the ways demonstrated earlier in this series of notebooks. \n", "For example, it would be easy to filter based on the values in the bitscore column, like in the following cell. Note to make the filter step easier to read syntax-wise, first I assign the first dataframe from the list of datafames to a simpler moniker, of just `df`. \n", "Then I specify to subset that dataframe to just cases where the values in the dataframe's 'bitscore' column exceed 100. " ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
qseqidsseqidstitlepidentqcovslengthmismatchgapopenqstartqendsstartsendqframesframeframesevaluebitscoreqseqsseq
0NOP1KMV65234KMV65234 pep supercontig:ECIIIL:CH10_ECIII-L:9...58.1238227710045732410279111/12.610000e-112324.0GRGGSRGGA--RGGSRGGRGGAAGGARGGA------KVVIEPH-RH...GRKFQKGGKPFRSGKGEGRGRMNNKKKGSVNAGLDRKVLVEPHPRF...
\n", "
" ], "text/plain": [ " qseqid sseqid stitle pident \\\n", "0 NOP1 KMV65234 KMV65234 pep supercontig:ECIIIL:CH10_ECIII-L:9... 58.123 \n", "\n", " qcovs length mismatch gapopen qstart qend sstart send qframe \\\n", "0 82 277 100 4 57 324 10 279 1 \n", "\n", " sframe frames evalue bitscore \\\n", "0 1 1/1 2.610000e-112 324.0 \n", "\n", " qseq \\\n", "0 GRGGSRGGA--RGGSRGGRGGAAGGARGGA------KVVIEPH-RH... \n", "\n", " sseq \n", "0 GRKFQKGGKPFRSGKGEGRGRMNNKKKGSVNAGLDRKVLVEPHPRF... " ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = dfs[0].copy() #make a copy assigned to `df` so next line easier to read\n", "df[df['bitscore'] > 100]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Text result\n", "\n", "If you are more used to viewing the text output from BLAST to assess how things worked, you can modify the code used to generate the dataframes above in order show that instead of generating dataframes. In an attempt to make it easier to scroll through to view the results by limiting the output in the window, the results will be limited to the first gene in the list, Nop1, by adding `break` so the `for` loop ends when it is hit. \n", "Here is a reworked version to show the text output." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\n", "\n", "Protein sequence saved as `blast_input_seq.fa`." ] }, { "name": "stdout", "output_type": "stream", "text": [ "BLASTP 2.15.0+\n", "\n", "\n", "Reference: Stephen F. Altschul, Thomas L. Madden, Alejandro A.\n", "Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J.\n", "Lipman (1997), \"Gapped BLAST and PSI-BLAST: a new generation of\n", "protein database search programs\", Nucleic Acids Res. 25:3389-3402.\n", "\n", "\n", "Reference for composition-based statistics: Alejandro A. Schaffer,\n", "L. Aravind, Thomas L. Madden, Sergei Shavirin, John L. Spouge, Yuri\n", "I. Wolf, Eugene V. Koonin, and Stephen F. Altschul (2001),\n", "\"Improving the accuracy of PSI-BLAST protein database searches with\n", "composition-based statistics and other refinements\", Nucleic Acids\n", "Res. 29:2994-3005.\n", "\n", "\n", "\n", "Database: ec.pep.fas\n", " 1,834 sequences; 653,245 total letters\n", "\n", "\n", "\n", "Query= NOP1 YDL014W\n", "\n", "Length=328\n", " Score E\n", "Sequences producing significant alignments: (Bits) Value\n", "\n", "KMV65234 pep supercontig:ECIIIL:CH10_ECIII-L:96494:97369:1 gene:M... 324 3e-112\n", "KMV65287 pep supercontig:ECIIIL:CH10_ECIII-L:168660:170150:1 gene... 33.1 0.011 \n", "KMV66016 pep supercontig:ECIIIL:CH06_ECIII-L:150158:150508:-1 gen... 28.9 0.098 \n", "KMV66763 pep supercontig:ECIIIL:CH01_ECIII-L:92624:94390:-1 gene:... 25.0 4.9 \n", "KMV65333 pep supercontig:ECIIIL:CH10_ECIII-L:231151:232890:1 gene... 23.9 9.4 \n", "\n", "\n", ">KMV65234 pep supercontig:ECIIIL:CH10_ECIII-L:96494:97369:1 gene:M970_100740 \n", "transcript:KMV65234 gene_biotype:protein_coding \n", "transcript_biotype:protein_coding description:fibrillarin\n", "Length=291\n", "\n", " Score = 324 bits (830), Expect = 3e-112, Method: Compositional matrix adjust.\n", " Identities = 161/277 (58%), Positives = 208/277 (75%), Gaps = 16/277 (6%)\n", "\n", "Query 57 GRGGSRGGA--RGGSRGGRGGAAGGARGGA------KVVIEPH-RHAGVYIARGKEDLLV 107\n", " GR +GG R G GRG +G KV++EPH R GVYI+RGKEDLL+\n", "Sbjct 10 GRKFQKGGKPFRSGKGEGRGRMNNKKKGSVNAGLDRKVLVEPHPRFPGVYISRGKEDLLL 69\n", "\n", "Query 108 TKNMAPGESVYGEKRISVEEPSKEDGVPPTKVEYRVWNPFRSKLAAGIMGGLDELFIAPG 167\n", " T+N+ PG SVYGEKR++V+ + KVEYRVWN +RSKLAAGI+ G + + + PG\n", "Sbjct 70 TRNLVPGVSVYGEKRVAVD-------LEGMKVEYRVWNAYRSKLAAGIVCGAENIHMEPG 122\n", "\n", "Query 168 KKVLYLGAASGTSVSHVSDVVGPEGVVYAVEFSHRPGRELISMAKKRPNIIPIIEDARHP 227\n", " KVLYLGA+SGT+VSHVSD+VG +GVVYAVEFS R GR+LI+M+ KRPNI+PIIEDAR+P\n", "Sbjct 123 SKVLYLGASSGTTVSHVSDIVGKDGVVYAVEFSERSGRDLINMSMKRPNIVPIIEDARYP 182\n", "\n", "Query 228 QKYRMLIGMVDCVFADVAQPDQARIIALNSHMFLKDQGGVVISIKANCIDSTVDAETVFA 287\n", " +YRML+ +VDC+F+DV+QPDQ RI+ALN+ FLK+ GGV +SIKANC++S V AETVFA\n", "Sbjct 183 SRYRMLVPIVDCIFSDVSQPDQTRIVALNAQYFLKEGGGVDVSIKANCVNSAVPAETVFA 242\n", "\n", "Query 288 REVQKLREERIKPLEQLTLEPYERDHCIVVGRYMRSG 324\n", " EV LR+ I+P EQ+TLEP+E+DH +++GR+ S \n", "Sbjct 243 DEVNILRKNSIRPKEQVTLEPFEKDHAMIIGRFKLSA 279\n", "\n", "\n", ">KMV65287 pep supercontig:ECIIIL:CH10_ECIII-L:168660:170150:1 \n", "gene:M970_101280 transcript:KMV65287 gene_biotype:protein_coding \n", "transcript_biotype:protein_coding description:hypothetical \n", "protein\n", "Length=496\n", "\n", " Score = 33.1 bits (74), Expect = 0.011, Method: Compositional matrix adjust.\n", " Identities = 26/90 (29%), Positives = 42/90 (47%), Gaps = 15/90 (17%)\n", "\n", "Query 160 DELFIAPGKKVLYLGAASGTSVSHVSDVVGPEGVVYAVEFSHRPGREL--ISMAKKRPNI 217\n", " ++ +I P Y G G ++ HV VVG + VVY ++S P + L S+ RP++\n", "Sbjct 141 EDFYITP----YYAGHVLGAAMFHV--VVGDQSVVYTGDYSTTPDKHLGPASIKCIRPDL 194\n", "\n", "Query 218 I-------PIIEDARHPQKYRMLIGMVDCV 240\n", " + I D R ++ L + DCV\n", "Sbjct 195 LITESTYGSITRDCRKVKEREFLKAVSDCV 224\n", "\n", "\n", ">KMV66016 pep supercontig:ECIIIL:CH06_ECIII-L:150158:150508:-1 \n", "gene:M970_061160 transcript:KMV66016 gene_biotype:protein_coding \n", "transcript_biotype:protein_coding description:hypothetical \n", "protein\n", "Length=106\n", "\n", " Score = 28.9 bits (63), Expect = 0.098, Method: Composition-based stats.\n", " Identities = 15/68 (22%), Positives = 35/68 (51%), Gaps = 2/68 (3%)\n", "\n", "Query 217 IIPIIEDARHPQKYRMLIGMVDCVFADVAQPDQARIIALNSHMFLKDQGGVVISIKANCI 276\n", " + P +++ RH Y +++G++D F + + D R I +++ L+D+ + + A I\n", "Sbjct 10 VTPALKERRH--TYAIIVGIIDVTFVLLQRKDGEREICSVANLHLEDEAFDIKGLSAEEI 67\n", "\n", "Query 277 DSTVDAET 284\n", " + +T\n", "Sbjct 68 GKLIPEDT 75\n", "\n", "\n", ">KMV66763 pep supercontig:ECIIIL:CH01_ECIII-L:92624:94390:-1 gene:M970_010930 \n", "transcript:KMV66763 gene_biotype:protein_coding \n", "transcript_biotype:protein_coding description:hypothetical \n", "protein\n", "Length=588\n", "\n", " Score = 25.0 bits (53), Expect = 4.9, Method: Compositional matrix adjust.\n", " Identities = 43/152 (28%), Positives = 65/152 (43%), Gaps = 17/152 (11%)\n", "\n", "Query 13 GGSRGGFGGRGGSRGGARGG-SRGGFGGR-GGSRGGARGGSRGGFGGRGGSRGGARGGSR 70\n", " G R G G + G RGG S G GR GG R G RG F +G + G R R\n", "Sbjct 32 SGGRRGAGQKSQRIRGYRGGTSSGSHLGRSGGKRSDRSGCERGSFPMKGEAVTGGRYELR 91\n", "\n", "Query 71 -----GGRGGAAGGARGGAKVVIEPHRHAGVYI-ARGKEDLLVTKNMAPGESVYGEKRIS 124\n", " G +G G + G + E VY+ A+ ++ ++ P V+G + \n", "Sbjct 92 MSEDIGVQGPTGEGGQQGERCQEE------VYVDAKERDSSNRRCSVRPANEVFGREMCR 145\n", "\n", "Query 125 VEEPSKEDGVPPTKVEYRVWNPFRSKLAAGIM 156\n", " + EDG ++E ++ + R+ LA GI+\n", "Sbjct 146 LHSSPLEDG--KDRIE-KIESMLRNDLADGIV 174\n", "\n", "\n", ">KMV65333 pep supercontig:ECIIIL:CH10_ECIII-L:231151:232890:1 \n", "gene:M970_101740 transcript:KMV65333 gene_biotype:protein_coding \n", "transcript_biotype:protein_coding description:glycyl-tRNA \n", "synthetase\n", "Length=579\n", "\n", " Score = 23.9 bits (50), Expect = 9.4, Method: Compositional matrix adjust.\n", " Identities = 15/47 (32%), Positives = 21/47 (45%), Gaps = 0/47 (0%)\n", "\n", "Query 222 EDARHPQKYRMLIGMVDCVFADVAQPDQARIIALNSHMFLKDQGGVV 268\n", " ED+R +++ I V C + D+ LN FL D G VV\n", "Sbjct 463 EDSRPVFRFKPAIAPVQCAIGYLIHFDEFNEHILNIKKFLTDNGLVV 509\n", "\n", "\n", "\n", "Lambda K H a alpha\n", " 0.319 0.141 0.423 0.792 4.96 \n", "\n", "Gapped\n", "Lambda K H a alpha sigma\n", " 0.267 0.0410 0.140 1.90 42.6 43.6 \n", "\n", "Effective search space used: 126581391\n", "\n", "\n", " Database: ec.pep.fas\n", " Posted date: Mar 6, 2024 10:04 PM\n", " Number of letters in database: 653,245\n", " Number of sequences in database: 1,834\n", "\n", "\n", "\n", "Matrix: BLOSUM62\n", "Gap Penalties: Existence: 11, Extension: 1\n", "Neighboring words threshold: 11\n", "Window for multiple hits: 40\n" ] } ], "source": [ "records = Fasta(seq_file_to_use)\n", "for seq in records:\n", " one_seq_to_file(str(seq), seq.long_name, \"blast_input_seq.fa\")\n", " !blastp -query blast_input_seq.fa -db ec.pep.fas -task blastp\n", " !rm blast_input_seq.fa\n", " break" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It clearly shows in the description of the best match for Nop1p that it is annotated as fibrillarin, and so it seems indeed the top hit identified is the ortholog of *S. cerevisiae* Nop1p.\n", "\n", "To make it easier to review the results, you send them to a text file that you review here or download. This modification of the code when run, will send **the search results for all the sequences** to a separate file." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\n", "\n", "Protein sequence saved as `blast_input_seq.fa`.\n", "\n", "Protein sequence saved as `blast_input_seq.fa`.\n", "\n", "Protein sequence saved as `blast_input_seq.fa`." ] } ], "source": [ "records = Fasta(seq_file_to_use)\n", "!touch results.txt\n", "!rm results.txt # adding a deletion so if you run this cell more than once it\n", "# will start a fresh file and then append to it\n", "!touch results.txt\n", "for seq in records:\n", " one_seq_to_file(str(seq), seq.long_name, \"blast_input_seq.fa\")\n", " !blastp -query blast_input_seq.fa -db ec.pep.fas -task blastp >>results.txt\n", " !rm blast_input_seq.fa" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you are using the demonstration with the sequences as it was originally writen, the following command will show the size of the `results.txt` file as 43k." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-rw-r--r-- 1 jovyan jovyan 43K Mar 6 22:04 results.txt\n" ] } ], "source": [ "%%bash\n", "ls -lah results.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Go to the dashboard by clicking on the Jupyter icon in the upper left and then you can review it like a text file. Or download this file if you'd like to keep it, or if you think you can view it better outside of Jupyter." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "-----" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" } }, "nbformat": 4, "nbformat_minor": 4 }