{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Searching for homologs among deduced proteins from a genome using BLAST and Python\n",
    "\n",
    "This notebook will take three specified budding yeast genes and identify & collect the protein sequences of homologs among a collection of deduced proteins available for a genome from Ensembl.\n",
    "\n",
    "This notebook builds on the previous ones. See those for introduction and credits. \n",
    "\n",
    "This is meant to demonstrate using BLAST+, Python, and a few of my Python scripts to accomplish a task in a series of steps. \n",
    "\n",
    "How to modify the input to use your own sequence or sequences is described.\n",
    "\n",
    "The file used here contains the deduced sequences predicted proteins encoded by the genome of *Encephalitozoon cuniculi*. (The details of these sequences are just below in an expandable section; however, the option doesn't render statically via Github; use [nbviewer](https://nbviewer.jupyter.org/github/fomightez/blast-binder/blob/master/notebooks/Searching%20for%20homologs%20among%20deduced%20proteins%20from%20a%20genome%20using%20BLAST%20and%20Python.ipynb) or [launch an active session via MyBinder](https://mybinder.org/v2/gh/fomightez/blast-binder/master?filepath=notebooks/Searching%20for%20homologs%20among%20deduced%20proteins%20from%20a%20genome%20using%20BLAST%20and%20Python.ipynb).)\n",
    "\n",
    "<details>\n",
    "  <summary>Click the arrow below to expand to fuller explanation of the sequence source.</summary>\n",
    "  \n",
    "From [the Ensembl genome page for Encephalitozoon cuniculi](https://fungi.ensembl.org/Encephalitozoon_cuniculi_ecuniii_l_gca_001078035/Info/Index), under 'Gene Annoation', you'd click 'FASTA' next to the text 'Download genes, cDNAs, ncRNA, proteins -'. That leads to the [Downloads page](ftp://ftp.ensemblgenomes.org/pub/fungi/release-48/fasta/fungi_microsporidia1_collection/encephalitozoon_cuniculi_ecuniii_l_gca_001078035)\n",
    "which lists:\n",
    "\n",
    "```bash\n",
    "cdna/\t\t8/17/20, 1:24:00 PM\n",
    "cds/\t\t8/17/20, 1:24:00 PM\n",
    "dna/\t\t8/17/20, 1:24:00 PM\n",
    "dna_index/\t\t8/17/20, 1:24:00 PM\n",
    "ncrna/\t\t8/17/20, 1:24:00 PM\n",
    "pep/\t\t8/17/20, 1:24:00 PM\n",
    "```\n",
    "\n",
    "Specifically the deduced protein sequences is the `pep` entry at the bottom of that list and can be directly retrieved from:\n",
    "ftp://ftp.ensemblgenomes.org/pub/fungi/release-48/fasta/fungi_microsporidia1_collection/encephalitozoon_cuniculi_ecuniii_l_gca_001078035/pep/\n",
    "\n",
    "Source information for the Encephalitozoon cuniculi sequence data:  \n",
    "https://www.ebi.ac.uk/ena/browser/view/GCA_001078035.1\n",
    "    \n",
    "Sadly, the sequence file we use here cannot be directly retrieved via `curl` or `wget` from the source because the port for ftp is blocked on MyBinder.org to prevent abuse. However, you could easily download the file and then upload it to the active session running via MyBinder.org. To save on that step it is already included in this repository and so is present when the session launches. \n",
    "</details>\n",
    "\n",
    "-----"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Preparation\n",
    "\n",
    "In this demonstration we'll get protein sequences for three genes of *S. cerevisiae* and make a multi-FASTA file from them. If you already have your own protein sequences in a file, you can upload that to this running session and skip to the step below where the input file is specified and edit the `seq_file_to_use` variable to be the name of your file. The sequence file in FASTA format you provide will need to be in the `notebooks` directory alongside this notebook. This will be revisited in greater detail below. \n",
    "\n",
    "To get the demonstration protein sequences, we'll use a script I developed that fetches protein sequences for *S. cerevisiae* genes when provided wit  a gene's systematic name, standard name, or alias as defined at gene page at yeastgenome.org. See more about that script [here](https://github.com/fomightez/yeastmine#descriptions-of-the-scripts)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n",
      "                                 Dload  Upload   Total   Spent    Left  Speed\n",
      "100 11399  100 11399    0     0  50823      0 --:--:-- --:--:-- --:--:-- 50888\n"
     ]
    }
   ],
   "source": [
    "!curl -O https://raw.githubusercontent.com/fomightez/yeastmine/master/get_protein_seq_as_FASTA.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The intermine webservices package that script relies on has not been updated on package distribution sites in a while and so we are going to get the current version directly from Github using the following command. We'll also add a package called `pyfaidx` that makes handling FASTA-formated files easier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Collecting https://github.com/intermine/intermine-ws-python/archive/dev.zip\n",
      "  Downloading https://github.com/intermine/intermine-ws-python/archive/dev.zip\n",
      "\u001b[2K     \u001b[32m-\u001b[0m \u001b[32m150.5 kB\u001b[0m \u001b[31m4.9 MB/s\u001b[0m \u001b[33m0:00:00\u001b[0m\n",
      "\u001b[?25h  Preparing metadata (setup.py) ... \u001b[?25ldone\n",
      "\u001b[?25hBuilding wheels for collected packages: intermine\n",
      "  Building wheel for intermine (setup.py) ... \u001b[?25ldone\n",
      "\u001b[?25h  Created wheel for intermine: filename=intermine-1.13.0-py3-none-any.whl size=75660 sha256=5f2f7c4d4b2ecae2cac9ff52c43a956dd2dc03ad8c7c2f24ffc543e1d6d299c5\n",
      "  Stored in directory: /tmp/pip-ephem-wheel-cache-_wq3fh2r/wheels/74/4c/cc/b2bc6341109f116ae83cc8424243b1a820cd4fcdf99a4f6932\n",
      "Successfully built intermine\n",
      "Installing collected packages: intermine\n",
      "Successfully installed intermine-1.13.0\n",
      "Note: you may need to restart the kernel to use updated packages.\n",
      "Collecting pyfaidx\n",
      "  Downloading pyfaidx-0.8.1.1-py3-none-any.whl.metadata (25 kB)\n",
      "Requirement already satisfied: setuptools in /srv/conda/envs/notebook/lib/python3.10/site-packages (from pyfaidx) (68.2.2)\n",
      "Requirement already satisfied: importlib-metadata in /srv/conda/envs/notebook/lib/python3.10/site-packages (from pyfaidx) (6.8.0)\n",
      "Requirement already satisfied: zipp>=0.5 in /srv/conda/envs/notebook/lib/python3.10/site-packages (from importlib-metadata->pyfaidx) (3.17.0)\n",
      "Downloading pyfaidx-0.8.1.1-py3-none-any.whl (28 kB)\n",
      "Installing collected packages: pyfaidx\n",
      "Successfully installed pyfaidx-0.8.1.1\n",
      "Note: you may need to restart the kernel to use updated packages.\n"
     ]
    }
   ],
   "source": [
    "%pip install https://github.com/intermine/intermine-ws-python/archive/dev.zip\n",
    "%pip install pyfaidx"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we'll use it to get the protein sequences corresponding to the three genes form a list.  \n",
    "(If you see, errors like `ModuleNotFoundError: No module named 'urlparse'`...`ImportError: cannot import name 'MutableMapping' from 'collections' (/srv/conda/envs/notebook/lib/python3.10/collections/__init__.py)` as you run this next cell, then read below this next cell for possible solution.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "looking up the gene associated with NOP1...getting protein sequence...\n",
      "\n",
      "File of protein sequence saved as 'S288C_YDL014W_NOP1_protein.fsa'.\n",
      "Finished.\n",
      "looking up the gene associated with VPH1...getting protein sequence...\n",
      "\n",
      "File of protein sequence saved as 'S288C_YOR270C_VPH1_protein.fsa'.\n",
      "Finished.\n",
      "looking up the gene associated with RPB1...getting protein sequence...\n",
      "\n",
      "File of protein sequence saved as 'S288C_YDL140C_RPO21_protein.fsa'.\n",
      "Finished.\n"
     ]
    }
   ],
   "source": [
    "seqs_to_get = [\"NOP1\",\"VPH1\",\"RPB1\"]\n",
    "for seq in seqs_to_get:\n",
    "    %run get_protein_seq_as_FASTA.py {seq}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(If you see `basil2` among the output, disregard it. I believe it simply an indicator that it is using the development version of the code.)  \n",
    "(If you saw errors, like `ModuleNotFoundError: No module named 'urlparse'`...`ImportError: cannot import name 'MutableMapping' from 'collections' (/srv/conda/envs/notebook/lib/python3.10/collections/__init__.py)` as you run the cell, it is possibly [because intermine needs updating](https://github.com/intermine/intermine/issues/2424). Foruntately it is easily fixed by making a new cell and running two commands: Rename the code to fix it by running in the new cell `!mv possible_webservice_fix_for_intermine_code.py webservice.py`. Then to fix run `!cp ./webservice.py /srv/conda/envs/notebook/lib/python3.10/site-packages/intermine/webservice.py`. Now delete that extra cell and try running the cell above again, and then if it works continue on with the rest.)\n",
    "\n",
    "With those protein sequences we'll combine them into one multi-FASTA sequence file using the cell below. (Most of the code in it is concerned with getting a list of the individual FASTA files without needing to hardcode them into this notebook. The putting the files together is really done in the `!cat` command.)\n",
    "\n",
    "The collected individual sequences are funneled into one sequence file by running the next cells. This may seem counterproductive because later we split the sequences back out to individual files again. However, it is done in this manner because BLAST needs a file as input and this way it is easier for others to plug-in their own sequences in FASTA form right into this notebook by uploading it here and editing the value of file name for the `seq_file_to_use` variable below. This way they don't need to relist their gene names to adapt this notebook. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import sys\n",
    "import fnmatch\n",
    "fasta_files = []\n",
    "name_part_to_match = \".fsa\"\n",
    "for file in os.listdir('.'):\n",
    "    if fnmatch.fnmatch(file, '*'+name_part_to_match):\n",
    "        #print (file)\n",
    "        #first_part_filen = file.rsplit(extension_to_handle,1)[0]\n",
    "        fasta_files.append(file)\n",
    "!cat {\" \".join(fasta_files)} >protein_sequences.fasta"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That should now produce a file named `protein_sequences.fasta` in the current working directory. We can check that is the case. If you run the code in the next cell it will first display the current working directory, and then the list of files in that directory. We use the `%%bash%` magic tag at the top of the cell to specify to run those commands in the shell. (For those that are Jupyter power users, you may realize this is unnecessary for these two commands; however, it is used here as what unix commands work without anything special is one of those things that novices find hard to follow when learning on Jupyter.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/home/jovyan/notebooks\n",
      "-*--above is working directory; below is the list of files in that directory-*--\n",
      "BLAST on Command Line and Integrating with Python.ipynb\n",
      "BLAST on the Command Line with Advanced Python.ipynb\n",
      "Encephalitozoon_cuniculi_ecuniii_l_gca_001078035.ECIIIL.dna.toplevel.fa.gz\n",
      "Encephalitozoon_cuniculi_ecuniii_l_gca_001078035.ECIIIL.pep.all.fa.gz\n",
      "get_protein_seq_as_FASTA.py\n",
      "GSD\n",
      "protein_sequences.fasta\n",
      "S288C_YDL014W_NOP1_protein.fsa\n",
      "S288C_YDL140C_RPO21_protein.fsa\n",
      "S288C_YOR270C_VPH1_protein.fsa\n",
      "Searching for coding sequences in genomes using BLAST and Python.ipynb\n",
      "Searching for homologs among deduced proteins from a genome using BLAST and Python.ipynb\n",
      "Untitled.ipynb\n",
      "webservice.py\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "pwd\n",
    "echo \"-*--above is working directory; below is the list of files in that directory-*--\"\n",
    "ls"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You should see `protein_sequences.fasta` listed there.\n",
    "\n",
    "The file `protein_sequences.fasta` contains the sequences we'll use in the remainder of this notebook to look for homologs among the deduced protein sequences from *Encephalitozoon cuniculi*. If you already have a file of your own protein sequence(s) in FASTA format, you have some choices. One is to simply edit the content of `protein_sequences.fasta` to replace it with your sequence or sequences. The more direct way, and one that allows you to skip any of the above cells once you know what you are doing is that you upload your sequence file to this session and change the file name below to match your sequence file. To do that you edit the text between the quotes. **Note that the sequence file in FASTA format you provide will need to be located in the `notebooks` directory alongside this notebook.** To help in troubleshooting the upload and placing the file in the correct directory, you should see your file listed now if you re-run the cell above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "seq_file_to_use = \"protein_sequences.fasta\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sadly, the file containing the deduced sequences predicted proteins encoded by the genome of *Encephalitozoon cuniculi* we use here cannot be directly retrieved via `curl` or `wget` from the source because the port for ftp is blocked on MyBinder.org to prevent abuse. However, you could easily download the file and then upload it to the active session running via MyBinder.org. To save on that step it is already included in this repository and so is present when the session launches. \n",
    "\n",
    "We need to first extract the sequence file from a compressed form using the following command:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "gunzip Encephalitozoon_cuniculi_ecuniii_l_gca_001078035.ECIIIL.pep.all.fa.gz"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's rename the sequence file something more manageable, `ec.pep.fas`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "mv Encephalitozoon_cuniculi_ecuniii_l_gca_001078035.ECIIIL.pep.all.fa ec.pep.fas"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get the `blast_to_df` script that will be used to convert the raw output from BLAST to something useful in Python by running the following cell.  We'll also go ahead and import the main function from the script that allows sending BLAST output to Python dataframes so that we can use it here. Also we prepare to make files from each sequence so that we can feed as input into BLAST."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n",
      "                                 Dload  Upload   Total   Spent    Left  Speed\n",
      "100 13771  100 13771    0     0  57280      0 --:--:-- --:--:-- --:--:-- 57141\n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "file_needed = \"blast_to_df.py\"\n",
    "if not os.path.isfile(file_needed):\n",
    "    !curl -O https://raw.githubusercontent.com/fomightez/sequencework/master/blast-utilities/blast_to_df.py\n",
    "import pandas as pd\n",
    "\n",
    "from blast_to_df import blast_to_df\n",
    "from pyfaidx import Fasta"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll also define a function we can use later to save the individual sequences as single files for input into BLAST. This apparent counterproductivity approach where we start out with single files of the sequences in FASTA format and later save them again is to make it easier for researchers with sequences already in a file to plug into this workflow as mentioned above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "def one_seq_to_file(prot_seq, description, name_to_use):\n",
    "    '''\n",
    "    function to save a sequence as a file in FASTA-format\n",
    "    \n",
    "    NOTE THAT NOT JUST USING `!faidx --split-files <seq_file>` because\n",
    "    want to know what files will be named for each. In fact can use same name\n",
    "    because will delete.\n",
    "    '''\n",
    "    # save the protein sequence as FASTA\n",
    "    chunk_size = 70 #<---amino acids per line to have in FASTA (note this chunking to\n",
    "    # multiple lines is the opposite of what PatMatch's `unjustify_fasta.pl` does)\n",
    "    prot_seq_chunks = [prot_seq[i:i+chunk_size] for i in range(\n",
    "        0, len(prot_seq),chunk_size)]\n",
    "    prot_seq_fa = \">\" + description + \"\\n\"+ \"\\n\".join(prot_seq_chunks)\n",
    "    p_output_file_name = name_to_use\n",
    "    with open(p_output_file_name, 'w') as output:\n",
    "        output.write(prot_seq_fa)\n",
    "    sys.stderr.write(\"\\n\\nProtein sequence saved as \"\n",
    "        \"`{}`.\".format(p_output_file_name))  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now you are prepared to run BLAST to search the deduced protein sequences from the *Encephalitozoon cuniculi* genome for the homologs of the three yeast genes. If you provided sequences as part of this, it will look for homologs of those among the deduced protein sequences from the *Encephalitozoon cuniculi* genome."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Use BLAST to search the deduced proteins sequences for matches to the specified genes\n",
    "\n",
    "This going to use protein sequences from the *Encephalitozoon cuniculi* genome and make a database so it is searchable and then search for matches to each the genes. The information on the best match will be collected. One use for that information will be collecting the corresponding sequences later."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "Building a new DB, current time: 03/06/2024 22:04:27\n",
      "New DB name:   /home/jovyan/notebooks/ec.pep.fas\n",
      "New DB title:  ec.pep.fas\n",
      "Sequence type: Protein\n",
      "Keep MBits: T\n",
      "Maximum file size: 3000000000B\n",
      "Adding sequences from FASTA; added 1834 sequences in 0.137166 seconds.\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "!makeblastdb -in ec.pep.fas -dbtype prot"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "Protein sequence saved as `blast_input_seq.fa`.Provided results read and converted to a dataframe...\n",
      "\n",
      "A dataframe of the data has been saved as a file\n",
      "in a manner where other Python programs can access it (pickled form).\n",
      "RESULTING DATAFRAME is stored as ==> 'BLAST_pickled_df.pkl'\n",
      "\n",
      "Returning a dataframe with the information as well.\n",
      "\n",
      "Protein sequence saved as `blast_input_seq.fa`.Provided results read and converted to a dataframe...\n",
      "\n",
      "A dataframe of the data has been saved as a file\n",
      "in a manner where other Python programs can access it (pickled form).\n",
      "RESULTING DATAFRAME is stored as ==> 'BLAST_pickled_df.pkl'\n",
      "\n",
      "Returning a dataframe with the information as well.\n",
      "\n",
      "Protein sequence saved as `blast_input_seq.fa`.Provided results read and converted to a dataframe...\n",
      "\n",
      "A dataframe of the data has been saved as a file\n",
      "in a manner where other Python programs can access it (pickled form).\n",
      "RESULTING DATAFRAME is stored as ==> 'BLAST_pickled_df.pkl'\n",
      "\n",
      "Returning a dataframe with the information as well."
     ]
    }
   ],
   "source": [
    "records = Fasta(seq_file_to_use)\n",
    "dfs = []\n",
    "for seq in records:\n",
    "    one_seq_to_file(str(seq), seq.long_name, \"blast_input_seq.fa\") # if you are seeing the caret ,a.k.a., less-than-symbol,\n",
    "    # get included in the name, switch `seq.long_name` to `seq.name`. I've seen where if the FASTA sequence file isn't \n",
    "    # standard and instead includes a blank line above the description lines (after the first), the `seq.long_name` will\n",
    "    # include `>` at the start. I chose to use `seq.long_name` to pass so that one has the richest set of\n",
    "    # options for adpating the code for naming of the files.\n",
    "    result = !blastp -query blast_input_seq.fa -db ec.pep.fas -outfmt \"6 qseqid sseqid stitle pident qcovs length mismatch gapopen qstart qend sstart send qframe sframe frames evalue bitscore qseq sseq\" -task blastp\n",
    "    dfs.append(blast_to_df(result.n))\n",
    "    !rm blast_input_seq.fa"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`dfs` results in a list of dataframes, one for each search.  \n",
    "\n",
    "For example, the first one in the list can be displayed by running the following code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>qseqid</th>\n",
       "      <th>sseqid</th>\n",
       "      <th>stitle</th>\n",
       "      <th>pident</th>\n",
       "      <th>qcovs</th>\n",
       "      <th>length</th>\n",
       "      <th>mismatch</th>\n",
       "      <th>gapopen</th>\n",
       "      <th>qstart</th>\n",
       "      <th>qend</th>\n",
       "      <th>sstart</th>\n",
       "      <th>send</th>\n",
       "      <th>qframe</th>\n",
       "      <th>sframe</th>\n",
       "      <th>frames</th>\n",
       "      <th>evalue</th>\n",
       "      <th>bitscore</th>\n",
       "      <th>qseq</th>\n",
       "      <th>sseq</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>NOP1</td>\n",
       "      <td>KMV65234</td>\n",
       "      <td>KMV65234 pep supercontig:ECIIIL:CH10_ECIII-L:9...</td>\n",
       "      <td>58.123</td>\n",
       "      <td>82</td>\n",
       "      <td>277</td>\n",
       "      <td>100</td>\n",
       "      <td>4</td>\n",
       "      <td>57</td>\n",
       "      <td>324</td>\n",
       "      <td>10</td>\n",
       "      <td>279</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1/1</td>\n",
       "      <td>2.610000e-112</td>\n",
       "      <td>324.0</td>\n",
       "      <td>GRGGSRGGA--RGGSRGGRGGAAGGARGGA------KVVIEPH-RH...</td>\n",
       "      <td>GRKFQKGGKPFRSGKGEGRGRMNNKKKGSVNAGLDRKVLVEPHPRF...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>NOP1</td>\n",
       "      <td>KMV65287</td>\n",
       "      <td>KMV65287 pep supercontig:ECIIIL:CH10_ECIII-L:1...</td>\n",
       "      <td>28.889</td>\n",
       "      <td>25</td>\n",
       "      <td>90</td>\n",
       "      <td>49</td>\n",
       "      <td>4</td>\n",
       "      <td>160</td>\n",
       "      <td>240</td>\n",
       "      <td>141</td>\n",
       "      <td>224</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1/1</td>\n",
       "      <td>1.100000e-02</td>\n",
       "      <td>33.1</td>\n",
       "      <td>DELFIAPGKKVLYLGAASGTSVSHVSDVVGPEGVVYAVEFSHRPGR...</td>\n",
       "      <td>EDFYITP----YYAGHVLGAAMFHV--VVGDQSVVYTGDYSTTPDK...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>NOP1</td>\n",
       "      <td>KMV66016</td>\n",
       "      <td>KMV66016 pep supercontig:ECIIIL:CH06_ECIII-L:1...</td>\n",
       "      <td>22.059</td>\n",
       "      <td>21</td>\n",
       "      <td>68</td>\n",
       "      <td>51</td>\n",
       "      <td>1</td>\n",
       "      <td>217</td>\n",
       "      <td>284</td>\n",
       "      <td>10</td>\n",
       "      <td>75</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1/1</td>\n",
       "      <td>9.800000e-02</td>\n",
       "      <td>28.9</td>\n",
       "      <td>IIPIIEDARHPQKYRMLIGMVDCVFADVAQPDQARIIALNSHMFLK...</td>\n",
       "      <td>VTPALKERRH--TYAIIVGIIDVTFVLLQRKDGEREICSVANLHLE...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>NOP1</td>\n",
       "      <td>KMV66763</td>\n",
       "      <td>KMV66763 pep supercontig:ECIIIL:CH01_ECIII-L:9...</td>\n",
       "      <td>28.289</td>\n",
       "      <td>44</td>\n",
       "      <td>152</td>\n",
       "      <td>92</td>\n",
       "      <td>7</td>\n",
       "      <td>13</td>\n",
       "      <td>156</td>\n",
       "      <td>32</td>\n",
       "      <td>174</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1/1</td>\n",
       "      <td>4.900000e+00</td>\n",
       "      <td>25.0</td>\n",
       "      <td>GGSRGGFGGRGGSRGGARGG-SRGGFGGR-GGSRGGARGGSRGGFG...</td>\n",
       "      <td>SGGRRGAGQKSQRIRGYRGGTSSGSHLGRSGGKRSDRSGCERGSFP...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>NOP1</td>\n",
       "      <td>KMV65333</td>\n",
       "      <td>KMV65333 pep supercontig:ECIIIL:CH10_ECIII-L:2...</td>\n",
       "      <td>31.915</td>\n",
       "      <td>14</td>\n",
       "      <td>47</td>\n",
       "      <td>32</td>\n",
       "      <td>0</td>\n",
       "      <td>222</td>\n",
       "      <td>268</td>\n",
       "      <td>463</td>\n",
       "      <td>509</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1/1</td>\n",
       "      <td>9.400000e+00</td>\n",
       "      <td>23.9</td>\n",
       "      <td>EDARHPQKYRMLIGMVDCVFADVAQPDQARIIALNSHMFLKDQGGVV</td>\n",
       "      <td>EDSRPVFRFKPAIAPVQCAIGYLIHFDEFNEHILNIKKFLTDNGLVV</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  qseqid    sseqid                                             stitle  pident  \\\n",
       "0   NOP1  KMV65234  KMV65234 pep supercontig:ECIIIL:CH10_ECIII-L:9...  58.123   \n",
       "1   NOP1  KMV65287  KMV65287 pep supercontig:ECIIIL:CH10_ECIII-L:1...  28.889   \n",
       "2   NOP1  KMV66016  KMV66016 pep supercontig:ECIIIL:CH06_ECIII-L:1...  22.059   \n",
       "3   NOP1  KMV66763  KMV66763 pep supercontig:ECIIIL:CH01_ECIII-L:9...  28.289   \n",
       "4   NOP1  KMV65333  KMV65333 pep supercontig:ECIIIL:CH10_ECIII-L:2...  31.915   \n",
       "\n",
       "   qcovs  length  mismatch  gapopen  qstart  qend  sstart  send  qframe  \\\n",
       "0     82     277       100        4      57   324      10   279       1   \n",
       "1     25      90        49        4     160   240     141   224       1   \n",
       "2     21      68        51        1     217   284      10    75       1   \n",
       "3     44     152        92        7      13   156      32   174       1   \n",
       "4     14      47        32        0     222   268     463   509       1   \n",
       "\n",
       "   sframe frames         evalue  bitscore  \\\n",
       "0       1    1/1  2.610000e-112     324.0   \n",
       "1       1    1/1   1.100000e-02      33.1   \n",
       "2       1    1/1   9.800000e-02      28.9   \n",
       "3       1    1/1   4.900000e+00      25.0   \n",
       "4       1    1/1   9.400000e+00      23.9   \n",
       "\n",
       "                                                qseq  \\\n",
       "0  GRGGSRGGA--RGGSRGGRGGAAGGARGGA------KVVIEPH-RH...   \n",
       "1  DELFIAPGKKVLYLGAASGTSVSHVSDVVGPEGVVYAVEFSHRPGR...   \n",
       "2  IIPIIEDARHPQKYRMLIGMVDCVFADVAQPDQARIIALNSHMFLK...   \n",
       "3  GGSRGGFGGRGGSRGGARGG-SRGGFGGR-GGSRGGARGGSRGGFG...   \n",
       "4    EDARHPQKYRMLIGMVDCVFADVAQPDQARIIALNSHMFLKDQGGVV   \n",
       "\n",
       "                                                sseq  \n",
       "0  GRKFQKGGKPFRSGKGEGRGRMNNKKKGSVNAGLDRKVLVEPHPRF...  \n",
       "1  EDFYITP----YYAGHVLGAAMFHV--VVGDQSVVYTGDYSTTPDK...  \n",
       "2  VTPALKERRH--TYAIIVGIIDVTFVLLQRKDGEREICSVANLHLE...  \n",
       "3  SGGRRGAGQKSQRIRGYRGGTSSGSHLGRSGGKRSDRSGCERGSFP...  \n",
       "4    EDSRPVFRFKPAIAPVQCAIGYLIHFDEFNEHILNIKKFLTDNGLVV  "
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dfs[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Displaying the results for the first one in the list, at index zero, shows that the first hit for Nop1p, the highly conserved protein fibrillarin, has high percent identity over 277 residues and a good bitscore. The others are not as good.\n",
    "\n",
    "(Note if you'd prefer to see the BLAST results as text this is covered after a few more cells below under 'Text result'.)\n",
    "\n",
    "Because the description in the `stitle` column gets cut off, this following trick can be used to see more of that and that column. The sequences aren't as useful but they are shown,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>qseqid</th>\n",
       "      <th>sseqid</th>\n",
       "      <th>stitle</th>\n",
       "      <th>pident</th>\n",
       "      <th>qcovs</th>\n",
       "      <th>length</th>\n",
       "      <th>mismatch</th>\n",
       "      <th>gapopen</th>\n",
       "      <th>qstart</th>\n",
       "      <th>qend</th>\n",
       "      <th>sstart</th>\n",
       "      <th>send</th>\n",
       "      <th>qframe</th>\n",
       "      <th>sframe</th>\n",
       "      <th>frames</th>\n",
       "      <th>evalue</th>\n",
       "      <th>bitscore</th>\n",
       "      <th>qseq</th>\n",
       "      <th>sseq</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>NOP1</td>\n",
       "      <td>KMV65234</td>\n",
       "      <td>KMV65234 pep supercontig:ECIIIL:CH10_ECIII-L:96494:97369:1 gene:M970_100740 transcript:KMV65234 gene_biotype:protein_coding transcript_biotype:protein_coding description:fibrillarin</td>\n",
       "      <td>58.123</td>\n",
       "      <td>82</td>\n",
       "      <td>277</td>\n",
       "      <td>100</td>\n",
       "      <td>4</td>\n",
       "      <td>57</td>\n",
       "      <td>324</td>\n",
       "      <td>10</td>\n",
       "      <td>279</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1/1</td>\n",
       "      <td>2.610000e-112</td>\n",
       "      <td>324.0</td>\n",
       "      <td>GRGGSRGGA--RGGSRGGRGGAAGGARGGA------KVVIEPH-RHAGVYIARGKEDLLVTKNMAPGESVYGEKRISVEEPSKEDGVPPTKVEYRVWNPFRSKLAAGIMGGLDELFIAPGKKVLYLGAASGTSVSHVSDVVGPEGVVYAVEFSHRPGRELISMAKKRPNIIPIIEDARHPQKYRMLIGMVDCVFADVAQPDQARIIALNSHMFLKDQGGVVISIKANCIDSTVDAETVFAREVQKLREERIKPLEQLTLEPYERDHCIVVGRYMRSG</td>\n",
       "      <td>GRKFQKGGKPFRSGKGEGRGRMNNKKKGSVNAGLDRKVLVEPHPRFPGVYISRGKEDLLLTRNLVPGVSVYGEKRVAVD-------LEGMKVEYRVWNAYRSKLAAGIVCGAENIHMEPGSKVLYLGASSGTTVSHVSDIVGKDGVVYAVEFSERSGRDLINMSMKRPNIVPIIEDARYPSRYRMLVPIVDCIFSDVSQPDQTRIVALNAQYFLKEGGGVDVSIKANCVNSAVPAETVFADEVNILRKNSIRPKEQVTLEPFEKDHAMIIGRFKLSA</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>NOP1</td>\n",
       "      <td>KMV65287</td>\n",
       "      <td>KMV65287 pep supercontig:ECIIIL:CH10_ECIII-L:168660:170150:1 gene:M970_101280 transcript:KMV65287 gene_biotype:protein_coding transcript_biotype:protein_coding description:hypothetical protein</td>\n",
       "      <td>28.889</td>\n",
       "      <td>25</td>\n",
       "      <td>90</td>\n",
       "      <td>49</td>\n",
       "      <td>4</td>\n",
       "      <td>160</td>\n",
       "      <td>240</td>\n",
       "      <td>141</td>\n",
       "      <td>224</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1/1</td>\n",
       "      <td>1.100000e-02</td>\n",
       "      <td>33.1</td>\n",
       "      <td>DELFIAPGKKVLYLGAASGTSVSHVSDVVGPEGVVYAVEFSHRPGREL--ISMAKKRPNII-------PIIEDARHPQKYRMLIGMVDCV</td>\n",
       "      <td>EDFYITP----YYAGHVLGAAMFHV--VVGDQSVVYTGDYSTTPDKHLGPASIKCIRPDLLITESTYGSITRDCRKVKEREFLKAVSDCV</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>NOP1</td>\n",
       "      <td>KMV66016</td>\n",
       "      <td>KMV66016 pep supercontig:ECIIIL:CH06_ECIII-L:150158:150508:-1 gene:M970_061160 transcript:KMV66016 gene_biotype:protein_coding transcript_biotype:protein_coding description:hypothetical protein</td>\n",
       "      <td>22.059</td>\n",
       "      <td>21</td>\n",
       "      <td>68</td>\n",
       "      <td>51</td>\n",
       "      <td>1</td>\n",
       "      <td>217</td>\n",
       "      <td>284</td>\n",
       "      <td>10</td>\n",
       "      <td>75</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1/1</td>\n",
       "      <td>9.800000e-02</td>\n",
       "      <td>28.9</td>\n",
       "      <td>IIPIIEDARHPQKYRMLIGMVDCVFADVAQPDQARIIALNSHMFLKDQGGVVISIKANCIDSTVDAET</td>\n",
       "      <td>VTPALKERRH--TYAIIVGIIDVTFVLLQRKDGEREICSVANLHLEDEAFDIKGLSAEEIGKLIPEDT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>NOP1</td>\n",
       "      <td>KMV66763</td>\n",
       "      <td>KMV66763 pep supercontig:ECIIIL:CH01_ECIII-L:92624:94390:-1 gene:M970_010930 transcript:KMV66763 gene_biotype:protein_coding transcript_biotype:protein_coding description:hypothetical protein</td>\n",
       "      <td>28.289</td>\n",
       "      <td>44</td>\n",
       "      <td>152</td>\n",
       "      <td>92</td>\n",
       "      <td>7</td>\n",
       "      <td>13</td>\n",
       "      <td>156</td>\n",
       "      <td>32</td>\n",
       "      <td>174</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1/1</td>\n",
       "      <td>4.900000e+00</td>\n",
       "      <td>25.0</td>\n",
       "      <td>GGSRGGFGGRGGSRGGARGG-SRGGFGGR-GGSRGGARGGSRGGFGGRGGSRGGARGGSR-----GGRGGAAGGARGGAKVVIEPHRHAGVYI-ARGKEDLLVTKNMAPGESVYGEKRISVEEPSKEDGVPPTKVEYRVWNPFRSKLAAGIM</td>\n",
       "      <td>SGGRRGAGQKSQRIRGYRGGTSSGSHLGRSGGKRSDRSGCERGSFPMKGEAVTGGRYELRMSEDIGVQGPTGEGGQQGERCQEE------VYVDAKERDSSNRRCSVRPANEVFGREMCRLHSSPLEDG--KDRIE-KIESMLRNDLADGIV</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>NOP1</td>\n",
       "      <td>KMV65333</td>\n",
       "      <td>KMV65333 pep supercontig:ECIIIL:CH10_ECIII-L:231151:232890:1 gene:M970_101740 transcript:KMV65333 gene_biotype:protein_coding transcript_biotype:protein_coding description:glycyl-tRNA synthetase</td>\n",
       "      <td>31.915</td>\n",
       "      <td>14</td>\n",
       "      <td>47</td>\n",
       "      <td>32</td>\n",
       "      <td>0</td>\n",
       "      <td>222</td>\n",
       "      <td>268</td>\n",
       "      <td>463</td>\n",
       "      <td>509</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1/1</td>\n",
       "      <td>9.400000e+00</td>\n",
       "      <td>23.9</td>\n",
       "      <td>EDARHPQKYRMLIGMVDCVFADVAQPDQARIIALNSHMFLKDQGGVV</td>\n",
       "      <td>EDSRPVFRFKPAIAPVQCAIGYLIHFDEFNEHILNIKKFLTDNGLVV</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  qseqid    sseqid  \\\n",
       "0   NOP1  KMV65234   \n",
       "1   NOP1  KMV65287   \n",
       "2   NOP1  KMV66016   \n",
       "3   NOP1  KMV66763   \n",
       "4   NOP1  KMV65333   \n",
       "\n",
       "                                                                                                                                                                                               stitle  \\\n",
       "0               KMV65234 pep supercontig:ECIIIL:CH10_ECIII-L:96494:97369:1 gene:M970_100740 transcript:KMV65234 gene_biotype:protein_coding transcript_biotype:protein_coding description:fibrillarin   \n",
       "1    KMV65287 pep supercontig:ECIIIL:CH10_ECIII-L:168660:170150:1 gene:M970_101280 transcript:KMV65287 gene_biotype:protein_coding transcript_biotype:protein_coding description:hypothetical protein   \n",
       "2   KMV66016 pep supercontig:ECIIIL:CH06_ECIII-L:150158:150508:-1 gene:M970_061160 transcript:KMV66016 gene_biotype:protein_coding transcript_biotype:protein_coding description:hypothetical protein   \n",
       "3     KMV66763 pep supercontig:ECIIIL:CH01_ECIII-L:92624:94390:-1 gene:M970_010930 transcript:KMV66763 gene_biotype:protein_coding transcript_biotype:protein_coding description:hypothetical protein   \n",
       "4  KMV65333 pep supercontig:ECIIIL:CH10_ECIII-L:231151:232890:1 gene:M970_101740 transcript:KMV65333 gene_biotype:protein_coding transcript_biotype:protein_coding description:glycyl-tRNA synthetase   \n",
       "\n",
       "   pident  qcovs  length  mismatch  gapopen  qstart  qend  sstart  send  \\\n",
       "0  58.123     82     277       100        4      57   324      10   279   \n",
       "1  28.889     25      90        49        4     160   240     141   224   \n",
       "2  22.059     21      68        51        1     217   284      10    75   \n",
       "3  28.289     44     152        92        7      13   156      32   174   \n",
       "4  31.915     14      47        32        0     222   268     463   509   \n",
       "\n",
       "   qframe  sframe frames         evalue  bitscore  \\\n",
       "0       1       1    1/1  2.610000e-112     324.0   \n",
       "1       1       1    1/1   1.100000e-02      33.1   \n",
       "2       1       1    1/1   9.800000e-02      28.9   \n",
       "3       1       1    1/1   4.900000e+00      25.0   \n",
       "4       1       1    1/1   9.400000e+00      23.9   \n",
       "\n",
       "                                                                                                                                                                                                                                                                                    qseq  \\\n",
       "0  GRGGSRGGA--RGGSRGGRGGAAGGARGGA------KVVIEPH-RHAGVYIARGKEDLLVTKNMAPGESVYGEKRISVEEPSKEDGVPPTKVEYRVWNPFRSKLAAGIMGGLDELFIAPGKKVLYLGAASGTSVSHVSDVVGPEGVVYAVEFSHRPGRELISMAKKRPNIIPIIEDARHPQKYRMLIGMVDCVFADVAQPDQARIIALNSHMFLKDQGGVVISIKANCIDSTVDAETVFAREVQKLREERIKPLEQLTLEPYERDHCIVVGRYMRSG   \n",
       "1                                                                                                                                                                                             DELFIAPGKKVLYLGAASGTSVSHVSDVVGPEGVVYAVEFSHRPGREL--ISMAKKRPNII-------PIIEDARHPQKYRMLIGMVDCV   \n",
       "2                                                                                                                                                                                                                   IIPIIEDARHPQKYRMLIGMVDCVFADVAQPDQARIIALNSHMFLKDQGGVVISIKANCIDSTVDAET   \n",
       "3                                                                                                                               GGSRGGFGGRGGSRGGARGG-SRGGFGGR-GGSRGGARGGSRGGFGGRGGSRGGARGGSR-----GGRGGAAGGARGGAKVVIEPHRHAGVYI-ARGKEDLLVTKNMAPGESVYGEKRISVEEPSKEDGVPPTKVEYRVWNPFRSKLAAGIM   \n",
       "4                                                                                                                                                                                                                                        EDARHPQKYRMLIGMVDCVFADVAQPDQARIIALNSHMFLKDQGGVV   \n",
       "\n",
       "                                                                                                                                                                                                                                                                                    sseq  \n",
       "0  GRKFQKGGKPFRSGKGEGRGRMNNKKKGSVNAGLDRKVLVEPHPRFPGVYISRGKEDLLLTRNLVPGVSVYGEKRVAVD-------LEGMKVEYRVWNAYRSKLAAGIVCGAENIHMEPGSKVLYLGASSGTTVSHVSDIVGKDGVVYAVEFSERSGRDLINMSMKRPNIVPIIEDARYPSRYRMLVPIVDCIFSDVSQPDQTRIVALNAQYFLKEGGGVDVSIKANCVNSAVPAETVFADEVNILRKNSIRPKEQVTLEPFEKDHAMIIGRFKLSA  \n",
       "1                                                                                                                                                                                             EDFYITP----YYAGHVLGAAMFHV--VVGDQSVVYTGDYSTTPDKHLGPASIKCIRPDLLITESTYGSITRDCRKVKEREFLKAVSDCV  \n",
       "2                                                                                                                                                                                                                   VTPALKERRH--TYAIIVGIIDVTFVLLQRKDGEREICSVANLHLEDEAFDIKGLSAEEIGKLIPEDT  \n",
       "3                                                                                                                               SGGRRGAGQKSQRIRGYRGGTSSGSHLGRSGGKRSDRSGCERGSFPMKGEAVTGGRYELRMSEDIGVQGPTGEGGQQGERCQEE------VYVDAKERDSSNRRCSVRPANEVFGREMCRLHSSPLEDG--KDRIE-KIESMLRNDLADGIV  \n",
       "4                                                                                                                                                                                                                                        EDSRPVFRFKPAIAPVQCAIGYLIHFDEFNEHILNIKKFLTDNGLVV  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "#trick from based on https://stackoverflow.com/a/30691921 and https://stackoverflow.com/a/25352191/8508004\n",
    "with pd.option_context('display.max_rows', None, 'display.max_columns', None,'display.max_colwidth', None):\n",
    "    display(dfs[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The individual dataframes of results can be used in some of the ways demonstrated earlier in this series of notebooks.  \n",
    "For example, it would be easy to filter based on the values in the bitscore column, like in the following cell. Note to make the filter step easier to read syntax-wise, first I assign the first dataframe from the list of datafames to a simpler moniker, of just `df`.  \n",
    "Then I specify to subset that dataframe to just cases where the values in the dataframe's 'bitscore' column exceed 100. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>qseqid</th>\n",
       "      <th>sseqid</th>\n",
       "      <th>stitle</th>\n",
       "      <th>pident</th>\n",
       "      <th>qcovs</th>\n",
       "      <th>length</th>\n",
       "      <th>mismatch</th>\n",
       "      <th>gapopen</th>\n",
       "      <th>qstart</th>\n",
       "      <th>qend</th>\n",
       "      <th>sstart</th>\n",
       "      <th>send</th>\n",
       "      <th>qframe</th>\n",
       "      <th>sframe</th>\n",
       "      <th>frames</th>\n",
       "      <th>evalue</th>\n",
       "      <th>bitscore</th>\n",
       "      <th>qseq</th>\n",
       "      <th>sseq</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>NOP1</td>\n",
       "      <td>KMV65234</td>\n",
       "      <td>KMV65234 pep supercontig:ECIIIL:CH10_ECIII-L:9...</td>\n",
       "      <td>58.123</td>\n",
       "      <td>82</td>\n",
       "      <td>277</td>\n",
       "      <td>100</td>\n",
       "      <td>4</td>\n",
       "      <td>57</td>\n",
       "      <td>324</td>\n",
       "      <td>10</td>\n",
       "      <td>279</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1/1</td>\n",
       "      <td>2.610000e-112</td>\n",
       "      <td>324.0</td>\n",
       "      <td>GRGGSRGGA--RGGSRGGRGGAAGGARGGA------KVVIEPH-RH...</td>\n",
       "      <td>GRKFQKGGKPFRSGKGEGRGRMNNKKKGSVNAGLDRKVLVEPHPRF...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  qseqid    sseqid                                             stitle  pident  \\\n",
       "0   NOP1  KMV65234  KMV65234 pep supercontig:ECIIIL:CH10_ECIII-L:9...  58.123   \n",
       "\n",
       "   qcovs  length  mismatch  gapopen  qstart  qend  sstart  send  qframe  \\\n",
       "0     82     277       100        4      57   324      10   279       1   \n",
       "\n",
       "   sframe frames         evalue  bitscore  \\\n",
       "0       1    1/1  2.610000e-112     324.0   \n",
       "\n",
       "                                                qseq  \\\n",
       "0  GRGGSRGGA--RGGSRGGRGGAAGGARGGA------KVVIEPH-RH...   \n",
       "\n",
       "                                                sseq  \n",
       "0  GRKFQKGGKPFRSGKGEGRGRMNNKKKGSVNAGLDRKVLVEPHPRF...  "
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = dfs[0].copy() #make a copy assigned to `df` so next line easier to read\n",
    "df[df['bitscore'] > 100]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Text result\n",
    "\n",
    "If you are more used to viewing the text output from BLAST to assess how things worked, you can modify the code used to generate the dataframes above in order show that instead of generating dataframes. In an attempt to make it easier to scroll through to view the results by limiting the output in the window, the results will be limited to the first gene in the list, Nop1, by adding `break` so the `for` loop ends when it is hit.   \n",
    "Here is a reworked version to show the text output."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "Protein sequence saved as `blast_input_seq.fa`."
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "BLASTP 2.15.0+\n",
      "\n",
      "\n",
      "Reference: Stephen F. Altschul, Thomas L. Madden, Alejandro A.\n",
      "Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J.\n",
      "Lipman (1997), \"Gapped BLAST and PSI-BLAST: a new generation of\n",
      "protein database search programs\", Nucleic Acids Res. 25:3389-3402.\n",
      "\n",
      "\n",
      "Reference for composition-based statistics: Alejandro A. Schaffer,\n",
      "L. Aravind, Thomas L. Madden, Sergei Shavirin, John L. Spouge, Yuri\n",
      "I. Wolf, Eugene V. Koonin, and Stephen F. Altschul (2001),\n",
      "\"Improving the accuracy of PSI-BLAST protein database searches with\n",
      "composition-based statistics and other refinements\", Nucleic Acids\n",
      "Res. 29:2994-3005.\n",
      "\n",
      "\n",
      "\n",
      "Database: ec.pep.fas\n",
      "           1,834 sequences; 653,245 total letters\n",
      "\n",
      "\n",
      "\n",
      "Query= NOP1 YDL014W\n",
      "\n",
      "Length=328\n",
      "                                                                      Score     E\n",
      "Sequences producing significant alignments:                          (Bits)  Value\n",
      "\n",
      "KMV65234 pep supercontig:ECIIIL:CH10_ECIII-L:96494:97369:1 gene:M...  324     3e-112\n",
      "KMV65287 pep supercontig:ECIIIL:CH10_ECIII-L:168660:170150:1 gene...  33.1    0.011 \n",
      "KMV66016 pep supercontig:ECIIIL:CH06_ECIII-L:150158:150508:-1 gen...  28.9    0.098 \n",
      "KMV66763 pep supercontig:ECIIIL:CH01_ECIII-L:92624:94390:-1 gene:...  25.0    4.9   \n",
      "KMV65333 pep supercontig:ECIIIL:CH10_ECIII-L:231151:232890:1 gene...  23.9    9.4   \n",
      "\n",
      "\n",
      ">KMV65234 pep supercontig:ECIIIL:CH10_ECIII-L:96494:97369:1 gene:M970_100740 \n",
      "transcript:KMV65234 gene_biotype:protein_coding \n",
      "transcript_biotype:protein_coding description:fibrillarin\n",
      "Length=291\n",
      "\n",
      " Score = 324 bits (830),  Expect = 3e-112, Method: Compositional matrix adjust.\n",
      " Identities = 161/277 (58%), Positives = 208/277 (75%), Gaps = 16/277 (6%)\n",
      "\n",
      "Query  57   GRGGSRGGA--RGGSRGGRGGAAGGARGGA------KVVIEPH-RHAGVYIARGKEDLLV  107\n",
      "            GR   +GG   R G   GRG      +G        KV++EPH R  GVYI+RGKEDLL+\n",
      "Sbjct  10   GRKFQKGGKPFRSGKGEGRGRMNNKKKGSVNAGLDRKVLVEPHPRFPGVYISRGKEDLLL  69\n",
      "\n",
      "Query  108  TKNMAPGESVYGEKRISVEEPSKEDGVPPTKVEYRVWNPFRSKLAAGIMGGLDELFIAPG  167\n",
      "            T+N+ PG SVYGEKR++V+       +   KVEYRVWN +RSKLAAGI+ G + + + PG\n",
      "Sbjct  70   TRNLVPGVSVYGEKRVAVD-------LEGMKVEYRVWNAYRSKLAAGIVCGAENIHMEPG  122\n",
      "\n",
      "Query  168  KKVLYLGAASGTSVSHVSDVVGPEGVVYAVEFSHRPGRELISMAKKRPNIIPIIEDARHP  227\n",
      "             KVLYLGA+SGT+VSHVSD+VG +GVVYAVEFS R GR+LI+M+ KRPNI+PIIEDAR+P\n",
      "Sbjct  123  SKVLYLGASSGTTVSHVSDIVGKDGVVYAVEFSERSGRDLINMSMKRPNIVPIIEDARYP  182\n",
      "\n",
      "Query  228  QKYRMLIGMVDCVFADVAQPDQARIIALNSHMFLKDQGGVVISIKANCIDSTVDAETVFA  287\n",
      "             +YRML+ +VDC+F+DV+QPDQ RI+ALN+  FLK+ GGV +SIKANC++S V AETVFA\n",
      "Sbjct  183  SRYRMLVPIVDCIFSDVSQPDQTRIVALNAQYFLKEGGGVDVSIKANCVNSAVPAETVFA  242\n",
      "\n",
      "Query  288  REVQKLREERIKPLEQLTLEPYERDHCIVVGRYMRSG  324\n",
      "             EV  LR+  I+P EQ+TLEP+E+DH +++GR+  S \n",
      "Sbjct  243  DEVNILRKNSIRPKEQVTLEPFEKDHAMIIGRFKLSA  279\n",
      "\n",
      "\n",
      ">KMV65287 pep supercontig:ECIIIL:CH10_ECIII-L:168660:170150:1 \n",
      "gene:M970_101280 transcript:KMV65287 gene_biotype:protein_coding \n",
      "transcript_biotype:protein_coding description:hypothetical \n",
      "protein\n",
      "Length=496\n",
      "\n",
      " Score = 33.1 bits (74),  Expect = 0.011, Method: Compositional matrix adjust.\n",
      " Identities = 26/90 (29%), Positives = 42/90 (47%), Gaps = 15/90 (17%)\n",
      "\n",
      "Query  160  DELFIAPGKKVLYLGAASGTSVSHVSDVVGPEGVVYAVEFSHRPGREL--ISMAKKRPNI  217\n",
      "            ++ +I P     Y G   G ++ HV  VVG + VVY  ++S  P + L   S+   RP++\n",
      "Sbjct  141  EDFYITP----YYAGHVLGAAMFHV--VVGDQSVVYTGDYSTTPDKHLGPASIKCIRPDL  194\n",
      "\n",
      "Query  218  I-------PIIEDARHPQKYRMLIGMVDCV  240\n",
      "            +        I  D R  ++   L  + DCV\n",
      "Sbjct  195  LITESTYGSITRDCRKVKEREFLKAVSDCV  224\n",
      "\n",
      "\n",
      ">KMV66016 pep supercontig:ECIIIL:CH06_ECIII-L:150158:150508:-1 \n",
      "gene:M970_061160 transcript:KMV66016 gene_biotype:protein_coding \n",
      "transcript_biotype:protein_coding description:hypothetical \n",
      "protein\n",
      "Length=106\n",
      "\n",
      " Score = 28.9 bits (63),  Expect = 0.098, Method: Composition-based stats.\n",
      " Identities = 15/68 (22%), Positives = 35/68 (51%), Gaps = 2/68 (3%)\n",
      "\n",
      "Query  217  IIPIIEDARHPQKYRMLIGMVDCVFADVAQPDQARIIALNSHMFLKDQGGVVISIKANCI  276\n",
      "            + P +++ RH   Y +++G++D  F  + + D  R I   +++ L+D+   +  + A  I\n",
      "Sbjct  10   VTPALKERRH--TYAIIVGIIDVTFVLLQRKDGEREICSVANLHLEDEAFDIKGLSAEEI  67\n",
      "\n",
      "Query  277  DSTVDAET  284\n",
      "               +  +T\n",
      "Sbjct  68   GKLIPEDT  75\n",
      "\n",
      "\n",
      ">KMV66763 pep supercontig:ECIIIL:CH01_ECIII-L:92624:94390:-1 gene:M970_010930 \n",
      "transcript:KMV66763 gene_biotype:protein_coding \n",
      "transcript_biotype:protein_coding description:hypothetical \n",
      "protein\n",
      "Length=588\n",
      "\n",
      " Score = 25.0 bits (53),  Expect = 4.9, Method: Compositional matrix adjust.\n",
      " Identities = 43/152 (28%), Positives = 65/152 (43%), Gaps = 17/152 (11%)\n",
      "\n",
      "Query  13   GGSRGGFGGRGGSRGGARGG-SRGGFGGR-GGSRGGARGGSRGGFGGRGGSRGGARGGSR  70\n",
      "             G R G G +     G RGG S G   GR GG R    G  RG F  +G +  G R   R\n",
      "Sbjct  32   SGGRRGAGQKSQRIRGYRGGTSSGSHLGRSGGKRSDRSGCERGSFPMKGEAVTGGRYELR  91\n",
      "\n",
      "Query  71   -----GGRGGAAGGARGGAKVVIEPHRHAGVYI-ARGKEDLLVTKNMAPGESVYGEKRIS  124\n",
      "                 G +G    G + G +   E      VY+ A+ ++      ++ P   V+G +   \n",
      "Sbjct  92   MSEDIGVQGPTGEGGQQGERCQEE------VYVDAKERDSSNRRCSVRPANEVFGREMCR  145\n",
      "\n",
      "Query  125  VEEPSKEDGVPPTKVEYRVWNPFRSKLAAGIM  156\n",
      "            +     EDG    ++E ++ +  R+ LA GI+\n",
      "Sbjct  146  LHSSPLEDG--KDRIE-KIESMLRNDLADGIV  174\n",
      "\n",
      "\n",
      ">KMV65333 pep supercontig:ECIIIL:CH10_ECIII-L:231151:232890:1 \n",
      "gene:M970_101740 transcript:KMV65333 gene_biotype:protein_coding \n",
      "transcript_biotype:protein_coding description:glycyl-tRNA \n",
      "synthetase\n",
      "Length=579\n",
      "\n",
      " Score = 23.9 bits (50),  Expect = 9.4, Method: Compositional matrix adjust.\n",
      " Identities = 15/47 (32%), Positives = 21/47 (45%), Gaps = 0/47 (0%)\n",
      "\n",
      "Query  222  EDARHPQKYRMLIGMVDCVFADVAQPDQARIIALNSHMFLKDQGGVV  268\n",
      "            ED+R   +++  I  V C    +   D+     LN   FL D G VV\n",
      "Sbjct  463  EDSRPVFRFKPAIAPVQCAIGYLIHFDEFNEHILNIKKFLTDNGLVV  509\n",
      "\n",
      "\n",
      "\n",
      "Lambda      K        H        a         alpha\n",
      "   0.319    0.141    0.423    0.792     4.96 \n",
      "\n",
      "Gapped\n",
      "Lambda      K        H        a         alpha    sigma\n",
      "   0.267   0.0410    0.140     1.90     42.6     43.6 \n",
      "\n",
      "Effective search space used: 126581391\n",
      "\n",
      "\n",
      "  Database: ec.pep.fas\n",
      "    Posted date:  Mar 6, 2024  10:04 PM\n",
      "  Number of letters in database: 653,245\n",
      "  Number of sequences in database:  1,834\n",
      "\n",
      "\n",
      "\n",
      "Matrix: BLOSUM62\n",
      "Gap Penalties: Existence: 11, Extension: 1\n",
      "Neighboring words threshold: 11\n",
      "Window for multiple hits: 40\n"
     ]
    }
   ],
   "source": [
    "records = Fasta(seq_file_to_use)\n",
    "for seq in records:\n",
    "    one_seq_to_file(str(seq), seq.long_name, \"blast_input_seq.fa\")\n",
    "    !blastp -query blast_input_seq.fa -db ec.pep.fas -task blastp\n",
    "    !rm blast_input_seq.fa\n",
    "    break"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It clearly shows in the description of the best match for Nop1p that it is annotated as fibrillarin, and so it seems indeed the top hit identified is the ortholog of *S. cerevisiae* Nop1p.\n",
    "\n",
    "To make it easier to review the results, you send them to a text file that you review here or download. This modification of the code when run, will send **the search results for all the sequences** to a separate file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "Protein sequence saved as `blast_input_seq.fa`.\n",
      "\n",
      "Protein sequence saved as `blast_input_seq.fa`.\n",
      "\n",
      "Protein sequence saved as `blast_input_seq.fa`."
     ]
    }
   ],
   "source": [
    "records = Fasta(seq_file_to_use)\n",
    "!touch results.txt\n",
    "!rm results.txt # adding a deletion so if you run this cell more than once it\n",
    "# will start a fresh file and then append to it\n",
    "!touch results.txt\n",
    "for seq in records:\n",
    "    one_seq_to_file(str(seq), seq.long_name, \"blast_input_seq.fa\")\n",
    "    !blastp -query blast_input_seq.fa -db ec.pep.fas -task blastp >>results.txt\n",
    "    !rm blast_input_seq.fa"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you are using the demonstration with the sequences as it was originally writen, the following command will show the size of the `results.txt` file as 43k."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-rw-r--r-- 1 jovyan jovyan 43K Mar  6 22:04 results.txt\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "ls -lah results.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Go to the dashboard by clicking on the Jupyter icon in the upper left and then you can review it like a text file. Or download this file if you'd like to keep it, or if you think you can view it better outside of Jupyter."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "-----"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}