{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "# Examples of running the script\n",
    "\n",
    "There are several options to run `find_sequence_element_occurrences_in_sequence` or utilize the core function.\n",
    "\n",
    "This notebook will demonstrate a few ways of using this script:\n",
    "\n",
    "- [pasting it into a notebook cell or loading it into a cell](#Running-core-function-of-the-scrip-after-loading-into-a-cell)\n",
    "- [importing the main function](#importing-the-main-function)\n",
    "\n",
    "Notably, it won't cover running the script after upload. An approach to that is illustrated [here](https://github.com/fomightez/sequencework/blob/master/circos-utilities/demo%20UCSC_chrom_sizes_2_circos_karyotype%20script.ipynb) for a different script, but that should serve as a good guide combined with getting the `USAGE` info by running it with the `--help/-h` flag.  \n",
    "\n",
    "(If you are having any problems at all doing any of this because of Python or needed dependencies, such as Bioython, this notebook was developed in the enviromenment launchable by pressing `Launch binder` badge [here](https://github.com/fomightez/qgrid-notebooks). You could always launch that environment and upload this notebook there and things should work.)\n",
    "\n",
    "**It will also demonstrate a more complex use of the core function to process multiple searches for sequence elements in a manner reminiscent of a [mini-pipeline](https://github.com/fomightez/mini-pipelines).**\n",
    "\n",
    "## Running as script file\n",
    "\n",
    "Similar to how one would run a script from the command line.\n",
    "\n",
    "Upload the script to the directory where you want to run it. Or upload it to a running Jupyter environment.\n",
    "\n",
    "A similar approach to the latter is illustrated [here](https://github.com/fomightez/sequencework/blob/master/circos-utilities/demo%20UCSC_chrom_sizes_2_circos_karyotype%20script.ipynb) for a different script but that should serve as a good guide combined with getting the `USAGE` info by running it with the `--help/-h` flag.\n",
    "\n",
    "-----\n",
    "\n",
    "## Running core function of the script after loading into a cell\n",
    "\n",
    "It can be pasted into a cell or loaded from github. Those will be demonstrated in this section of the notebook.\n",
    "\n",
    "First part of this section will cover pasting script into a cell.  \n",
    "In the next cell is the script (although it might not be the most up-to-date version, and so it would be best to get and paste the most-up-to-date version from Github or use the `%load `approach to fetch it from Github directly into a cell, as illustrated [here](https://github.com/fomightez/sequencework/blob/master/circos-utilities/demo%20UCSC_chrom_sizes_2_circos_karyotype%20script.ipynb))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: regex in /srv/conda/envs/notebook/lib/python3.7/site-packages (2021.11.10)\n",
      "Note: you may need to restart the kernel to use updated packages.\n",
      "Requirement already satisfied: openpyxl in /srv/conda/envs/notebook/lib/python3.7/site-packages (3.0.9)\n",
      "Requirement already satisfied: et-xmlfile in /srv/conda/envs/notebook/lib/python3.7/site-packages (from openpyxl) (1.1.0)\n",
      "Note: you may need to restart the kernel to use updated packages.\n"
     ]
    }
   ],
   "source": [
    "%pip install regex\n",
    "%pip install openpyxl"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "#!/usr/bin/env python\n",
    "# find_sequence_element_occurrences_in_sequence.py\n",
    "__author__ = \"Wayne Decatur\" #fomightez on GitHub\n",
    "__license__ = \"MIT\"\n",
    "__version__ = \"0.1.0\"\n",
    "\n",
    "\n",
    "# find_sequence_element_occurrences_in_sequence.py by Wayne Decatur\n",
    "# ver 0.1\n",
    "#\n",
    "#*******************************************************************************\n",
    "# Verified compatible with both Python 2.7 and Python 3.6; written initially in \n",
    "# Python 3.\n",
    "#\n",
    "# PURPOSE: Takes a short sequence represented as a string and a FASTA-formatted\n",
    "# sequence (either a file or URL to a file) and makes an accounting of the \n",
    "# occurrences of that short sequence element on both strands of the main \n",
    "# sequence. Exact matches only are counted as occurrences.\n",
    "# The script is case-insensitive, meaning the case of either the element or \n",
    "# FASTA sequence provided do not matter and matches will be noted anyway.\n",
    "#\n",
    "# As written, the script expects and only processes one FASTA-formatted \n",
    "# sequence. If your, FASTA file has more than one sequence entry within it and \n",
    "# the one you want to have scanned is not the first, copy and paste it to a new \n",
    "# file and use that as the sequence to scan.\n",
    "#\n",
    "# Written to run from command line or pasted/loaded inside a Jupyter notebook \n",
    "# cell or imported. \n",
    "#\n",
    "#\n",
    "#\n",
    "# This script based on work and musings developed in \n",
    "# `Resources for writing code to do GC cluster accounting.md`\n",
    "#\n",
    "# The script at https://www.biostars.org/p/209383/ (steve's answer) served as \n",
    "# the backbone for adding convenience and user-friendly features.\n",
    "#\n",
    "#\n",
    "#\n",
    "# Dependencies beyond the mostly standard libraries/modules:\n",
    "# - Biopython\n",
    "#\n",
    "#\n",
    "# VERSION HISTORY:\n",
    "# v.0.1. basic working version\n",
    "#\n",
    "# To do:\n",
    "# - \n",
    "#\n",
    "#\n",
    "#\n",
    "# TO RUN:\n",
    "# Examples,\n",
    "# Enter on the command line of your terminal, the line\n",
    "#-----------------------------------\n",
    "# python find_sequence_element_occurrences_in_sequence.py GAATTC sequences.fasta\n",
    "\n",
    "#-OR-\n",
    "# python find_sequence_element_occurrences_in_sequence.py GAATTC sequences.fasta URL\n",
    "#-----------------------------------\n",
    "# Issue `python find_sequence_element_occurrences_in_sequence.py -h` for details.\n",
    "# \n",
    "#\n",
    "# To use this after pasting or loading into a cell in a Jupyter notebook, in\n",
    "# the next cell define critical variables and then call the main function \n",
    "# similar to below:\n",
    "# source = \"https://downloads.yeastgenome.org/sequence/S288C_reference/chromosomes/fasta/chrmt.fsa\"\n",
    "# element = \"GAATTC\"\n",
    "# id_of_seq_element = \"ele1\" #Set to `None` without quotes or backticks to have defined automatically\n",
    "# find_sequence_element_occurrences_in_sequence(source, element, id_of_seq_element, return_dataframe = True)\n",
    "#\n",
    "# Something similar would need to be done if importing the script into another.\n",
    "# (`id_of_seq_scanned_hardcoded` can be assigned in a cell before calling the \n",
    "# function as well.)\n",
    "#\n",
    "# \n",
    "#\n",
    "'''\n",
    "CURRENT ACTUAL CODE FOR RUNNING/TESTING IN A NOTEBOOK WHEN LOADED OR PASTED IN \n",
    "ANOTHER CELL:\n",
    "source = \"https://downloads.yeastgenome.org/sequence/S288C_reference/chromosomes/fasta/chrmt.fsa\"\n",
    "element = \"GAATTC\"\n",
    "id_of_seq_element = \"EcoRI\" #Set to `None` without quotes or backticks to have defined automatically\n",
    "df = find_sequence_element_occurrences_in_sequence(source, element, id_of_seq_element, return_dataframe = True)\n",
    "\n",
    "\n",
    "'''\n",
    "#\n",
    "#\n",
    "#*******************************************************************************\n",
    "#\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "#*******************************************************************************\n",
    "##################################\n",
    "#  USER ADJUSTABLE VALUES        #\n",
    "\n",
    "##################################\n",
    "#\n",
    "output_file_name_prefix = \"seq_account\"\n",
    "id_of_seq_scanned_hardcoded = None # replace `None` with what you want to use,\n",
    "# with flanking quotes if something appropriate is not being extracted from the\n",
    "# provided filepath/filename or URL to save as an indicator of the file scanned\n",
    "# in the output file name. \n",
    "\n",
    "\n",
    "\n",
    "limit_of_name = 17 # number of bases of the sequence element to limit to using \n",
    "# if the sequence element sequence is used to make the name for the output file\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "#\n",
    "#*******************************************************************************\n",
    "#**********************END USER ADJUSTABLE VARIABLES****************************\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "#*******************************************************************************\n",
    "#*******************************************************************************\n",
    "###DO NOT EDIT BELOW HERE - ENTER VALUES ABOVE###\n",
    "\n",
    "import sys\n",
    "import os\n",
    "import regex\n",
    "import pandas as pd\n",
    "from Bio import SeqIO\n",
    "from Bio.Seq import Seq # for reverse complement\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "###---------------------------HELPER FUNCTIONS---------------------------------###\n",
    "\n",
    "\n",
    "def get_seq_element_representation(id_of_seq_element):\n",
    "    '''\n",
    "    Takes `id_of_seq_element` and returns a string represenation for using\n",
    "    in the output file name and the dataframe.\n",
    "    '''\n",
    "    if id_of_seq_element:\n",
    "        elem_id = id_of_seq_element\n",
    "    elif len(element) > limit_of_name:\n",
    "        elem_id = element[:limit_of_name]+\"...\"\n",
    "    else:\n",
    "        elem_id = element\n",
    "    return elem_id\n",
    "\n",
    "\n",
    "\n",
    "def generate_output_file_name(id_of_seq_element,id_of_seq_scanned):\n",
    "    '''\n",
    "    Takes a file name as an argument and returns string for the name of the\n",
    "    output file. The generated name is based on a prefix that can be adjusted\n",
    "    under the section ' USER ADJUSTABLE VALUES ', plus the provided text in \n",
    "    the function call.\n",
    "    If there is no `id_of_seq_element` specified, the sequence element will be\n",
    "    used, limited to the first number of bases specified in `limit_of_name`.\n",
    "\n",
    "    Specific examples\n",
    "    =================\n",
    "    Calling function with\n",
    "        (\"elem1\",\"chrmt\")\n",
    "    returns\n",
    "        \"seq_account_elem1_chrmt.tsv\"\n",
    "\n",
    "    Calling function with\n",
    "        (None,\"chrmt\")\n",
    "    returns\n",
    "        \"seq_account_GAATTC_chrmt.tsv\"\n",
    "    if `GAATTC` happened to be the provided sequence element.\n",
    "    '''\n",
    "    elem_id = get_seq_element_representation(id_of_seq_element)\n",
    "    if elem_id.endswith(\"...\") and (id_of_seq_element == None):\n",
    "        # Because in order to use `get_seq_element_representation()` in all the \n",
    "        # places something similar is needed, I have it adding `...` if the\n",
    "        # sequence of the element exceeds the limit size. Since I don't want\n",
    "        # that in the file name I am going to remove if it seems I had added it,\n",
    "        # and I will only have possibly added it if `id_of_seq_element` is None.\n",
    "        # This extra check is mitigate chances I'll remove `...` in unlikely\n",
    "        # possibility user wanted to include it in id for sequence element. \n",
    "        elem_id = elem_id[:-3] + \"-\" #add hyphen to indicate there is more \n",
    "    return \"{prefix}_{elem_id}_{seqid}.tsv\".format(\n",
    "        prefix=output_file_name_prefix,elem_id=elem_id,\n",
    "        seqid=id_of_seq_scanned)\n",
    "\n",
    "\n",
    "def extract_id_of_seq_scanned(source):\n",
    "    '''\n",
    "    Take something like:\n",
    "    https://downloads.yeastgenome.org/sequence/S288C_reference/chromosomes/fasta/chrmt.fsa\n",
    "\n",
    "    -or- \n",
    "\n",
    "    /local/directory1/directory2/chrmt.fa\n",
    "\n",
    "    -or-\n",
    "\n",
    "    chrmt.fa\n",
    "\n",
    "    And return:\n",
    "    chrmt\n",
    "    '''\n",
    "    if \"/\" in source:\n",
    "        last_bit = source.split(\"/\")[-1]\n",
    "    else:\n",
    "        last_bit = source\n",
    "    if '.' in last_bit:\n",
    "        main_part_of_name, file_extension = os.path.splitext(\n",
    "        last_bit) #from http://stackoverflow.com/questions/541390/extracting-extension-from-filename-in-python\n",
    "        return main_part_of_name\n",
    "    else:\n",
    "        return last_bit\n",
    "\n",
    "\n",
    "def get_seq_from_URL(url):\n",
    "    '''\n",
    "    takes a URL and gets the sequence\n",
    "    '''\n",
    "    try:\n",
    "        from StringIO import StringIO\n",
    "    except ImportError:\n",
    "        from io import StringIO\n",
    "\n",
    "    # Getting html originally for just Python 3, adapted from \n",
    "    # https://stackoverflow.com/a/17510727/8508004 and then updated from to \n",
    "    # handle Python 2 and 3 according to same link.\n",
    "    try:\n",
    "        # For Python 3.0 and later\n",
    "        from urllib.request import urlopen\n",
    "    except ImportError:\n",
    "        # Fall back to Python 2's urllib2\n",
    "        from urllib2 import urlopen\n",
    "    html = urlopen(url)\n",
    "    fasta_iterator = SeqIO.parse(StringIO(\n",
    "        html.read().decode(encoding='UTF-8')), \"fasta\")\n",
    "    # Use of `next()` on next line to get first FASTA -formatted sequence is \n",
    "    # based on http://biopython.org/DIST/docs/api/Bio.SeqIO-module.html\n",
    "    # I think difference from `SeqIO.read()` in this approach is that it won't\n",
    "    # give an error if more than one entry is in the html.\n",
    "    # I found I needed `StringIO()` or got issues with trying to handle long file name.\n",
    "    record = next(fasta_iterator)\n",
    "    return record.seq\n",
    "\n",
    "def get_fasta_seq(source):\n",
    "    '''\n",
    "    Takes a source URL or filepath/ file name and gets sequence if it is in\n",
    "    FASTA format.\n",
    "    It won't return anything if what is provided isn't FASTA format because\n",
    "    Biopython handles both trying to extract FASTA from URL and from file.\n",
    "    See https://stackoverflow.com/a/44294079/8508004.\n",
    "    Placing it in a function it easy to then check and provide some feedback.\n",
    "    '''\n",
    "    if source.lower().startswith(\"http\"):\n",
    "        return get_seq_from_URL(source)\n",
    "    else:\n",
    "        # Read sequence, treating source as a filepath.\n",
    "        # Use of `with` on next line based on http://biopython.org/wiki/SeqIO , \n",
    "        # under \"Sequence Input\". Otherwise, backbone based on \n",
    "        # https://www.biostars.org/p/209383/, and fact `rU` mode depecated.\n",
    "        with open(source, \"r\") as handle:\n",
    "            for record in SeqIO.parse(handle, \"fasta\"):\n",
    "                # print(record.seq) # for debugging\n",
    "                return record.seq\n",
    "\n",
    "\n",
    "def search_strand(pattern, sequence_to_scan, strand=1):\n",
    "    '''\n",
    "    take a sequence pattern (element) and find occurrences of that on the \n",
    "    provided, larger 5'-->3' sequence.\n",
    "    Assumes strand is first unless provided.\n",
    "\n",
    "    Tracks the start and end points of each occurrence, returning a list of\n",
    "    that information where each element is a tuple of the start and end points\n",
    "    along with the strand.\n",
    "\n",
    "    Works with overlapped sequences because now \n",
    "    \"regex.findall and regex.finditer support an ‘overlapped’ flag which \n",
    "    permits overlapped matches.\"\n",
    "    , see https://pypi.python.org/pypi/regex/2018.02.21\n",
    "\n",
    "    based on https://www.biostars.org/p/209383/ (specifically steve's answer)\n",
    "    '''\n",
    "    occurrences = []\n",
    "    for match in regex.finditer(\n",
    "        pattern.upper(), str(sequence_to_scan.upper()),overlapped=True):\n",
    "        if strand == 1:\n",
    "            start_pos = match.start() + 1\n",
    "            end_pos = match.end() + 1\n",
    "        else:\n",
    "            start_pos = (len(sequence_to_scan) - match.start() ) + 1\n",
    "            end_pos = (len(sequence_to_scan) - match.end() ) + 1\n",
    "        # print (start_pos, '\\t', end_pos, '\\t',strand) # for debugging\n",
    "        occurrences.append((start_pos, end_pos,strand))\n",
    "    return occurrences\n",
    "\n",
    "###--------------------------END OF HELPER FUNCTIONS---------------------------###\n",
    "###--------------------------END OF HELPER FUNCTIONS---------------------------###\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "#*******************************************************************************\n",
    "###------------------------'main' function of script---------------------------##\n",
    "def find_sequence_element_occurrences_in_sequence(\n",
    "    source, element, id_of_seq_element, return_dataframe = False):\n",
    "    '''\n",
    "    Main function of script. Scan a sequence (source) and report on occurrences \n",
    "    of a sub-sequence element in that sequence.\n",
    "\n",
    "    Returns None\n",
    "    Unless `return_dataframe = True`, and then it returns a dataframe of \n",
    "    accounting as well. That option being meant when using this script in a cell\n",
    "    in a Jupyter notebook or importing it into another script.\n",
    "    '''\n",
    "    # get the fasta_seq to scan\n",
    "    fasta_seq = get_fasta_seq(source)\n",
    "\n",
    "\n",
    "    assert fasta_seq, (\n",
    "    \"The provided source of the FASTA-formatted sequence seems invalid. Is it \"\n",
    "    \"FASTA format?\")\n",
    "\n",
    "    # With the approach in this next block, I can expose `id_of_seq_scanned` to \n",
    "    # setting for advanced use without it being required and without need to be \n",
    "    # passed into the function.\n",
    "    if id_of_seq_scanned_hardcoded:\n",
    "        id_of_seq_scanned = id_of_seq_scanned_hardcoded\n",
    "    else:\n",
    "        id_of_seq_scanned = extract_id_of_seq_scanned(source)\n",
    "        \n",
    "\n",
    "\n",
    "    #assert that element cannot be longer than fasta_seq\n",
    "    assert len(element) < len(fasta_seq), (\n",
    "    \"the FASTA sequence has to be longer than the provided sequence element.\\n\"\n",
    "    \"The provided FASTA sequence, {0}, is {1} bases;\\nthe provided sequence \"\n",
    "    \"element '{2}' is {3} bases.\".format(\n",
    "        id_of_seq_scanned,len(fasta.seq),element,len(element)))\n",
    "\n",
    "    occurrences = search_strand(element, fasta_seq) # first strand\n",
    "    occurrences += search_strand(\n",
    "        element, fasta_seq.reverse_complement(), strand=-1) #2nd strand\n",
    "    if occurrences:\n",
    "        # make it into a dataframe since provides convenient options for \n",
    "        # handling\n",
    "        df = pd.DataFrame(occurrences, columns=['start_pos','end_pos','strand'])\n",
    "        # add some useful information to the dataframe\n",
    "        df[\"seq. element\"] = get_seq_element_representation(id_of_seq_element)\n",
    "        df = df[['seq. element','start_pos','end_pos','strand']] # 'seq.element'\n",
    "        # column will show on far right otherwise\n",
    "\n",
    "        #print(updated_sites_df)#originally for debugging during development,added..\n",
    "        # Document the full set of data collected in the terminal or \n",
    "        # Jupyter notebook display in some manner. \n",
    "        # Using `df.to_string()` because more universal than `print(df)` \n",
    "        # or Jupyter's `display(df)`.\n",
    "        sys.stderr.write( \"\\nFor documenting purposes, the following lists the \"\n",
    "            \"collected occurences:\\n\")\n",
    "        #with pd.option_context('display.max_rows', None, 'display.max_columns', None):\n",
    "        #    display(df)\n",
    "        sys.stderr.write(df.to_string())\n",
    "\n",
    "        # write to tab-delimited file\n",
    "        output_file_name = generate_output_file_name(\n",
    "            id_of_seq_element,id_of_seq_scanned )\n",
    "        df.to_csv(output_file_name, sep='\\t',index = False)\n",
    "        sys.stderr.write( \"\\nThe information on the {0} occurrences of '{1}' \"\n",
    "                \"has\\nbeen saved as a file named\"\n",
    "                \" '{2}'.\".format(len(df),element,output_file_name))\n",
    "\n",
    "        if return_dataframe:\n",
    "            sys.stderr.write( \"\\n\\nReturning a dataframe with the information \"\n",
    "                \"as well.\")\n",
    "            return df\n",
    "    else:\n",
    "        sys.stderr.write( \"\\nNo occurrences of '{0}' \"\n",
    "                \"found in the provided sequence.\".format(element))\n",
    "        if return_dataframe:\n",
    "            sys.stderr.write( \"\\n\\nNo data to return in a dataframe and so \"\n",
    "                \"returning `None`.\")\n",
    "            return None\n",
    "\n",
    "\n",
    "###--------------------------END OF MAIN FUNCTION----------------------------###\n",
    "###--------------------------END OF MAIN FUNCTION----------------------------###\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "#*******************************************************************************\n",
    "###------------------------'main' section of script---------------------------##\n",
    "\n",
    "def main():\n",
    "    \"\"\" Main entry point of the script \"\"\"\n",
    "    # placing actual main action in a 'helper' script so can call that easily \n",
    "    # with a distinguishing name in Jupyter notebooks, where `main()` may get\n",
    "    # assigned multiple times depending how many scripts imported/pasted in.\n",
    "    find_sequence_element_occurrences_in_sequence(\n",
    "        source, element, id_of_seq_element)\n",
    "\n",
    "        \n",
    "\n",
    "if __name__ == \"__main__\" and '__file__' in globals():\n",
    "    \"\"\" This is executed when run from the command line \"\"\"\n",
    "    # Code with just `if __name__ == \"__main__\":` alone will be run if pasted\n",
    "    # into a notebook. The addition of ` and '__file__' in globals()` is based\n",
    "    # on https://stackoverflow.com/a/22923872/8508004\n",
    "    # See also https://stackoverflow.com/a/22424821/8508004 for an option to \n",
    "    # provide arguments when prototyping a full script in the notebook.\n",
    "    ###-----------------for parsing command line arguments-----------------------###\n",
    "    import argparse\n",
    "    parser = argparse.ArgumentParser(prog='find_sequence_element_occurrences_in_sequence.py',\n",
    "        description=\"find_sequence_element_occurrences_in_sequence.py takes \\\n",
    "        a short sequence represented as a string and a source FASTA-formatted \\\n",
    "        sequence (either a file or URL to a file) and makes an accounting of \\\n",
    "        the occurrences of that short sequence element (pattern) on both \\\n",
    "        strands of the main sequence. Exact matches only are counted as \\\n",
    "        occurrences. Matching is case-insensitive.    \\\n",
    "        **** Script by Wayne Decatur   \\\n",
    "        (fomightez @ github) ***\")\n",
    "\n",
    "    parser.add_argument(\"element\", help=\"Sequence of element to search for. \\\n",
    "        For example, to search for an EcoRI site, provided `GAATTC`, without \\\n",
    "        any quotes or backticks, in the call to the script.\\\n",
    "        \", metavar=\"ELEMENT\")\n",
    "    parser.add_argument(\"source\", help=\"Filepath or URL of FASTA-formatted \\\n",
    "        sequence to scan for occurrences of the sequence element. \\\n",
    "        \", metavar=\"SOURCE\")\n",
    "    parser.add_argument('-id', '--id_of_seq_element', action='store', type=str, \n",
    "        help=\"**OPTIONAL**Identifier \\\n",
    "        to reference sequence element. If none is provided, the sequence of \\\n",
    "        the user provided element will be used, limited to first {} bases if \\\n",
    "        exceeds that length.\".format(limit_of_name))\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "    #I would also like trigger help to display if no arguments provided because \n",
    "    # need at least one for url\n",
    "    if len(sys.argv)==1:    #from http://stackoverflow.com/questions/4042452/display-help-message-with-python-argparse-when-script-is-called-without-any-argu\n",
    "        parser.print_help()\n",
    "        sys.exit(1)\n",
    "    args = parser.parse_args()\n",
    "    element = args.element\n",
    "    source= args.source\n",
    "    id_of_seq_element= args.id_of_seq_element\n",
    "\n",
    "\n",
    "\n",
    "    main()\n",
    "\n",
    "#*******************************************************************************\n",
    "###-***********************END MAIN PORTION OF SCRIPT***********************-###\n",
    "#*******************************************************************************"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "For documenting purposes, the following lists the collected occurences:\n",
      "   seq. element  start_pos  end_pos  strand\n",
      "0         EcoRI      16724    16730       1\n",
      "1         EcoRI      17671    17677       1\n",
      "2         EcoRI      17760    17766       1\n",
      "3         EcoRI      26412    26418       1\n",
      "4         EcoRI      26639    26645       1\n",
      "5         EcoRI      29183    29189       1\n",
      "6         EcoRI      40987    40993       1\n",
      "7         EcoRI      44706    44712       1\n",
      "8         EcoRI      80961    80967       1\n",
      "9         EcoRI      81838    81844       1\n",
      "10        EcoRI      81844    81838      -1\n",
      "11        EcoRI      80967    80961      -1\n",
      "12        EcoRI      44712    44706      -1\n",
      "13        EcoRI      40993    40987      -1\n",
      "14        EcoRI      29189    29183      -1\n",
      "15        EcoRI      26645    26639      -1\n",
      "16        EcoRI      26418    26412      -1\n",
      "17        EcoRI      17766    17760      -1\n",
      "18        EcoRI      17677    17671      -1\n",
      "19        EcoRI      16730    16724      -1\n",
      "The information on the 20 occurrences of 'GAATTC' has\n",
      "been saved as a file named 'seq_account_EcoRI_chrmt.tsv'.\n",
      "\n",
      "Returning a dataframe with the information as well."
     ]
    }
   ],
   "source": [
    "source = \"https://downloads.yeastgenome.org/sequence/S288C_reference/chromosomes/fasta/chrmt.fsa\"\n",
    "element = \"GAATTC\"\n",
    "id_of_seq_element = \"EcoRI\" #Set to `None` without quotes or backticks to have defined automatically\n",
    "df = find_sequence_element_occurrences_in_sequence(source,element,id_of_seq_element,return_dataframe = True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>seq. element</th>\n",
       "      <th>start_pos</th>\n",
       "      <th>end_pos</th>\n",
       "      <th>strand</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>EcoRI</td>\n",
       "      <td>16724</td>\n",
       "      <td>16730</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>EcoRI</td>\n",
       "      <td>17671</td>\n",
       "      <td>17677</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>EcoRI</td>\n",
       "      <td>17760</td>\n",
       "      <td>17766</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>EcoRI</td>\n",
       "      <td>26412</td>\n",
       "      <td>26418</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>EcoRI</td>\n",
       "      <td>26639</td>\n",
       "      <td>26645</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>EcoRI</td>\n",
       "      <td>29183</td>\n",
       "      <td>29189</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>EcoRI</td>\n",
       "      <td>40987</td>\n",
       "      <td>40993</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>EcoRI</td>\n",
       "      <td>44706</td>\n",
       "      <td>44712</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>EcoRI</td>\n",
       "      <td>80961</td>\n",
       "      <td>80967</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>EcoRI</td>\n",
       "      <td>81838</td>\n",
       "      <td>81844</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>EcoRI</td>\n",
       "      <td>81844</td>\n",
       "      <td>81838</td>\n",
       "      <td>-1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>EcoRI</td>\n",
       "      <td>80967</td>\n",
       "      <td>80961</td>\n",
       "      <td>-1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>EcoRI</td>\n",
       "      <td>44712</td>\n",
       "      <td>44706</td>\n",
       "      <td>-1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>EcoRI</td>\n",
       "      <td>40993</td>\n",
       "      <td>40987</td>\n",
       "      <td>-1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>EcoRI</td>\n",
       "      <td>29189</td>\n",
       "      <td>29183</td>\n",
       "      <td>-1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>EcoRI</td>\n",
       "      <td>26645</td>\n",
       "      <td>26639</td>\n",
       "      <td>-1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>EcoRI</td>\n",
       "      <td>26418</td>\n",
       "      <td>26412</td>\n",
       "      <td>-1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>EcoRI</td>\n",
       "      <td>17766</td>\n",
       "      <td>17760</td>\n",
       "      <td>-1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>EcoRI</td>\n",
       "      <td>17677</td>\n",
       "      <td>17671</td>\n",
       "      <td>-1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>EcoRI</td>\n",
       "      <td>16730</td>\n",
       "      <td>16724</td>\n",
       "      <td>-1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   seq. element  start_pos  end_pos  strand\n",
       "0         EcoRI      16724    16730       1\n",
       "1         EcoRI      17671    17677       1\n",
       "2         EcoRI      17760    17766       1\n",
       "3         EcoRI      26412    26418       1\n",
       "4         EcoRI      26639    26645       1\n",
       "5         EcoRI      29183    29189       1\n",
       "6         EcoRI      40987    40993       1\n",
       "7         EcoRI      44706    44712       1\n",
       "8         EcoRI      80961    80967       1\n",
       "9         EcoRI      81838    81844       1\n",
       "10        EcoRI      81844    81838      -1\n",
       "11        EcoRI      80967    80961      -1\n",
       "12        EcoRI      44712    44706      -1\n",
       "13        EcoRI      40993    40987      -1\n",
       "14        EcoRI      29189    29183      -1\n",
       "15        EcoRI      26645    26639      -1\n",
       "16        EcoRI      26418    26412      -1\n",
       "17        EcoRI      17766    17760      -1\n",
       "18        EcoRI      17677    17671      -1\n",
       "19        EcoRI      16730    16724      -1"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "For documenting purposes, the following lists the collected occurences:\n",
      "   seq. element  start_pos  end_pos  strand\n",
      "0        GAATTC      16724    16730       1\n",
      "1        GAATTC      17671    17677       1\n",
      "2        GAATTC      17760    17766       1\n",
      "3        GAATTC      26412    26418       1\n",
      "4        GAATTC      26639    26645       1\n",
      "5        GAATTC      29183    29189       1\n",
      "6        GAATTC      40987    40993       1\n",
      "7        GAATTC      44706    44712       1\n",
      "8        GAATTC      80961    80967       1\n",
      "9        GAATTC      81838    81844       1\n",
      "10       GAATTC      81844    81838      -1\n",
      "11       GAATTC      80967    80961      -1\n",
      "12       GAATTC      44712    44706      -1\n",
      "13       GAATTC      40993    40987      -1\n",
      "14       GAATTC      29189    29183      -1\n",
      "15       GAATTC      26645    26639      -1\n",
      "16       GAATTC      26418    26412      -1\n",
      "17       GAATTC      17766    17760      -1\n",
      "18       GAATTC      17677    17671      -1\n",
      "19       GAATTC      16730    16724      -1\n",
      "The information on the 20 occurrences of 'GAATTC' has\n",
      "been saved as a file named 'seq_account_GAATTC_chrmt.tsv'.\n",
      "\n",
      "Returning a dataframe with the information as well."
     ]
    }
   ],
   "source": [
    "source = \"https://downloads.yeastgenome.org/sequence/S288C_reference/chromosomes/fasta/chrmt.fsa\"\n",
    "element = \"GAATTC\"\n",
    "id_of_seq_element = None #Set to `None` without quotes or backticks to have defined automatically\n",
    "df = find_sequence_element_occurrences_in_sequence(source,element,id_of_seq_element,return_dataframe = True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "For documenting purposes, the following lists the collected occurences:\n",
      "           seq. element  start_pos  end_pos  strand\n",
      "0  TTCATACTTTTTATTAA...       2705     2754       1\n",
      "The information on the 1 occurrences of 'TTCATACTTTTTATTAATATAATATAATATAATATTATTAATACTTTCT' has\n",
      "been saved as a file named 'seq_account_TTCATACTTTTTATTAA-_chrmt.tsv'.\n",
      "\n",
      "Returning a dataframe with the information as well."
     ]
    }
   ],
   "source": [
    "source = \"https://downloads.yeastgenome.org/sequence/S288C_reference/chromosomes/fasta/chrmt.fsa\"\n",
    "element = \"TTCATACTTTTTATTAATATAATATAATATAATATTATTAATACTTTCT\"\n",
    "id_of_seq_element = None #Set to `None` without quotes or backticks to have defined automatically\n",
    "df = find_sequence_element_occurrences_in_sequence(source,element,id_of_seq_element,return_dataframe = True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>seq. element</th>\n",
       "      <th>start_pos</th>\n",
       "      <th>end_pos</th>\n",
       "      <th>strand</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>TTCATACTTTTTATTAA...</td>\n",
       "      <td>2705</td>\n",
       "      <td>2754</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           seq. element  start_pos  end_pos  strand\n",
       "0  TTCATACTTTTTATTAA...       2705     2754       1"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "For documenting purposes, the following lists the collected occurences:\n",
      "        seq. element  start_pos  end_pos  strand\n",
      "0  TTCATACTTTTTATTAA       2705     2722       1\n",
      "The information on the 1 occurrences of 'TTCATACTTTTTATTAA' has\n",
      "been saved as a file named 'seq_account_TTCATACTTTTTATTAA_chrmt.tsv'.\n",
      "\n",
      "Returning a dataframe with the information as well."
     ]
    }
   ],
   "source": [
    "source = \"https://downloads.yeastgenome.org/sequence/S288C_reference/chromosomes/fasta/chrmt.fsa\"\n",
    "element = \"TTCATACTTTTTATTAA\"\n",
    "id_of_seq_element = None #Set to `None` without quotes or backticks to have defined automatically\n",
    "df = find_sequence_element_occurrences_in_sequence(source,element,id_of_seq_element,return_dataframe = True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>seq. element</th>\n",
       "      <th>start_pos</th>\n",
       "      <th>end_pos</th>\n",
       "      <th>strand</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>TTCATACTTTTTATTAA</td>\n",
       "      <td>2705</td>\n",
       "      <td>2722</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        seq. element  start_pos  end_pos  strand\n",
       "0  TTCATACTTTTTATTAA       2705     2722       1"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For loading the script into a cell from Github, see the 'Skipping pasting by loading into a cell' section of [here](https://github.com/fomightez/sequencework/blob/master/circos-utilities/demo%20UCSC_chrom_sizes_2_circos_karyotype%20script.ipynb) to understand the paradigm and then call the core function similar to the cells just above this cell."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "-----\n",
    "\n",
    "## Running the script with import of the main function\n",
    "\n",
    "\n",
    "This is similar to the last section,['Running core function of the script after loading into a cell'](#Running-core-function-of-the-script-after-loading-into-a-cell), but here we take advantage of Python's import statement to do what we did by pasting or loading code into a cell and running it.\n",
    "\n",
    "First insure the script is available where you are running. Running the next command will do that here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n",
      "                                 Dload  Upload   Total   Spent    Left  Speed\n",
      "100 18500  100 18500    0     0   240k      0 --:--:-- --:--:-- --:--:--  240k\n"
     ]
    }
   ],
   "source": [
    "!curl -O https://raw.githubusercontent.com/fomightez/sequencework/master/FindSequence/find_sequence_element_occurrences_in_sequence.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then import the main function of the script to the notebook's active computational environment via an import statement. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "from find_sequence_element_occurrences_in_sequence import find_sequence_element_occurrences_in_sequence"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(As written above the command to do that looks a bit redundant; however, the first `from` part of the command below actually is referencing the `find_sequence_element_occurrences_in_sequence` script, but it doesn't need the `.py` extension because the `import` only deals with such files.)\n",
    "\n",
    "With the main function imported, it is now available to be run. As with the above method to run the script, the variables involved need to be defined before calling the function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "For documenting purposes, the following lists the collected occurences:\n",
      "   seq. element  start_pos  end_pos  strand\n",
      "0         EcoRI      16724    16730       1\n",
      "1         EcoRI      17671    17677       1\n",
      "2         EcoRI      17760    17766       1\n",
      "3         EcoRI      26412    26418       1\n",
      "4         EcoRI      26639    26645       1\n",
      "5         EcoRI      29183    29189       1\n",
      "6         EcoRI      40987    40993       1\n",
      "7         EcoRI      44706    44712       1\n",
      "8         EcoRI      80961    80967       1\n",
      "9         EcoRI      81838    81844       1\n",
      "10        EcoRI      81844    81838      -1\n",
      "11        EcoRI      80967    80961      -1\n",
      "12        EcoRI      44712    44706      -1\n",
      "13        EcoRI      40993    40987      -1\n",
      "14        EcoRI      29189    29183      -1\n",
      "15        EcoRI      26645    26639      -1\n",
      "16        EcoRI      26418    26412      -1\n",
      "17        EcoRI      17766    17760      -1\n",
      "18        EcoRI      17677    17671      -1\n",
      "19        EcoRI      16730    16724      -1\n",
      "The information on the 20 occurrences of 'GAATTC' has\n",
      "been saved as a file named 'seq_account_EcoRI_chrmt.tsv'.\n",
      "\n",
      "Returning a dataframe with the information as well."
     ]
    }
   ],
   "source": [
    "source = \"https://downloads.yeastgenome.org/sequence/S288C_reference/chromosomes/fasta/chrmt.fsa\"\n",
    "element = \"GAATTC\"\n",
    "id_of_seq_element = \"EcoRI\" #Set to `None` without quotes or backticks to have defined automatically\n",
    "df = find_sequence_element_occurrences_in_sequence(source,element,id_of_seq_element,return_dataframe = True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Complex use case example: process multiple searches for sequence elements"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n",
      "                                 Dload  Upload   Total   Spent    Left  Speed\n",
      "100  9726  100  9726    0     0  94427      0 --:--:-- --:--:-- --:--:-- 94427\n"
     ]
    }
   ],
   "source": [
    "!curl -O https://static-content.springer.com/esm/art%3A10.1186%2Fs12864-015-1664-4/MediaObjects/12864_2015_1664_MOESM9_ESM.xlsx"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: xlrd in /srv/conda/envs/notebook/lib/python3.7/site-packages (2.0.1)\n"
     ]
    }
   ],
   "source": [
    "!pip install xlrd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Class</th>\n",
       "      <th>Consensus</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>M1</td>\n",
       "      <td>TTCCGGGGCCCGGCCACGGGAGCCGGAACCCCGAAAGGAG</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>M1'</td>\n",
       "      <td>TTCCCGCTTCGCGGGAACCCCGTAAGGAG</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>M2</td>\n",
       "      <td>TCCGGCCCGCCCCGCGGGGCGGACCCCGAAGGAG</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>M2'</td>\n",
       "      <td>TCCCCGCCCCGGCGGGGACCCCGAAGGAG</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>M2''</td>\n",
       "      <td>TCCGGCCGAAGGAG</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>M3</td>\n",
       "      <td>TGAGGGACCCCTCCCTATACTATGGGAGGGGGACCGAACCCTTTAA...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>M4</td>\n",
       "      <td>TGAACACCTTTATTTAATTATAAAGGTGTGAACCAATCCGCAAGGCAAG</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>G</td>\n",
       "      <td>CCCGGTTTCTTACGAAACCGGGACCTCGGAGACGT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>V</td>\n",
       "      <td>CCCCGCGGGCGCCAATCCGGTTGTTCACCGGATTGGTCCCGCGGGG</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  Class                                          Consensus\n",
       "0    M1           TTCCGGGGCCCGGCCACGGGAGCCGGAACCCCGAAAGGAG\n",
       "1   M1'                      TTCCCGCTTCGCGGGAACCCCGTAAGGAG\n",
       "2    M2                 TCCGGCCCGCCCCGCGGGGCGGACCCCGAAGGAG\n",
       "3   M2'                      TCCCCGCCCCGGCGGGGACCCCGAAGGAG\n",
       "4  M2''                                     TCCGGCCGAAGGAG\n",
       "5    M3  TGAGGGACCCCTCCCTATACTATGGGAGGGGGACCGAACCCTTTAA...\n",
       "6    M4  TGAACACCTTTATTTAATTATAAAGGTGTGAACCAATCCGCAAGGCAAG\n",
       "7     G                CCCGGTTTCTTACGAAACCGGGACCTCGGAGACGT\n",
       "8     V     CCCCGCGGGCGCCAATCCGGTTGTTCACCGGATTGGTCCCGCGGGG"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "GC_df = pd.read_excel('12864_2015_1664_MOESM9_ESM.xlsx')\n",
    "GC_df.rename(columns={\"Consensus Sequence 5' -> 3'\":'Consensus'}, inplace=True) # I was having difficulty using bracket notation and so this makes easier\n",
    "GC_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "For documenting purposes, the following lists the collected occurences:\n",
      "   seq. element  start_pos  end_pos  strand\n",
      "0            M1       2097     2137       1\n",
      "1            M1       3062     3102       1\n",
      "2            M1       8238     8278       1\n",
      "3            M1       9034     9074       1\n",
      "4            M1      11049    11089       1\n",
      "5            M1      11508    11548       1\n",
      "6            M1      34289    34329       1\n",
      "7            M1      34937    34977       1\n",
      "8            M1      46317    46357       1\n",
      "9            M1      48104    48144       1\n",
      "10           M1      50113    50153       1\n",
      "11           M1      53829    53869       1\n",
      "12           M1      62864    62904       1\n",
      "13           M1      66829    66869       1\n",
      "14           M1      69446    69486       1\n",
      "15           M1      76597    76637       1\n",
      "16           M1      80505    80545       1\n",
      "17           M1      81406    81446       1\n",
      "18           M1      83056    83096       1\n",
      "19           M1      75661    75621      -1\n",
      "20           M1      71254    71214      -1\n",
      "21           M1      57070    57030      -1\n",
      "22           M1      52372    52332      -1\n",
      "23           M1      47515    47475      -1\n",
      "24           M1       9530     9490      -1\n",
      "25           M1       2792     2752      -1\n",
      "26           M1        125       85      -1\n",
      "The information on the 27 occurrences of 'TTCCGGGGCCCGGCCACGGGAGCCGGAACCCCGAAAGGAG' has\n",
      "been saved as a file named 'seq_account_M1_chrmt.tsv'.\n",
      "\n",
      "Returning a dataframe with the information as well.\n",
      "No occurrences of 'TTCCCGCTTCGCGGGAACCCCGTAAGGAG' found in the provided sequence.\n",
      "\n",
      "No data to return in a dataframe and so returning `None`.\n",
      "For documenting purposes, the following lists the collected occurences:\n",
      "  seq. element  start_pos  end_pos  strand\n",
      "0           M2      65101    65067      -1\n",
      "The information on the 1 occurrences of 'TCCGGCCCGCCCCGCGGGGCGGACCCCGAAGGAG' has\n",
      "been saved as a file named 'seq_account_M2_chrmt.tsv'.\n",
      "\n",
      "Returning a dataframe with the information as well.\n",
      "For documenting purposes, the following lists the collected occurences:\n",
      "  seq. element  start_pos  end_pos  strand\n",
      "0          M2'       2855     2884       1\n",
      "1          M2'      34171    34200       1\n",
      "2          M2'      73206    73235       1\n",
      "3          M2'      68689    68660      -1\n",
      "4          M2'      56701    56672      -1\n",
      "5          M2'       4183     4154      -1\n",
      "The information on the 6 occurrences of 'TCCCCGCCCCGGCGGGGACCCCGAAGGAG' has\n",
      "been saved as a file named 'seq_account_M2'_chrmt.tsv'.\n",
      "\n",
      "Returning a dataframe with the information as well.\n",
      "For documenting purposes, the following lists the collected occurences:\n",
      "  seq. element  start_pos  end_pos  strand\n",
      "0         M2''      57860    57874       1\n",
      "1         M2''      84762    84748      -1\n",
      "2         M2''      72792    72778      -1\n",
      "3         M2''      65067    65053      -1\n",
      "4         M2''       8668     8654      -1\n",
      "The information on the 5 occurrences of 'TCCGGCCGAAGGAG' has\n",
      "been saved as a file named 'seq_account_M2''_chrmt.tsv'.\n",
      "\n",
      "Returning a dataframe with the information as well.\n",
      "No occurrences of 'TGAGGGACCCCTCCCTATACTATGGGAGGGGGACCGAACCCTTTAAAGAAGAG' found in the provided sequence.\n",
      "\n",
      "No data to return in a dataframe and so returning `None`.\n",
      "No occurrences of 'TGAACACCTTTATTTAATTATAAAGGTGTGAACCAATCCGCAAGGCAAG' found in the provided sequence.\n",
      "\n",
      "No data to return in a dataframe and so returning `None`.\n",
      "For documenting purposes, the following lists the collected occurences:\n",
      "  seq. element  start_pos  end_pos  strand\n",
      "0            G      30185    30220       1\n",
      "1            G      44891    44926       1\n",
      "2            G      12816    12781      -1\n",
      "3            G       4348     4313      -1\n",
      "The information on the 4 occurrences of 'CCCGGTTTCTTACGAAACCGGGACCTCGGAGACGT' has\n",
      "been saved as a file named 'seq_account_G_chrmt.tsv'.\n",
      "\n",
      "Returning a dataframe with the information as well.\n",
      "For documenting purposes, the following lists the collected occurences:\n",
      "  seq. element  start_pos  end_pos  strand\n",
      "0            V      47719    47765       1\n",
      "1            V      49089    49135       1\n",
      "The information on the 2 occurrences of 'CCCCGCGGGCGCCAATCCGGTTGTTCACCGGATTGGTCCCGCGGGG' has\n",
      "been saved as a file named 'seq_account_V_chrmt.tsv'.\n",
      "\n",
      "Returning a dataframe with the information as well."
     ]
    }
   ],
   "source": [
    "for row in GC_df.itertuples():\n",
    "    #print(row) # for debugging\n",
    "    source = \"https://downloads.yeastgenome.org/sequence/S288C_reference/chromosomes/fasta/chrmt.fsa\"\n",
    "    element = row.Consensus\n",
    "    id_of_seq_element = row.Class\n",
    "    df = find_sequence_element_occurrences_in_sequence(source,element,id_of_seq_element,return_dataframe = True)\n",
    "    #print(df)\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}