{ "cells": [ { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "# Examples of running the script\n", "\n", "There are several options to run `find_sequence_element_occurrences_in_sequence` or utilize the core function.\n", "\n", "This notebook will demonstrate a few ways of using this script:\n", "\n", "- [pasting it into a notebook cell or loading it into a cell](#Running-core-function-of-the-scrip-after-loading-into-a-cell)\n", "- [importing the main function](#importing-the-main-function)\n", "\n", "Notably, it won't cover running the script after upload. An approach to that is illustrated [here](https://github.com/fomightez/sequencework/blob/master/circos-utilities/demo%20UCSC_chrom_sizes_2_circos_karyotype%20script.ipynb) for a different script, but that should serve as a good guide combined with getting the `USAGE` info by running it with the `--help/-h` flag. \n", "\n", "(If you are having any problems at all doing any of this because of Python or needed dependencies, such as Bioython, this notebook was developed in the enviromenment launchable by pressing `Launch binder` badge [here](https://github.com/fomightez/qgrid-notebooks). You could always launch that environment and upload this notebook there and things should work.)\n", "\n", "**It will also demonstrate a more complex use of the core function to process multiple searches for sequence elements in a manner reminiscent of a [mini-pipeline](https://github.com/fomightez/mini-pipelines).**\n", "\n", "## Running as script file\n", "\n", "Similar to how one would run a script from the command line.\n", "\n", "Upload the script to the directory where you want to run it. Or upload it to a running Jupyter environment.\n", "\n", "A similar approach to the latter is illustrated [here](https://github.com/fomightez/sequencework/blob/master/circos-utilities/demo%20UCSC_chrom_sizes_2_circos_karyotype%20script.ipynb) for a different script but that should serve as a good guide combined with getting the `USAGE` info by running it with the `--help/-h` flag.\n", "\n", "-----\n", "\n", "## Running core function of the script after loading into a cell\n", "\n", "It can be pasted into a cell or loaded from github. Those will be demonstrated in this section of the notebook.\n", "\n", "First part of this section will cover pasting script into a cell. \n", "In the next cell is the script (although it might not be the most up-to-date version, and so it would be best to get and paste the most-up-to-date version from Github or use the `%load `approach to fetch it from Github directly into a cell, as illustrated [here](https://github.com/fomightez/sequencework/blob/master/circos-utilities/demo%20UCSC_chrom_sizes_2_circos_karyotype%20script.ipynb))" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: regex in /srv/conda/envs/notebook/lib/python3.7/site-packages (2021.11.10)\n", "Note: you may need to restart the kernel to use updated packages.\n", "Requirement already satisfied: openpyxl in /srv/conda/envs/notebook/lib/python3.7/site-packages (3.0.9)\n", "Requirement already satisfied: et-xmlfile in /srv/conda/envs/notebook/lib/python3.7/site-packages (from openpyxl) (1.1.0)\n", "Note: you may need to restart the kernel to use updated packages.\n" ] } ], "source": [ "%pip install regex\n", "%pip install openpyxl" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "#!/usr/bin/env python\n", "# find_sequence_element_occurrences_in_sequence.py\n", "__author__ = \"Wayne Decatur\" #fomightez on GitHub\n", "__license__ = \"MIT\"\n", "__version__ = \"0.1.0\"\n", "\n", "\n", "# find_sequence_element_occurrences_in_sequence.py by Wayne Decatur\n", "# ver 0.1\n", "#\n", "#*******************************************************************************\n", "# Verified compatible with both Python 2.7 and Python 3.6; written initially in \n", "# Python 3.\n", "#\n", "# PURPOSE: Takes a short sequence represented as a string and a FASTA-formatted\n", "# sequence (either a file or URL to a file) and makes an accounting of the \n", "# occurrences of that short sequence element on both strands of the main \n", "# sequence. Exact matches only are counted as occurrences.\n", "# The script is case-insensitive, meaning the case of either the element or \n", "# FASTA sequence provided do not matter and matches will be noted anyway.\n", "#\n", "# As written, the script expects and only processes one FASTA-formatted \n", "# sequence. If your, FASTA file has more than one sequence entry within it and \n", "# the one you want to have scanned is not the first, copy and paste it to a new \n", "# file and use that as the sequence to scan.\n", "#\n", "# Written to run from command line or pasted/loaded inside a Jupyter notebook \n", "# cell or imported. \n", "#\n", "#\n", "#\n", "# This script based on work and musings developed in \n", "# `Resources for writing code to do GC cluster accounting.md`\n", "#\n", "# The script at https://www.biostars.org/p/209383/ (steve's answer) served as \n", "# the backbone for adding convenience and user-friendly features.\n", "#\n", "#\n", "#\n", "# Dependencies beyond the mostly standard libraries/modules:\n", "# - Biopython\n", "#\n", "#\n", "# VERSION HISTORY:\n", "# v.0.1. basic working version\n", "#\n", "# To do:\n", "# - \n", "#\n", "#\n", "#\n", "# TO RUN:\n", "# Examples,\n", "# Enter on the command line of your terminal, the line\n", "#-----------------------------------\n", "# python find_sequence_element_occurrences_in_sequence.py GAATTC sequences.fasta\n", "\n", "#-OR-\n", "# python find_sequence_element_occurrences_in_sequence.py GAATTC sequences.fasta URL\n", "#-----------------------------------\n", "# Issue `python find_sequence_element_occurrences_in_sequence.py -h` for details.\n", "# \n", "#\n", "# To use this after pasting or loading into a cell in a Jupyter notebook, in\n", "# the next cell define critical variables and then call the main function \n", "# similar to below:\n", "# source = \"https://downloads.yeastgenome.org/sequence/S288C_reference/chromosomes/fasta/chrmt.fsa\"\n", "# element = \"GAATTC\"\n", "# id_of_seq_element = \"ele1\" #Set to `None` without quotes or backticks to have defined automatically\n", "# find_sequence_element_occurrences_in_sequence(source, element, id_of_seq_element, return_dataframe = True)\n", "#\n", "# Something similar would need to be done if importing the script into another.\n", "# (`id_of_seq_scanned_hardcoded` can be assigned in a cell before calling the \n", "# function as well.)\n", "#\n", "# \n", "#\n", "'''\n", "CURRENT ACTUAL CODE FOR RUNNING/TESTING IN A NOTEBOOK WHEN LOADED OR PASTED IN \n", "ANOTHER CELL:\n", "source = \"https://downloads.yeastgenome.org/sequence/S288C_reference/chromosomes/fasta/chrmt.fsa\"\n", "element = \"GAATTC\"\n", "id_of_seq_element = \"EcoRI\" #Set to `None` without quotes or backticks to have defined automatically\n", "df = find_sequence_element_occurrences_in_sequence(source, element, id_of_seq_element, return_dataframe = True)\n", "\n", "\n", "'''\n", "#\n", "#\n", "#*******************************************************************************\n", "#\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "#*******************************************************************************\n", "##################################\n", "# USER ADJUSTABLE VALUES #\n", "\n", "##################################\n", "#\n", "output_file_name_prefix = \"seq_account\"\n", "id_of_seq_scanned_hardcoded = None # replace `None` with what you want to use,\n", "# with flanking quotes if something appropriate is not being extracted from the\n", "# provided filepath/filename or URL to save as an indicator of the file scanned\n", "# in the output file name. \n", "\n", "\n", "\n", "limit_of_name = 17 # number of bases of the sequence element to limit to using \n", "# if the sequence element sequence is used to make the name for the output file\n", "\n", "\n", "\n", "\n", "\n", "#\n", "#*******************************************************************************\n", "#**********************END USER ADJUSTABLE VARIABLES****************************\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "#*******************************************************************************\n", "#*******************************************************************************\n", "###DO NOT EDIT BELOW HERE - ENTER VALUES ABOVE###\n", "\n", "import sys\n", "import os\n", "import regex\n", "import pandas as pd\n", "from Bio import SeqIO\n", "from Bio.Seq import Seq # for reverse complement\n", "\n", "\n", "\n", "\n", "###---------------------------HELPER FUNCTIONS---------------------------------###\n", "\n", "\n", "def get_seq_element_representation(id_of_seq_element):\n", " '''\n", " Takes `id_of_seq_element` and returns a string represenation for using\n", " in the output file name and the dataframe.\n", " '''\n", " if id_of_seq_element:\n", " elem_id = id_of_seq_element\n", " elif len(element) > limit_of_name:\n", " elem_id = element[:limit_of_name]+\"...\"\n", " else:\n", " elem_id = element\n", " return elem_id\n", "\n", "\n", "\n", "def generate_output_file_name(id_of_seq_element,id_of_seq_scanned):\n", " '''\n", " Takes a file name as an argument and returns string for the name of the\n", " output file. The generated name is based on a prefix that can be adjusted\n", " under the section ' USER ADJUSTABLE VALUES ', plus the provided text in \n", " the function call.\n", " If there is no `id_of_seq_element` specified, the sequence element will be\n", " used, limited to the first number of bases specified in `limit_of_name`.\n", "\n", " Specific examples\n", " =================\n", " Calling function with\n", " (\"elem1\",\"chrmt\")\n", " returns\n", " \"seq_account_elem1_chrmt.tsv\"\n", "\n", " Calling function with\n", " (None,\"chrmt\")\n", " returns\n", " \"seq_account_GAATTC_chrmt.tsv\"\n", " if `GAATTC` happened to be the provided sequence element.\n", " '''\n", " elem_id = get_seq_element_representation(id_of_seq_element)\n", " if elem_id.endswith(\"...\") and (id_of_seq_element == None):\n", " # Because in order to use `get_seq_element_representation()` in all the \n", " # places something similar is needed, I have it adding `...` if the\n", " # sequence of the element exceeds the limit size. Since I don't want\n", " # that in the file name I am going to remove if it seems I had added it,\n", " # and I will only have possibly added it if `id_of_seq_element` is None.\n", " # This extra check is mitigate chances I'll remove `...` in unlikely\n", " # possibility user wanted to include it in id for sequence element. \n", " elem_id = elem_id[:-3] + \"-\" #add hyphen to indicate there is more \n", " return \"{prefix}_{elem_id}_{seqid}.tsv\".format(\n", " prefix=output_file_name_prefix,elem_id=elem_id,\n", " seqid=id_of_seq_scanned)\n", "\n", "\n", "def extract_id_of_seq_scanned(source):\n", " '''\n", " Take something like:\n", " https://downloads.yeastgenome.org/sequence/S288C_reference/chromosomes/fasta/chrmt.fsa\n", "\n", " -or- \n", "\n", " /local/directory1/directory2/chrmt.fa\n", "\n", " -or-\n", "\n", " chrmt.fa\n", "\n", " And return:\n", " chrmt\n", " '''\n", " if \"/\" in source:\n", " last_bit = source.split(\"/\")[-1]\n", " else:\n", " last_bit = source\n", " if '.' in last_bit:\n", " main_part_of_name, file_extension = os.path.splitext(\n", " last_bit) #from http://stackoverflow.com/questions/541390/extracting-extension-from-filename-in-python\n", " return main_part_of_name\n", " else:\n", " return last_bit\n", "\n", "\n", "def get_seq_from_URL(url):\n", " '''\n", " takes a URL and gets the sequence\n", " '''\n", " try:\n", " from StringIO import StringIO\n", " except ImportError:\n", " from io import StringIO\n", "\n", " # Getting html originally for just Python 3, adapted from \n", " # https://stackoverflow.com/a/17510727/8508004 and then updated from to \n", " # handle Python 2 and 3 according to same link.\n", " try:\n", " # For Python 3.0 and later\n", " from urllib.request import urlopen\n", " except ImportError:\n", " # Fall back to Python 2's urllib2\n", " from urllib2 import urlopen\n", " html = urlopen(url)\n", " fasta_iterator = SeqIO.parse(StringIO(\n", " html.read().decode(encoding='UTF-8')), \"fasta\")\n", " # Use of `next()` on next line to get first FASTA -formatted sequence is \n", " # based on http://biopython.org/DIST/docs/api/Bio.SeqIO-module.html\n", " # I think difference from `SeqIO.read()` in this approach is that it won't\n", " # give an error if more than one entry is in the html.\n", " # I found I needed `StringIO()` or got issues with trying to handle long file name.\n", " record = next(fasta_iterator)\n", " return record.seq\n", "\n", "def get_fasta_seq(source):\n", " '''\n", " Takes a source URL or filepath/ file name and gets sequence if it is in\n", " FASTA format.\n", " It won't return anything if what is provided isn't FASTA format because\n", " Biopython handles both trying to extract FASTA from URL and from file.\n", " See https://stackoverflow.com/a/44294079/8508004.\n", " Placing it in a function it easy to then check and provide some feedback.\n", " '''\n", " if source.lower().startswith(\"http\"):\n", " return get_seq_from_URL(source)\n", " else:\n", " # Read sequence, treating source as a filepath.\n", " # Use of `with` on next line based on http://biopython.org/wiki/SeqIO , \n", " # under \"Sequence Input\". Otherwise, backbone based on \n", " # https://www.biostars.org/p/209383/, and fact `rU` mode depecated.\n", " with open(source, \"r\") as handle:\n", " for record in SeqIO.parse(handle, \"fasta\"):\n", " # print(record.seq) # for debugging\n", " return record.seq\n", "\n", "\n", "def search_strand(pattern, sequence_to_scan, strand=1):\n", " '''\n", " take a sequence pattern (element) and find occurrences of that on the \n", " provided, larger 5'-->3' sequence.\n", " Assumes strand is first unless provided.\n", "\n", " Tracks the start and end points of each occurrence, returning a list of\n", " that information where each element is a tuple of the start and end points\n", " along with the strand.\n", "\n", " Works with overlapped sequences because now \n", " \"regex.findall and regex.finditer support an ‘overlapped’ flag which \n", " permits overlapped matches.\"\n", " , see https://pypi.python.org/pypi/regex/2018.02.21\n", "\n", " based on https://www.biostars.org/p/209383/ (specifically steve's answer)\n", " '''\n", " occurrences = []\n", " for match in regex.finditer(\n", " pattern.upper(), str(sequence_to_scan.upper()),overlapped=True):\n", " if strand == 1:\n", " start_pos = match.start() + 1\n", " end_pos = match.end() + 1\n", " else:\n", " start_pos = (len(sequence_to_scan) - match.start() ) + 1\n", " end_pos = (len(sequence_to_scan) - match.end() ) + 1\n", " # print (start_pos, '\\t', end_pos, '\\t',strand) # for debugging\n", " occurrences.append((start_pos, end_pos,strand))\n", " return occurrences\n", "\n", "###--------------------------END OF HELPER FUNCTIONS---------------------------###\n", "###--------------------------END OF HELPER FUNCTIONS---------------------------###\n", "\n", "\n", "\n", "\n", "#*******************************************************************************\n", "###------------------------'main' function of script---------------------------##\n", "def find_sequence_element_occurrences_in_sequence(\n", " source, element, id_of_seq_element, return_dataframe = False):\n", " '''\n", " Main function of script. Scan a sequence (source) and report on occurrences \n", " of a sub-sequence element in that sequence.\n", "\n", " Returns None\n", " Unless `return_dataframe = True`, and then it returns a dataframe of \n", " accounting as well. That option being meant when using this script in a cell\n", " in a Jupyter notebook or importing it into another script.\n", " '''\n", " # get the fasta_seq to scan\n", " fasta_seq = get_fasta_seq(source)\n", "\n", "\n", " assert fasta_seq, (\n", " \"The provided source of the FASTA-formatted sequence seems invalid. Is it \"\n", " \"FASTA format?\")\n", "\n", " # With the approach in this next block, I can expose `id_of_seq_scanned` to \n", " # setting for advanced use without it being required and without need to be \n", " # passed into the function.\n", " if id_of_seq_scanned_hardcoded:\n", " id_of_seq_scanned = id_of_seq_scanned_hardcoded\n", " else:\n", " id_of_seq_scanned = extract_id_of_seq_scanned(source)\n", " \n", "\n", "\n", " #assert that element cannot be longer than fasta_seq\n", " assert len(element) < len(fasta_seq), (\n", " \"the FASTA sequence has to be longer than the provided sequence element.\\n\"\n", " \"The provided FASTA sequence, {0}, is {1} bases;\\nthe provided sequence \"\n", " \"element '{2}' is {3} bases.\".format(\n", " id_of_seq_scanned,len(fasta.seq),element,len(element)))\n", "\n", " occurrences = search_strand(element, fasta_seq) # first strand\n", " occurrences += search_strand(\n", " element, fasta_seq.reverse_complement(), strand=-1) #2nd strand\n", " if occurrences:\n", " # make it into a dataframe since provides convenient options for \n", " # handling\n", " df = pd.DataFrame(occurrences, columns=['start_pos','end_pos','strand'])\n", " # add some useful information to the dataframe\n", " df[\"seq. element\"] = get_seq_element_representation(id_of_seq_element)\n", " df = df[['seq. element','start_pos','end_pos','strand']] # 'seq.element'\n", " # column will show on far right otherwise\n", "\n", " #print(updated_sites_df)#originally for debugging during development,added..\n", " # Document the full set of data collected in the terminal or \n", " # Jupyter notebook display in some manner. \n", " # Using `df.to_string()` because more universal than `print(df)` \n", " # or Jupyter's `display(df)`.\n", " sys.stderr.write( \"\\nFor documenting purposes, the following lists the \"\n", " \"collected occurences:\\n\")\n", " #with pd.option_context('display.max_rows', None, 'display.max_columns', None):\n", " # display(df)\n", " sys.stderr.write(df.to_string())\n", "\n", " # write to tab-delimited file\n", " output_file_name = generate_output_file_name(\n", " id_of_seq_element,id_of_seq_scanned )\n", " df.to_csv(output_file_name, sep='\\t',index = False)\n", " sys.stderr.write( \"\\nThe information on the {0} occurrences of '{1}' \"\n", " \"has\\nbeen saved as a file named\"\n", " \" '{2}'.\".format(len(df),element,output_file_name))\n", "\n", " if return_dataframe:\n", " sys.stderr.write( \"\\n\\nReturning a dataframe with the information \"\n", " \"as well.\")\n", " return df\n", " else:\n", " sys.stderr.write( \"\\nNo occurrences of '{0}' \"\n", " \"found in the provided sequence.\".format(element))\n", " if return_dataframe:\n", " sys.stderr.write( \"\\n\\nNo data to return in a dataframe and so \"\n", " \"returning `None`.\")\n", " return None\n", "\n", "\n", "###--------------------------END OF MAIN FUNCTION----------------------------###\n", "###--------------------------END OF MAIN FUNCTION----------------------------###\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "#*******************************************************************************\n", "###------------------------'main' section of script---------------------------##\n", "\n", "def main():\n", " \"\"\" Main entry point of the script \"\"\"\n", " # placing actual main action in a 'helper' script so can call that easily \n", " # with a distinguishing name in Jupyter notebooks, where `main()` may get\n", " # assigned multiple times depending how many scripts imported/pasted in.\n", " find_sequence_element_occurrences_in_sequence(\n", " source, element, id_of_seq_element)\n", "\n", " \n", "\n", "if __name__ == \"__main__\" and '__file__' in globals():\n", " \"\"\" This is executed when run from the command line \"\"\"\n", " # Code with just `if __name__ == \"__main__\":` alone will be run if pasted\n", " # into a notebook. The addition of ` and '__file__' in globals()` is based\n", " # on https://stackoverflow.com/a/22923872/8508004\n", " # See also https://stackoverflow.com/a/22424821/8508004 for an option to \n", " # provide arguments when prototyping a full script in the notebook.\n", " ###-----------------for parsing command line arguments-----------------------###\n", " import argparse\n", " parser = argparse.ArgumentParser(prog='find_sequence_element_occurrences_in_sequence.py',\n", " description=\"find_sequence_element_occurrences_in_sequence.py takes \\\n", " a short sequence represented as a string and a source FASTA-formatted \\\n", " sequence (either a file or URL to a file) and makes an accounting of \\\n", " the occurrences of that short sequence element (pattern) on both \\\n", " strands of the main sequence. Exact matches only are counted as \\\n", " occurrences. Matching is case-insensitive. \\\n", " **** Script by Wayne Decatur \\\n", " (fomightez @ github) ***\")\n", "\n", " parser.add_argument(\"element\", help=\"Sequence of element to search for. \\\n", " For example, to search for an EcoRI site, provided `GAATTC`, without \\\n", " any quotes or backticks, in the call to the script.\\\n", " \", metavar=\"ELEMENT\")\n", " parser.add_argument(\"source\", help=\"Filepath or URL of FASTA-formatted \\\n", " sequence to scan for occurrences of the sequence element. \\\n", " \", metavar=\"SOURCE\")\n", " parser.add_argument('-id', '--id_of_seq_element', action='store', type=str, \n", " help=\"**OPTIONAL**Identifier \\\n", " to reference sequence element. If none is provided, the sequence of \\\n", " the user provided element will be used, limited to first {} bases if \\\n", " exceeds that length.\".format(limit_of_name))\n", "\n", "\n", "\n", "\n", "\n", " #I would also like trigger help to display if no arguments provided because \n", " # need at least one for url\n", " if len(sys.argv)==1: #from http://stackoverflow.com/questions/4042452/display-help-message-with-python-argparse-when-script-is-called-without-any-argu\n", " parser.print_help()\n", " sys.exit(1)\n", " args = parser.parse_args()\n", " element = args.element\n", " source= args.source\n", " id_of_seq_element= args.id_of_seq_element\n", "\n", "\n", "\n", " main()\n", "\n", "#*******************************************************************************\n", "###-***********************END MAIN PORTION OF SCRIPT***********************-###\n", "#*******************************************************************************" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\n", "For documenting purposes, the following lists the collected occurences:\n", " seq. element start_pos end_pos strand\n", "0 EcoRI 16724 16730 1\n", "1 EcoRI 17671 17677 1\n", "2 EcoRI 17760 17766 1\n", "3 EcoRI 26412 26418 1\n", "4 EcoRI 26639 26645 1\n", "5 EcoRI 29183 29189 1\n", "6 EcoRI 40987 40993 1\n", "7 EcoRI 44706 44712 1\n", "8 EcoRI 80961 80967 1\n", "9 EcoRI 81838 81844 1\n", "10 EcoRI 81844 81838 -1\n", "11 EcoRI 80967 80961 -1\n", "12 EcoRI 44712 44706 -1\n", "13 EcoRI 40993 40987 -1\n", "14 EcoRI 29189 29183 -1\n", "15 EcoRI 26645 26639 -1\n", "16 EcoRI 26418 26412 -1\n", "17 EcoRI 17766 17760 -1\n", "18 EcoRI 17677 17671 -1\n", "19 EcoRI 16730 16724 -1\n", "The information on the 20 occurrences of 'GAATTC' has\n", "been saved as a file named 'seq_account_EcoRI_chrmt.tsv'.\n", "\n", "Returning a dataframe with the information as well." ] } ], "source": [ "source = \"https://downloads.yeastgenome.org/sequence/S288C_reference/chromosomes/fasta/chrmt.fsa\"\n", "element = \"GAATTC\"\n", "id_of_seq_element = \"EcoRI\" #Set to `None` without quotes or backticks to have defined automatically\n", "df = find_sequence_element_occurrences_in_sequence(source,element,id_of_seq_element,return_dataframe = True)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
seq. elementstart_posend_posstrand
0EcoRI16724167301
1EcoRI17671176771
2EcoRI17760177661
3EcoRI26412264181
4EcoRI26639266451
5EcoRI29183291891
6EcoRI40987409931
7EcoRI44706447121
8EcoRI80961809671
9EcoRI81838818441
10EcoRI8184481838-1
11EcoRI8096780961-1
12EcoRI4471244706-1
13EcoRI4099340987-1
14EcoRI2918929183-1
15EcoRI2664526639-1
16EcoRI2641826412-1
17EcoRI1776617760-1
18EcoRI1767717671-1
19EcoRI1673016724-1
\n", "
" ], "text/plain": [ " seq. element start_pos end_pos strand\n", "0 EcoRI 16724 16730 1\n", "1 EcoRI 17671 17677 1\n", "2 EcoRI 17760 17766 1\n", "3 EcoRI 26412 26418 1\n", "4 EcoRI 26639 26645 1\n", "5 EcoRI 29183 29189 1\n", "6 EcoRI 40987 40993 1\n", "7 EcoRI 44706 44712 1\n", "8 EcoRI 80961 80967 1\n", "9 EcoRI 81838 81844 1\n", "10 EcoRI 81844 81838 -1\n", "11 EcoRI 80967 80961 -1\n", "12 EcoRI 44712 44706 -1\n", "13 EcoRI 40993 40987 -1\n", "14 EcoRI 29189 29183 -1\n", "15 EcoRI 26645 26639 -1\n", "16 EcoRI 26418 26412 -1\n", "17 EcoRI 17766 17760 -1\n", "18 EcoRI 17677 17671 -1\n", "19 EcoRI 16730 16724 -1" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\n", "For documenting purposes, the following lists the collected occurences:\n", " seq. element start_pos end_pos strand\n", "0 GAATTC 16724 16730 1\n", "1 GAATTC 17671 17677 1\n", "2 GAATTC 17760 17766 1\n", "3 GAATTC 26412 26418 1\n", "4 GAATTC 26639 26645 1\n", "5 GAATTC 29183 29189 1\n", "6 GAATTC 40987 40993 1\n", "7 GAATTC 44706 44712 1\n", "8 GAATTC 80961 80967 1\n", "9 GAATTC 81838 81844 1\n", "10 GAATTC 81844 81838 -1\n", "11 GAATTC 80967 80961 -1\n", "12 GAATTC 44712 44706 -1\n", "13 GAATTC 40993 40987 -1\n", "14 GAATTC 29189 29183 -1\n", "15 GAATTC 26645 26639 -1\n", "16 GAATTC 26418 26412 -1\n", "17 GAATTC 17766 17760 -1\n", "18 GAATTC 17677 17671 -1\n", "19 GAATTC 16730 16724 -1\n", "The information on the 20 occurrences of 'GAATTC' has\n", "been saved as a file named 'seq_account_GAATTC_chrmt.tsv'.\n", "\n", "Returning a dataframe with the information as well." ] } ], "source": [ "source = \"https://downloads.yeastgenome.org/sequence/S288C_reference/chromosomes/fasta/chrmt.fsa\"\n", "element = \"GAATTC\"\n", "id_of_seq_element = None #Set to `None` without quotes or backticks to have defined automatically\n", "df = find_sequence_element_occurrences_in_sequence(source,element,id_of_seq_element,return_dataframe = True)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\n", "For documenting purposes, the following lists the collected occurences:\n", " seq. element start_pos end_pos strand\n", "0 TTCATACTTTTTATTAA... 2705 2754 1\n", "The information on the 1 occurrences of 'TTCATACTTTTTATTAATATAATATAATATAATATTATTAATACTTTCT' has\n", "been saved as a file named 'seq_account_TTCATACTTTTTATTAA-_chrmt.tsv'.\n", "\n", "Returning a dataframe with the information as well." ] } ], "source": [ "source = \"https://downloads.yeastgenome.org/sequence/S288C_reference/chromosomes/fasta/chrmt.fsa\"\n", "element = \"TTCATACTTTTTATTAATATAATATAATATAATATTATTAATACTTTCT\"\n", "id_of_seq_element = None #Set to `None` without quotes or backticks to have defined automatically\n", "df = find_sequence_element_occurrences_in_sequence(source,element,id_of_seq_element,return_dataframe = True)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
seq. elementstart_posend_posstrand
0TTCATACTTTTTATTAA...270527541
\n", "
" ], "text/plain": [ " seq. element start_pos end_pos strand\n", "0 TTCATACTTTTTATTAA... 2705 2754 1" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\n", "For documenting purposes, the following lists the collected occurences:\n", " seq. element start_pos end_pos strand\n", "0 TTCATACTTTTTATTAA 2705 2722 1\n", "The information on the 1 occurrences of 'TTCATACTTTTTATTAA' has\n", "been saved as a file named 'seq_account_TTCATACTTTTTATTAA_chrmt.tsv'.\n", "\n", "Returning a dataframe with the information as well." ] } ], "source": [ "source = \"https://downloads.yeastgenome.org/sequence/S288C_reference/chromosomes/fasta/chrmt.fsa\"\n", "element = \"TTCATACTTTTTATTAA\"\n", "id_of_seq_element = None #Set to `None` without quotes or backticks to have defined automatically\n", "df = find_sequence_element_occurrences_in_sequence(source,element,id_of_seq_element,return_dataframe = True)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
seq. elementstart_posend_posstrand
0TTCATACTTTTTATTAA270527221
\n", "
" ], "text/plain": [ " seq. element start_pos end_pos strand\n", "0 TTCATACTTTTTATTAA 2705 2722 1" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For loading the script into a cell from Github, see the 'Skipping pasting by loading into a cell' section of [here](https://github.com/fomightez/sequencework/blob/master/circos-utilities/demo%20UCSC_chrom_sizes_2_circos_karyotype%20script.ipynb) to understand the paradigm and then call the core function similar to the cells just above this cell." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "-----\n", "\n", "## Running the script with import of the main function\n", "\n", "\n", "This is similar to the last section,['Running core function of the script after loading into a cell'](#Running-core-function-of-the-script-after-loading-into-a-cell), but here we take advantage of Python's import statement to do what we did by pasting or loading code into a cell and running it.\n", "\n", "First insure the script is available where you are running. Running the next command will do that here." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", "100 18500 100 18500 0 0 240k 0 --:--:-- --:--:-- --:--:-- 240k\n" ] } ], "source": [ "!curl -O https://raw.githubusercontent.com/fomightez/sequencework/master/FindSequence/find_sequence_element_occurrences_in_sequence.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then import the main function of the script to the notebook's active computational environment via an import statement. " ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "from find_sequence_element_occurrences_in_sequence import find_sequence_element_occurrences_in_sequence" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(As written above the command to do that looks a bit redundant; however, the first `from` part of the command below actually is referencing the `find_sequence_element_occurrences_in_sequence` script, but it doesn't need the `.py` extension because the `import` only deals with such files.)\n", "\n", "With the main function imported, it is now available to be run. As with the above method to run the script, the variables involved need to be defined before calling the function." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\n", "For documenting purposes, the following lists the collected occurences:\n", " seq. element start_pos end_pos strand\n", "0 EcoRI 16724 16730 1\n", "1 EcoRI 17671 17677 1\n", "2 EcoRI 17760 17766 1\n", "3 EcoRI 26412 26418 1\n", "4 EcoRI 26639 26645 1\n", "5 EcoRI 29183 29189 1\n", "6 EcoRI 40987 40993 1\n", "7 EcoRI 44706 44712 1\n", "8 EcoRI 80961 80967 1\n", "9 EcoRI 81838 81844 1\n", "10 EcoRI 81844 81838 -1\n", "11 EcoRI 80967 80961 -1\n", "12 EcoRI 44712 44706 -1\n", "13 EcoRI 40993 40987 -1\n", "14 EcoRI 29189 29183 -1\n", "15 EcoRI 26645 26639 -1\n", "16 EcoRI 26418 26412 -1\n", "17 EcoRI 17766 17760 -1\n", "18 EcoRI 17677 17671 -1\n", "19 EcoRI 16730 16724 -1\n", "The information on the 20 occurrences of 'GAATTC' has\n", "been saved as a file named 'seq_account_EcoRI_chrmt.tsv'.\n", "\n", "Returning a dataframe with the information as well." ] } ], "source": [ "source = \"https://downloads.yeastgenome.org/sequence/S288C_reference/chromosomes/fasta/chrmt.fsa\"\n", "element = \"GAATTC\"\n", "id_of_seq_element = \"EcoRI\" #Set to `None` without quotes or backticks to have defined automatically\n", "df = find_sequence_element_occurrences_in_sequence(source,element,id_of_seq_element,return_dataframe = True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Complex use case example: process multiple searches for sequence elements" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", "100 9726 100 9726 0 0 94427 0 --:--:-- --:--:-- --:--:-- 94427\n" ] } ], "source": [ "!curl -O https://static-content.springer.com/esm/art%3A10.1186%2Fs12864-015-1664-4/MediaObjects/12864_2015_1664_MOESM9_ESM.xlsx" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: xlrd in /srv/conda/envs/notebook/lib/python3.7/site-packages (2.0.1)\n" ] } ], "source": [ "!pip install xlrd" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ClassConsensus
0M1TTCCGGGGCCCGGCCACGGGAGCCGGAACCCCGAAAGGAG
1M1'TTCCCGCTTCGCGGGAACCCCGTAAGGAG
2M2TCCGGCCCGCCCCGCGGGGCGGACCCCGAAGGAG
3M2'TCCCCGCCCCGGCGGGGACCCCGAAGGAG
4M2''TCCGGCCGAAGGAG
5M3TGAGGGACCCCTCCCTATACTATGGGAGGGGGACCGAACCCTTTAA...
6M4TGAACACCTTTATTTAATTATAAAGGTGTGAACCAATCCGCAAGGCAAG
7GCCCGGTTTCTTACGAAACCGGGACCTCGGAGACGT
8VCCCCGCGGGCGCCAATCCGGTTGTTCACCGGATTGGTCCCGCGGGG
\n", "
" ], "text/plain": [ " Class Consensus\n", "0 M1 TTCCGGGGCCCGGCCACGGGAGCCGGAACCCCGAAAGGAG\n", "1 M1' TTCCCGCTTCGCGGGAACCCCGTAAGGAG\n", "2 M2 TCCGGCCCGCCCCGCGGGGCGGACCCCGAAGGAG\n", "3 M2' TCCCCGCCCCGGCGGGGACCCCGAAGGAG\n", "4 M2'' TCCGGCCGAAGGAG\n", "5 M3 TGAGGGACCCCTCCCTATACTATGGGAGGGGGACCGAACCCTTTAA...\n", "6 M4 TGAACACCTTTATTTAATTATAAAGGTGTGAACCAATCCGCAAGGCAAG\n", "7 G CCCGGTTTCTTACGAAACCGGGACCTCGGAGACGT\n", "8 V CCCCGCGGGCGCCAATCCGGTTGTTCACCGGATTGGTCCCGCGGGG" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "GC_df = pd.read_excel('12864_2015_1664_MOESM9_ESM.xlsx')\n", "GC_df.rename(columns={\"Consensus Sequence 5' -> 3'\":'Consensus'}, inplace=True) # I was having difficulty using bracket notation and so this makes easier\n", "GC_df" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\n", "For documenting purposes, the following lists the collected occurences:\n", " seq. element start_pos end_pos strand\n", "0 M1 2097 2137 1\n", "1 M1 3062 3102 1\n", "2 M1 8238 8278 1\n", "3 M1 9034 9074 1\n", "4 M1 11049 11089 1\n", "5 M1 11508 11548 1\n", "6 M1 34289 34329 1\n", "7 M1 34937 34977 1\n", "8 M1 46317 46357 1\n", "9 M1 48104 48144 1\n", "10 M1 50113 50153 1\n", "11 M1 53829 53869 1\n", "12 M1 62864 62904 1\n", "13 M1 66829 66869 1\n", "14 M1 69446 69486 1\n", "15 M1 76597 76637 1\n", "16 M1 80505 80545 1\n", "17 M1 81406 81446 1\n", "18 M1 83056 83096 1\n", "19 M1 75661 75621 -1\n", "20 M1 71254 71214 -1\n", "21 M1 57070 57030 -1\n", "22 M1 52372 52332 -1\n", "23 M1 47515 47475 -1\n", "24 M1 9530 9490 -1\n", "25 M1 2792 2752 -1\n", "26 M1 125 85 -1\n", "The information on the 27 occurrences of 'TTCCGGGGCCCGGCCACGGGAGCCGGAACCCCGAAAGGAG' has\n", "been saved as a file named 'seq_account_M1_chrmt.tsv'.\n", "\n", "Returning a dataframe with the information as well.\n", "No occurrences of 'TTCCCGCTTCGCGGGAACCCCGTAAGGAG' found in the provided sequence.\n", "\n", "No data to return in a dataframe and so returning `None`.\n", "For documenting purposes, the following lists the collected occurences:\n", " seq. element start_pos end_pos strand\n", "0 M2 65101 65067 -1\n", "The information on the 1 occurrences of 'TCCGGCCCGCCCCGCGGGGCGGACCCCGAAGGAG' has\n", "been saved as a file named 'seq_account_M2_chrmt.tsv'.\n", "\n", "Returning a dataframe with the information as well.\n", "For documenting purposes, the following lists the collected occurences:\n", " seq. element start_pos end_pos strand\n", "0 M2' 2855 2884 1\n", "1 M2' 34171 34200 1\n", "2 M2' 73206 73235 1\n", "3 M2' 68689 68660 -1\n", "4 M2' 56701 56672 -1\n", "5 M2' 4183 4154 -1\n", "The information on the 6 occurrences of 'TCCCCGCCCCGGCGGGGACCCCGAAGGAG' has\n", "been saved as a file named 'seq_account_M2'_chrmt.tsv'.\n", "\n", "Returning a dataframe with the information as well.\n", "For documenting purposes, the following lists the collected occurences:\n", " seq. element start_pos end_pos strand\n", "0 M2'' 57860 57874 1\n", "1 M2'' 84762 84748 -1\n", "2 M2'' 72792 72778 -1\n", "3 M2'' 65067 65053 -1\n", "4 M2'' 8668 8654 -1\n", "The information on the 5 occurrences of 'TCCGGCCGAAGGAG' has\n", "been saved as a file named 'seq_account_M2''_chrmt.tsv'.\n", "\n", "Returning a dataframe with the information as well.\n", "No occurrences of 'TGAGGGACCCCTCCCTATACTATGGGAGGGGGACCGAACCCTTTAAAGAAGAG' found in the provided sequence.\n", "\n", "No data to return in a dataframe and so returning `None`.\n", "No occurrences of 'TGAACACCTTTATTTAATTATAAAGGTGTGAACCAATCCGCAAGGCAAG' found in the provided sequence.\n", "\n", "No data to return in a dataframe and so returning `None`.\n", "For documenting purposes, the following lists the collected occurences:\n", " seq. element start_pos end_pos strand\n", "0 G 30185 30220 1\n", "1 G 44891 44926 1\n", "2 G 12816 12781 -1\n", "3 G 4348 4313 -1\n", "The information on the 4 occurrences of 'CCCGGTTTCTTACGAAACCGGGACCTCGGAGACGT' has\n", "been saved as a file named 'seq_account_G_chrmt.tsv'.\n", "\n", "Returning a dataframe with the information as well.\n", "For documenting purposes, the following lists the collected occurences:\n", " seq. element start_pos end_pos strand\n", "0 V 47719 47765 1\n", "1 V 49089 49135 1\n", "The information on the 2 occurrences of 'CCCCGCGGGCGCCAATCCGGTTGTTCACCGGATTGGTCCCGCGGGG' has\n", "been saved as a file named 'seq_account_V_chrmt.tsv'.\n", "\n", "Returning a dataframe with the information as well." ] } ], "source": [ "for row in GC_df.itertuples():\n", " #print(row) # for debugging\n", " source = \"https://downloads.yeastgenome.org/sequence/S288C_reference/chromosomes/fasta/chrmt.fsa\"\n", " element = row.Consensus\n", " id_of_seq_element = row.Class\n", " df = find_sequence_element_occurrences_in_sequence(source,element,id_of_seq_element,return_dataframe = True)\n", " #print(df)\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" } }, "nbformat": 4, "nbformat_minor": 4 }