{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# demonstration of `delete_seq_following_pattern_within_multiFASTA.py` script\n",
    "\n",
    "If you'd like an active Jupyter session to run this notebook, launch one by clicking [here](https://mybinder.org/v2/gh/fomightez/clausen_ribonucleotides/master), and then upload this notebook to the session that starts up.  \n",
    "Otherwise, the static version is rendered more nicely via [here](https://nbviewer.jupyter.org/github/fomightez/sequencework/blob/master/AdjustFASTA_or_FASTQ/demo%20delete_seq_following_pattern_within_multiFASTA.ipynb).\n",
    "\n",
    "<div class=\"alert alert-block alert-warning\">\n",
    "<p>If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!.</p>\n",
    "\n",
    "<p>\n",
    "    Some tips:\n",
    "    <ul>\n",
    "        <li>Code cells have boxes around them. When you hover over them a <i class=\"fa-step-forward fa\"></i> icon appears.</li>\n",
    "        <li>To run a code cell either click the <i class=\"fa-step-forward fa\"></i> icon, or click on the cell and then hit <b>Shift+Enter</b>. The <b>Shift+Enter</b> combo will also move you to the next cell, so it's a quick way to work through the notebook.</li>\n",
    "        <li>While a cell is running a <b>*</b> appears in the square brackets next to the cell. Once the cell has finished running the asterix will be replaced with a number.</li>\n",
    "        <li>In most cases you'll want to start from the top of notebook and work your way down running each cell in turn. Later cells might depend on the results of earlier ones.</li>\n",
    "        <li>To edit a code cell, just click on it and type stuff. Remember to run the cell once you've finished editing.</li>\n",
    "    </ul>\n",
    "</p>\n",
    "</div>\n",
    "\n",
    "You'll need the current version of the script to run this notebook, and the next cell will get that. (Remember if you want to make things more reproducible when you use the script with your own data, you'll want to edit calls such as this to fetch a specific version of the script. How to do this is touched upon in the comment below [here](https://stackoverflow.com/a/48587645/8508004)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n",
      "                                 Dload  Upload   Total   Spent    Left  Speed\n",
      "100 17560  100 17560    0     0  68862      0 --:--:-- --:--:-- --:--:-- 68862\n"
     ]
    }
   ],
   "source": [
    "!curl -O https://raw.githubusercontent.com/fomightez/sequencework/master/AdjustFASTA_or_FASTQ/delete_seq_following_pattern_within_multiFASTA.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Display Usage / Help Block"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "usage: delete_seq_following_pattern_within_multiFASTA.py [-h] [-ld]\n",
      "                                                         [-os OUTPUT_SUFFIX]\n",
      "                                                         SEQUENCE_FILE\n",
      "                                                         RECORD_ID PATTERN\n",
      "\n",
      "delete_seq_following_pattern_within_multiFASTA.py takes a sequence pattern\n",
      "string, a sequence file (FASTA-format), and a record id, and deletes any\n",
      "sequence following the sequence pattern. In other words it trims the specified\n",
      "sequence, to make the first match to the pattern the new end. (The FASTA-\n",
      "formatted sequence file is assumed by default to be a multi-FASTA, i.e.,\n",
      "multiple sequences in the provided file, although it definitely doesn't have\n",
      "to be. In case it is only a single sequence, the record id becomes moot, see\n",
      "below.) Nothing will be returned; however a copy of the FASTA sequence file\n",
      "with the truncated sequence specified will be produced. **** Script by Wayne\n",
      "Decatur (fomightez @ github) ***\n",
      "\n",
      "positional arguments:\n",
      "  SEQUENCE_FILE         Name of sequence file to use as input. Must be FASTA\n",
      "                        format. Can be a multi-FASTA file, i.e., multiple\n",
      "                        sequences in FASTA format in one file.\n",
      "  RECORD_ID             Specific identifier of sequence entry in sequence file\n",
      "                        to modify. If the provided sequence file only contains\n",
      "                        one sequence, then that sequence will be altered, and\n",
      "                        whatever is provided for this parameter will be\n",
      "                        ignored. In other words, if the sequence file is not a\n",
      "                        multi-FASTA file, you don't need to determine the\n",
      "                        identifier and can instead just enter `blahblah` or\n",
      "                        any other nonsensical string in this spot.\n",
      "  PATTERN               Sequence or sequence pattern to use to locate site\n",
      "                        after which to delete in the specified sequence.\n",
      "                        Regular expressions are accepted here; however any\n",
      "                        information about case will be ignored as the provided\n",
      "                        sequence pattern and sequence will both be converted\n",
      "                        to lower case to check for a match.\n",
      "\n",
      "optional arguments:\n",
      "  -h, --help            show this help message and exit\n",
      "  -ld, --leave_dashes   Add this flag when calling the script in order to be\n",
      "                        able to use gaps (represented as dashes) in the\n",
      "                        pattern required to match. I.E., for matching with an\n",
      "                        aligned FASTA file. (***ATYPICAL.***)\n",
      "  -os OUTPUT_SUFFIX, --output_suffix OUTPUT_SUFFIX\n",
      "                        OPTIONAL: Set a suffix for including in file name of\n",
      "                        output. If none provided, '_clipped' will be used.\n"
     ]
    }
   ],
   "source": [
    "%run delete_seq_following_pattern_within_multiFASTA.py -h"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To read more about this script beyond that and what is covered below, see [here](https://github.com/fomightez/sequencework/tree/master/AdjustFASTA_or_FASTQ).\n",
    "\n",
    "-----\n",
    "\n",
    "## Basic use examples set #1: Using from the command line (or equivalent / similar)\n",
    "\n",
    "### Preparing for usage example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "#write example FASTA to file\n",
    "s = '''>evoli\n",
    "atctgatctggggcgaaatgagactgatctgatctggtctgtggcg\n",
    ">smer\n",
    "atctgaatctgagactatatgagactgatctgatctgctctgaagc\n",
    "'''\n",
    "\n",
    "!echo \"{s}\" > sequence.fa"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Run the script"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "*****************DONE**************************\n",
      "Sequence after the match to the provided pattern \n",
      "deleted from 'smer', and saved within a modified version \n",
      "of 'sequence.fa' as the output file 'sequence_clipped.fa'.\n",
      "*****************DONE**************************\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "python delete_seq_following_pattern_within_multiFASTA.py sequence.fa smer tCtgAGact"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Note** that cell above illustrates that the comparison is insensitive to case.\n",
    "\n",
    "\n",
    "In the above cell and elsewhere in this notebook, `%%bash` cell magic is used to send this to the shell to run as if on the command line. \n",
    "\n",
    "You could simply run something like `python delete_seq_following_pattern_within_multiFASTA.py sequence.fa smer tgAtct` if you are working on the command line directly. In fact, the terminal is available from the Jupyter dashboard (or from the JupyterLab launcher) and you can feel free to try running the command below in a terminal in this Jupyter session if you'd like.\n",
    "\n",
    "    python delete_seq_following_pattern_within_multiFASTA.py sequence.fa smer tCt\n",
    "\n",
    "We'll use a shorter way to send the commad to the shell in the next cell. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">evoli\n",
      "atctgatctggggcgaaatgagactgatctgatctggtctgtggcg\n",
      ">smer CLIPPED\n",
      "atctgaatctgagact\n"
     ]
    }
   ],
   "source": [
    "!head sequence_clipped.fa"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(The cell above uses another Jupyter notebook/ IPython trick to send a command to the command line. Namely that anything on a line after an exclamation point `!` will be executed on the system command line. However, using that style I saw no advanced display formatting of the stderr when I tried using the exclamation point, e.g., `!python delete_seq_following_pattern_within_multiFASTA.py smer tCt` vs. using the `%%bash` cell magic. Hence, I used `%%bash` in the demo when calling the script.)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Although it is good practice to keep original versions of files, if you absolutely need to replace the original file, you can rename the ouput file to replace the original with a command similar to this:\n",
    "\n",
    "    !mv sequence_clipped.fa sequence.fa"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*Remember you can dispense with providing an actual record id if there is only one record.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "#write example FASTA-formatted with one sequence to file\n",
    "s = '''>evoli\n",
    "atctgatctggggcgaaatgagactgatctgatctggtctgtggcg\n",
    "'''\n",
    "\n",
    "!echo \"{s}\" > single_sequence.fa"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You still have to provide *something* for record identifier, but it can be any string. In the example, below `moot` is used. Completely irrelevant but the 'placeholder' makes the command have all the parts needed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Single sequence with id of 'evoli' provided in the sequence file.\n",
      "It will be used to search for the provided sequence pattern\n",
      "and delete the residues after it.\n",
      "\n",
      "2 matches to the sequence found in the specified sequence. The sequence\n",
      "that follows the match encountered first has been deleted.\n",
      "\n",
      "*****************DONE**************************\n",
      "Sequence after the match to the provided pattern \n",
      "deleted from 'evoli', and saved within a modified version \n",
      "of 'single_sequence.fa' as the output file 'single_sequence_clipped.fa'.\n",
      "*****************DONE**************************\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "python delete_seq_following_pattern_within_multiFASTA.py single_sequence.fa moot tCtgA"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you are used to using Jupyter notebooks, you can use `%run` instead of `python delete_seq_following_pattern_within_multiFASTA sequence.fa smer tct` to get the same result, as shown in the next call."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "5 matches to the sequence found in the specified sequence. The sequence\n",
      "that follows the match encountered first has been deleted.\n",
      "\n",
      "*****************DONE**************************\n",
      "Sequence after the match to the provided pattern \n",
      "deleted from 'smer', and saved within a modified version \n",
      "of 'sequence.fa' as the output file 'sequence_diff_clipped.fa'.\n",
      "*****************DONE**************************\n"
     ]
    }
   ],
   "source": [
    "%run delete_seq_following_pattern_within_multiFASTA.py sequence.fa smer tct --output_suffix _diff_clipped"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">evoli\n",
      "atctgatctggggcgaaatgagactgatctgatctggtctgtggcg\n",
      ">smer CLIPPED\n",
      "atct\n"
     ]
    }
   ],
   "source": [
    "!head sequence_diff_clipped.fa"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "------\n",
    "\n",
    "## Basic use example set #2: Use the main function via import\n",
    "\n",
    "Very useful for when using this in a Jupyter notebook to build into a pipeline or workflow."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Prepare first by  importing the main function from the script into the notbeook environment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "from delete_seq_following_pattern_within_multiFASTA import delete_seq_following_pattern_within_multiFASTA"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(That call will look redundant; however, it actually means *from the file* `delete_seq_following_pattern_within_multiFASTA.py`  *import the* `delete_seq_following_pattern_within_multiFASTA()` *function*.)\n",
    "\n",
    "Then call that function and provide the needed arguments in the call. The needed arguments are the `sequence file`, `record id` of the specific sequence to search for the pattern within (can be gibberish if there is only one sequence provided inside sequence file), `sequence pattern to search for`, and `number of residues` to get after the sequence.\n",
    "\n",
    "The function will produce a file as output if there is a match."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "*****************DONE**************************\n",
      "Sequence after the match to the provided pattern \n",
      "deleted from 'evoli', and saved within a modified version \n",
      "of 'sequence.fa' as the output file 'sequence_clipped.fa'.\n",
      "*****************DONE**************************\n"
     ]
    }
   ],
   "source": [
    "delete_seq_following_pattern_within_multiFASTA(\"sequence.fa\", \"evoli\", \"GATCTGGGGCGA\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">evoli CLIPPED\n",
      "atctgatctggggcga\n",
      ">smer\n",
      "atctgaatctgagactatatgagactgatctgatctgctctgaagc\n"
     ]
    }
   ],
   "source": [
    "!head sequence_clipped.fa"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The equivalent to using the `--output_suffix` option on the command line can also be done when calling the function; however, the syntax is slightly different because the way functions work in Python differs than ways you use things on the command line."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "3 matches to the sequence found in the specified sequence. The sequence\n",
      "that follows the match encountered first has been deleted.\n",
      "\n",
      "*****************DONE**************************\n",
      "Sequence after the match to the provided pattern \n",
      "deleted from 'evoli', and saved within a modified version \n",
      "of 'sequence.fa' as the output file 'sequence_clipped_fun.fa'.\n",
      "*****************DONE**************************\n"
     ]
    }
   ],
   "source": [
    "delete_seq_following_pattern_within_multiFASTA(\"sequence.fa\", \"evoli\", \"GATCT\", suffix_for_saving = \"_clipped_fun\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">evoli CLIPPED\n",
      "atctgatct\n",
      ">smer\n",
      "atctgaatctgagactatatgagactgatctgatctgctctgaagc\n"
     ]
    }
   ],
   "source": [
    "!head sequence_clipped_fun.fa"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*Remember you can dispense with providing an actual, real record id if there is only one record.*\n",
    "\n",
    "You just need to supply *something* in that spot as a 'placeholder'."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Single sequence with id of 'evoli' provided in the sequence file.\n",
      "It will be used to search for the provided sequence pattern\n",
      "and delete the residues after it.\n",
      "\n",
      "3 matches to the sequence found in the specified sequence. The sequence\n",
      "that follows the match encountered first has been deleted.\n",
      "\n",
      "*****************DONE**************************\n",
      "Sequence after the match to the provided pattern \n",
      "deleted from 'evoli', and saved within a modified version \n",
      "of 'single_sequence.fa' as the output file 'single_sequence_clipped.fa'.\n",
      "*****************DONE**************************\n"
     ]
    }
   ],
   "source": [
    "delete_seq_following_pattern_within_multiFASTA(\"single_sequence.fa\", \"evoli\", \"GATCT\", suffix_for_saving = \"_clipped\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">evoli CLIPPED\n",
      "atctgatct\n"
     ]
    }
   ],
   "source": [
    "!head single_sequence_clipped.fa"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "\n",
    "## More advanced use examples #1: Use with regular expressions\n",
    "\n",
    "Providing sequence patterns to search for can accomodate regular expression search terms (see [Appendix 2 of Haddock and Dunn's Practical Computing for Biologists](http://practicalcomputing.org/files/PCfB_Appendices.pdf)). However, it can be tricky to input some of the symbols and special characters that regular expression search terms tend to use and get them interpreted exactly as expected. Especially in light of the many ways one can call this script or the associated function in a Jupyter notebook.\n",
    "\n",
    "I illustrate some of the things I found to work here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2 matches to the sequence found in the specified sequence. The sequence\n",
      "that follows the match encountered first has been deleted.\n",
      "\n",
      "*****************DONE**************************\n",
      "Sequence after the match to the provided pattern \n",
      "deleted from 'smer', and saved within a modified version \n",
      "of 'sequence.fa' as the output file 'sequence_clipped.fa'.\n",
      "*****************DONE**************************\n"
     ]
    }
   ],
   "source": [
    "%run delete_seq_following_pattern_within_multiFASTA.py sequence.fa smer a{{2,}}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">evoli\n",
      "atctgatctggggcgaaatgagactgatctgatctggtctgtggcg\n",
      ">smer CLIPPED\n",
      "atctgaa\n"
     ]
    }
   ],
   "source": [
    "!head sequence_clipped.fa"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That regular expression search term demonstrated above is equivalent to `a{2,}` and searches for two or more matches to `a` in a row (or `A` in row because I make comparison case insensitive beyond input expression). Note that the brackets have to be doubled up to get read in from IPython to ultimately Python as single brackets. (Single brackets got converted to parantheses for some reason.) I worked this out by testing input from command by printing what I had right before search and luckily tried what I had learned from [here](https://stackoverflow.com/a/5466478/8508004) for dealing with brackets and `.format()`.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### When using the function call, it seems no special escaping is needed.\n",
    "\n",
    "**This is probably the best route to use regular expressions.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2 matches to the sequence found in the specified sequence. The sequence\n",
      "that follows the match encountered first has been deleted.\n",
      "\n",
      "*****************DONE**************************\n",
      "Sequence after the match to the provided pattern \n",
      "deleted from 'smer', and saved within a modified version \n",
      "of 'sequence.fa' as the output file 'sequence_clipped_viafun.fa'.\n",
      "*****************DONE**************************\n"
     ]
    }
   ],
   "source": [
    "delete_seq_following_pattern_within_multiFASTA(\"sequence.fa\", \"smer\", \"a{2,}\", suffix_for_saving = \"_clipped_viafun\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">evoli\n",
      "atctgatctggggcgaaatgagactgatctgatctggtctgtggcg\n",
      ">smer CLIPPED\n",
      "atctgaa\n"
     ]
    }
   ],
   "source": [
    "!head sequence_clipped_viafun.fa"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">evoli\n",
      "atctgatctggggcgaaatgagactgatctgatctggtctgtggcg\n",
      ">smer\n",
      "atctgaatctgagactatatgagactgatctgatctgctctgaagc\n",
      "\n"
     ]
    }
   ],
   "source": [
    "!head sequence.fa"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "6 matches to the sequence found in the specified sequence. The sequence\n",
      "that follows the match encountered first has been deleted.\n",
      "\n",
      "*****************DONE**************************\n",
      "Sequence after the match to the provided pattern \n",
      "deleted from 'smer', and saved within a modified version \n",
      "of 'sequence.fa' as the output file 'sequence_asterisk_demo.fa'.\n",
      "*****************DONE**************************\n"
     ]
    }
   ],
   "source": [
    "delete_seq_following_pattern_within_multiFASTA(\"sequence.fa\", \"smer\", \"atc*\", suffix_for_saving = \"_asterisk_demo\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">evoli\n",
      "atctgatctggggcgaaatgagactgatctgatctggtctgtggcg\n",
      ">smer CLIPPED\n",
      "atc\n"
     ]
    }
   ],
   "source": [
    "!head sequence_asterisk_demo.fa"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "*****************DONE**************************\n",
      "Sequence after the match to the provided pattern \n",
      "deleted from 'colc', and saved within a modified version \n",
      "of 'sequencewn.fa' as the output file 'sequencewn_clipped.fa'.\n",
      "*****************DONE**************************\n"
     ]
    }
   ],
   "source": [
    "#write example with blocks of unknown nucleotides in FASTA to file\n",
    "s = '''>smar\n",
    "atNNctgatNNNNNNNNNNNNNNNNNNNNNNNtgatctggtctgtggcg\n",
    ">colc\n",
    "atNNctgaatctgagactatatNNNNNNNNNNNNNNtctgctctgaagc\n",
    "'''\n",
    "\n",
    "!echo \"{s}\" > sequencewn.fa\n",
    "\n",
    "delete_seq_following_pattern_within_multiFASTA(\"sequencewn.fa\", \"colc\", \"N{5,}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">smar\n",
      "atNNctgatNNNNNNNNNNNNNNNNNNNNNNNtgatctggtctgtggcg\n",
      ">colc CLIPPED\n",
      "atNNctgaatctgagactatatNNNNNNNNNNNNNN\n"
     ]
    }
   ],
   "source": [
    "!head sequencewn_clipped.fa"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*Despite that method of calling the function with the regular expression provided as an argument being the most direct and easiest way to use them, I can imagine it won't cover all cases, and so I am going to detail my additional findings in this section.*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Interestingly, a different approach to escaping the brackets is necessary when using the `%%bash` cell magic."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2 matches to the sequence found in the specified sequence. The sequence\n",
      "that follows the match encountered first has been deleted.\n",
      "\n",
      "*****************DONE**************************\n",
      "Sequence after the match to the provided pattern \n",
      "deleted from 'smer', and saved within a modified version \n",
      "of 'sequence.fa' as the output file 'sequence_clipped.fa'.\n",
      "*****************DONE**************************\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "python delete_seq_following_pattern_within_multiFASTA.py sequence.fa smer a\\{2,\\}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">evoli\n",
      "atctgatctggggcgaaatgagactgatctgatctggtctgtggcg\n",
      ">smer CLIPPED\n",
      "atctgaa\n"
     ]
    }
   ],
   "source": [
    "!head sequence_clipped.fa"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Yet, if you add in quotes you can get away without escaping the brackets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2 matches to the sequence found in the specified sequence. The sequence\n",
      "that follows the match encountered first has been deleted.\n",
      "\n",
      "*****************DONE**************************\n",
      "Sequence after the match to the provided pattern \n",
      "deleted from 'smer', and saved within a modified version \n",
      "of 'sequence.fa' as the output file 'sequence_clipped.fa'.\n",
      "*****************DONE**************************\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "python delete_seq_following_pattern_within_multiFASTA.py sequence.fa smer \"a{2,}\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">evoli\n",
      "atctgatctggggcgaaatgagactgatctgatctggtctgtggcg\n",
      ">smer CLIPPED\n",
      "atctgaa\n"
     ]
    }
   ],
   "source": [
    "!head sequence_clipped.fa"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The cell below shows it works when using the exclamation mark way to send commands to shell, too."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2 matches to the sequence found in the specified sequence. The sequence\n",
      "that follows the match encountered first has been deleted.\n",
      "\n",
      "*****************DONE**************************\n",
      "Sequence after the match to the provided pattern \n",
      "deleted from 'smer', and saved within a modified version \n",
      "of 'sequence.fa' as the output file 'sequence_clipped.fa'.\n",
      "*****************DONE**************************\n"
     ]
    }
   ],
   "source": [
    "!python delete_seq_following_pattern_within_multiFASTA.py sequence.fa smer a{{2,}}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">evoli\n",
      "atctgatctggggcgaaatgagactgatctgatctggtctgtggcg\n",
      ">smer CLIPPED\n",
      "atctgaa\n"
     ]
    }
   ],
   "source": [
    "!head sequence_clipped.fa"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2 matches to the sequence found in the specified sequence. The sequence\n",
      "that follows the match encountered first has been deleted.\n",
      "\n",
      "*****************DONE**************************\n",
      "Sequence after the match to the provided pattern \n",
      "deleted from 'smer', and saved within a modified version \n",
      "of 'sequence.fa' as the output file 'sequence_clipped.fa'.\n",
      "*****************DONE**************************\n"
     ]
    }
   ],
   "source": [
    "!python delete_seq_following_pattern_within_multiFASTA.py sequence.fa smer a\\{2,\\}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">evoli\n",
      "atctgatctggggcgaaatgagactgatctgatctggtctgtggcg\n",
      ">smer CLIPPED\n",
      "atctgaa\n"
     ]
    }
   ],
   "source": [
    "!head sequence_clipped.fa"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "Adding quotes around pattern works, too."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2 matches to the sequence found in the specified sequence. The sequence\n",
      "that follows the match encountered first has been deleted.\n",
      "\n",
      "*****************DONE**************************\n",
      "Sequence after the match to the provided pattern \n",
      "deleted from 'smer', and saved within a modified version \n",
      "of 'sequence.fa' as the output file 'sequence_clipped.fa'.\n",
      "*****************DONE**************************\n"
     ]
    }
   ],
   "source": [
    "!python delete_seq_following_pattern_within_multiFASTA.py sequence.fa smer \"a{{2,}}\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">evoli\n",
      "atctgatctggggcgaaatgagactgatctgatctggtctgtggcg\n",
      ">smer CLIPPED\n",
      "atctgaa\n"
     ]
    }
   ],
   "source": [
    "!head sequence_clipped.fa"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As below shows, other complex regular expression search terms work when `%run` method used sometimes both with and without quotes around the pattern producing the same results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "9 matches to the sequence found in the specified sequence. The sequence\n",
      "that follows the match encountered first has been deleted.\n",
      "\n",
      "*****************DONE**************************\n",
      "Sequence after the match to the provided pattern \n",
      "deleted from 'smer', and saved within a modified version \n",
      "of 'sequence.fa' as the output file 'sequencedot_expl.fa'.\n",
      "*****************DONE**************************\n"
     ]
    }
   ],
   "source": [
    "%run delete_seq_following_pattern_within_multiFASTA.py sequence.fa smer ..... --output_suffix dot_expl"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">evoli\n",
      "atctgatctggggcgaaatgagactgatctgatctggtctgtggcg\n",
      ">smer CLIPPED\n",
      "atctg\n"
     ]
    }
   ],
   "source": [
    "!head sequencedot_expl.fa"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "9 matches to the sequence found in the specified sequence. The sequence\n",
      "that follows the match encountered first has been deleted.\n",
      "\n",
      "*****************DONE**************************\n",
      "Sequence after the match to the provided pattern \n",
      "deleted from 'smer', and saved within a modified version \n",
      "of 'sequence.fa' as the output file 'sequencedot_explwq.fa'.\n",
      "*****************DONE**************************\n"
     ]
    }
   ],
   "source": [
    "%run delete_seq_following_pattern_within_multiFASTA.py sequence.fa smer \".....\" --output_suffix dot_explwq"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">evoli\n",
      "atctgatctggggcgaaatgagactgatctgatctggtctgtggcg\n",
      ">smer CLIPPED\n",
      "atctg\n"
     ]
    }
   ],
   "source": [
    "!head sequencedot_explwq.fa"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2 matches to the sequence found in the specified sequence. The sequence\n",
      "that follows the match encountered first has been deleted.\n",
      "\n",
      "*****************DONE**************************\n",
      "Sequence after the match to the provided pattern \n",
      "deleted from 'smer', and saved within a modified version \n",
      "of 'sequence.fa' as the output file 'sequence_clipped.fa'.\n",
      "*****************DONE**************************\n"
     ]
    }
   ],
   "source": [
    "%run delete_seq_following_pattern_within_multiFASTA.py sequence.fa smer \"a{{2,}}\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">evoli\n",
      "atctgatctggggcgaaatgagactgatctgatctggtctgtggcg\n",
      ">smer CLIPPED\n",
      "atctgaa\n"
     ]
    }
   ],
   "source": [
    "!head sequence_clipped.fa"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use of an asterisk in the regular expression search term with the `%run` approach seems to be allowed if handled like in the `%%bash` approach."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "6 matches to the sequence found in the specified sequence. The sequence\n",
      "that follows the match encountered first has been deleted.\n",
      "\n",
      "*****************DONE**************************\n",
      "Sequence after the match to the provided pattern \n",
      "deleted from 'smer', and saved within a modified version \n",
      "of 'sequence.fa' as the output file 'sequence_clipped.fa'.\n",
      "*****************DONE**************************\n"
     ]
    }
   ],
   "source": [
    "%run delete_seq_following_pattern_within_multiFASTA.py sequence.fa smer \\atc\\*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">evoli\n",
      "atctgatctggggcgaaatgagactgatctgatctggtctgtggcg\n",
      ">smer CLIPPED\n",
      "atc\n"
     ]
    }
   ],
   "source": [
    "!head sequence_clipped.fa"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "6 matches to the sequence found in the specified sequence. The sequence\n",
      "that follows the match encountered first has been deleted.\n",
      "\n",
      "*****************DONE**************************\n",
      "Sequence after the match to the provided pattern \n",
      "deleted from 'smer', and saved within a modified version \n",
      "of 'sequence.fa' as the output file 'sequence_clipped.fa'.\n",
      "*****************DONE**************************\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "python delete_seq_following_pattern_within_multiFASTA.py sequence.fa smer \\atc\\*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">evoli\n",
      "atctgatctggggcgaaatgagactgatctgatctggtctgtggcg\n",
      ">smer CLIPPED\n",
      "atc\n"
     ]
    }
   ],
   "source": [
    "!head sequence_clipped.fa"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "\n",
    "## More advanced use examples #2: Dealing with gaps\n",
    "\n",
    "The default behaviour of the script is to remove gaps represented by dashes from any sequence pattern provided. The idea is that many use cases will involve searhcing for sequence patterns that have gaps because the sequence text was copied from a sequence alignment, and it seems like a waste of processing to have the user clean the sequences ahead of time. Plus, most people will be searching sequences that don't have gaps.\n",
    "\n",
    "However, with the addition of the `--leave_dashes` option in the command line tool or setting the `filter_dashes` variable to `False` when calling the main function, the user can ovveride this typical behavior and still use the script. For example, with an aligned FASTA file format. The caveat is that number of residues to get will then be counting the gaps / dashes too.\n",
    "\n",
    "*First, show it goes from working to failing when that setting added in current example data.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "*****************DONE**************************\n",
      "Sequence after the match to the provided pattern \n",
      "deleted from 'evoli', and saved within a modified version \n",
      "of 'sequence.fa' as the output file 'sequence_clipped.fa'.\n",
      "*****************DONE**************************\n"
     ]
    }
   ],
   "source": [
    "%run delete_seq_following_pattern_within_multiFASTA.py sequence.fa evoli GATCTGGG------GCGA"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "*****************DONE**************************\n",
      "Sequence after the match to the provided pattern \n",
      "deleted from 'evoli', and saved within a modified version \n",
      "of 'sequence.fa' as the output file 'sequence_clipped.fa'.\n",
      "*****************DONE**************************\n"
     ]
    }
   ],
   "source": [
    "%run delete_seq_following_pattern_within_multiFASTA.py sequence.fa evoli \"GATCTGGG------GCGA\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "*****************DONE**************************\n",
      "***NO MATCHES FOUND. NO CHANGES MADE.*****    **** ERROR?!?!?**\n",
      "*****************DONE**************************\n"
     ]
    }
   ],
   "source": [
    "%run delete_seq_following_pattern_within_multiFASTA.py sequence.fa evoli GATCTGGG------GCGA --leave_dashes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "*****************DONE**************************\n",
      "Sequence after the match to the provided pattern \n",
      "deleted from 'evoli', and saved within a modified version \n",
      "of 'sequence.fa' as the output file 'sequence_clipped.fa'.\n",
      "*****************DONE**************************\n"
     ]
    }
   ],
   "source": [
    "delete_seq_following_pattern_within_multiFASTA(\"sequence.fa\", \"evoli\", \"GATCTGGG------GCGA\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "*****************DONE**************************\n",
      "***NO MATCHES FOUND. NO CHANGES MADE.*****    **** ERROR?!?!?**\n",
      "*****************DONE**************************\n"
     ]
    }
   ],
   "source": [
    "delete_seq_following_pattern_within_multiFASTA(\"sequence.fa\", \"evoli\", \"GATCTGGG------GCGA\", filter_dashes=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*To demonstrate the setting works.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "*****************DONE**************************\n",
      "Sequence after the match to the provided pattern \n",
      "deleted from 'smer', and saved within a modified version \n",
      "of 'gapped_sequence.fa' as the output file 'gapped_sequence_clipped.fa'.\n",
      "*****************DONE**************************\n"
     ]
    }
   ],
   "source": [
    "#write example aligned FASTA file format\n",
    "s = '''>evoli\n",
    "atct--gatctgggggatctggg------gcgactgatctgatctggtctgtggcggcaagcgaaaaacaaa\n",
    ">smer\n",
    "atct--gaatcg----atctggg------gcgaagactgatctgatctgctctgaagc--gcgaaaaaaaaa\n",
    "'''\n",
    "\n",
    "!echo \"{s}\" > gapped_sequence.fa\n",
    "\n",
    "delete_seq_following_pattern_within_multiFASTA(\"gapped_sequence.fa\", \"smer\", \"-{5,}GCGA\", filter_dashes=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">evoli\n",
      "atct--gatctgggggatctggg------gcgactgatctgatctggtctgtggcggcaa\n",
      "gcgaaaaacaaa\n",
      ">smer CLIPPED\n",
      "atct--gaatcg----atctggg------gcga\n"
     ]
    }
   ],
   "source": [
    "!head gapped_sequence_clipped.fa"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "*****************DONE**************************\n",
      "Sequence after the match to the provided pattern \n",
      "deleted from 'smer', and saved within a modified version \n",
      "of 'gapped_sequence.fa' as the output file 'gapped_sequence_clipped.fa'.\n",
      "*****************DONE**************************\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">evoli\n",
      "atct--gatctgggggatctggg------gcgactgatctgatctggtctgtggcggcaa\n",
      "gcgaaaaacaaa\n",
      ">smer CLIPPED\n",
      "atct--gaatcg----atctggg------gcga\n"
     ]
    }
   ],
   "source": [
    "%run delete_seq_following_pattern_within_multiFASTA.py gapped_sequence.fa smer \"\\-\\-\\-GCGA\" --leave_dashes\n",
    "!head gapped_sequence_clipped.fa"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "*****************DONE**************************\n",
      "Sequence after the match to the provided pattern \n",
      "deleted from 'smer', and saved within a modified version \n",
      "of 'gapped_sequence.fa' as the output file 'gapped_sequence_clipped.fa'.\n",
      "*****************DONE**************************\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">evoli\n",
      "atct--gatctgggggatctggg------gcgactgatctgatctggtctgtggcggcaa\n",
      "gcgaaaaacaaa\n",
      ">smer CLIPPED\n",
      "atct--gaatcg----atctggg------gcga\n"
     ]
    }
   ],
   "source": [
    "%run delete_seq_following_pattern_within_multiFASTA.py gapped_sequence.fa smer \"\\-{{5,}}GCGA\" --leave_dashes\n",
    "!head gapped_sequence_clipped.fa"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When using on the command line, the dashes need to be escaped with a backslash. The next two cells demonstrate that."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "usage: delete_seq_following_pattern_within_multiFASTA.py [-h] [-ld]\n",
      "                                                         [-os OUTPUT_SUFFIX]\n",
      "                                                         SEQUENCE_FILE\n",
      "                                                         RECORD_ID PATTERN\n",
      "delete_seq_following_pattern_within_multiFASTA.py: error: the following arguments are required: PATTERN\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "python delete_seq_following_pattern_within_multiFASTA.py gapped_sequence.fa smer \"---GCGA\" --leave_dashes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "*****************DONE**************************\n",
      "Sequence after the match to the provided pattern \n",
      "deleted from 'smer', and saved within a modified version \n",
      "of 'gapped_sequence.fa' as the output file 'gapped_sequence_clipped.fa'.\n",
      "*****************DONE**************************\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "python delete_seq_following_pattern_within_multiFASTA.py gapped_sequence.fa smer \"\\-\\-\\-GCGA\" --leave_dashes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">evoli\n",
      "atct--gatctgggggatctggg------gcgactgatctgatctggtctgtggcggcaa\n",
      "gcgaaaaacaaa\n",
      ">smer CLIPPED\n",
      "atct--gaatcg----atctggg------gcga\n"
     ]
    }
   ],
   "source": [
    "!head gapped_sequence_clipped.fa"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "*****************DONE**************************\n",
      "Sequence after the match to the provided pattern \n",
      "deleted from 'smer', and saved within a modified version \n",
      "of 'gapped_sequence.fa' as the output file 'gapped_sequence_clipped.fa'.\n",
      "*****************DONE**************************\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "python delete_seq_following_pattern_within_multiFASTA.py gapped_sequence.fa smer \"\\-{5,}GCGA\" --leave_dashes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">evoli\n",
      "atct--gatctgggggatctggg------gcgactgatctgatctggtctgtggcggcaa\n",
      "gcgaaaaacaaa\n",
      ">smer CLIPPED\n",
      "atct--gaatcg----atctggg------gcga\n"
     ]
    }
   ],
   "source": [
    "!head gapped_sequence_clipped.fa"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "*****************DONE**************************\n",
      "Sequence after the match to the provided pattern \n",
      "deleted from 'smer', and saved within a modified version \n",
      "of 'gapped_sequence.fa' as the output file 'gapped_sequence_clipped.fa'.\n",
      "*****************DONE**************************\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "python delete_seq_following_pattern_within_multiFASTA.py gapped_sequence.fa smer \"\\-{5,}GCGA\" --leave_dashes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">evoli\n",
      "atct--gatctgggggatctggg------gcgactgatctgatctggtctgtggcggcaa\n",
      "gcgaaaaacaaa\n",
      ">smer CLIPPED\n",
      "atct--gaatcg----atctggg------gcga\n"
     ]
    }
   ],
   "source": [
    "!head gapped_sequence_clipped.fa"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Thus, as with regular expression search terms in general, the wisest choice is probably using `delete_seq_following_pattern_within_multiFASTA()` function when dealing with a complex search pattern."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "*****************DONE**************************\n",
      "Sequence after the match to the provided pattern \n",
      "deleted from 'smer', and saved within a modified version \n",
      "of 'gapped_sequence.fa' as the output file 'gapped_sequence_clipped.fa'.\n",
      "*****************DONE**************************\n"
     ]
    }
   ],
   "source": [
    "delete_seq_following_pattern_within_multiFASTA(\"gapped_sequence.fa\", \"smer\", \"-{5,}GCGA\", filter_dashes=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">evoli\n",
      "atct--gatctgggggatctggg------gcgactgatctgatctggtctgtggcggcaa\n",
      "gcgaaaaacaaa\n",
      ">smer CLIPPED\n",
      "atct--gaatcg----atctggg------gcga\n"
     ]
    }
   ],
   "source": [
    "!head gapped_sequence_clipped.fa"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2 matches to the sequence found in the specified sequence. The sequence\n",
      "that follows the match encountered first has been deleted.\n",
      "\n",
      "*****************DONE**************************\n",
      "Sequence after the match to the provided pattern \n",
      "deleted from 'smer', and saved within a modified version \n",
      "of 'gapped_sequence.fa' as the output file 'gapped_sequence_clipped.fa'.\n",
      "*****************DONE**************************\n"
     ]
    }
   ],
   "source": [
    "delete_seq_following_pattern_within_multiFASTA(\"gapped_sequence.fa\", \"smer\", \"--GCGA\", filter_dashes=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Two matches for above, but only processes after first."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">evoli\n",
      "atct--gatctgggggatctggg------gcgactgatctgatctggtctgtggcggcaa\n",
      "gcgaaaaacaaa\n",
      ">smer CLIPPED\n",
      "atct--gaatcg----atctggg------gcga\n"
     ]
    }
   ],
   "source": [
    "!head gapped_sequence_clipped.fa"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "*****************DONE**************************\n",
      "Sequence after the match to the provided pattern \n",
      "deleted from 'smer', and saved within a modified version \n",
      "of 'gapped_sequence.fa' as the output file 'gapped_sequence_clipped.fa'.\n",
      "*****************DONE**************************\n"
     ]
    }
   ],
   "source": [
    "delete_seq_following_pattern_within_multiFASTA(\"gapped_sequence.fa\", \"smer\", \"---GCGA\", filter_dashes=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">evoli\n",
      "atct--gatctgggggatctggg------gcgactgatctgatctggtctgtggcggcaa\n",
      "gcgaaaaacaaa\n",
      ">smer CLIPPED\n",
      "atct--gaatcg----atctggg------gcga\n"
     ]
    }
   ],
   "source": [
    "!head gapped_sequence_clipped.fa"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2 matches to the sequence found in the specified sequence. The sequence\n",
      "that follows the match encountered first has been deleted.\n",
      "\n",
      "*****************DONE**************************\n",
      "Sequence after the match to the provided pattern \n",
      "deleted from 'smer', and saved within a modified version \n",
      "of 'gapped_sequence.fa' as the output file 'gapped_sequence_clipped.fa'.\n",
      "*****************DONE**************************\n"
     ]
    }
   ],
   "source": [
    "delete_seq_following_pattern_within_multiFASTA(\"gapped_sequence.fa\", \"smer\", \"-{2,}GCGA\", filter_dashes=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">evoli\n",
      "atct--gatctgggggatctggg------gcgactgatctgatctggtctgtggcggcaa\n",
      "gcgaaaaacaaa\n",
      ">smer CLIPPED\n",
      "atct--gaatcg----atctggg------gcga\n"
     ]
    }
   ],
   "source": [
    "!head gapped_sequence_clipped.fa"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "----\n",
    "\n",
    "Enjoy!\n",
    "\n",
    "Upload your own sequence files to any running Jupyter session and adapt the commands in this notebook to search wihin them. Edit the notebook or copy the necessary cells to make the script work with your own data.\n",
    "\n",
    "----\n",
    "### ADVANCED DEVELOPMENT NOTE\n",
    "\n",
    "If editing the script (***ATYPICAL***) and using import of the main function to test changes here in this Jupyter notebook, you'll need to run the following code in order to specifically trigger import of the updated version of the code for the function subsequent to any edit. Otherwise, without a restart of the kernel, the notebook environment will see any call to import the function and essentially ignore it as it considers that function already imported into the notebook environment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Run this to have new code reflected in the version of the function in memory within the notebook namespace\n",
    "import importlib\n",
    "import delete_seq_following_pattern_within_multiFASTA; importlib.reload( delete_seq_following_pattern_within_multiFASTA ); from delete_seq_following_pattern_within_multiFASTA import delete_seq_following_pattern_within_multiFASTA\n",
    "# above line from https://stackoverflow.com/a/11724154/8508004"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}