{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Examples of running the script\n", "\n", "There are several options to run `roughly_score_relationships_to_subject_seq_pairwise_premsa.py` or utilize the core function.\n", "\n", "This notebook will demonstrate a few ways of using this script:\n", "\n", "- [Running as script file](#Running-as-script-file)\n", "- [Running core function of the script after import](#Running-core-function-of-the-script-after-importing)\n", "\n", "Notably, it won't cover running the script after pasting or loading it into a cell. An approach to that is illustrated [here](https://nbviewer.jupyter.org/github/fomightez/sequencework/blob/master/circos-utilities/demo%20UCSC_chrom_sizes_2_circos_karyotype%20script.ipynb) and [here](https://nbviewer.jupyter.org/github/fomightez/sequencework/blob/master/FindSequence/demo%20find_sequence_element_occurrences_in_sequence%20script.ipynb) for different scripts, but those should serve as a good guides combined with what is shown here for using the main function in a notebook with import. There are a copuple of ways to get the script into the cell, namely pasting it in or loading it from github, that are covered there.\n", "\n", "(If you are having any problems at all doing any of this because of Python or needed dependencies, such as Bioython, this notebook was developed in the enviromenment launchable by pressing `Launch binder` badge [here](https://github.com/fomightez/blast-binder). You could always launch that environment and upload this notebook there and things should work.)\n", "\n", "## Running as script file\n", "\n", "Similar to how one would run a script from the command line.\n", "\n", "Upload the script to the directory where you want to run it. Or upload it to a running Jupyter environment.\n", "\n", "(For the sake of this demonstration, I am going to use `curl` to get the file from github and upload it to the 'local' environment. You of course can use whatever download and upload steps you'd like, such as using a browser and your system's graphical user interface, to place the script in the directory. 'local' is in parentheses because if running this in a Jupyter interface via the Binder system, 'local' would be inside the running enviroment.)\n", "\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", "100 20938 100 20938 0 0 68649 0 --:--:-- --:--:-- --:--:-- 68875\n" ] } ], "source": [ "!curl -O https://raw.githubusercontent.com/fomightez/sequencework/master/alignment-utilities/roughly_score_relationships_to_subject_seq_pairwise_premsa.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That command would work on the command line without the exclamation point. The use of the exclamation point signals here to not treat it as Python code and instead target the command to the available command line shell. \n", "\n", "**THEN AFTER UPLOADED**... \n", "If running on the command line then you would enter:\n", "\n", "```\n", "python roughly_score_relationships_to_subject_seq_pairwise_premsa.py my_sequences.fa\n", "```\n", "\n", "Or something similar to that depending on your Python environment and source of data.\n", "\n", "\n", "Similarly you can do that in the Jupyter environment using either either `!python` before the script name or using the `%run` magic command. \n", "The `%run` magic command is demonstrated in the next cell. If you are in an active Jupyter environment, to run it click on the next cell and type `shift-enter` or click run on the toolbar above the notebook." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "usage: roughly_score_relationships_to_subject_seq_pairwise_premsa.py\n", " [-h] [-bl BLOCK_LEN] [-dfo DF_OUTPUT] SEQS_FILE\n", "\n", "roughly_score_relationships_to_subject_seq_pairwise_premsa.py Takes a file of\n", "multiple sequences in FASTA format and aligns each of them in turn to the\n", "first sequence. However, if the sequence happens to be moderate- or large-\n", "sized (> 5 kb), by default it only samples part of the sequence due to memory\n", "limitations. It scores the alignments and produces a dataframe ranking the\n", "sequences from most similar to most different relative the first one in the\n", "supplied file. The dataframe is saved as a tabular text file when used on the\n", "command line. Optionally, it can also return that dataframe for use inside a\n", "Jupyter notebook. **** Script by Wayne Decatur (fomightez @ github) ***\n", "\n", "positional arguments:\n", " SEQS_FILE Name of file of sequences (all FASTA-formatted) to\n", " compare to the first in that file .\n", "\n", "optional arguments:\n", " -h, --help show this help message and exit\n", " -bl BLOCK_LEN, --block_len BLOCK_LEN\n", " **FOR ADVANCED USE.*** Allows for setting the sequence\n", " blocks compared for long sequences. The default of\n", " 9141 was worked out for uisng in a session launched\n", " form MyBinder.org 2018. You can free to make larger if\n", " you have more computational resources. For example,\n", " `-bl 15000`; however, the failing condition is just a\n", " silent(hanging) state.\n", " -dfo DF_OUTPUT, --df_output DF_OUTPUT\n", " OPTIONAL: Set file name for saving tabular text (tab-\n", " delimited) derived from the produced dataframe. If\n", " none provided, 'ranked_seqs_df.tsv' will be used. To\n", " force no table to be saved, enter `-dfo no_table`\n", " without quotes or ticks as output file (ATYPICAL).\n" ] } ], "source": [ "%run roughly_score_relationships_to_subject_seq_pairwise_premsa.py --help" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `USAGE` shown as the output from in the above cell due to running with the script with the `--help/-h` flag.\n", "\n", "For the rest of this section we are going to obtain real data and use that to provide actual arguments to call the script as the `USAGE` outlines.\n", "\n", "First, the next call will get some sequence data and concatenate it into one file. The first file in the resulting file is the one the others will be compared against." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", "100 178 100 178 0 0 1745 0 --:--:-- --:--:-- --:--:-- 1745\n", "100 22109 100 22109 0 0 70862 0 --:--:-- --:--:-- --:--:-- 70862\n", " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", "100 178 100 178 0 0 1745 0 --:--:-- --:--:-- --:--:-- 1745\n", "100 20363 100 20363 0 0 83114 0 --:--:-- --:--:-- --:--:-- 83114\n", " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", "100 178 100 178 0 0 2342 0 --:--:-- --:--:-- --:--:-- 2342\n", "100 20829 100 20829 0 0 89012 0 --:--:-- --:--:-- --:--:-- 89012\n", " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", "100 178 100 178 0 0 2342 0 --:--:-- --:--:-- --:--:-- 2342\n", "100 21373 100 21373 0 0 98493 0 --:--:-- --:--:-- --:--:-- 98493\n", " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", "100 178 100 178 0 0 2342 0 --:--:-- --:--:-- --:--:-- 2342\n", "100 21190 100 21190 0 0 96757 0 --:--:-- --:--:-- --:--:-- 96757\n", " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", "100 178 100 178 0 0 1390 0 --:--:-- --:--:-- --:--:-- 1390\n", "100 19885 100 19885 0 0 73376 0 --:--:-- --:--:-- --:--:-- 404k\n", " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", "100 178 100 178 0 0 1745 0 --:--:-- --:--:-- --:--:-- 1745\n", "100 18809 100 18809 0 0 72621 0 --:--:-- --:--:-- --:--:-- 72621\n", " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", "100 178 100 178 0 0 2342 0 --:--:-- --:--:-- --:--:-- 2342\n", "100 18692 100 18692 0 0 62306 0 --:--:-- --:--:-- --:--:-- 62306\n", " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", "100 178 100 178 0 0 1328 0 --:--:-- --:--:-- --:--:-- 1328\n", "100 18399 100 18399 0 0 63226 0 --:--:-- --:--:-- --:--:-- 63226\n", " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", "100 178 100 178 0 0 3016 0 --:--:-- --:--:-- --:--:-- 3016\n", "100 18509 100 18509 0 0 69322 0 --:--:-- --:--:-- --:--:-- 69322\n", " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", "100 178 100 178 0 0 1745 0 --:--:-- --:--:-- --:--:-- 1745\n", "100 18886 100 18886 0 0 77720 0 --:--:-- --:--:-- --:--:-- 77720\n", " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", "100 178 100 178 0 0 2342 0 --:--:-- --:--:-- --:--:-- 2342\n", "100 17877 100 17877 0 0 62947 0 --:--:-- --:--:-- --:--:-- 62947\n" ] } ], "source": [ "#make a list of the strain designations\n", "yue_et_al_strains = [\"S288C\",\"DBVPG6044\",\"DBVPG6765\",\"SK1\",\"Y12\",\n", " \"YPS128\",\"UWOPS034614\",\"CBS432\",\"N44\",\"YPS138\",\n", " \"UFRJ50816\",\"UWOPS919171\"]\n", "# Get set of sequences and edit description line to be cleaner\n", "for s in yue_et_al_strains:\n", " !curl -LO http://yjx1217.github.io/Yeast_PacBio_2016/data/Mitochondrial_Genome/{s}.mt.genome.fa.gz\n", " !gunzip -f {s}.mt.genome.fa.gz\n", " !sed -i \"1s/.*/>{s}/\" {s}.mt.genome.fa\n", "# concatenate the FASTA-formatted sequences into one FASTA file\n", "seq_files = [s+\".mt.genome.fa\" for s in yue_et_al_strains]\n", "!cat {\" \".join(seq_files)} > seq_files.fa" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With some data retrieved, you are ready to run the script. Running the next cell will compare each sequence after the first one in the file in turn to the first sequence and score the similarity." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Sequences read in...\n", "Longest sequence in input detected as 85793.\n", "...calculating scores of pairwise alignments...\n", "Anticipated memory issues with long sequence and\n", "so only block of 9141 bps from the start compared.\n", "...\n", "Super long sequence detected: Comparing a 2nd block back from the\n", "sequence 'end' as well and combining scores.\n", "...summarizing scores...\n", "Results converted to a dataframe...\n", "\n", "A table of the data has been saved as a text file (tab-delimited).\n", "DATA is stored as ==> 'ranked_seqs_df.tsv'" ] } ], "source": [ "%run roughly_score_relationships_to_subject_seq_pairwise_premsa.py seq_files.fa" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When run from the command line, which is essentially what the `%run` command does here, the output is saved as a file with the results as tabular text automatically.\n", "Open `ranked_seqs_df.tsv` or what you opted to name it and review the results. Those with the highest score, being most similar, will be towards the top.\n", "\n", "(Note that `DBVPG676` comes out as more different from `S228C` than expected here. This probably is just due to a sampling issue as the 'start' in these sequences happens to be the most different part as shown in the Dot Matrix View plot generated from BLAST comparision of these two sequences. When I have been careful to adjust the sequence so the 'start' sites are what the SGD reference uses, it comes out as the most similar to `S288C`. This actually highlights the fact that not having the memory to handle sampling all the sequence in large sequences can end up being misleading and to only consider this as a quick-an-dirty approach to get information about which sequences are similar and which ones are more different. Follow this up by an multiple sequence alignment ASAP.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "-----\n", "\n", "## Running the script with import of the main function\n", "\n", "\n", "This is similar to ['Running core function of the script after loading into a cell'](#https://nbviewer.jupyter.org/github/fomightez/sequencework/blob/master/circos-utilities/demo%20UCSC_chrom_sizes_2_circos_karyotype%20script.ipynb#Running-core-function-of-the-script-after-loading-into-a-cell), but here we take advantage of Python's import statement to do what was previously handled by pasting or loading code into a cell and running it.\n", "\n", "First insure the script is available where you are running. Running the next command will do that here." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", "100 20938 100 20938 0 0 61946 0 --:--:-- --:--:-- --:--:-- 61764\n" ] } ], "source": [ "!curl -O https://raw.githubusercontent.com/fomightez/sequencework/master/alignment-utilities/roughly_score_relationships_to_subject_seq_pairwise_premsa.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then import the main function of the script to the notebook's active computational environment via an import statement. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from roughly_score_relationships_to_subject_seq_pairwise_premsa import roughly_score_relationships_to_subject_seq_pairwise_premsa" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(As written above the command to do that looks a bit redundant; however, the first `from` part of the command below actually is referencing the `find_sequence_element_occurrences_in_sequence` script, but it doesn't need the `.py` extension because the `import` only deals with such files.)\n", "\n", "With the main function imported, it is now available to be run. We'll use the data retrieved above. (Run the third cell in the botebook if you haven't already before running the next cell.)\n", "\n", "This time, by default, though the results will also be returned as a dataframe that can be viewed directly in the notebook and subsequently utilized." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Sequences read in...\n", "Longest sequence in input detected as 85793.\n", "...calculating scores of pairwise alignments...\n", "Anticipated memory issues with long sequence and\n", "so only block of 9141 bps from the start compared.\n", "...\n", "Super long sequence detected: Comparing a 2nd block back from the\n", "sequence 'end' as well and combining scores.\n", "...summarizing scores...\n", "Results converted to a dataframe...\n", "\n", "A table of the data has been saved as a text file (tab-delimited).\n", "DATA is stored as ==> 'ranked_seqs_df.tsv'\n", "\n", "Returning a dataframe with the information as well." ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idscore_vs_S288C
3Y1214129.5
4YPS12812898.5
2SK112345.0
5UWOPS03461412197.5
6CBS43210283.0
0DBVPG60449973.0
10UWOPS9191719835.0
9UFRJ508169771.0
8YPS1388994.5
7N448834.5
1DBVPG67658527.0
\n", "
" ], "text/plain": [ " id score_vs_S288C\n", "3 Y12 14129.5\n", "4 YPS128 12898.5\n", "2 SK1 12345.0\n", "5 UWOPS034614 12197.5\n", "6 CBS432 10283.0\n", "0 DBVPG6044 9973.0\n", "10 UWOPS919171 9835.0\n", "9 UFRJ50816 9771.0\n", "8 YPS138 8994.5\n", "7 N44 8834.5\n", "1 DBVPG6765 8527.0" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = roughly_score_relationships_to_subject_seq_pairwise_premsa(\"seq_files.fa\")\n", "df" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
score_vs_S288C
count11.000000
mean10708.045455
std1869.949872
min8527.000000
25%9382.750000
50%9973.000000
75%12271.250000
max14129.500000
\n", "
" ], "text/plain": [ " score_vs_S288C\n", "count 11.000000\n", "mean 10708.045455\n", "std 1869.949872\n", "min 8527.000000\n", "25% 9382.750000\n", "50% 9973.000000\n", "75% 12271.250000\n", "max 14129.500000" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Troubleshooting\n", "\n", "I have found on limited computational resources that this script will just stop working silently and hang. I have made an effort to set this to work on Jupyter sessions launched from MyBinder.org. For example, I suggest going [here](https://github.com/fomightez/blast-binder), pressing `launch binder` and uploading this notebook there to run it actively. It will most likely not hang unless they have lowered the limits or things are overly busy.\n", "\n", "If you find it isn't working, I suggest you insure it is the memory issue by lowering the `block_len` (a.k.a `length of the sequence block to examine`) option to something tiny. The data it produces won't be useful but it should tell you if that is why your script is working or not. \n", "\n", "How to do that depends if you are using it in an Jupyter notebook cell or at what is equivalent to the command line. This section will demonstrate the two approaches.\n", "\n", "First on the command line, you can set the `block_len` like so:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Sequences read in...\n", "Longest sequence in input detected as 85793.\n", "...calculating scores of pairwise alignments...\n", "Anticipated memory issues with long sequence and\n", "so only block of 200 bps from the start compared.\n", "...\n", "Super long sequence detected: Comparing a 2nd block back from the\n", "sequence 'end' as well and combining scores.\n", "...summarizing scores...\n", "Results converted to a dataframe...\n", "\n", "A table of the data has been saved as a text file (tab-delimited).\n", "DATA is stored as ==> 'ranked_seqs_df.tsv'" ] } ], "source": [ "%run roughly_score_relationships_to_subject_seq_pairwise_premsa.py seq_files.fa --block_len 200" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That should run instantly and generate a file of results.\n", "\n", "Here is how to set the `block_len` when working inside the notebook." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Sequences read in...\n", "Longest sequence in input detected as 85793.\n", "...calculating scores of pairwise alignments...\n", "Anticipated memory issues with long sequence and\n", "so only block of 200 bps from the start compared.\n", "...\n", "Super long sequence detected: Comparing a 2nd block back from the\n", "sequence 'end' as well and combining scores.\n", "...summarizing scores...\n", "Results converted to a dataframe...\n", "\n", "A table of the data has been saved as a text file (tab-delimited).\n", "DATA is stored as ==> 'ranked_seqs_df.tsv'\n", "\n", "Returning a dataframe with the information as well." ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idscore_vs_S288C
4YPS128360.5
5UWOPS034614356.0
3Y12351.5
0DBVPG6044334.5
2SK1334.5
8YPS138324.0
9UFRJ50816324.0
10UWOPS919171316.0
1DBVPG6765315.0
6CBS432313.0
7N44310.0
\n", "
" ], "text/plain": [ " id score_vs_S288C\n", "4 YPS128 360.5\n", "5 UWOPS034614 356.0\n", "3 Y12 351.5\n", "0 DBVPG6044 334.5\n", "2 SK1 334.5\n", "8 YPS138 324.0\n", "9 UFRJ50816 324.0\n", "10 UWOPS919171 316.0\n", "1 DBVPG6765 315.0\n", "6 CBS432 313.0\n", "7 N44 310.0" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from roughly_score_relationships_to_subject_seq_pairwise_premsa import roughly_score_relationships_to_subject_seq_pairwise_premsa\n", "df = roughly_score_relationships_to_subject_seq_pairwise_premsa(\"seq_files.fa\", block_len=200)\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Remember, the data this produces won't be informative. It is just to make sure something more fundamental isn't causing you a problem. If you get it working with a tiny value then you can attempt to use more reasonable sizes." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }