{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Overlap Graphs, from [Rosalind.info](https://www.rosalind.info)\n", "\n", "(Text copied from http://rosalind.info/problems/grph/)\n", "\n", ">**Problem**\n", ">\n", ">A graph whose nodes have all been labeled can be represented by an adjacency list, in which each row of the list contains the two node labels corresponding to a unique edge.\n", ">\n", ">A directed graph (or digraph) is a graph containing directed edges, each of which has an orientation. That is, a directed edge is represented by an arrow instead of a line segment; the starting and ending nodes of an edge form its tail and head, respectively. The directed edge with tail v\n", "and head w is represented by (v,w) (but not by (w,v)). A directed loop is a directed edge of the form (v,v).\n", ">\n", ">For a collection of strings and a positive integer k, the overlap graph for the strings is a directed graph Ok in which each string is represented by a node, and string s is connected to string t with a directed edge when there is a length k suffix of s that matches a length k prefix of t, as long as s≠t; we demand s≠t to prevent directed loops in the overlap graph (although directed cycles may be present).\n", ">\n", ">**Given**: A collection of DNA strings in FASTA format having total length at most 10 kbp.\n", ">\n", ">**Return**: The adjacency list corresponding to O3. You may return edges in any order.\n", ">\n", ">\n", ">**Sample Dataset**\n", "```\n", ">Rosalind_0498\n", "AAATAAA\n", ">Rosalind_2391\n", "AAATTTT\n", ">Rosalind_2323\n", "TTTTCCC\n", ">Rosalind_0442\n", "AAATCCC\n", ">Rosalind_5013\n", "GGGTGGG\n", "```\n", ">**Sample Output**\n", "```\n", "Rosalind_0498 Rosalind_2391\n", "Rosalind_0498 Rosalind_0442\n", "Rosalind_2391 Rosalind_2323\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## My interpretation/reasoning\n", "\n", "1. I am expected to make a _directed_ graph of DNA sequences.\n", "\n", "2. Because it is directed, there is a clear order, e.g. the _end_ (suffix) of sequence 1 should match the _beginning_ (prefix) of sequence 2.\n", "\n", "3. From the example, it looks like you have to check: for each sequence, look which other sequences have a beginning that matches the end of the current sequence.\n", "\n", "4. The length of the overlap is defined as parameter _k_. In the example k = 3. (Is this also true for the actual exercise?)\n", "\n", "Now practically, this means the code is to:\n", "\n", " - Open and read a fasta file\n", " - For each sequence, take the name and suffix, then...\n", " - match that suffix to each other sequence's prefix, and...\n", " - if it matches, print the sequence name from the suffix and the matched prefix\n", " - continue until no sequences are left." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's see how that translates to code:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from Bio import SeqIO" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "def look_for_overlaps(sequence_file, k=3, testing=True):\n", " \"\"\"\n", "Function that reads a sequence file to look for sequence overlaps in suffixes\n", "and prefixes of sequences. Returns pairs of sequence IDs for which the suffix\n", "of the former matches the prefix of the latter.\n", "(Also see Rosalind.info exercise http://rosalind.info/problems/grph/)\n", " \"\"\"\n", " #Start by saving all sequences in a dictionary.\n", " #This work easier than re-reading the file for each sequence.\n", " sequence_dict = {}\n", "\n", " for record in SeqIO.parse(sequence_file, \"fasta\"):\n", " sequence_dict[record.id] = record.seq\n", " \n", " for key, value in sequence_dict.items():\n", " #For each item in the dictionary\n", " # the key is the name\n", " sequence_name = key\n", " # and the sequence suffix is the last k characters from\n", " # 'value' = the sequence.\n", " suffix = value[-k:]\n", " \n", " if testing:\n", " #enable debugging prints\n", " print(\"Now looking for overlaps with suffix %s of sequence %s...\" \n", " % (suffix, sequence_name))\n", " else:\n", " pass\n", " \n", " for key, value in sequence_dict.items():\n", " #For each sequence, read the complete list of sequences\n", " other_name = key\n", " prefix = value[:k]\n", " \n", " if testing:\n", " print(\"Now checking sequence %s with prefix %s...\" %\n", " (other_name, prefix))\n", " \n", " if other_name == sequence_name:\n", " #No need to check for matches with itself, so pass\n", " pass\n", " else:\n", " #Check for overlap\n", " if prefix == suffix:\n", " #There is a match!\n", " if testing:\n", " print(\"There seems to be a match: %s (suffix of %s) = %s (prefix of %s)\"\n", " % (suffix, sequence_name, prefix, other_name))\n", " else:\n", " print(sequence_name, other_name)\n", " else:\n", " #if there is no overlap, do nothing\n", " pass\n", " \n", " return(None)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's run this function with the test dataset:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Now looking for overlaps with suffix AAA of sequence Rosalind_0498...\n", "Now checking sequence Rosalind_0498 with prefix AAA...\n", "Now checking sequence Rosalind_2391 with prefix AAA...\n", "There seems to be a match: AAA (suffix of Rosalind_0498) = AAA (prefix of Rosalind_2391)\n", "Now checking sequence Rosalind_2323 with prefix TTT...\n", "Now checking sequence Rosalind_0442 with prefix AAA...\n", "There seems to be a match: AAA (suffix of Rosalind_0498) = AAA (prefix of Rosalind_0442)\n", "Now checking sequence Rosalind_5013 with prefix GGG...\n", "Now looking for overlaps with suffix TTT of sequence Rosalind_2391...\n", "Now checking sequence Rosalind_0498 with prefix AAA...\n", "Now checking sequence Rosalind_2391 with prefix AAA...\n", "Now checking sequence Rosalind_2323 with prefix TTT...\n", "There seems to be a match: TTT (suffix of Rosalind_2391) = TTT (prefix of Rosalind_2323)\n", "Now checking sequence Rosalind_0442 with prefix AAA...\n", "Now checking sequence Rosalind_5013 with prefix GGG...\n", "Now looking for overlaps with suffix CCC of sequence Rosalind_2323...\n", "Now checking sequence Rosalind_0498 with prefix AAA...\n", "Now checking sequence Rosalind_2391 with prefix AAA...\n", "Now checking sequence Rosalind_2323 with prefix TTT...\n", "Now checking sequence Rosalind_0442 with prefix AAA...\n", "Now checking sequence Rosalind_5013 with prefix GGG...\n", "Now looking for overlaps with suffix CCC of sequence Rosalind_0442...\n", "Now checking sequence Rosalind_0498 with prefix AAA...\n", "Now checking sequence Rosalind_2391 with prefix AAA...\n", "Now checking sequence Rosalind_2323 with prefix TTT...\n", "Now checking sequence Rosalind_0442 with prefix AAA...\n", "Now checking sequence Rosalind_5013 with prefix GGG...\n", "Now looking for overlaps with suffix GGG of sequence Rosalind_5013...\n", "Now checking sequence Rosalind_0498 with prefix AAA...\n", "Now checking sequence Rosalind_2391 with prefix AAA...\n", "Now checking sequence Rosalind_2323 with prefix TTT...\n", "Now checking sequence Rosalind_0442 with prefix AAA...\n", "Now checking sequence Rosalind_5013 with prefix GGG...\n" ] } ], "source": [ "look_for_overlaps(\"data/Example_overlap_graphs.fasta\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And how does that look when not using debugging prints?" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Rosalind_0498 Rosalind_2391\n", "Rosalind_0498 Rosalind_0442\n", "Rosalind_2391 Rosalind_2323\n" ] } ], "source": [ "look_for_overlaps(\"data/Example_overlap_graphs.fasta\", k=3, testing=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That looks fine! So now the code seems to work. At least with the test dataset... Let's try what happens with the real dataset:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Rosalind_1040 Rosalind_5677\n", "Rosalind_1040 Rosalind_4711\n", "Rosalind_6532 Rosalind_5400\n", "Rosalind_1652 Rosalind_5442\n", "Rosalind_6440 Rosalind_5677\n", "Rosalind_6440 Rosalind_4711\n", "Rosalind_5771 Rosalind_8550\n", "Rosalind_9673 Rosalind_7260\n", "Rosalind_9673 Rosalind_0232\n", "Rosalind_9673 Rosalind_3971\n", "Rosalind_1214 Rosalind_8526\n", "Rosalind_5697 Rosalind_3085\n", "Rosalind_5697 Rosalind_2525\n", "Rosalind_8572 Rosalind_1040\n", "Rosalind_8572 Rosalind_1449\n", "Rosalind_9157 Rosalind_9564\n", "Rosalind_9157 Rosalind_4016\n", "Rosalind_3810 Rosalind_9673\n", "Rosalind_3810 Rosalind_4526\n", "Rosalind_8526 Rosalind_1652\n", "Rosalind_8526 Rosalind_5697\n", "Rosalind_4110 Rosalind_9009\n", "Rosalind_2297 Rosalind_1040\n", "Rosalind_2297 Rosalind_1449\n", "Rosalind_9564 Rosalind_3810\n", "Rosalind_9564 Rosalind_8991\n", "Rosalind_7260 Rosalind_9158\n", "Rosalind_7260 Rosalind_6075\n", "Rosalind_7296 Rosalind_7051\n", "Rosalind_7296 Rosalind_7177\n", "Rosalind_2027 Rosalind_8830\n", "Rosalind_2027 Rosalind_4607\n", "Rosalind_2027 Rosalind_2021\n", "Rosalind_3309 Rosalind_1214\n", "Rosalind_3309 Rosalind_2553\n", "Rosalind_1449 Rosalind_4402\n", "Rosalind_2940 Rosalind_8550\n", "Rosalind_5750 Rosalind_5572\n", "Rosalind_7245 Rosalind_1040\n", "Rosalind_7245 Rosalind_1449\n", "Rosalind_4189 Rosalind_5409\n", "Rosalind_3446 Rosalind_7245\n", "Rosalind_1778 Rosalind_3694\n", "Rosalind_8323 Rosalind_3053\n", "Rosalind_8830 Rosalind_5400\n", "Rosalind_2068 Rosalind_1778\n", "Rosalind_2068 Rosalind_3694\n", "Rosalind_0825 Rosalind_3832\n", "Rosalind_0825 Rosalind_9385\n", "Rosalind_6234 Rosalind_9564\n", "Rosalind_6234 Rosalind_4016\n", "Rosalind_5677 Rosalind_7051\n", "Rosalind_5677 Rosalind_7177\n", "Rosalind_4711 Rosalind_8572\n", "Rosalind_4711 Rosalind_8323\n", "Rosalind_4711 Rosalind_0944\n", "Rosalind_7610 Rosalind_1040\n", "Rosalind_7610 Rosalind_1449\n", "Rosalind_7361 Rosalind_4242\n", "Rosalind_4607 Rosalind_0659\n", "Rosalind_4607 Rosalind_6143\n", "Rosalind_9024 Rosalind_2297\n", "Rosalind_9024 Rosalind_7296\n", "Rosalind_9024 Rosalind_0290\n", "Rosalind_1141 Rosalind_5572\n", "Rosalind_0025 Rosalind_4110\n", "Rosalind_0025 Rosalind_6234\n", "Rosalind_0025 Rosalind_4770\n", "Rosalind_0232 Rosalind_5677\n", "Rosalind_0232 Rosalind_4711\n", "Rosalind_3832 Rosalind_6775\n", "Rosalind_3832 Rosalind_7789\n", "Rosalind_9385 Rosalind_3810\n", "Rosalind_9385 Rosalind_8991\n", "Rosalind_0659 Rosalind_4790\n", "Rosalind_6283 Rosalind_1040\n", "Rosalind_6283 Rosalind_1449\n", "Rosalind_6167 Rosalind_2027\n", "Rosalind_6167 Rosalind_8830\n", "Rosalind_6167 Rosalind_4607\n", "Rosalind_6167 Rosalind_2021\n", "Rosalind_6155 Rosalind_2068\n", "Rosalind_6155 Rosalind_2196\n", "Rosalind_0944 Rosalind_5400\n", "Rosalind_4759 Rosalind_3309\n", "Rosalind_4759 Rosalind_8174\n", "Rosalind_4662 Rosalind_4402\n", "Rosalind_8004 Rosalind_9395\n", "Rosalind_8004 Rosalind_5157\n", "Rosalind_2021 Rosalind_9673\n", "Rosalind_2021 Rosalind_4526\n", "Rosalind_7051 Rosalind_3810\n", "Rosalind_7051 Rosalind_8991\n", "Rosalind_4402 Rosalind_3832\n", "Rosalind_4402 Rosalind_9385\n", "Rosalind_5442 Rosalind_8526\n", "Rosalind_6775 Rosalind_9395\n", "Rosalind_6775 Rosalind_5157\n", "Rosalind_2196 Rosalind_7260\n", "Rosalind_2196 Rosalind_0232\n", "Rosalind_2196 Rosalind_3971\n", "Rosalind_9395 Rosalind_5409\n", "Rosalind_3085 Rosalind_0659\n", "Rosalind_3085 Rosalind_6143\n", "Rosalind_4526 Rosalind_8550\n", "Rosalind_6384 Rosalind_4790\n", "Rosalind_4242 Rosalind_0507\n", "Rosalind_4242 Rosalind_9328\n", "Rosalind_2553 Rosalind_0507\n", "Rosalind_2553 Rosalind_9328\n", "Rosalind_5123 Rosalind_7610\n", "Rosalind_9328 Rosalind_5677\n", "Rosalind_9328 Rosalind_4711\n", "Rosalind_0290 Rosalind_0507\n", "Rosalind_0290 Rosalind_9328\n", "Rosalind_2678 Rosalind_1487\n", "Rosalind_8174 Rosalind_5400\n", "Rosalind_8550 Rosalind_3832\n", "Rosalind_8550 Rosalind_9385\n", "Rosalind_5157 Rosalind_3053\n", "Rosalind_2525 Rosalind_4625\n", "Rosalind_1487 Rosalind_5400\n", "Rosalind_7715 Rosalind_4759\n", "Rosalind_7715 Rosalind_8004\n", "Rosalind_7886 Rosalind_1778\n", "Rosalind_7886 Rosalind_3694\n", "Rosalind_8991 Rosalind_8526\n", "Rosalind_6143 Rosalind_4402\n", "Rosalind_7789 Rosalind_9564\n", "Rosalind_7789 Rosalind_4016\n", "Rosalind_5572 Rosalind_1214\n", "Rosalind_5572 Rosalind_2553\n", "Rosalind_0412 Rosalind_9024\n", "Rosalind_0412 Rosalind_0042\n", "Rosalind_4016 Rosalind_7610\n", "Rosalind_5409 Rosalind_6440\n", "Rosalind_3694 Rosalind_8572\n", "Rosalind_3694 Rosalind_8323\n", "Rosalind_3694 Rosalind_0944\n", "Rosalind_3971 Rosalind_5677\n", "Rosalind_3971 Rosalind_4711\n", "Rosalind_4790 Rosalind_3810\n", "Rosalind_4790 Rosalind_8991\n", "Rosalind_0405 Rosalind_4110\n", "Rosalind_0405 Rosalind_6234\n", "Rosalind_0405 Rosalind_4770\n", "Rosalind_5400 Rosalind_0507\n", "Rosalind_5400 Rosalind_9328\n", "Rosalind_6075 Rosalind_2297\n", "Rosalind_6075 Rosalind_7296\n", "Rosalind_6075 Rosalind_0290\n", "Rosalind_1674 Rosalind_3810\n", "Rosalind_1674 Rosalind_8991\n" ] } ], "source": [ "look_for_overlaps(\"data/rosalind_grph.txt\", k=3, testing=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Success!!\n", "\n", "I solved the problem. Now let's save this notebook and commit it to git." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.1" } }, "nbformat": 4, "nbformat_minor": 4 }