{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Enumerating k-mers lexicographically, from [Rosalind.info](https://www.rosalind.info)\n", "\n", "(Specific exercise at: http://rosalind.info/problems/lexf/)\n", "\n", "This exercise sounds very much like one I've done earlier: \n", "Enumerating gene orders (PERM).\n", "\n", "It seems like the only differences are:\n", "\n", " - the other exercise used numbers, this one uses _letters_\n", " - the other exercise required the number of possible combinations with the answer, this exercise wants _only the combinations_\n", " \n", "So now I should be able to do basically the same thing as in [PERM](PERM-Enumerating_gene_orders.ipynb)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import itertools #this library is going to do all the heavy lifting" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "def read_alphabet_and_length(input_file):\n", " \"\"\"\n", "Given a path to a text file with space-separated letters on the first line,\n", "and a number on the second line,\n", "return the 'alphabet' as list and the number as 'length'.\n", " \"\"\"\n", " first_line = True #use a trick to separate the first and second line\n", " \n", " with open(input_file, \"r\") as read_file:\n", " for line in read_file:\n", " if first_line:\n", " alphabet = line.split()\n", " first_line = False\n", " else:\n", " length = int(line.strip())\n", " \n", " return alphabet, length" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Example alphabet: ['A', 'C', 'G', 'T']\n", "Example length: 2\n" ] } ], "source": [ "example_file = \"data/Example_enumerating_k-mers_lexicographically.txt\"\n", "\n", "(example_alphabet, example_length) = read_alphabet_and_length(example_file)\n", "\n", "print(\"Example alphabet: %s\\nExample length: %i\" %(\n", "example_alphabet, example_length)\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alright, that works. Now let's see if I can uses the `itertools.permutations()` function to make some combinations." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[('A', 'C'), ('A', 'G'), ('A', 'T'), ('C', 'A'), ('C', 'G'), ('C', 'T'), ('G', 'A'), ('G', 'C'), ('G', 'T'), ('T', 'A'), ('T', 'C'), ('T', 'G')]\n" ] } ], "source": [ "print(list(itertools.permutations(example_alphabet, example_length)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Right, that seems fine. Now I need to convert this list of tuples into separate lines of strings..." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "AC\n", "AG\n", "AT\n", "CA\n", "CG\n", "CT\n", "GA\n", "GC\n", "GT\n", "TA\n", "TC\n", "TG\n" ] } ], "source": [ "for letter_combination in list(itertools.permutations(example_alphabet, example_length)):\n", " print(\"\".join(list(letter_combination)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And with that I'm practically done, right?\n", "\n", "Hold on..., this exercise is again not entirely clear about how the output should be ordered...\n", "\n", "First it says the provided 'alphabet' will be ordered (which implies it may be any hypothetical order). \n", "Then it says the output should be ordered alphabetically?\n", "\n", "I will for now just go with whatever is provided and see how that works out..." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ABCD\n", "ABDC\n", "ACBD\n", "ACDB\n", "ADBC\n", "ADCB\n", "BACD\n", "BADC\n", "BCAD\n", "BCDA\n", "BDAC\n", "BDCA\n", "CABD\n", "CADB\n", "CBAD\n", "CBDA\n", "CDAB\n", "CDBA\n", "DABC\n", "DACB\n", "DBAC\n", "DBCA\n", "DCAB\n", "DCBA\n" ] } ], "source": [ "test_file = \"data/rosalind_lexf.txt\"\n", "\n", "(test_alphabet, test_length) = read_alphabet_and_length(test_file)\n", "\n", "for string in list(itertools.permutations(test_alphabet, test_length)):\n", " print(\"\".join(list(string)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hm... So this was wrong.\n", "\n", "Let's check what the input was:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Test alphabet: ['A', 'B', 'C', 'D']\n", "Test length: 4\n" ] } ], "source": [ "print(\"Test alphabet: %s\\nTest length: %i\" %(\n", "test_alphabet, test_length)\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That looks right... What may have gone wrong then?\n", "\n", "Oh yes, I see now that my example went wrong too. The answer I got is too short. \n", "I miss all the repeats! Now how can I include those, too...?" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "AAAA\n", "AAAB\n", "AAAC\n", "AAAD\n", "AABB\n", "AABC\n", "AABD\n", "AACC\n", "AACD\n", "AADD\n", "ABBB\n", "ABBC\n", "ABBD\n", "ABCC\n", "ABCD\n", "ABDD\n", "ACCC\n", "ACCD\n", "ACDD\n", "ADDD\n", "BBBB\n", "BBBC\n", "BBBD\n", "BBCC\n", "BBCD\n", "BBDD\n", "BCCC\n", "BCCD\n", "BCDD\n", "BDDD\n", "CCCC\n", "CCCD\n", "CCDD\n", "CDDD\n", "DDDD\n" ] } ], "source": [ "for string in list(itertools.combinations_with_replacement(test_alphabet, test_length)):\n", " print(\"\".join(list(string)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So I used the wrong function, this one looks more like it!\n", "\n", "Let's try the exercise again." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "AA\n", "AB\n", "AC\n", "AD\n", "AE\n", "AF\n", "AG\n", "AH\n", "AI\n", "AJ\n", "BB\n", "BC\n", "BD\n", "BE\n", "BF\n", "BG\n", "BH\n", "BI\n", "BJ\n", "CC\n", "CD\n", "CE\n", "CF\n", "CG\n", "CH\n", "CI\n", "CJ\n", "DD\n", "DE\n", "DF\n", "DG\n", "DH\n", "DI\n", "DJ\n", "EE\n", "EF\n", "EG\n", "EH\n", "EI\n", "EJ\n", "FF\n", "FG\n", "FH\n", "FI\n", "FJ\n", "GG\n", "GH\n", "GI\n", "GJ\n", "HH\n", "HI\n", "HJ\n", "II\n", "IJ\n", "JJ\n" ] } ], "source": [ "second_test_file = \"data/rosalind_lexf2.txt\"\n", "\n", "(second_test_alphabet, second_test_length) = read_alphabet_and_length(second_test_file)\n", "\n", "for string in list(itertools.combinations_with_replacement(\n", " second_test_alphabet, second_test_length)):\n", " print(\"\".join(list(string)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Oh. This is still wrong.\n", "\n", "I thought this would be easy, but it turns out a bit more complicated..." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Second test alphabet: ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']\n", "Second test length: 2\n" ] } ], "source": [ "print(\"Second test alphabet: %s\\nSecond test length: %i\" %(\n", "second_test_alphabet, second_test_length)\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Oh, I see... When the 'A' has been used, it is no longer used in combinations with 'B', and so on.\n", "\n", "So I need even more repeats than this function does by default.\n", "\n", "(Check [this page](https://docs.python.org/3/library/itertools.html#itertools.combinations_with_replacement) for more info on the functions I tried so far.)\n", "\n", "I will get back to this exercise and try it again later." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Update: it looks like I need the `product()` function from itertools! (Also see: https://docs.python.org/2/library/itertools.html#itertools.product)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "AA\n", "AB\n", "AC\n", "AD\n", "AE\n", "AF\n", "AG\n", "AH\n", "AI\n", "AJ\n", "BA\n", "BB\n", "BC\n", "BD\n", "BE\n", "BF\n", "BG\n", "BH\n", "BI\n", "BJ\n", "CA\n", "CB\n", "CC\n", "CD\n", "CE\n", "CF\n", "CG\n", "CH\n", "CI\n", "CJ\n", "DA\n", "DB\n", "DC\n", "DD\n", "DE\n", "DF\n", "DG\n", "DH\n", "DI\n", "DJ\n", "EA\n", "EB\n", "EC\n", "ED\n", "EE\n", "EF\n", "EG\n", "EH\n", "EI\n", "EJ\n", "FA\n", "FB\n", "FC\n", "FD\n", "FE\n", "FF\n", "FG\n", "FH\n", "FI\n", "FJ\n", "GA\n", "GB\n", "GC\n", "GD\n", "GE\n", "GF\n", "GG\n", "GH\n", "GI\n", "GJ\n", "HA\n", "HB\n", "HC\n", "HD\n", "HE\n", "HF\n", "HG\n", "HH\n", "HI\n", "HJ\n", "IA\n", "IB\n", "IC\n", "ID\n", "IE\n", "IF\n", "IG\n", "IH\n", "II\n", "IJ\n", "JA\n", "JB\n", "JC\n", "JD\n", "JE\n", "JF\n", "JG\n", "JH\n", "JI\n", "JJ\n" ] } ], "source": [ "for string in list(itertools.product(\n", " second_test_alphabet, repeat=second_test_length)):\n", " print(\"\".join(list(string)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So now that I think I know how to do it, let's try it the third time:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "AAAA\n", "AAAB\n", "AAAC\n", "AAAD\n", "AAAE\n", "AABA\n", "AABB\n", "AABC\n", "AABD\n", "AABE\n", "AACA\n", "AACB\n", "AACC\n", "AACD\n", "AACE\n", "AADA\n", "AADB\n", "AADC\n", "AADD\n", "AADE\n", "AAEA\n", "AAEB\n", "AAEC\n", "AAED\n", "AAEE\n", "ABAA\n", "ABAB\n", "ABAC\n", "ABAD\n", "ABAE\n", "ABBA\n", "ABBB\n", "ABBC\n", "ABBD\n", "ABBE\n", "ABCA\n", "ABCB\n", "ABCC\n", "ABCD\n", "ABCE\n", "ABDA\n", "ABDB\n", "ABDC\n", "ABDD\n", "ABDE\n", "ABEA\n", "ABEB\n", "ABEC\n", "ABED\n", "ABEE\n", "ACAA\n", "ACAB\n", "ACAC\n", "ACAD\n", "ACAE\n", "ACBA\n", "ACBB\n", "ACBC\n", "ACBD\n", "ACBE\n", "ACCA\n", "ACCB\n", "ACCC\n", "ACCD\n", "ACCE\n", "ACDA\n", "ACDB\n", "ACDC\n", "ACDD\n", "ACDE\n", "ACEA\n", "ACEB\n", "ACEC\n", "ACED\n", "ACEE\n", "ADAA\n", "ADAB\n", "ADAC\n", "ADAD\n", "ADAE\n", "ADBA\n", "ADBB\n", "ADBC\n", "ADBD\n", "ADBE\n", "ADCA\n", "ADCB\n", "ADCC\n", "ADCD\n", "ADCE\n", "ADDA\n", "ADDB\n", "ADDC\n", "ADDD\n", "ADDE\n", "ADEA\n", "ADEB\n", "ADEC\n", "ADED\n", "ADEE\n", "AEAA\n", "AEAB\n", "AEAC\n", "AEAD\n", "AEAE\n", "AEBA\n", "AEBB\n", "AEBC\n", "AEBD\n", "AEBE\n", "AECA\n", "AECB\n", "AECC\n", "AECD\n", "AECE\n", "AEDA\n", "AEDB\n", "AEDC\n", "AEDD\n", "AEDE\n", "AEEA\n", "AEEB\n", "AEEC\n", "AEED\n", "AEEE\n", "BAAA\n", "BAAB\n", "BAAC\n", "BAAD\n", "BAAE\n", "BABA\n", "BABB\n", "BABC\n", "BABD\n", "BABE\n", "BACA\n", "BACB\n", "BACC\n", "BACD\n", "BACE\n", "BADA\n", "BADB\n", "BADC\n", "BADD\n", "BADE\n", "BAEA\n", "BAEB\n", "BAEC\n", "BAED\n", "BAEE\n", "BBAA\n", "BBAB\n", "BBAC\n", "BBAD\n", "BBAE\n", "BBBA\n", "BBBB\n", "BBBC\n", "BBBD\n", "BBBE\n", "BBCA\n", "BBCB\n", "BBCC\n", "BBCD\n", "BBCE\n", "BBDA\n", "BBDB\n", "BBDC\n", "BBDD\n", "BBDE\n", "BBEA\n", "BBEB\n", "BBEC\n", "BBED\n", "BBEE\n", "BCAA\n", "BCAB\n", "BCAC\n", "BCAD\n", "BCAE\n", "BCBA\n", "BCBB\n", "BCBC\n", "BCBD\n", "BCBE\n", "BCCA\n", "BCCB\n", "BCCC\n", "BCCD\n", "BCCE\n", "BCDA\n", "BCDB\n", "BCDC\n", "BCDD\n", "BCDE\n", "BCEA\n", "BCEB\n", "BCEC\n", "BCED\n", "BCEE\n", "BDAA\n", "BDAB\n", "BDAC\n", "BDAD\n", "BDAE\n", "BDBA\n", "BDBB\n", "BDBC\n", "BDBD\n", "BDBE\n", "BDCA\n", "BDCB\n", "BDCC\n", "BDCD\n", "BDCE\n", "BDDA\n", "BDDB\n", "BDDC\n", "BDDD\n", "BDDE\n", "BDEA\n", "BDEB\n", "BDEC\n", "BDED\n", "BDEE\n", "BEAA\n", "BEAB\n", "BEAC\n", "BEAD\n", "BEAE\n", "BEBA\n", "BEBB\n", "BEBC\n", "BEBD\n", "BEBE\n", "BECA\n", "BECB\n", "BECC\n", "BECD\n", "BECE\n", "BEDA\n", "BEDB\n", "BEDC\n", "BEDD\n", "BEDE\n", "BEEA\n", "BEEB\n", "BEEC\n", "BEED\n", "BEEE\n", "CAAA\n", "CAAB\n", "CAAC\n", "CAAD\n", "CAAE\n", "CABA\n", "CABB\n", "CABC\n", "CABD\n", "CABE\n", "CACA\n", "CACB\n", "CACC\n", "CACD\n", "CACE\n", "CADA\n", "CADB\n", "CADC\n", "CADD\n", "CADE\n", "CAEA\n", "CAEB\n", "CAEC\n", "CAED\n", "CAEE\n", "CBAA\n", "CBAB\n", "CBAC\n", "CBAD\n", "CBAE\n", "CBBA\n", "CBBB\n", "CBBC\n", "CBBD\n", "CBBE\n", "CBCA\n", "CBCB\n", "CBCC\n", "CBCD\n", "CBCE\n", "CBDA\n", "CBDB\n", "CBDC\n", "CBDD\n", "CBDE\n", "CBEA\n", "CBEB\n", "CBEC\n", "CBED\n", "CBEE\n", "CCAA\n", "CCAB\n", "CCAC\n", "CCAD\n", "CCAE\n", "CCBA\n", "CCBB\n", "CCBC\n", "CCBD\n", "CCBE\n", "CCCA\n", "CCCB\n", "CCCC\n", "CCCD\n", "CCCE\n", "CCDA\n", "CCDB\n", "CCDC\n", "CCDD\n", "CCDE\n", "CCEA\n", "CCEB\n", "CCEC\n", "CCED\n", "CCEE\n", "CDAA\n", "CDAB\n", "CDAC\n", "CDAD\n", "CDAE\n", "CDBA\n", "CDBB\n", "CDBC\n", "CDBD\n", "CDBE\n", "CDCA\n", "CDCB\n", "CDCC\n", "CDCD\n", "CDCE\n", "CDDA\n", "CDDB\n", "CDDC\n", "CDDD\n", "CDDE\n", "CDEA\n", "CDEB\n", "CDEC\n", "CDED\n", "CDEE\n", "CEAA\n", "CEAB\n", "CEAC\n", "CEAD\n", "CEAE\n", "CEBA\n", "CEBB\n", "CEBC\n", "CEBD\n", "CEBE\n", "CECA\n", "CECB\n", "CECC\n", "CECD\n", "CECE\n", "CEDA\n", "CEDB\n", "CEDC\n", "CEDD\n", "CEDE\n", "CEEA\n", "CEEB\n", "CEEC\n", "CEED\n", "CEEE\n", "DAAA\n", "DAAB\n", "DAAC\n", "DAAD\n", "DAAE\n", "DABA\n", "DABB\n", "DABC\n", "DABD\n", "DABE\n", "DACA\n", "DACB\n", "DACC\n", "DACD\n", "DACE\n", "DADA\n", "DADB\n", "DADC\n", "DADD\n", "DADE\n", "DAEA\n", "DAEB\n", "DAEC\n", "DAED\n", "DAEE\n", "DBAA\n", "DBAB\n", "DBAC\n", "DBAD\n", "DBAE\n", "DBBA\n", "DBBB\n", "DBBC\n", "DBBD\n", "DBBE\n", "DBCA\n", "DBCB\n", "DBCC\n", "DBCD\n", "DBCE\n", "DBDA\n", "DBDB\n", "DBDC\n", "DBDD\n", "DBDE\n", "DBEA\n", "DBEB\n", "DBEC\n", "DBED\n", "DBEE\n", "DCAA\n", "DCAB\n", "DCAC\n", "DCAD\n", "DCAE\n", "DCBA\n", "DCBB\n", "DCBC\n", "DCBD\n", "DCBE\n", "DCCA\n", "DCCB\n", "DCCC\n", "DCCD\n", "DCCE\n", "DCDA\n", "DCDB\n", "DCDC\n", "DCDD\n", "DCDE\n", "DCEA\n", "DCEB\n", "DCEC\n", "DCED\n", "DCEE\n", "DDAA\n", "DDAB\n", "DDAC\n", "DDAD\n", "DDAE\n", "DDBA\n", "DDBB\n", "DDBC\n", "DDBD\n", "DDBE\n", "DDCA\n", "DDCB\n", "DDCC\n", "DDCD\n", "DDCE\n", "DDDA\n", "DDDB\n", "DDDC\n", "DDDD\n", "DDDE\n", "DDEA\n", "DDEB\n", "DDEC\n", "DDED\n", "DDEE\n", "DEAA\n", "DEAB\n", "DEAC\n", "DEAD\n", "DEAE\n", "DEBA\n", "DEBB\n", "DEBC\n", "DEBD\n", "DEBE\n", "DECA\n", "DECB\n", "DECC\n", "DECD\n", "DECE\n", "DEDA\n", "DEDB\n", "DEDC\n", "DEDD\n", "DEDE\n", "DEEA\n", "DEEB\n", "DEEC\n", "DEED\n", "DEEE\n", "EAAA\n", "EAAB\n", "EAAC\n", "EAAD\n", "EAAE\n", "EABA\n", "EABB\n", "EABC\n", "EABD\n", "EABE\n", "EACA\n", "EACB\n", "EACC\n", "EACD\n", "EACE\n", "EADA\n", "EADB\n", "EADC\n", "EADD\n", "EADE\n", "EAEA\n", "EAEB\n", "EAEC\n", "EAED\n", "EAEE\n", "EBAA\n", "EBAB\n", "EBAC\n", "EBAD\n", "EBAE\n", "EBBA\n", "EBBB\n", "EBBC\n", "EBBD\n", "EBBE\n", "EBCA\n", "EBCB\n", "EBCC\n", "EBCD\n", "EBCE\n", "EBDA\n", "EBDB\n", "EBDC\n", "EBDD\n", "EBDE\n", "EBEA\n", "EBEB\n", "EBEC\n", "EBED\n", "EBEE\n", "ECAA\n", "ECAB\n", "ECAC\n", "ECAD\n", "ECAE\n", "ECBA\n", "ECBB\n", "ECBC\n", "ECBD\n", "ECBE\n", "ECCA\n", "ECCB\n", "ECCC\n", "ECCD\n", "ECCE\n", "ECDA\n", "ECDB\n", "ECDC\n", "ECDD\n", "ECDE\n", "ECEA\n", "ECEB\n", "ECEC\n", "ECED\n", "ECEE\n", "EDAA\n", "EDAB\n", "EDAC\n", "EDAD\n", "EDAE\n", "EDBA\n", "EDBB\n", "EDBC\n", "EDBD\n", "EDBE\n", "EDCA\n", "EDCB\n", "EDCC\n", "EDCD\n", "EDCE\n", "EDDA\n", "EDDB\n", "EDDC\n", "EDDD\n", "EDDE\n", "EDEA\n", "EDEB\n", "EDEC\n", "EDED\n", "EDEE\n", "EEAA\n", "EEAB\n", "EEAC\n", "EEAD\n", "EEAE\n", "EEBA\n", "EEBB\n", "EEBC\n", "EEBD\n", "EEBE\n", "EECA\n", "EECB\n", "EECC\n", "EECD\n", "EECE\n", "EEDA\n", "EEDB\n", "EEDC\n", "EEDD\n", "EEDE\n", "EEEA\n", "EEEB\n", "EEEC\n", "EEED\n", "EEEE\n" ] } ], "source": [ "third_test_file = \"data/rosalind_lexf3.txt\"\n", "\n", "(third_test_alphabet, third_test_length) = read_alphabet_and_length(third_test_file)\n", "\n", "for string in list(itertools.product(\n", " third_test_alphabet, repeat=third_test_length)):\n", " print(\"\".join(list(string)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Success!!\n", "\n", "That did the trick. I was using the wrong function before..." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.1" } }, "nbformat": 4, "nbformat_minor": 4 }