{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "

cs1001.py , Tel Aviv University, Spring 2018/19

\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Recitation 10\n", "\n", "We discussed iterators, generators, and the Karp-Rabin algorithm for string matching " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###### Takeaways:\n", "- A generator function is a function that contains the yield command and returns a genertor object. \n", "\n", "\n", "- The Karp-Rabin algorithm is a probabilistic string matching algorithm (find all occurences of a string of length $m$ in a string of length $n$) running in linear time $O(m+n)$ (as opposed to the naive solution which has running time $O(nm)$).\n", "- Make sure you read our KR summary.\n", "- Make sure you understand the way the algorithm works, and in particular the \"rolling hash mechanism\", that is, how to compute the fingerprint of the next substring in $O(1)$ time, given the fingerprint of the current substring.\n", "- Make sure you understand the \"arithmetization\" step used by the algorithm." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Code for printing several outputs in one cell (not part of the recitation):" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from IPython.core.interactiveshell import InteractiveShell\n", "InteractiveShell.ast_node_interactivity = \"all\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Iterators and Generators" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Iterators" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "str_iterator" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "l = \"123\" #[1,2,3]\n", "li = iter(l)\n", "type(li)\n", "li2 = iter(l)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'1'" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "next(li)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'2'" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "next(li)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "z is 3\n" ] } ], "source": [ "z = next(li)\n", "print(\"z is\", z)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "ename": "StopIteration", "evalue": "", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mStopIteration\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mnext\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mli\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[1;31mStopIteration\u001b[0m: " ] } ], "source": [ "next(li)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'1'" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "next(li2)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2\n", "3\n" ] } ], "source": [ "for elem in li2:\n", " print(elem)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Generators" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "def countdown_gen():\n", " ''' calling it creates an iterator for the values 5,4,3,2,1,'launch' '''\n", " yield 5\n", " yield 4\n", " yield 3\n", " yield 2\n", " yield 1\n", " yield 'launch'" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cd = countdown_gen()\n", "cd #generator object (private case of iterator)" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "5" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "next(cd) " ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "next(cd) " ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "3\n", "2\n", "1\n", "launch\n" ] } ], "source": [ "for e in cd:\n", " print(e)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With generators we can have infinite iterations" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def countdown_infinite():\n", " i = 0\n", " while True:\n", " yield i\n", " i += 1\n", " \n", "[i for i in range(10)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Solving this exam question about generators" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Section (b)" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(1, 0),\n", " (2, 0),\n", " (2, 1),\n", " (3, 0),\n", " (3, 1),\n", " (3, 2),\n", " (4, 0),\n", " (4, 1),\n", " (4, 2),\n", " (4, 3)]" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def SomePairs():\n", " i=1\n", " while True:\n", " for j in range(i):\n", " yield(i,j)\n", " i=i+1\n", "\n", "gen = SomePairs()\n", "[next(gen) for _ in range(10)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Section (c)" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(0, 1),\n", " (0, 2),\n", " (1, 2),\n", " (0, 3),\n", " (1, 3),\n", " (2, 3),\n", " (0, 4),\n", " (1, 4),\n", " (2, 4),\n", " (3, 4)]" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def RevGen(PairsGen):\n", " pairs = PairsGen()\n", " while True:\n", " pair = next(pairs)\n", " yield(pair[1],pair[0])\n", "\n", "gen = RevGen(SomePairs)\n", "[next(gen) for _ in range(10)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Section (d1)" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1, 0)" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "(0, 1)" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "(2, 0)" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "(0, 2)" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "(2, 1)" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "(1, 2)" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "(3, 0)" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "(0, 3)" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "(3, 1)" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "(1, 3)" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def UnionGenerators(gen1, gen2):\n", " while True:\n", " yield next(gen1)\n", " yield next(gen2)\n", " \n", "ug = UnionGenerators(SomePairs(), RevGen(SomePairs))\n", "\n", "for _ in range(10):\n", " next(ug)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Section (d2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def EqPairs():\n", " i=0\n", " while True:\n", " yield (i,i)\n", " i=i+1\n", " \n", "def AllPairs():\n", " return UnionGenerators(SomePairs(), UnionGenerators(EqPairs(), RevGen(SomePairs)))\n", "\n", "gen = AllPairs()\n", "[next(gen) for _ in range(10)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The string-matching problem" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Given a string $text$ of length $n$, and a short(er) string $pattern$ of length $m$ ($m\\leq n$), report all occurrances of $pattern$ in $text$.\n", "\n", "Example:\n", "\n", "$text = $\"abracadabra\", $pattern = $\"abr\"\n", "\n", "The requested output should be $[0,7]$, since $pattern$ appears in $text$ in indices $0,7$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Karp-Rabin Algorithm\n", "\n", "- The algorithm works as follows:\n", " - An initial \"fingerprint\" is computed for the pattern string and the first substring of length $m$ in the text\n", " - If both fingerprints are equal, we assume a match\n", " - Then, a \"rolling hash\" mechanism computes the next fingerprint in $O(1)$ time given the current fingerprint in the text\n", " - At each stage again a comparison is made and equal fingerprints are considered a match" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "def is_prime(N, show_witness=False):\n", " \"\"\" probabilistic test for N's compositeness \"\"\"\n", " for i in range(0,100):\n", " a = random.randint(1,N-1) # a is a random integer in [1..N-1]\n", " if pow(a,N-1,N) != 1:\n", " if show_witness: # caller wishes to see a witness\n", " print(N,\"is composite\",\"\\n\",a,\"is a witness, i=\",i+1)\n", " return False\n", " return True\n", "\n", "\n", "def find_prime(n):\n", " \"\"\" find random n-bit long prime \"\"\"\n", " while(True):\n", " candidate = random.randrange(2**(n-1),2**n)\n", " if is_prime(candidate):\n", " return candidate\n", "\n", "\n", "def fingerprint(string, basis, r):\n", " \"\"\" used to computes karp-rabin fingerprint of the pattern\n", " and of the first substring in the text\n", " employs Horner method (modulo r) \"\"\"\n", " s = 0\n", " for ch in string:\n", " s = (s*basis + ord(ch)) % r # Horner\n", " return s\n", "\n", "\n", "\n", "\n", "def substring_fingerprints(string, m, basis, r):\n", " \"\"\" return a list of all fingerprints of size m windows in string \"\"\"\n", " fp_list = []\n", " b_power = pow(basis, m-1, r)\n", "\n", " # compute first fingerprint\n", " fp_list.append(fingerprint(string[0:m], basis, r))\n", "\n", " # compute f_list[s], based on f_list[s-1]\n", " for s in range(1,len(string)-m+1): # O(n-m-1), each iteration O(1)\n", " next_fingerprint = \\\n", " ((fp_list[s-1] - ord(string[s-1])*b_power) * basis \\\n", " + ord(string[s+m-1])) % r\n", " fp_list.append(next_fingerprint)\n", "\n", " return fp_list\n", "\n", "def find_matches_KR(text, pattern, r=None):\n", " \"\"\" find all shifts of occurances of pattern in text \"\"\"\n", " if len(pattern) > len(text):\n", " return [] #no matches\n", "\n", " basis = 2**16 # assume 16 bits are enough for all chars\n", " if r == None: \n", " r = find_prime(32) # randomly pick a 32 bit long prime using \n", " # good old find_prime from an earlier lecture\n", "\n", " p = fingerprint(pattern, basis, r)\n", " fp_list = substring_fingerprints(text, len(pattern), basis, r)\n", "\n", " matches = [shift for shift in range(len(fp_list)) if fp_list[shift] == p]\n", "\n", " return matches" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "text = \"abracadabra\"\n", "pattern = \"abr\"" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2149219973" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "r = find_prime(32)\n", "base = 2**16\n", "\n", "r" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1818795565" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fingerprint(\"abr\", base, r)" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "416618250354" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "1818795565" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "base = 2**16\n", "arit = ord(\"a\")*(base**2) + ord(\"b\")*(base**1) + ord(\"r\")*(base**0)\n", "arit\n", "fp = arit%r\n", "fp" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[1818795565,\n", " 1816371474,\n", " 1759694964,\n", " 1818861084,\n", " 1811784715,\n", " 1818926620,\n", " 1808312063,\n", " 1818795565,\n", " 1816371474]" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "substring_fingerprints(text, 3, base, r)" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[0, 7]" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "find_matches_KR(text, pattern, r)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Safe version\n", "Makes sure no false positives occur. In the worst case, when all $n-m$ possible substrings are indeed matches, behaves as the naive solution in terms of time complexity." ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [], "source": [ "def find_matches_KR_safe(text, pattern, r=None):\n", " \"\"\" a safe version of KR\n", " checks every suspect for a match \"\"\"\n", " if len(pattern) > len(text):\n", " return [] #no matches\n", "\n", " basis = 2**16 # assume 16 bits are enough for all chars\n", " if r == None: \n", " r = find_prime(32) # randomly pick a 32 bit long prime using \n", " # good old find_prime from an earlier lecture\n", "\n", " p = fingerprint(pattern, basis, r)\n", " fp_list = substring_fingerprints(text, len(pattern), basis, r)\n", "\n", " matches = [shift for shift in range(len(fp_list)) \\\n", " if fp_list[shift] == p and \\\n", " text[shift:shift+len(pattern)]==pattern]\n", "\n", " return matches\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Competition between versions on single char string.\n", "This is the worst-case scenario for the safe version.\n", "Changing $m$ has a greater effect on the safe version than on the standard KR." ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "text = 'a'* 1000000\n", "\n", "pattern = 'a'* 100\n", "find_matches_KR 1.4179207529959967\n", "find_matches_KR_safe 1.6589124150050338\n", "\n", "pattern = 'a'* 1000\n", "find_matches_KR 1.3420768009964377\n", "find_matches_KR_safe 1.775463817990385\n", "\n", "pattern = 'a'* 10000\n", "find_matches_KR 1.2947049719950883\n", "find_matches_KR_safe 2.638186254000175\n", "\n", "pattern = 'a'* 100000\n", "find_matches_KR 1.4077826730062952\n", "find_matches_KR_safe 13.829727393997018\n", "\n" ] } ], "source": [ "import time\n", "\n", "text = \"a\"*1000000\n", "print(\"text = 'a'*\",len(text))\n", "print()\n", "for pattern in [\"a\"*100, \"a\"*1000, \"a\"*10000, \"a\"*100000]:\n", " print(\"pattern = 'a'*\",len(pattern))\n", " r = find_prime(32)\n", " for f in [find_matches_KR, find_matches_KR_safe]:\n", " t0=time.perf_counter()\n", " res=f(text, pattern, r)\n", " t1=time.perf_counter()\n", " print (f.__name__, t1-t0 )\n", " print(\"\") #newline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Competition between versions on random strings. \n", "\n", "Note that the standard and safe versions of KR has similar running times. Moreover, as $m$ increases, the running time slightly decreases since there are less candidates to consider." ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "random str of len n= 1000000 , random pattern of length m= 1000\n" ] }, { "data": { "text/plain": [ "[]" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" }, { "name": "stdout", "output_type": "stream", "text": [ "find_matches_KR 1.4785060709982645\n" ] }, { "data": { "text/plain": [ "[]" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" }, { "name": "stdout", "output_type": "stream", "text": [ "find_matches_KR_safe 1.3623063810082385\n", "random str of len n= 1000000 , random pattern of length m= 10000\n" ] }, { "data": { "text/plain": [ "[]" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" }, { "name": "stdout", "output_type": "stream", "text": [ "find_matches_KR 1.383506228987244\n" ] }, { "data": { "text/plain": [ "[]" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" }, { "name": "stdout", "output_type": "stream", "text": [ "find_matches_KR_safe 1.3038542309950572\n", "random str of len n= 1000000 , random pattern of length m= 100000\n" ] }, { "data": { "text/plain": [ "[]" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" }, { "name": "stdout", "output_type": "stream", "text": [ "find_matches_KR 1.25419370700547\n" ] }, { "data": { "text/plain": [ "[]" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" }, { "name": "stdout", "output_type": "stream", "text": [ "find_matches_KR_safe 1.6594200499966973\n" ] } ], "source": [ "import random\n", "def gen_str(size):\n", " ''' Generate a random lowercase English string of length size'''\n", " s=\"\"\n", " for i in range(size):\n", " s+=random.choice(\"abcdefghijklmnopqrstuvwxyz\")\n", " return s\n", "\n", "\n", "n=1000000\n", "m=1000\n", "text = gen_str(n)\n", "pattern = gen_str(m)\n", "print(\"random str of len n=\", n, \" , random pattern of length m=\",m)\n", "r = find_prime(32)\n", "for f in [find_matches_KR, find_matches_KR_safe]:\n", " t0=time.perf_counter()\n", " f(text, pattern, r)\n", " t1=time.perf_counter()\n", " print (f.__name__, t1-t0)\n", " \n", "\n", "m=10000\n", "pattern = gen_str(m)\n", "print(\"random str of len n=\", n, \" , random pattern of length m=\",m)\n", "r = find_prime(32)\n", "for f in [find_matches_KR, find_matches_KR_safe]:\n", " t0=time.perf_counter()\n", " f(text, pattern, r)\n", " t1=time.perf_counter()\n", " print (f.__name__, t1-t0)\n", " \n", "m=100000\n", "pattern = gen_str(m)\n", "print(\"random str of len n=\", n, \" , random pattern of length m=\",m)\n", "r = find_prime(32)\n", "for f in [find_matches_KR, find_matches_KR_safe]:\n", " t0=time.perf_counter()\n", " f(text, pattern, r)\n", " t1=time.perf_counter()\n", " print (f.__name__, t1-t0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Choice of $r$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By setting $r$ to be a power of the base we will obtain more false-positives. Specifically, for $r=base$, we will get many more false positives. Why is that?\n", "\n", "Note that the computation of the fingerprint is of the form $fp = (x \\cdot base + y) \\mod r$ for some $x,y$ (where $y$ is the arithmetization of the last character in the current substring). \n", "\n", "In the case where $base = r$ this is equivalent to $(x \\cdot r + y) \\mod r = y \\mod r$, i.e., the fingerprint takes into account only the last character of the substring!\n", "\n", "Clearly, choosing $r$ which is a multiple of $base$ is not a good idea. This may also serve as an intuition for why we choose $r$ to be prime - say for example we know that all the fingerprints we compute are even, then if our $r$ is a multiple of $2$, we will always have $fp \\mod r$ an even number, and thus we will only utilize half of our range!" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[2, 4, 6, 9]" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "[2, 4, 6, 9]" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "[6]" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Many false positives\n", "find_matches_KR(\"abracadabra\", \"da\", r=2**16)\n", "\n", "# Same result here since the \"x\" will be cancelled out \n", "find_matches_KR(\"abracadabra\", \"xa\", r=2**16)\n", "\n", "# The safe check still works, of course\n", "find_matches_KR_safe(\"abracadabra\", \"da\", r=2**16)" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "97" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "6553697" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "97" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "97" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "6488161" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "97" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fingerprint(\"da\", 2**16, r=2**16)\n", "ord(\"d\")*(2**16)**1 + ord(\"a\")\n", "ord(\"a\")\n", "\n", "fingerprint(\"ca\", 2**16, r=2**16)\n", "ord(\"c\")*(2**16)**1 + ord(\"a\")\n", "(ord(\"c\")*(2**16)**1 + ord(\"a\") )%2**16" ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "97" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "420913348705" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "420913348705" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "True" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "97" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "425208316001" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "True" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "base = 2**16\n", "r = 2**16\n", "fingerprint(\"bda\", base, r)\n", "ord(\"b\")*(base**2) + ord(\"d\")*(base**1) + ord(\"a\")\n", "(ord(\"b\")*base + ord(\"d\"))*base + ord(\"a\")\n", "((ord(\"b\")*base + ord(\"d\"))*base + ord(\"a\"))%r == ord(\"a\")%r\n", "\n", "\n", "fingerprint(\"cda\", base, r)\n", "(ord(\"c\")*base + ord(\"d\"))*base + ord(\"a\")\n", "((ord(\"c\")*base + ord(\"d\"))*base + ord(\"a\"))%r == ord(\"a\")%r" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When $r = base^k$ for some $k$ we have a similar behaviour, though this time it means we only take the last $k$ characters of the rolling hash into account." ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[0, 7]" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "find_matches_KR(\"Humpty Dumpty\", \"Humpty\", r=2**(16*5))" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2158299737877522940025" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "2158299737877522940025" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "2158299737877522940025" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fingerprint(\"Humpty\", 2**16, r=2**(16*5))\n", "fingerprint(\"Dumpty\", 2**16, r=2**(16*5))\n", "fingerprint(\"Xumpty\", 2**16, r=2**(16*5))" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[2158299737877522940025,\n", " 2010726629729956855840,\n", " 2066067987872461357124,\n", " 2139856371159933386869,\n", " 2232065040410175930477,\n", " 590314951159640293488,\n", " 1254411530052683432052,\n", " 2158299737877522940025]" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "substring_fingerprints(\"Humpty Dumpty\", 6, 2**16, r=2**(16*5))" ] }, { "cell_type": "code", "execution_count": 68, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "[0]" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Why don't we return the last index?\n", "find_matches_KR(\"Humpty Dumpty\", \"Humpty\", r=2**(16*6))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Question 2 in this exercise\n", "\n", "#### (We did not see this in class, but you should make sure you understand this exercise)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "def fingerprint(string, basis=2**16, r=2**32-3):\n", " \"\"\" used to computes karp-rabin fingerprint of the pattern\n", " employs Horner method (modulo r) \"\"\"\n", " partial_sum = 0\n", " for x in string:\n", " partial_sum = (partial_sum*basis + ord(x)) % r\n", " return partial_sum\n", "\n", "def slide(prev_fp, prev_char, next_char, b_power, basis=2**16, r=2**32-3):\n", " new_fp=((prev_fp - ord(prev_char)*b_power)*basis + ord(next_char)) % r\n", " return new_fp" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Section (a)\n", "\n", "Build a generator which, given a string text and a length parameter generates all KR fingerprints of desired length in text.\n", "Instructions:\n", "* Make use of both $fingerprint$ and $slide$\n", "* The generator can call $fingerprint$ only once\n" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "def kr_gen(text, length, basis=2**16, r=2**32-3):\n", " fp = fingerprint(text[:length])\n", " yield fp\n", " b_power = pow(basis, length - 1, r)\n", " for s in range(1, len(text) - length + 1):\n", " fp = slide(fp, text[s - 1], text[s - 1 + length], b_power)\n", " yield fp" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "6422933" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "7471495" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gen = kr_gen(\"abracadabra\", 3)\n", "next(gen)\n", "next(gen)\n" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "6357433" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "6488452" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "6357389" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "6553988" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "6357390" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "6422933" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "7471495" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" }, { "ename": "StopIteration", "evalue": "", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mStopIteration\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m\u001b[0m\n\u001b[0;32m 6\u001b[0m \u001b[0mnext\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mgen\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 7\u001b[0m \u001b[0mnext\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mgen\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 8\u001b[1;33m \u001b[0mnext\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mgen\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 9\u001b[0m \u001b[0mnext\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mgen\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 10\u001b[0m \u001b[0mnext\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mgen\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;31mStopIteration\u001b[0m: " ] } ], "source": [ "next(gen)\n", "next(gen)\n", "next(gen)\n", "next(gen)\n", "next(gen)\n", "next(gen)\n", "next(gen)\n", "next(gen)\n", "next(gen)\n", "next(gen)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Section (b)\n", "\n", "Build a generator which, given two strings $text1, text2$ and a length parameter $\\ell$ generates all pairs of indices $(i,j)$ such that $text1[i:i+\\ell] == text2[j:j+\\ell]$.\n", "\n", "Instructions:\n", "* Use $kr\\_gen$ from the previous section\n", "* The generator must work in space smaller than the size of the strings. In particular, at no point can you save the array of fingerprints of any entire string" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "def generate_shared_substrings(text1, text2, length):\n", " g1 = kr_gen(text1, length)\n", " i1 = 0\n", " for fp1 in g1:\n", " g2 = kr_gen(text2, length)\n", " i2 = 0\n", " for fp2 in g2:\n", " if fp1 == fp2:\n", " yield(i1, i2)\n", " i2 += 1\n", " i1 += 1" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(2, 1)" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "(3, 2)" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "g = generate_shared_substrings(\"abcdef\",\"xcdefx\",3)\n", "next(g)\n", "next(g)" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "scrolled": true }, "outputs": [ { "ename": "StopIteration", "evalue": "", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mStopIteration\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mnext\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mg\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[1;31mStopIteration\u001b[0m: " ] } ], "source": [ "next(g)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.0" } }, "nbformat": 4, "nbformat_minor": 1 }