{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "

cs1001.py , Tel Aviv University, Fall 2017-2018

\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Recitation 11\n", "\n", "We discussed the \"string matching\" problem and the Karp-Rabin(KR) algorithm for solving it. Then we solved a question about KR using generators." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###### Takeaways:\n", "- Make sure you read our summary.\n", "- A naive solution for the string-matching problem has $O(m(n-m))$ time complexity.\n", "- By allowing \"false-positives\" (with small probability), we can obtain a linear time solution for the string-matching problem.\n", "- Make sure you understand the way the algorithm works, and in particular the \"rolling hash mechanism\", that is, how to compute the fingerprint of the next substring in $O(1)$ time, given the fingerprint of the current substring.\n", "- Make sure you understand the \"aritmetization\" step used by the algorithm.\n", "- Make sure you understand the question we solved." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Code for printing several outputs in one cell (not part of the recitation):" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from IPython.core.interactiveshell import InteractiveShell\n", "InteractiveShell.ast_node_interactivity = \"all\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The string-matching problem" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Given a string $text$ of length $n$, and a short(er) string $pattern$ of length $m$ ($m\\leq n$), report all occurrances of $pattern$ in $text$.\n", "\n", "Example:\n", "\n", "$text = $\"abracadabra\", $pattern = $\"abr\"\n", "\n", "The requested output should be $[0,7]$, since $pattern$ appears in $text$ in indices $0,7$.\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Karp-Rabin Algorithm" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "def fingerprint(text, basis=2**16, r=2**32-3):\n", " \"\"\" used to compute karp-rabin fingerprint of the pattern\n", " employs Horner method (modulo r) \"\"\"\n", " partial_sum = 0\n", " for ch in text:\n", " partial_sum =(partial_sum*basis + ord(ch)) % r\n", " return partial_sum\n", "\n", "def text_fingerprint(text, m, basis=2**16, r=2**32-3):\n", " \"\"\" computes karp-rabin fingerprint of the text \"\"\"\n", " f=[]\n", " b_power = pow(basis, m-1, r)\n", " list.append(f, fingerprint(text[0:m], basis, r))\n", " # f[0] equals first text fingerprint \n", " for s in range(1, len(text)-m+1):\n", " new_fingerprint = ( (f[s-1] - ord(text[s-1])*b_power)*basis\n", " +ord(text[s+m-1]) ) % r\n", " # compute f[s], based on f[s-1]\n", " list.append(f,new_fingerprint)# append f[s] to existing f \n", " return f\n", "\n", "def find_matches_KR(pattern, text, basis=2**16, r=2**32-3):\n", " \"\"\" find all occurances of pattern in text\n", " using coin flipping Karp-Rabin algorithm \"\"\"\n", " if len(pattern) > len(text):\n", " return []\n", " p = fingerprint(pattern, basis, r)\n", " f = text_fingerprint(text, len(pattern), basis, r)\n", " matches = [s for (s,f_s) in enumerate(f) if f_s == p]\n", " # list comprehension \n", " return matches" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "text = \"abracadabra\"\n", "pattern = \"abr\"\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "6422933" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fingerprint(\"abr\")" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "416618250354" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "6422933" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "base = 2**16\n", "arit = ord(\"a\")*(base**2) + ord(\"b\")*(base**1) + ord(\"r\")*(base**0)\n", "arit\n", "r = 2**32 - 3\n", "fp = arit%r\n", "fp" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[6422933,\n", " 7471495,\n", " 6357433,\n", " 6488452,\n", " 6357389,\n", " 6553988,\n", " 6357390,\n", " 6422933,\n", " 7471495]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text_fingerprint(text, 3)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[0, 7]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "find_matches_KR(pattern, text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Safe version\n", "Makes sure no false positives occur. In the worst case, when all $n-m$ possible substrings are indeed matches, behaves as the naive solution in terms of time complexity." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "def find_matches_KR_safe(pattern, text, basis=2**16, r=2**32-3):\n", " \"\"\" a safe version of KR\n", " checks every suspect for a match \"\"\"\n", "\n", " if len(pattern) > len(text):\n", " return []\n", " p = fingerprint(pattern, basis, r)\n", " f = text_fingerprint(text, len(pattern), basis, r)\n", " matches = [s for (s,f_s) in enumerate(f) if f_s == p \\\n", " and text[s:s+len(pattern)]==pattern]\n", " #note that python performs \"cleaver evaluation\" of the 'and' statement\n", " return matches" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Competition between versions on single char string.\n", "This is the worst-case scenario for the safe version.\n", "Changing $m$ has a greater effect on the safe version than on the standard KR." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "text = 'a'* 100000\n", "pattern = 'a'* 100\n", "find_matches_KR_safe 0.17724202722919818\n", "find_matches_KR 0.1276535441773806\n", "\n", "pattern = 'a'* 1000\n", "find_matches_KR_safe 0.18844752517457397\n", "find_matches_KR 0.12176520046895167\n", "\n", "pattern = 'a'* 10000\n", "find_matches_KR_safe 0.27398147088196856\n", "find_matches_KR 0.13507782308521266\n", "\n" ] } ], "source": [ "import time\n", "\n", "text = \"a\"*100000\n", "print(\"text = 'a'*\",len(text))\n", "for pattern in [\"a\"*100, \"a\"*1000, \"a\"*10000]:\n", "#for pattern in [\"a\"*30000, \"a\"*40000, \"a\"*50000, \"a\"*60000, \"a\"*70000]: #max in m=n/2\n", " print(\"pattern = 'a'*\",len(pattern))\n", " for f in [find_matches_KR_safe, find_matches_KR]:\n", " t0=time.clock()\n", " res=f(pattern, text)\n", " t1=time.clock()\n", " print (f.__name__, t1-t0)\n", " print(\"\") #newline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Competition between versions on random strings. \n", "\n", "Note that the standard and safe versions of KR has similar running times. Moreover, as $m$ increases, the running time slightly decreases since there are less candidates to consider." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "random str of len n= 100000 , random pattern of length m= 1000\n" ] }, { "data": { "text/plain": [ "[]" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" }, { "name": "stdout", "output_type": "stream", "text": [ "find_matches_KR 0.17595477853319608\n" ] }, { "data": { "text/plain": [ "[]" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" }, { "name": "stdout", "output_type": "stream", "text": [ "find_matches_KR_safe 0.1846639377035899\n", "random str of len n= 100000 , random pattern of length m= 10000\n" ] }, { "data": { "text/plain": [ "[]" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" }, { "name": "stdout", "output_type": "stream", "text": [ "find_matches_KR 0.1361657278466737\n" ] }, { "data": { "text/plain": [ "[]" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" }, { "name": "stdout", "output_type": "stream", "text": [ "find_matches_KR_safe 0.17290264515031595\n" ] } ], "source": [ "import random\n", "def gen_str(size):\n", " ''' Generate a random lowercase English string of length size'''\n", " s=\"\"\n", " for i in range(size):\n", " s+=random.choice(\"abcdefghijklmnopqrstuvwxyz\")\n", " return s\n", "\n", "\n", "n=100000\n", "m=1000\n", "text = gen_str(n)\n", "pattern = gen_str(m)\n", "print(\"random str of len n=\", n, \" , random pattern of length m=\",m)\n", "for f in [find_matches_KR, find_matches_KR_safe]:\n", " t0=time.clock()\n", " f(pattern, text)\n", " t1=time.clock()\n", " print (f.__name__, t1-t0)\n", " \n", "n=100000\n", "m=10000\n", "text = gen_str(n)\n", "pattern = gen_str(m)\n", "print(\"random str of len n=\", n, \" , random pattern of length m=\",m)\n", "for f in [find_matches_KR, find_matches_KR_safe]:\n", " t0=time.clock()\n", " f(pattern, text)\n", " t1=time.clock()\n", " print (f.__name__, t1-t0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Choice of $r$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By setting $r$ to be a power of the base, say $r=base$, we will obtain more false-positives. This may serve as an intuition for choosing a prime $r$." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[2, 4, 6, 9]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "[6]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "find_matches_KR(\"da\", \"abracadabra\", basis=2**16, r=2**16)\n", "find_matches_KR_safe(\"da\", \"abracadabra\", basis=2**16, r=2**16)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "97" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "6553697" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "97" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "97" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "6488161" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "97" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fingerprint(\"da\", 2**16, r=2**16)\n", "ord(\"d\")*(2**16)**1 + ord(\"a\")\n", "ord(\"a\")\n", "\n", "fingerprint(\"ca\", 2**16, r=2**16)\n", "ord(\"c\")*(2**16)**1 + ord(\"a\")\n", "(ord(\"c\")*(2**16)**1 + ord(\"a\") )%2**16" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "97" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "420913348705" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "420913348705" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "True" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "97" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "425208316001" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "True" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "base = 2**16\n", "r = 2**16\n", "fingerprint(\"bda\", base, r)\n", "ord(\"b\")*(base**2) + ord(\"d\")*(base**1) + ord(\"a\")\n", "(ord(\"b\")*base + ord(\"d\"))*base + ord(\"a\")\n", "((ord(\"b\")*base + ord(\"d\"))*base + ord(\"a\"))%r == ord(\"a\")%r\n", "\n", "\n", "fingerprint(\"cda\", base, r)\n", "(ord(\"c\")*base + ord(\"d\"))*base + ord(\"a\")\n", "((ord(\"c\")*base + ord(\"d\"))*base + ord(\"a\"))%r == ord(\"a\")%r" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[0, 7]" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "find_matches_KR(\"Humpty\", \"Humpty Dumpty\", r=2**(16*5))" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2158299737877522940025" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "2158299737877522940025" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fingerprint(\"Humpty\", r=2**(16*5))\n", "fingerprint(\"Dumpty\", r=2**(16*5))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "text_fingerprint(\"Humpty Dumpty\", 6, r=2**(16*5))" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[0]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "find_matches_KR(\"Humpty\", \"Humpty Dumpty\", r=2**(16*6))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Alternative fingerprint" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "def fingerprint_sum(string, r=2**32-3):\n", " ''' a different kind of fingerprint: sum of all ords '''\n", " partial_sum=0\n", " for x in string:\n", " partial_sum = (partial_sum + ord(x)) % r\n", " return partial_sum\n", "\n", "def text_fingerprint_sum(string, length, r=2**32-3):\n", " \"\"\" used to compute a simple sum fingerprint of the text \"\"\"\n", " f =[]\n", " list.append(f, fingerprint_sum(string[0:length], r))\n", " for s in range (1, len(string)-length+1):\n", " new_fingerprint = (f[s-1] - ord(string[s-1]) + ord(string[s+length-1])) % r\n", " # compute f[s], based on f[s -1]\n", " list.append(f, new_fingerprint) # append f[s] to existing f\n", " return f" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[309, 309, 310, 293, 296, 294, 295, 309, 309]" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text_fingerprint_sum(\"abracadabra\", 3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Solving question 2 in this exercise" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "def fingerprint(string, basis=2**16, r=2**32-3):\n", " \"\"\" used to computes karp-rabin fingerprint of the pattern\n", " employs Horner method (modulo r) \"\"\"\n", " partial_sum=0\n", " for x in string:\n", " partial_sum=(partial_sum*basis+ord(x)) % r\n", " return partial_sum\n", "\n", "def slide(prev_fp,prev_char,next_char,b_power,basis=2**16,r=2**32-3):\n", " new_fp=((prev_fp-ord(prev_char)*b_power)*basis+ord(next_char)) % r\n", " return new_fp" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Section (a)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "def kr_gen(text,length,basis=2**16,r=2**32-3):\n", " fp = fingerprint(text[:length])\n", " yield fp\n", " b_power = pow(basis,length-1,r)\n", " for s in range(1,len(text)-length+1):\n", " fp = slide(fp,text[s-1],text[s-1+length],b_power)\n", " yield fp" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "6422933" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "7471495" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gen = kr_gen(\"abracadabra\", 3)\n", "next(gen)\n", "next(gen)\n" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "6357433" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "6488452" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "6357389" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "6553988" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "6357390" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "6422933" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "7471495" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" }, { "ename": "StopIteration", "evalue": "", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mStopIteration\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[0;32m 6\u001b[0m \u001b[0mnext\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mgen\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 7\u001b[0m \u001b[0mnext\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mgen\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 8\u001b[1;33m \u001b[0mnext\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mgen\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 9\u001b[0m \u001b[0mnext\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mgen\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 10\u001b[0m \u001b[0mnext\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mgen\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;31mStopIteration\u001b[0m: " ] } ], "source": [ "next(gen)\n", "next(gen)\n", "next(gen)\n", "next(gen)\n", "next(gen)\n", "next(gen)\n", "next(gen)\n", "next(gen)\n", "next(gen)\n", "next(gen)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Section (b)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def generate_shared_substrings(text1,text2,length):\n", " g1 = kr_gen(text1, length)\n", " i1 = 0\n", " for fp1 in g1:\n", " g2 = kr_gen(text2, length)\n", " i2 = 0\n", " for fp2 in g2:\n", " if fp1 == fp2:\n", " yield(i1, i2)\n", " i2 += 1\n", " i1 += 1" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(2, 1)" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "(3, 2)" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "g = generate_shared_substrings(\"abcdef\",\"xcdefx\",3)\n", "next(g)\n", "next(g)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "ename": "StopIteration", "evalue": "", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mStopIteration\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mnext\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mg\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[1;31mStopIteration\u001b[0m: " ] } ], "source": [ "next(g)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.2" } }, "nbformat": 4, "nbformat_minor": 1 }