{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h1><center>cs1001.py , Tel Aviv University, Spring 2018/19</center></h1>\n",
    "<img src=\"http://www.pngall.com/wp-content/uploads/2016/05/Python-Logo-PNG-Image-180x180.png\" width=50/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Recitation 10\n",
    "\n",
    "We discussed iterators, generators, and the Karp-Rabin algorithm for string matching "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###### Takeaways:\n",
    "- A generator function is a function that contains the yield command and returns a genertor object. \n",
    "\n",
    "\n",
    "- The Karp-Rabin algorithm is a probabilistic string matching algorithm (find all occurences of a string of length $m$ in a string of length $n$) running in linear time $O(m+n)$ (as opposed to the naive solution which has running time $O(nm)$).\n",
    "- Make sure you read our KR <a href=\"http://tau-cs1001-py.wdfiles.com/local--files/recitation-logs-2017b/KR-summary_new.pdf\">summary.</a>\n",
    "- Make sure you understand the way the algorithm works, and in particular the \"rolling hash mechanism\", that is, how to compute the fingerprint of the next substring in $O(1)$ time, given the fingerprint of the current substring.\n",
    "- Make sure you understand the \"arithmetization\" step used by the algorithm."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Code for printing several outputs in one cell (not part of the recitation):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from IPython.core.interactiveshell import InteractiveShell\n",
    "InteractiveShell.ast_node_interactivity = \"all\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  Iterators and Generators"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Iterators"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "str_iterator"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "l = \"123\"  #[1,2,3]\n",
    "li = iter(l)\n",
    "type(li)\n",
    "li2 = iter(l)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'1'"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "next(li)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'2'"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "next(li)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "z is 3\n"
     ]
    }
   ],
   "source": [
    "z = next(li)\n",
    "print(\"z is\", z)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "ename": "StopIteration",
     "evalue": "",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mStopIteration\u001b[0m                             Traceback (most recent call last)",
      "\u001b[1;32m<ipython-input-28-c9b8845252db>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mnext\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mli\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[1;31mStopIteration\u001b[0m: "
     ]
    }
   ],
   "source": [
    "next(li)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'1'"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "next(li2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2\n",
      "3\n"
     ]
    }
   ],
   "source": [
    "for elem in li2:\n",
    "    print(elem)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Generators"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
    "def countdown_gen():\n",
    "    ''' calling it creates an iterator for the values 5,4,3,2,1,'launch' '''\n",
    "    yield 5\n",
    "    yield 4\n",
    "    yield 3\n",
    "    yield 2\n",
    "    yield 1\n",
    "    yield 'launch'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<generator object countdown_gen at 0x000002623F0CFE58>"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cd = countdown_gen()\n",
    "cd #generator object (private case of iterator)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "5"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "next(cd) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "next(cd) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3\n",
      "2\n",
      "1\n",
      "launch\n"
     ]
    }
   ],
   "source": [
    "for e in cd:\n",
    "    print(e)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With generators we can have infinite iterations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def countdown_infinite():\n",
    "    i = 0\n",
    "    while True:\n",
    "        yield i\n",
    "        i += 1\n",
    "        \n",
    "[i for i in range(10)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Solving this <a href=\"http://tau-cs1001-py.wdfiles.com/local--files/recitation-logs-2017a/a10_geneators_exam_q.pdf\"> exam question about generators</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Section (b)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(1, 0),\n",
       " (2, 0),\n",
       " (2, 1),\n",
       " (3, 0),\n",
       " (3, 1),\n",
       " (3, 2),\n",
       " (4, 0),\n",
       " (4, 1),\n",
       " (4, 2),\n",
       " (4, 3)]"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def SomePairs():\n",
    "    i=1\n",
    "    while True:\n",
    "        for j in range(i):\n",
    "            yield(i,j)\n",
    "        i=i+1\n",
    "\n",
    "gen = SomePairs()\n",
    "[next(gen) for _ in range(10)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Section (c)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(0, 1),\n",
       " (0, 2),\n",
       " (1, 2),\n",
       " (0, 3),\n",
       " (1, 3),\n",
       " (2, 3),\n",
       " (0, 4),\n",
       " (1, 4),\n",
       " (2, 4),\n",
       " (3, 4)]"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def RevGen(PairsGen):\n",
    "    pairs = PairsGen()\n",
    "    while True:\n",
    "        pair = next(pairs)\n",
    "        yield(pair[1],pair[0])\n",
    "\n",
    "gen = RevGen(SomePairs)\n",
    "[next(gen) for _ in range(10)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Section (d1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(1, 0)"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "(0, 1)"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "(2, 0)"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "(0, 2)"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "(2, 1)"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "(1, 2)"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "(3, 0)"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "(0, 3)"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "(3, 1)"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "(1, 3)"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def UnionGenerators(gen1, gen2):\n",
    "    while True:\n",
    "        yield next(gen1)\n",
    "        yield next(gen2)\n",
    "        \n",
    "ug = UnionGenerators(SomePairs(), RevGen(SomePairs))\n",
    "\n",
    "for _ in range(10):\n",
    "    next(ug)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Section (d2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def EqPairs():\n",
    "    i=0\n",
    "    while True:\n",
    "        yield (i,i)\n",
    "        i=i+1\n",
    "        \n",
    "def AllPairs():\n",
    "    return UnionGenerators(SomePairs(), UnionGenerators(EqPairs(), RevGen(SomePairs)))\n",
    "\n",
    "gen = AllPairs()\n",
    "[next(gen) for _ in range(10)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The string-matching problem"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Given a string $text$ of length $n$, and a short(er) string $pattern$ of length $m$ ($m\\leq n$), report all occurrances of $pattern$ in $text$.\n",
    "\n",
    "Example:\n",
    "\n",
    "$text = $\"abracadabra\",  $pattern = $\"abr\"\n",
    "\n",
    "The requested output should be $[0,7]$, since $pattern$ appears in $text$ in indices $0,7$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Karp-Rabin Algorithm\n",
    "\n",
    "- The algorithm works as follows:\n",
    "    - An initial \"fingerprint\" is computed for the pattern string and the first substring of length $m$ in the text\n",
    "    - If both fingerprints are equal, we assume a match\n",
    "    - Then, a \"rolling hash\" mechanism computes the next fingerprint in $O(1)$ time given the current fingerprint in the text\n",
    "    - At each stage again a comparison is made and equal fingerprints are considered a match"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "def is_prime(N, show_witness=False):\n",
    "    \"\"\" probabilistic test for N's compositeness \"\"\"\n",
    "    for i in range(0,100):\n",
    "        a = random.randint(1,N-1) # a is a random integer in [1..N-1]\n",
    "        if pow(a,N-1,N) != 1:\n",
    "            if show_witness:  # caller wishes to see a witness\n",
    "                print(N,\"is composite\",\"\\n\",a,\"is a witness, i=\",i+1)\n",
    "            return False\n",
    "    return True\n",
    "\n",
    "\n",
    "def find_prime(n):\n",
    "    \"\"\" find random n-bit long prime \"\"\"\n",
    "    while(True):\n",
    "        candidate = random.randrange(2**(n-1),2**n)\n",
    "        if is_prime(candidate):\n",
    "            return candidate\n",
    "\n",
    "\n",
    "def fingerprint(string, basis, r):\n",
    "    \"\"\" used to computes karp-rabin fingerprint of the pattern\n",
    "        and of the first substring in the text\n",
    "        employs Horner method (modulo r) \"\"\"\n",
    "    s = 0\n",
    "    for ch in string:\n",
    "        s = (s*basis + ord(ch)) % r # Horner\n",
    "    return s\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "def substring_fingerprints(string, m, basis, r):\n",
    "    \"\"\" return a list of all fingerprints of size m windows in string \"\"\"\n",
    "    fp_list = []\n",
    "    b_power = pow(basis, m-1, r)\n",
    "\n",
    "    # compute first fingerprint\n",
    "    fp_list.append(fingerprint(string[0:m], basis, r))\n",
    "\n",
    "    # compute f_list[s], based on f_list[s-1]\n",
    "    for s in range(1,len(string)-m+1): # O(n-m-1), each iteration O(1)\n",
    "        next_fingerprint = \\\n",
    "                ((fp_list[s-1] - ord(string[s-1])*b_power) * basis \\\n",
    "                              + ord(string[s+m-1])) % r\n",
    "        fp_list.append(next_fingerprint)\n",
    "\n",
    "    return fp_list\n",
    "\n",
    "def find_matches_KR(text, pattern, r=None):\n",
    "    \"\"\" find all shifts of occurances of pattern in text \"\"\"\n",
    "    if len(pattern) > len(text):\n",
    "        return [] #no matches\n",
    "\n",
    "    basis = 2**16 # assume 16 bits are enough for all chars\n",
    "    if r == None: \n",
    "        r = find_prime(32) # randomly pick a 32 bit long prime using \n",
    "                           # good old find_prime from an earlier lecture\n",
    "\n",
    "    p = fingerprint(pattern, basis, r)\n",
    "    fp_list = substring_fingerprints(text, len(pattern), basis, r)\n",
    "\n",
    "    matches = [shift for shift in range(len(fp_list)) if fp_list[shift] == p]\n",
    "\n",
    "    return matches"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "text = \"abracadabra\"\n",
    "pattern = \"abr\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2149219973"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "r = find_prime(32)\n",
    "base = 2**16\n",
    "\n",
    "r"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1818795565"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fingerprint(\"abr\", base, r)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "416618250354"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "1818795565"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "base = 2**16\n",
    "arit = ord(\"a\")*(base**2) + ord(\"b\")*(base**1) + ord(\"r\")*(base**0)\n",
    "arit\n",
    "fp = arit%r\n",
    "fp"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[1818795565,\n",
       " 1816371474,\n",
       " 1759694964,\n",
       " 1818861084,\n",
       " 1811784715,\n",
       " 1818926620,\n",
       " 1808312063,\n",
       " 1818795565,\n",
       " 1816371474]"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "substring_fingerprints(text, 3, base, r)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[0, 7]"
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "find_matches_KR(text, pattern, r)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Safe version\n",
    "Makes sure no false positives occur. In the worst case, when all $n-m$ possible substrings are indeed matches, behaves as the naive solution in terms of time complexity."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {},
   "outputs": [],
   "source": [
    "def find_matches_KR_safe(text, pattern, r=None):\n",
    "    \"\"\" a safe version of KR\n",
    "        checks every suspect for a match \"\"\"\n",
    "    if len(pattern) > len(text):\n",
    "        return [] #no matches\n",
    "\n",
    "    basis = 2**16 # assume 16 bits are enough for all chars\n",
    "    if r == None: \n",
    "        r = find_prime(32) # randomly pick a 32 bit long prime using \n",
    "                           # good old find_prime from an earlier lecture\n",
    "\n",
    "    p = fingerprint(pattern, basis, r)\n",
    "    fp_list = substring_fingerprints(text, len(pattern), basis, r)\n",
    "\n",
    "    matches = [shift for shift in range(len(fp_list)) \\\n",
    "               if fp_list[shift] == p and \\\n",
    "                  text[shift:shift+len(pattern)]==pattern]\n",
    "\n",
    "    return matches\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Competition between versions on single char string.\n",
    "This is the worst-case scenario for the safe version.\n",
    "Changing $m$ has a greater effect on the safe version than on the standard KR."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "text = 'a'* 1000000\n",
      "\n",
      "pattern = 'a'* 100\n",
      "find_matches_KR 1.4179207529959967\n",
      "find_matches_KR_safe 1.6589124150050338\n",
      "\n",
      "pattern = 'a'* 1000\n",
      "find_matches_KR 1.3420768009964377\n",
      "find_matches_KR_safe 1.775463817990385\n",
      "\n",
      "pattern = 'a'* 10000\n",
      "find_matches_KR 1.2947049719950883\n",
      "find_matches_KR_safe 2.638186254000175\n",
      "\n",
      "pattern = 'a'* 100000\n",
      "find_matches_KR 1.4077826730062952\n",
      "find_matches_KR_safe 13.829727393997018\n",
      "\n"
     ]
    }
   ],
   "source": [
    "import time\n",
    "\n",
    "text = \"a\"*1000000\n",
    "print(\"text = 'a'*\",len(text))\n",
    "print()\n",
    "for pattern in [\"a\"*100, \"a\"*1000, \"a\"*10000, \"a\"*100000]:\n",
    "    print(\"pattern = 'a'*\",len(pattern))\n",
    "    r = find_prime(32)\n",
    "    for f in [find_matches_KR, find_matches_KR_safe]:\n",
    "        t0=time.perf_counter()\n",
    "        res=f(text, pattern, r)\n",
    "        t1=time.perf_counter()\n",
    "        print (f.__name__, t1-t0 )\n",
    "    print(\"\") #newline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Competition between versions on random strings. \n",
    "\n",
    "Note that the standard and safe versions of KR has similar running times. Moreover, as $m$ increases, the running time slightly decreases since there are less candidates to consider."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "random str of len n= 1000000  , random pattern of length m= 1000\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[]"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "find_matches_KR 1.4785060709982645\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[]"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "find_matches_KR_safe 1.3623063810082385\n",
      "random str of len n= 1000000  , random pattern of length m= 10000\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[]"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "find_matches_KR 1.383506228987244\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[]"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "find_matches_KR_safe 1.3038542309950572\n",
      "random str of len n= 1000000  , random pattern of length m= 100000\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[]"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "find_matches_KR 1.25419370700547\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[]"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "find_matches_KR_safe 1.6594200499966973\n"
     ]
    }
   ],
   "source": [
    "import random\n",
    "def gen_str(size):\n",
    "    ''' Generate a random lowercase English string of length size'''\n",
    "    s=\"\"\n",
    "    for i in range(size):\n",
    "        s+=random.choice(\"abcdefghijklmnopqrstuvwxyz\")\n",
    "    return s\n",
    "\n",
    "\n",
    "n=1000000\n",
    "m=1000\n",
    "text = gen_str(n)\n",
    "pattern = gen_str(m)\n",
    "print(\"random str of len n=\", n, \" , random pattern of length m=\",m)\n",
    "r = find_prime(32)\n",
    "for f in [find_matches_KR, find_matches_KR_safe]:\n",
    "    t0=time.perf_counter()\n",
    "    f(text, pattern, r)\n",
    "    t1=time.perf_counter()\n",
    "    print (f.__name__, t1-t0)\n",
    "    \n",
    "\n",
    "m=10000\n",
    "pattern = gen_str(m)\n",
    "print(\"random str of len n=\", n, \" , random pattern of length m=\",m)\n",
    "r = find_prime(32)\n",
    "for f in [find_matches_KR, find_matches_KR_safe]:\n",
    "    t0=time.perf_counter()\n",
    "    f(text, pattern, r)\n",
    "    t1=time.perf_counter()\n",
    "    print (f.__name__, t1-t0)\n",
    "    \n",
    "m=100000\n",
    "pattern = gen_str(m)\n",
    "print(\"random str of len n=\", n, \" , random pattern of length m=\",m)\n",
    "r = find_prime(32)\n",
    "for f in [find_matches_KR, find_matches_KR_safe]:\n",
    "    t0=time.perf_counter()\n",
    "    f(text, pattern, r)\n",
    "    t1=time.perf_counter()\n",
    "    print (f.__name__, t1-t0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Choice of $r$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By setting $r$ to be a power of the base we will obtain more false-positives. Specifically, for $r=base$, we will get many more false positives. Why is that?\n",
    "\n",
    "Note that the computation of the fingerprint is of the form $fp = (x \\cdot base + y) \\mod r$ for some $x,y$ (where $y$ is the arithmetization of the last character in the current substring). \n",
    "\n",
    "In the case where $base = r$ this is equivalent to $(x \\cdot r + y) \\mod r = y \\mod r$, i.e., the fingerprint takes into account only the last character of the substring!\n",
    "\n",
    "Clearly, choosing $r$ which is a multiple of $base$ is not a good idea. This may also serve as an intuition for why we choose $r$ to be prime - say for example we know that all the fingerprints we compute are even, then if our $r$ is a multiple of $2$, we will always have $fp \\mod r$ an even number, and thus we will only utilize half of our range!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[2, 4, 6, 9]"
      ]
     },
     "execution_count": 60,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "[2, 4, 6, 9]"
      ]
     },
     "execution_count": 60,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "[6]"
      ]
     },
     "execution_count": 60,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Many false positives\n",
    "find_matches_KR(\"abracadabra\", \"da\", r=2**16)\n",
    "\n",
    "# Same result here since the \"x\" will be cancelled out \n",
    "find_matches_KR(\"abracadabra\", \"xa\", r=2**16)\n",
    "\n",
    "# The safe check still works, of course\n",
    "find_matches_KR_safe(\"abracadabra\", \"da\", r=2**16)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "97"
      ]
     },
     "execution_count": 61,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "6553697"
      ]
     },
     "execution_count": 61,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "97"
      ]
     },
     "execution_count": 61,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "97"
      ]
     },
     "execution_count": 61,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "6488161"
      ]
     },
     "execution_count": 61,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "97"
      ]
     },
     "execution_count": 61,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fingerprint(\"da\", 2**16, r=2**16)\n",
    "ord(\"d\")*(2**16)**1 + ord(\"a\")\n",
    "ord(\"a\")\n",
    "\n",
    "fingerprint(\"ca\", 2**16, r=2**16)\n",
    "ord(\"c\")*(2**16)**1 + ord(\"a\")\n",
    "(ord(\"c\")*(2**16)**1 + ord(\"a\") )%2**16"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "97"
      ]
     },
     "execution_count": 62,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "420913348705"
      ]
     },
     "execution_count": 62,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "420913348705"
      ]
     },
     "execution_count": 62,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 62,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "97"
      ]
     },
     "execution_count": 62,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "425208316001"
      ]
     },
     "execution_count": 62,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 62,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "base = 2**16\n",
    "r = 2**16\n",
    "fingerprint(\"bda\", base, r)\n",
    "ord(\"b\")*(base**2) + ord(\"d\")*(base**1) + ord(\"a\")\n",
    "(ord(\"b\")*base + ord(\"d\"))*base + ord(\"a\")\n",
    "((ord(\"b\")*base + ord(\"d\"))*base + ord(\"a\"))%r == ord(\"a\")%r\n",
    "\n",
    "\n",
    "fingerprint(\"cda\", base, r)\n",
    "(ord(\"c\")*base + ord(\"d\"))*base + ord(\"a\")\n",
    "((ord(\"c\")*base + ord(\"d\"))*base + ord(\"a\"))%r == ord(\"a\")%r"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When $r = base^k$ for some $k$ we have a similar behaviour, though this time it means we only take the last $k$ characters of the rolling hash into account."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[0, 7]"
      ]
     },
     "execution_count": 63,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "find_matches_KR(\"Humpty Dumpty\", \"Humpty\", r=2**(16*5))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2158299737877522940025"
      ]
     },
     "execution_count": 65,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "2158299737877522940025"
      ]
     },
     "execution_count": 65,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "2158299737877522940025"
      ]
     },
     "execution_count": 65,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fingerprint(\"Humpty\", 2**16, r=2**(16*5))\n",
    "fingerprint(\"Dumpty\", 2**16, r=2**(16*5))\n",
    "fingerprint(\"Xumpty\", 2**16, r=2**(16*5))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[2158299737877522940025,\n",
       " 2010726629729956855840,\n",
       " 2066067987872461357124,\n",
       " 2139856371159933386869,\n",
       " 2232065040410175930477,\n",
       " 590314951159640293488,\n",
       " 1254411530052683432052,\n",
       " 2158299737877522940025]"
      ]
     },
     "execution_count": 67,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "substring_fingerprints(\"Humpty Dumpty\", 6, 2**16, r=2**(16*5))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[0]"
      ]
     },
     "execution_count": 68,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Why don't we return the last index?\n",
    "find_matches_KR(\"Humpty Dumpty\", \"Humpty\", r=2**(16*6))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Question 2  in <a href=\"http://tau-cs1001-py.wdfiles.com/local--files/home-assignments-2017b/HW6_2017b.pdf\"> this exercise</a>\n",
    "\n",
    "#### (We did not see this in class, but you should make sure you understand this exercise)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "def fingerprint(string, basis=2**16, r=2**32-3):\n",
    "    \"\"\" used to computes karp-rabin fingerprint of the pattern\n",
    "    employs Horner method (modulo r) \"\"\"\n",
    "    partial_sum = 0\n",
    "    for x in string:\n",
    "        partial_sum = (partial_sum*basis + ord(x)) % r\n",
    "    return partial_sum\n",
    "\n",
    "def slide(prev_fp, prev_char, next_char, b_power, basis=2**16, r=2**32-3):\n",
    "    new_fp=((prev_fp - ord(prev_char)*b_power)*basis + ord(next_char)) % r\n",
    "    return new_fp"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Section (a)\n",
    "\n",
    "Build a generator which, given a string text and a length parameter generates all KR fingerprints of desired length in text.\n",
    "Instructions:\n",
    "* Make use of both $fingerprint$ and $slide$\n",
    "* The generator can call $fingerprint$ only once\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "def kr_gen(text, length, basis=2**16, r=2**32-3):\n",
    "    fp = fingerprint(text[:length])\n",
    "    yield fp\n",
    "    b_power = pow(basis, length - 1, r)\n",
    "    for s in range(1, len(text) - length + 1):\n",
    "        fp = slide(fp, text[s - 1], text[s - 1 + length], b_power)\n",
    "        yield fp"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "6422933"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "7471495"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gen = kr_gen(\"abracadabra\", 3)\n",
    "next(gen)\n",
    "next(gen)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "6357433"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "6488452"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "6357389"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "6553988"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "6357390"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "6422933"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "7471495"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "ename": "StopIteration",
     "evalue": "",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mStopIteration\u001b[0m                             Traceback (most recent call last)",
      "\u001b[1;32m<ipython-input-23-6fd1aa460113>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m\u001b[0m\n\u001b[0;32m      6\u001b[0m \u001b[0mnext\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mgen\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m      7\u001b[0m \u001b[0mnext\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mgen\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 8\u001b[1;33m \u001b[0mnext\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mgen\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m      9\u001b[0m \u001b[0mnext\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mgen\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     10\u001b[0m \u001b[0mnext\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mgen\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;31mStopIteration\u001b[0m: "
     ]
    }
   ],
   "source": [
    "next(gen)\n",
    "next(gen)\n",
    "next(gen)\n",
    "next(gen)\n",
    "next(gen)\n",
    "next(gen)\n",
    "next(gen)\n",
    "next(gen)\n",
    "next(gen)\n",
    "next(gen)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Section (b)\n",
    "\n",
    "Build a generator which, given two strings $text1, text2$ and a length parameter $\\ell$ generates all pairs of indices $(i,j)$ such that $text1[i:i+\\ell] == text2[j:j+\\ell]$.\n",
    "\n",
    "Instructions:\n",
    "* Use $kr\\_gen$ from the previous section\n",
    "* The generator must work in space smaller than the size of the strings. In particular, at no point can you save the array of fingerprints of any entire string"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "def generate_shared_substrings(text1, text2, length):\n",
    "    g1 = kr_gen(text1, length)\n",
    "    i1 = 0\n",
    "    for fp1 in g1:\n",
    "        g2 = kr_gen(text2, length)\n",
    "        i2 = 0\n",
    "        for fp2 in g2:\n",
    "            if fp1 == fp2:\n",
    "                yield(i1, i2)\n",
    "            i2 += 1\n",
    "        i1 += 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(2, 1)"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "(3, 2)"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "g = generate_shared_substrings(\"abcdef\",\"xcdefx\",3)\n",
    "next(g)\n",
    "next(g)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "ename": "StopIteration",
     "evalue": "",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mStopIteration\u001b[0m                             Traceback (most recent call last)",
      "\u001b[1;32m<ipython-input-51-e734f8aca5ac>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mnext\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mg\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[1;31mStopIteration\u001b[0m: "
     ]
    }
   ],
   "source": [
    "next(g)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}