{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "

cs1001.py , Tel Aviv University, Fall 2018/19

\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Recitation 10\n", "\n", "We discussed the Karp-Rabin algorithm for string matching and Huffman coding" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###### Takeaways:\n", "- The Karp-Rabin algorithm is a probabilistic string matching algorithm (find all occurrences of a string of length $m$ in a string of length $n$) running in linear time $O(m+n)$ (as opposed to the naive solution which has running time $O(nm)$).\n", "- Make sure you read our KR summary.\n", "- Make sure you understand the way the algorithm works, and in particular the \"rolling hash mechanism\", that is, how to compute the fingerprint of the next substring in $O(1)$ time, given the fingerprint of the current substring.\n", "- Make sure you understand the \"arithmetization\" step used by the algorithm.\n", "\n", "- Huffman Coding is a lossless compression technique. Given a corpus, a character frequency analysis is performed and a prefix free tree is built according to the analysis\n", "- Given a text, the characters are encoded using the tree\n", "- Given an encoding, decoding is performed in a \"greedy\" manner (for this reason a prefix free encoding is crucial)\n", "- While Huffman coding is optimal, if the character frequency of the corpus differs from that of the encoded text, the compression can be inefficient" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Code for printing several outputs in one cell (not part of the recitation):" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "from IPython.core.interactiveshell import InteractiveShell\n", "InteractiveShell.ast_node_interactivity = \"all\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The string-matching problem" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Given a string $text$ of length $n$, and a short(er) string $pattern$ of length $m$ ($m\\leq n$), report all occurrences of $pattern$ in $text$.\n", "\n", "Example:\n", "\n", "$text = $\"abracadabra\", $pattern = $\"abr\"\n", "\n", "The requested output should be $[0,7]$, since $pattern$ appears in $text$ in indices $0,7$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Karp-Rabin Algorithm\n", "\n", "- The algorithm works as follows:\n", " - An initial \"fingerprint\" is computed for the pattern string and the first substring of length $m$ in the text\n", " - If both fingerprints are equal, we assume a match\n", " - Then, a \"rolling hash\" mechanism computes the next fingerprint in $O(1)$ time given the current fingerprint in the text\n", " - At each stage again a comparison is made and equal fingerprints are considered a match" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "def fingerprint(text, basis=2**16, r=2**32-3):\n", " \"\"\" used to compute karp-rabin fingerprint of the pattern\n", " employs Horner method (modulo r) \"\"\"\n", " partial_sum = 0\n", " for ch in text:\n", " partial_sum =(partial_sum*basis + ord(ch)) % r\n", " return partial_sum\n", "\n", "def text_fingerprint(text, m, basis=2**16, r=2**32-3):\n", " \"\"\" computes karp-rabin fingerprint of the text \"\"\"\n", " f=[]\n", " b_power = pow(basis, m-1, r)\n", " list.append(f, fingerprint(text[0:m], basis, r))\n", " # f[0] equals first text fingerprint \n", " for s in range(1, len(text)-m+1):\n", " new_fingerprint = ( (f[s-1] - ord(text[s-1])*b_power)*basis\n", " +ord(text[s+m-1]) ) % r\n", " # compute f[s], based on f[s-1]\n", " list.append(f,new_fingerprint)# append f[s] to existing f \n", " return f\n", "\n", "def find_matches_KR(pattern, text, basis=2**16, r=2**32-3):\n", " \"\"\" find all occurrences of pattern in text\n", " using coin flipping Karp-Rabin algorithm \"\"\"\n", " if len(pattern) > len(text):\n", " return []\n", " p = fingerprint(pattern, basis, r)\n", " f = text_fingerprint(text, len(pattern), basis, r)\n", " matches = [s for (s,f_s) in enumerate(f) if f_s == p]\n", " # list comprehension \n", " return matches" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "text = \"abracadabra\"\n", "pattern = \"abr\"" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "6422933" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fingerprint(\"abr\")" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "416618250354" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "6422933" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "base = 2**16\n", "arit = ord(\"a\")*(base**2) + ord(\"b\")*(base**1) + ord(\"r\")*(base**0)\n", "arit\n", "r = 2**32 - 3\n", "fp = arit%r\n", "fp" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[6422933,\n", " 7471495,\n", " 6357433,\n", " 6488452,\n", " 6357389,\n", " 6553988,\n", " 6357390,\n", " 6422933,\n", " 7471495]" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text_fingerprint(text, 3)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[0, 7]" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "find_matches_KR(pattern, text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Safe version\n", "Makes sure no false positives occur. In the worst case, when all $n-m$ possible substrings are indeed matches, behaves as the naive solution in terms of time complexity." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "def find_matches_KR_safe(pattern, text, basis=2**16, r=2**32-3):\n", " \"\"\" a safe version of KR\n", " checks every suspect for a match \"\"\"\n", "\n", " if len(pattern) > len(text):\n", " return []\n", " p = fingerprint(pattern, basis, r)\n", " f = text_fingerprint(text, len(pattern), basis, r)\n", " matches = [s for (s,f_s) in enumerate(f) if f_s == p \\\n", " and text[s:s+len(pattern)]==pattern]\n", " #note that python performs \"short circuit evaluation\" of the 'and' statement\n", " return matches" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Competition between versions on single char string.\n", "This is the worst-case scenario for the safe version.\n", "\n", "How will changing the pattern length $m$ affect both the safe version and the standard KR version?" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "text = 'a'* 1000000\n", "pattern = 'a'* 100\n", "find_matches_KR 2.527005435999996\n", "find_matches_KR_safe 3.1151218930000084\n", "\n", "pattern = 'a'* 1000\n", "find_matches_KR 2.022127041999994\n", "find_matches_KR_safe 3.558386521000017\n", "\n", "pattern = 'a'* 10000\n", "find_matches_KR 2.499810314000001\n", "find_matches_KR_safe 6.067419519999987\n", "\n", "pattern = 'a'* 100000\n", "find_matches_KR 2.4022465289999957\n", "find_matches_KR_safe 29.330972100999986\n", "\n" ] } ], "source": [ "import time\n", "\n", "text = \"a\"*1000000\n", "print(\"text = 'a'*\",len(text))\n", "for pattern in [\"a\"*100, \"a\"*1000, \"a\"*10000, \"a\"*100000]:\n", " print(\"pattern = 'a'*\",len(pattern))\n", " for f in [find_matches_KR, find_matches_KR_safe]:\n", " t0=time.perf_counter()\n", " res=f(pattern, text)\n", " t1=time.perf_counter()\n", " print (f.__name__, t1-t0)\n", " print(\"\") #newline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Competition between versions on random strings. \n", "\n", "How do the regular/safe versions differ when we look for a random pattern in a random string?\n", "\n", "Furthermore, how does the running time vary when $m$, the pattern length, increases?" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "random str of len n= 1000000 , random pattern of length m= 1000\n" ] }, { "data": { "text/plain": [ "[]" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" }, { "name": "stdout", "output_type": "stream", "text": [ "find_matches_KR 2.527568842999983\n" ] }, { "data": { "text/plain": [ "[]" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" }, { "name": "stdout", "output_type": "stream", "text": [ "find_matches_KR_safe 2.52221031900001\n", "random str of len n= 1000000 , random pattern of length m= 10000\n" ] }, { "data": { "text/plain": [ "[]" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" }, { "name": "stdout", "output_type": "stream", "text": [ "find_matches_KR 2.490760865000027\n" ] }, { "data": { "text/plain": [ "[]" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" }, { "name": "stdout", "output_type": "stream", "text": [ "find_matches_KR_safe 2.4914142309999647\n", "random str of len n= 1000000 , random pattern of length m= 100000\n" ] }, { "data": { "text/plain": [ "[]" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" }, { "name": "stdout", "output_type": "stream", "text": [ "find_matches_KR 2.41860688700001\n" ] }, { "data": { "text/plain": [ "[]" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" }, { "name": "stdout", "output_type": "stream", "text": [ "find_matches_KR_safe 2.451781514000004\n" ] } ], "source": [ "import random\n", "def gen_str(size):\n", " ''' Generate a random lowercase English string of length size'''\n", " s=\"\"\n", " for i in range(size):\n", " s+=random.choice(\"abcdefghijklmnopqrstuvwxyz\")\n", " return s\n", "\n", "\n", "n=1000000\n", "m=1000\n", "text = gen_str(n)\n", "pattern = gen_str(m)\n", "print(\"random str of len n=\", n, \" , random pattern of length m=\",m)\n", "for f in [find_matches_KR, find_matches_KR_safe]:\n", " t0=time.perf_counter()\n", " f(pattern, text)\n", " t1=time.perf_counter()\n", " print (f.__name__, t1-t0)\n", " \n", "\n", "m=10000\n", "pattern = gen_str(m)\n", "print(\"random str of len n=\", n, \" , random pattern of length m=\",m)\n", "for f in [find_matches_KR, find_matches_KR_safe]:\n", " t0=time.perf_counter()\n", " f(pattern, text)\n", " t1=time.perf_counter()\n", " print (f.__name__, t1-t0)\n", " \n", "m=100000\n", "pattern = gen_str(m)\n", "print(\"random str of len n=\", n, \" , random pattern of length m=\",m)\n", "for f in [find_matches_KR, find_matches_KR_safe]:\n", " t0=time.perf_counter()\n", " f(pattern, text)\n", " t1=time.perf_counter()\n", " print (f.__name__, t1-t0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Choice of $r$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By setting $r$ to be a power of the base we will obtain more false-positives. Specifically, for $r=base$, we will get many more false positives. Why is that?\n", "\n", "Note that the computation of the fingerprint is of the form $fp = (x \\cdot base + y) \\mod r$ for some $x,y$ (where $y$ is the arithmetization of the last character in the current substring). \n", "\n", "In the case where $base = r$ this is equivalent to $(x \\cdot r + y) \\mod r = y \\mod r$, i.e., the fingerprint takes into account only the last character of the substring!\n", "\n", "Clearly, choosing $r$ which is a multiple of $base$ is not a good idea. This may also serve as an intuition for why we choose $r$ to be prime - say for example we know that all the fingerprints we compute are even, then if our $r$ is a multiple of $2$, we will always have $fp \\mod r$ an even number, and thus we will only utilize half of our range!" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[2, 4, 6, 9]" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "[2, 4, 6, 9]" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "[6]" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Many false positives\n", "find_matches_KR(\"da\", \"abracadabra\", basis=2**16, r=2**16)\n", "\n", "# Same result here since the \"x\" will be cancelled out \n", "find_matches_KR(\"xa\", \"abracadabra\", basis=2**16, r=2**16)\n", "\n", "# The safe check still works, of course\n", "find_matches_KR_safe(\"da\", \"abracadabra\", basis=2**16, r=2**16)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "97" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "6553697" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "97" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "97" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "6488161" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "97" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fingerprint(\"da\", 2**16, r=2**16)\n", "ord(\"d\")*(2**16)**1 + ord(\"a\")\n", "ord(\"a\")\n", "\n", "fingerprint(\"ca\", 2**16, r=2**16)\n", "ord(\"c\")*(2**16)**1 + ord(\"a\")\n", "(ord(\"c\")*(2**16)**1 + ord(\"a\") )%(2**16)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "97" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "420913348705" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "420913348705" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "True" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "97" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "425208316001" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "True" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "base = 2**16\n", "r = 2**16\n", "fingerprint(\"bda\", base, r)\n", "ord(\"b\")*(base**2) + ord(\"d\")*(base**1) + ord(\"a\")\n", "(ord(\"b\")*base + ord(\"d\"))*base + ord(\"a\")\n", "((ord(\"b\")*base + ord(\"d\"))*base + ord(\"a\"))%r == ord(\"a\")%r\n", "\n", "\n", "fingerprint(\"cda\", base, r)\n", "(ord(\"c\")*base + ord(\"d\"))*base + ord(\"a\")\n", "((ord(\"c\")*base + ord(\"d\"))*base + ord(\"a\"))%r == ord(\"a\")%r" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When $r = base^k$ for some $k$ we have a similar behaviour, though this time it means we only take the last $k$ characters of the rolling hash into account." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[0, 7]" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "find_matches_KR(\"Humpty\", \"Humpty Dumpty\", r=2**(16*5))" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "True" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "False" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "base = 2**16\n", "r = 2**(16*5)\n", "\n", "fingerprint(\"Humpty\", r=2**(16*5)) == fingerprint(\"Dumpty\", r=2**(16*5))\n", "fingerprint(\"Humpty\", r=2**(16*5)) == fingerprint(\"Xumpty\", r=2**(16*5))\n", "fingerprint(\"Humpty\", r=2**(16*5)) == fingerprint(\"Hxmpty\", r=2**(16*5))" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[2158299737877522940025,\n", " 2010726629729956855840,\n", " 2066067987872461357124,\n", " 2139856371159933386869,\n", " 2232065040410175930477,\n", " 590314951159640293488,\n", " 1254411530052683432052,\n", " 2158299737877522940025]" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text_fingerprint(\"Humpty Dumpty\", 6, r=2**(16*5))" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "[0]" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Why don't we return the last index?\n", "find_matches_KR(\"Humpty\", \"Humpty Dumpty\", r=2**(16*6))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Question 2 in this exercise" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "def fingerprint(string, basis=2**16, r=2**32-3):\n", " \"\"\" used to computes karp-rabin fingerprint of the pattern\n", " employs Horner method (modulo r) \"\"\"\n", " partial_sum = 0\n", " for x in string:\n", " partial_sum = (partial_sum*basis + ord(x)) % r\n", " return partial_sum\n", "\n", "def slide(prev_fp, prev_char, next_char, b_power, basis=2**16, r=2**32-3):\n", " new_fp=((prev_fp - ord(prev_char)*b_power)*basis + ord(next_char)) % r\n", " return new_fp" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Section (a)\n", "\n", "Build a generator which, given a string text and a length parameter generates all KR fingerprints of desired length in text.\n", "Instructions:\n", "* Make use of both $fingerprint$ and $slide$\n", "* The generator can call $fingerprint$ only once\n" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [ "def kr_gen(text, length, basis=2**16, r=2**32-3):\n", " fp = fingerprint(text[:length])\n", " yield fp\n", " b_power = pow(basis, length - 1, r)\n", " for s in range(1, len(text) - length + 1):\n", " fp = slide(fp, text[s - 1], text[s - 1 + length], b_power)\n", " yield fp" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "6422933" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "7471495" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gen = kr_gen(\"abracadabra\", 3)\n", "next(gen)\n", "next(gen)\n" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "6357433" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "6488452" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "6357389" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "6553988" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "6357390" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "6422933" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "7471495" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" }, { "ename": "StopIteration", "evalue": "", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mStopIteration\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m\u001b[0m\n\u001b[0;32m 6\u001b[0m \u001b[0mnext\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mgen\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 7\u001b[0m \u001b[0mnext\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mgen\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 8\u001b[1;33m \u001b[0mnext\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mgen\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 9\u001b[0m \u001b[0mnext\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mgen\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 10\u001b[0m \u001b[0mnext\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mgen\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;31mStopIteration\u001b[0m: " ] } ], "source": [ "next(gen)\n", "next(gen)\n", "next(gen)\n", "next(gen)\n", "next(gen)\n", "next(gen)\n", "next(gen)\n", "next(gen)\n", "next(gen)\n", "next(gen)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Section (b)\n", "\n", "Build a generator which, given two strings $text1, text2$ and a length parameter $\\ell$ generates all pairs of indices $(i,j)$ such that $text1[i:i+\\ell] == text2[j:j+\\ell]$.\n", "\n", "Instructions:\n", "* Use $kr\\_gen$ from the previous section\n", "* The generator must work in space smaller than the size of the strings. In particular, at no point can you save the array of fingerprints of any entire string" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "def generate_shared_substrings(text1, text2, length):\n", " g1 = kr_gen(text1, length)\n", " i1 = 0\n", " for fp1 in g1:\n", " g2 = kr_gen(text2, length)\n", " i2 = 0\n", " for fp2 in g2:\n", " if fp1 == fp2:\n", " yield(i1, i2)\n", " i2 += 1\n", " i1 += 1" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(2, 1)" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "(3, 2)" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "g = generate_shared_substrings(\"abcdef\",\"xcdefx\",3)\n", "next(g)\n", "next(g)" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "scrolled": true }, "outputs": [ { "ename": "StopIteration", "evalue": "", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mStopIteration\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mnext\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mg\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[1;31mStopIteration\u001b[0m: " ] } ], "source": [ "next(g)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Huffman Coding\n", "\n", "Generates a variable-length, prefix-free code, based on the character frequencies in corpus string.\n", "Notice that there are several degrees of freedom that could lead to different tree structures." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "def ascii2bit_stream(text):\n", " \"\"\" Translates ASCII text to binary reprersentation using\n", " 7 bits per character. Assume only ASCII chars \"\"\"\n", " return \"\".join([bin(ord(c))[2:].zfill(7) for c in text])\n", "\n", "\n", "\n", "########################################################\n", "#### HUFFMAN CODE\n", "########################################################\n", "\n", "def char_count(text):\n", " \"\"\" Counts the number of each character in text.\n", " Returns a dictionary, with keys being the observed characters,\n", " values being the counts \"\"\"\n", " d = {}\n", " for ch in text:\n", " if ch in d:\n", " d[ch] += 1\n", " else:\n", " d[ch] = 1\n", " return d\n", "\n", "\n", "def build_huffman_tree(char_count_dict):\n", " \"\"\" Recieves dictionary with char:count entries\n", " Generates a LIST structure representing\n", " the binary Huffman encoding tree \"\"\"\n", " queue = [(c,cnt) for (c,cnt) in char_count_dict.items()]\n", "\n", " while len(queue) > 1:\n", " #print(queue)\n", " # combine two smallest elements\n", " A, cntA = extract_min(queue) # smallest in queue\n", " B, cntB = extract_min(queue) # next smallest\n", " chars = [A,B]\n", " weight = cntA + cntB # combined weight\n", " queue.append((chars, weight)) # insert combined node\n", "\n", " # only root node left\n", " #print(\"final queue:\", queue)\n", " root, weight_trash = extract_min(queue) # weight_trash unused\n", " return root #a LIST representing the tree structure\n", "\n", "\n", "def extract_min(queue): \n", " \"\"\" queue is a list of 2-tuples (x,y).\n", " remove and return the tuple with minimal y \"\"\" \n", " min_pair = min(queue, key = lambda pair: pair[1])\n", " queue.remove(min_pair)\n", " return min_pair\n", "\n", "\n", "\n", "def generate_code(huff_tree, prefix=\"\"):\n", " \"\"\" Receives a Huffman tree with embedded encoding,\n", " and a prefix of encodings.\n", " returns a dictionary where characters are\n", " keys and associated binary strings are values.\"\"\"\n", " if isinstance(huff_tree, str): # a leaf\n", " return {huff_tree: prefix}\n", " else:\n", " lchild, rchild = huff_tree[0], huff_tree[1]\n", " codebook = {}\n", "\n", " codebook.update(generate_code(lchild, prefix+'0'))\n", " codebook.update(generate_code(rchild, prefix+'1'))\n", " # oh, the beauty of recursion...\n", " return codebook\n", "\n", " \n", "def compress(text, encoding_dict):\n", " \"\"\" compress text using encoding dictionary \"\"\"\n", " assert isinstance(text, str)\n", " return \"\".join(encoding_dict[ch] for ch in text)\n", "\n", "\n", "def reverse_dict(d):\n", " \"\"\" build the \"reverse\" of encoding dictionary \"\"\"\n", " return {y:x for (x,y) in d.items()}\n", "\n", "\n", "def decompress(bits, decoding_dict):\n", " prefix = \"\"\n", " result = []\n", " for bit in bits:\n", " prefix += bit\n", " if prefix in decoding_dict:\n", " result.append(decoding_dict[prefix])\n", " prefix = \"\" #restart\n", " assert prefix == \"\" # must finish last codeword\n", " return \"\".join(result) # converts list of chars to a string" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The Huffman Encoding Process" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "s = \"live and let live\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Frequency dictionary" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'l': 3, 'i': 2, 'v': 2, 'e': 3, ' ': 3, 'a': 1, 'n': 1, 'd': 1, 't': 1}" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "count_dict = char_count(s)\n", "count_dict" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Build a Huffman tree:\n", "\n", "The function returns a list which represents a binary tree as follows:\n", "* A string in the list is a leaf\n", "* Otherwise, a list of size two represents two sub-trees: the left subtree and the right subtree\n", "\n", "The set of indices that lead to a leaf (char) correspond to the path leading to the leaf, and therefore to the encoding of the character in the final Huffman code" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[[' ', ['i', 'v']], [[['a', 'n'], ['d', 't']], ['l', 'e']]]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "[' ', ['i', 'v']]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "[[['a', 'n'], ['d', 't']], ['l', 'e']]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "' '" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "'v'" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "huff_tree = build_huffman_tree(count_dict)\n", "\n", "# A list with two elements\n", "huff_tree\n", "\n", "# The left subtree of the root\n", "huff_tree[0]\n", "\n", "# The right subtree of the root\n", "huff_tree[1]\n", "\n", "# Leaves correspond to characters in the final code\n", "huff_tree[0][0]\n", "huff_tree[0][1][1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Generate binary representation for each char in the corpus based on the tree" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{' ': '00',\n", " 'i': '010',\n", " 'v': '011',\n", " 'a': '1000',\n", " 'n': '1001',\n", " 'd': '1010',\n", " 't': '1011',\n", " 'l': '110',\n", " 'e': '111'}" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "d = generate_code(huff_tree)\n", "d" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compressing some text using our code" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [], "source": [ "text1 = \"tell\"" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'1011111110110'" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "13" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "compress(text1, d)\n", "len(compress(text1, d))#4+3+3+3" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "28" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unicode_conversion = ascii2bit_stream(\"tell\") #when we convert every character to 7 bits\n", "len(unicode_conversion)#4*7" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Decoding" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'00': ' ',\n", " '010': 'i',\n", " '011': 'v',\n", " '1000': 'a',\n", " '1001': 'n',\n", " '1010': 'd',\n", " '1011': 't',\n", " '110': 'l',\n", " '111': 'e'}" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "e = reverse_dict(d)\n", "e" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'tell'" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "decompress('1011111110110', e)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Handling missing characters" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'\\x00\\x01\\x02\\x03\\x04\\x05\\x06\\x07\\x08\\t\\n\\x0b\\x0c\\r\\x0e\\x0f\\x10\\x11\\x12\\x13\\x14\\x15\\x16\\x17\\x18\\x19\\x1a\\x1b\\x1c\\x1d\\x1e\\x1f !\"#$%&\\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\\x7f'" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "asci = str.join(\"\",[chr(c) for c in range(128)])\n", "asci" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Computing the compression ratio" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "def compression_ratio(text, corpus):\n", " d = generate_code(build_huffman_tree(char_count(corpus)))\n", " len_compress = len(compress(text, d))\n", " len_ascii = len(ascii2bit_stream(text)) #len(text)*7\n", " return len_compress/len_ascii" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.14285714285714285" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "0.14285714285714285" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "compression_ratio(\"ab\", \"ab\")\n", "1/7" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.2857142857142857" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "0.2857142857142857" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "compression_ratio(\"ab\", \"abcd\")\n", "2/7" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.1428571428571428" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "1.1428571428571428" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "compression_ratio(\"hello\", \"a\"*100 + asci)\n", "8/7" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.0" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "1.0" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "1.1428571428571428" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "compression_ratio(\"hello\", \"a\"*10 + asci)\n", "compression_ratio(\"hello\", \"a\"*40 + asci)\n", "compression_ratio(\"hello\", \"a\"*80 + asci)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What happened here? Where is the precise cutoff between a 1 and 8/7 ratio?\n", "\n", "If we have 65 occurrences of \"a\", then we are left with 127 occurences of characters that are not \"a\".\n", "\n", "Eventually, all 127 non-\"a\" chars will be grouped in a tree of depth 7. Since the tree is built by pairing items of minimal weight it will not be a full binary tree but slightly skewed - one character (the last character in the dictionary) will have an encoding of length 6 and all others an encoding of length 7 (this can easily be shown by induction).\n", "\n", "At this point the \"a\" node will be attached to the tree - \"a\" will get an encoding of length 1 and all other characters will get an encoding of length 8 (or 7 in case of the last letter in the dictionary). Since \"a\" is not in \"hello\" the compression ratio will be 8/7.\n", "\n", "When \"a\" appears less than 65 times the compression ratio depends (among other things) on the order in which the dictionary is accessed. Specifically, in our implementation this will result in a 1.0 compression ratio for \"hello\" (i.e. - all chars in \"hello\" have an encoding of length 7) but not, for example, for \"12345\"." ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.0" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "1.1428571428571428" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "1.1428571428571428" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "1.0" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c_cnt = char_count(\"a\"*64 + asci)\n", "d = build_huffman_tree(c_cnt)\n", "#d[0]\n", "#d[1]\n", "code = generate_code(d)\n", "#code\n", "\n", "# Since all chars in \"hello\" have an encoding of length 7 the ratio will be 1.0\n", "compression_ratio(\"hello\", \"a\"*63 + asci)\n", "\n", "# There are however characters with an encoding of length 8 resulting in a 8/7 ratio\n", "compression_ratio(\"12345\", \"a\"*63 + asci)\n", "\n", "# If \"a\" appears more than 64 times all characters (apart from \"a\" and chr(127)) will have an encdoing of length 8\n", "compression_ratio(\"hello\", \"a\"*64 + asci)\n", "\n", "# chr(127) is the only character with an encoding length of 7\n", "compression_ratio(chr(127)*5, \"a\"*64 + asci)\n" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.0" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "compression_ratio(\"hello\", \"a\"*4 + asci)\n", "\n", "d = generate_code(build_huffman_tree(char_count(\"a\"*4 + asci)))\n" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'11110'" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "[('\\x00', '11111010'),\n", " ('\\x01', '11111011'),\n", " ('\\x02', '11111100'),\n", " ('\\x03', '11111101'),\n", " ('\\x04', '11111110'),\n", " ('\\x05', '11111111')]" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "d[\"a\"]\n", "\n", "[(key,d[key]) for key in d.keys() if len(d[key]) > 7]\n" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'0'" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "'11101001'" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "126" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "1" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "d = generate_code(build_huffman_tree(char_count(\"a\"*100 + asci)))\n", "d['a']\n", "d['h']\n", "len([(key,d[key]) for key in d.keys() if len(d[key]) > 7])\n", "len([(key,d[key]) for key in d.keys() if 1 < len(d[key]) <= 7])" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" } }, "nbformat": 4, "nbformat_minor": 1 }