{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "

cs1001.py , Tel Aviv University, Fall 2017-2018

\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Recitation 10\n", "\n", "We discussed hash tables and iterators + generators" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###### Takeaways:\n", "- Hash tables can be useful for many algorithms, including memoization. \n", "- Make sure you understand the complexity analysis for hash tables (see the links below).\n", "- Generators function is a function that contains the yield command and returns a genertor object. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Code for printing several outputs in one cell (not part of the recitation):" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "from IPython.core.interactiveshell import InteractiveShell\n", "InteractiveShell.ast_node_interactivity = \"all\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Hash" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We wish to have a data structure that implements the operations: insert, search and delete in **expected** $O(1)$ time. \n", "\n", "Summarizing the insert and search complexity of the data structures that we have seen already:\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A detailed summary on the complexity of insert/search operations using hash tables can be found here . Make sure you read it." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise: \n", "given a string $st$ of length $n$ and a small integer $\\ell$, write a function that checks whether there is a substring in $st$ of length $\\ell$ that appears more than once.\n", "\n", "\n", "Make sure you read the following summary that includes a detailed explanation on the experiments." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Naive solution" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The complexity is $O(\\ell(n-\\ell)^2)$. \n", "There $O((n-\\ell)^2)$ iterations (make sure you undersand why) and in each iteration we perform operations in $O(\\ell)$ time." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "True" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "False" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def repeat_naive(st, l): \n", " for i in range(len(st)-l+1):\n", " for j in range(i+1,len(st)-l+1):\n", " if st[i:i+l]==st[j:j+l]:\n", " return True\n", " return False\n", "\n", "repeat_naive(\"hello\", 1)\n", "repeat_naive(\"hello\"*10, 45)\n", "repeat_naive(\"hello\"*10, 46)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A function that generates a random string of a given size" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "False" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "False" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import random\n", "def gen_str(size, alphabet = \"abcdefghijklmnopqrstuvwxyz\"):\n", " ''' Generate a random string of length size over alphabet '''\n", " s=\"\"\n", " for i in range(size):\n", " s += random.choice(alphabet)\n", " return s\n", "rndstr = gen_str(1000)\n", "repeat_naive(rndstr, 3)\n", "repeat_naive(rndstr, 10)\n", "\n", "rndstr1 = gen_str(10000)\n", "repeat_naive(rndstr1, 10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The class Hashtable from the lectures" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class Hashtable:\n", " def __init__(self, m, hash_func=hash):\n", " \"\"\" initial hash table, m empty entries \"\"\"\n", " ##bogus initialization #1:\n", " #self.table = [[]*m]\n", " ##bogus initialization #2:\n", " #empty=[]\n", " #self.table = [empty for i in range(m)]\n", " \n", " self.table = [ [] for i in range(m)]\n", " self.hash_mod = lambda x: hash_func(x) % m # using python hash function\n", "\n", " def __repr__(self):\n", " L = [self.table[i] for i in range(len(self.table))]\n", " return \"\".join([str(i) + \" \" + str(L[i]) + \"\\n\" for i in range(len(self.table))])\n", " \n", " def find(self, item):\n", " \"\"\" returns True if item in hashtable, False otherwise \"\"\"\n", " i = self.hash_mod(item)\n", " return item in self.table[i]\n", " #if item in self.table[i]:\n", " # return True\n", " #else:\n", " # return False\n", "\n", " def insert(self, item):\n", " \"\"\" insert an item into table \"\"\"\n", " i = self.hash_mod(item)\n", " if item not in self.table[i]:\n", " self.table[i].append(item)\n", " \n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Solution using the class Hashtable" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The expected (average) complexity is: $O(\\ell(n-\\ell))$\n", "\n", "Creating the table takes $O(n-\\ell)$ time, and there are $O(n-\\ell)$ iterations, each taking expected $O(\\ell)$ time.\n", "\n", "\n", "\n", "The worst case complexity is: $O(\\ell(n-\\ell)^2)$\n", "\n", "Creating the table takes $O(n-\\ell)$ time, and the time for executing the loop is\n", "$\\ell\\cdot\\sum_{i=0}^{n-\\ell}{i}= O(\\ell(n-\\ell)^2)$ \n", "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def repeat_hash1(st, l):\n", " m=len(st)-l+1\n", " htable = Hashtable(m)\n", " for i in range(len(st)-l+1):\n", " if htable.find(st[i:i+l])==False:\n", " htable.insert(st[i:i+l])\n", " else:\n", " return True\n", " return False" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Which of Python's naitive DS fits the solution?\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Solution using Python's set implementation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def repeat_hash2(st, l):\n", " htable = set() #Python sets use hash functions for fast lookup\n", " for i in range(len(st)-l+1):\n", " if st[i:i+l] not in htable:\n", " htable.add(st[i:i+l])\n", " else: return True\n", " return False" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Competition between the 3 solutions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For a random string of size $n=1000$ and for $l=10$ the running time of repeat_hash2 is the smallest, while the one for repeat_naive is the largest.\n", "\n", "When increasing $n$ to 2000, the running time of repeat_naive increases by ~4, while the running time of repeat_hash1, repeat_hash2 increases by ~2." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "str_len= 1000 repeating substring len= 10\n", "repeat_naive 0.17578042042896413 found? False\n", "repeat_hash1 0.0020885685707980883 found? False\n", "repeat_hash2 0.0005123723312863149 found? False\n" ] } ], "source": [ "import time\n", "str_len=1000\n", "st=gen_str(str_len)\n", "l=10\n", "print(\"str_len=\",str_len, \"repeating substring len=\",l)\n", "for f in [repeat_naive,repeat_hash1,repeat_hash2]:\n", " t0=time.clock()\n", " res=f(st, l)\n", " t1=time.clock()\n", " print(f.__name__, t1-t0, \"found?\",res)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "str_len= 2000 repeating substring len= 10\n", "repeat_naive 0.7582052599900635 found? False\n", "repeat_hash1 0.0047206939198076725 found? False\n", "repeat_hash2 0.0010855345899472013 found? False\n" ] } ], "source": [ "str_len=2000\n", "st=gen_str(str_len)\n", "l=10\n", "print(\"str_len=\",str_len, \"repeating substring len=\",l)\n", "for f in [repeat_naive,repeat_hash1,repeat_hash2]:\n", " t0=time.clock()\n", " res=f(st, l)\n", " t1=time.clock()\n", " print(f.__name__, t1-t0, \"found?\",res)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When $st$ is \"a\"$*1000$, repeat_hash1 is the slowest, since it spends time on creating an empty table of size 991.\n" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "str_len= 2000 repeating substring len= 10\n", "repeat_naive 1.1842188541777432e-05 found? True\n", "repeat_hash1 0.0003122392372461036 found? True\n", "repeat_hash2 8.684277418069541e-06 found? True\n" ] } ], "source": [ "st=\"a\"*1000\n", "l=10\n", "print(\"str_len=\",str_len, \"repeating substring len=\",l)\n", "for f in [repeat_naive,repeat_hash1,repeat_hash2]:\n", " t0=time.clock()\n", " res=f(st, l)\n", " t1=time.clock()\n", " print(f.__name__, t1-t0, \"found?\",res)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The effect of table size" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our solution, with control over the table size" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "def repeat_hash1_var_size(st, l, m=0):\n", " if m==0: #default hash table size is ~number of substrings to be inserted\n", " m=len(st)-l+1\n", " htable = Hashtable(m)\n", " for i in range(len(st)-l+1):\n", " if htable.find(st[i:i+l])==False:\n", " htable.insert(st[i:i+l])\n", " else:\n", " return True\n", " return False" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Comparing different table sizes" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "str_len= 1000 repeating substring len= 10\n", "0.0392379310214892 found? False table size= 1\n", "0.00573320165858604 found? False table size= 10\n", "0.0035834484006045386 found? False table size= 100\n", "0.0035980537650175393 found? False table size= 1000\n", "0.003550684981746599 found? False table size= 1500\n", "0.0036312119191279635 found? False table size= 10000\n", "0.045343372359639034 found? False table size= 100000\n" ] } ], "source": [ "import time\n", "str_len=1000\n", "st=gen_str(str_len)\n", "l=10\n", "print(\"str_len=\",str_len, \"repeating substring len=\",l)\n", "for m in [1, 10, 100, 1000, 1500, 10000, 100000]:\n", " t0=time.clock()\n", " res=repeat_hash1_var_size(st, l, m)\n", " t1=time.clock()\n", " print(t1-t0, \"found?\",res, \"table size=\",m)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Iterators and Generators" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "list_iterator" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "l = [1,2,3]\n", "li = iter(l)\n", "type(li)\n", "li2 = iter(l)\n", "\n" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "2" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" }, { "name": "stdout", "output_type": "stream", "text": [ "z is 3\n" ] }, { "data": { "text/plain": [ "1" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" }, { "name": "stdout", "output_type": "stream", "text": [ "before loop\n", "2\n", "3\n" ] } ], "source": [ "next(li)\n", "next(li)\n", "z = next(li)\n", "print(\"z is\", z)\n", "next(li2)\n", "#next(li)\n", "print(\"before loop\")\n", "for elem in li2:\n", " print(elem)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "ename": "StopIteration", "evalue": "", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mStopIteration\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mnext\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mli\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[1;31mStopIteration\u001b[0m: " ] } ], "source": [ "next(li)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A function which produces a list of all positive even numbers up to $n$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def Evens_list(n):\n", " ''' a list of evens up to n '''\n", " return [num for num in range(n) if num%2==0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A generator function (includes a \"yield\" statement) that returns a generator that generates all positive even numbers up to $n$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def Evens_gen(n):\n", " ''' returns a generator of evens up to n '''\n", " for i in range(n):\n", " print(\"current i:\", i)\n", " if i%2 == 0:\n", " print(\"before yield\")\n", " yield i\n", " print(\"after yield\")" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "g = Evens_gen(10)\n" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "current i: 0\n", "before yield\n" ] }, { "data": { "text/plain": [ "0" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "next(g)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "after yield\n", "current i: 1\n", "current i: 2\n", "before yield\n" ] }, { "data": { "text/plain": [ "2" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "next(g)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A generator function which produces the **infinite** sequence of all positive even numbers" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def All_evens_gen():\n", " i=0\n", " while True:\n", " yield i\n", " i+=2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Solving this exam question about generators" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Section (b)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def SomePairs():\n", " i=0\n", " while True:\n", " for j in range(i):\n", " yield(i,j)\n", " i=i+1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Section (c)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def RevGen(PairsGen):\n", " pairs = PairsGen()\n", " while True:\n", " pair = next(pairs)\n", " yield(pair[1],pair[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Section (d1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def UnionGenerators(gen1, gen2):\n", " while True:\n", " yield next(gen1)\n", " yield next(gen2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Section (d2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def EqPairs():\n", " i=0\n", " while True:\n", " yield (i,i)\n", " i=i+1" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def AllPairs():\n", " return UnionGenerators(SomePairs(),\n", " UnionGenerators(EqPairs(),\n", " RevGen(SomePairs)))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.2" } }, "nbformat": 4, "nbformat_minor": 1 }