{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "

cs1001.py , Tel Aviv University, Spring 2019

\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Recitation 10" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We discussed Hash Tables and Nim\n", "\n", "#### Takeaways:\n", "- Important properties of Hash Tables:\n", " - Hash tables can be useful for many algorithms, including memoization. \n", " - Insert and find operation run in $O(1)$ on average, but $O(n)$ in the worst case (where $n$ is the number of elements in the table)\n", " - Make sure you understand the complexity analysis for hash tables (see the links below)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Code for printing several outputs in one cell (not part of the recitation):" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from IPython.core.interactiveshell import InteractiveShell\n", "InteractiveShell.ast_node_interactivity = \"all\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Hash Tables\n", "\n", "We wish to have a data structure that implements the operations: insert, search and delete in **expected** $O(1)$ time. \n", "\n", "Summarizing the insert and search complexity of the data structures that we have seen already:\n", "\n", "| implementation | insert | search | delete |\n", "|-------------------------------|--------------------------|----------|---------------------|\n", "| Python list | O(1) always at the end | O(n) | O(n) |\n", "| Python ordered list | O(n) | O(log n) | O(n) |\n", "| Linked list | O(1) always at the start | O(n) | O(1) given the node before the one to delete |\n", "| Sorted linked list | O(n) | O(n) | O(1) given the node before the one to delete |\n", "| Unbalanced Binary Search Tree | O(n) | O(n) | O(n) |\n", "| Balanced Binary Search Tree | O(log n) | O(log n) | O(log n) |" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Please read the following summary on the various data structures mentioned in class.\n", "\n", "A detailed summary on the complexity of insert/search operations using hash tables can be found here. Make sure you read it." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise: \n", "Given a string $st$ of length $n$ and a small integer $\\ell$, write a function that checks whether there is a substring in $st$ of length $\\ell$ that appears more than once.\n", "\n", "#### Solution \\#1: Naive\n", "The complexity is $O(\\ell(n-\\ell)^2)$. \n", "There $O((n-\\ell)^2)$ iterations (make sure you undersand why) and in each iteration we perform operations in $O(\\ell)$ time." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "True" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "False" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def repeat_naive(st, l): \n", " for i in range(len(st)-l+1):\n", " for j in range(i+1,len(st)-l+1):\n", " if st[i:i+l]==st[j:j+l]:\n", " return True\n", " return False\n", "\n", "repeat_naive(\"hello\", 1)\n", "repeat_naive(\"hello\"*10, 45)\n", "repeat_naive(\"hello\"*10, 46)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's test our algorithm with by generating a random string of a given size." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "dlotznwapuossuxedpcsvzoufxhqvfonloryhpgsyrzyolkfmfxpwnmeizpdykeckfmgaeaadsgnyzqsxcfhpwouweerlqdsguvygjueogzluetlacuhaimdosiqewcwdmkisskuaqpyoebiinexfxffikzodqwkdofiqcbxhxcvflpylluotyjqndguigmgainmytznmirbeynyebitpcmaaaghaovddmmlcymdgqgjsaxkxrxvqpsvwuoxzddvwbixwdxnxxlmqfwmbozipvfpudijxttjgvkyziljouovunoylhuzlbmcgiiiqnftfxuzuktqsewcucoilfoigtixrlynmlnoxignmulzcajelhzwupblwwtbzgjbpyrsidazfpknzgvqqaajgnbwcbzmeefzkwaytkespouzqrureuwxppxdpyahgqziqsrherqiilcpwrdmtoaeefpabbvvzewfcvnrjifahqbjatyswtylywrdqvuemvhmdrbtvokopegwaljkpeafnysdbcswmavrpjkixrdxoenaebztyiiplhdxizpbyecjfcyirksttntxiihbtyzmsepctluedamzaqcnyvhazbycuygqxstxnhsvjbbgoupkqoocpuepnmnzkecqklxcbmmzlndjbafiqgihqfjzgnkmpzuhishmhytzpyqrtnppnwfnxisinqngbumlrtozmmnuzkbbjacforiuqbgewvtrctdpydsylivhorxjzkuvafvggysstzgjgprfhieebrlhvhurnfeybshvpiuvtpvtyykpkthraeeydqxvlladhcnsofrlaljylkxfqzrdeptbrgumynhdspzfabvgmzscnfmydswhxnmehzyziumeutuxxnshvovyfwovscdrfcqelivceiwujagevnaviolrlypxsplxtmdlkakvhntmwhvxzteabgzfyctntapasbatvqogxripapfdzlgphrno\n" ] }, { "data": { "text/plain": [ "True" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "False" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import random\n", "def gen_str(size, alphabet = \"abcdefghijklmnopqrstuvwxyz\"):\n", " ''' Generate a random string of length size over alphabet '''\n", " s=\"\"\n", " for i in range(size):\n", " s += random.choice(alphabet)\n", " return s\n", "rndstr = gen_str(1000)\n", "print(rndstr)\n", "repeat_naive(rndstr, 3)\n", "repeat_naive(rndstr, 10)\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For bigger $n$ and $\\ell$, this could be quite slow:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "False" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rndstr = gen_str(10000)\n", "repeat_naive(rndstr, 3)\n", "repeat_naive(rndstr, 10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The class Hashtable from the lectures" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "class Hashtable:\n", " def __init__(self, m, hash_func=hash):\n", " \"\"\" initial hash table, m empty entries \"\"\"\n", " ##bogus initialization #1:\n", " #self.table = [[]*m]\n", " ##bogus initialization #2:\n", " #empty=[]\n", " #self.table = [empty for i in range(m)]\n", " \n", " self.table = [ [] for i in range(m)]\n", " self.hash_mod = lambda x: hash_func(x) % m # using python hash function\n", "\n", " def __repr__(self):\n", " L = [self.table[i] for i in range(len(self.table))]\n", " return \"\".join([str(i) + \" \" + str(L[i]) + \"\\n\" for i in range(len(self.table))])\n", " \n", " def find(self, item):\n", " \"\"\" returns True if item in hashtable, False otherwise \"\"\"\n", " i = self.hash_mod(item)\n", " return item in self.table[i]\n", " #if item in self.table[i]:\n", " # return True\n", " #else:\n", " # return False\n", "\n", " def insert(self, item):\n", " \"\"\" insert an item into table \"\"\"\n", " i = self.hash_mod(item)\n", " if item not in self.table[i]:\n", " self.table[i].append(item)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Solution \\#2: using the class Hashtable" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "def repeat_hash1(st, l):\n", " m=len(st)-l+1\n", " htable = Hashtable(m)\n", " for i in range(len(st)-l+1):\n", " if htable.find(st[i:i+l])==False:\n", " htable.insert(st[i:i+l])\n", " else:\n", " return True\n", " return False" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The expected (average) complexity is: $O(\\ell(n-\\ell))$\n", "\n", "Creating the table takes $O(n-\\ell)$ time, and there are $O(n-\\ell)$ iterations, each taking expected $O(\\ell)$ time.\n", "\n", "\n", "\n", "The worst case complexity is: $O(\\ell(n-\\ell)^2)$\n", "\n", "Creating the table takes $O(n-\\ell)$ time, and the time for executing the loop is\n", "$\\ell\\cdot\\sum_{i=0}^{n-\\ell}{i}= O(\\ell(n-\\ell)^2)$\n", "\n", "Which native Python data structure is most suitable for this problem?\n", "\n", "#### Solution #3: using Python's set implementation\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "def repeat_hash2(st, l):\n", " htable = set() #Python sets use hash functions for fast lookup\n", " for i in range(len(st)-l+1):\n", " if st[i:i+l] not in htable:\n", " htable.add(st[i:i+l])\n", " else: return True\n", " return False" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Competition between the 3 solutions" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "repeat_naive 0.42494059999989986 found? False\n", "repeat_hash1 0.0042048000000249885 found? False\n", "repeat_hash2 0.001372700000047189 found? False\n" ] } ], "source": [ "import time\n", "str_len=1000\n", "st=gen_str(str_len)\n", "l=10\n", "for f in [repeat_naive,repeat_hash1,repeat_hash2]:\n", " t0=time.perf_counter()\n", " res=f(st, l)\n", " t1=time.perf_counter()\n", " print(f.__name__, t1-t0, \"found?\",res)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "repeat_naive 1.633473399999957 found? False\n", "repeat_hash1 0.009137300000020332 found? False\n", "repeat_hash2 0.0028928000001542387 found? False\n" ] } ], "source": [ "str_len=2000\n", "st=gen_str(str_len)\n", "l=10\n", "for f in [repeat_naive,repeat_hash1,repeat_hash2]:\n", " t0=time.perf_counter()\n", " res=f(st, l)\n", " t1=time.perf_counter()\n", " print(f.__name__, t1-t0, \"found?\",res)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For a random string of size $n=1000$ and for $l=10$ the running time of repeat_hash2 is the smallest, while the one for repeat_naive is the largest.\n", "\n", "When increasing $n$ to 2000, the running time of repeat_naive increases by ~4, while the running time of repeat_hash1, repeat_hash2 increases by ~2.\n", "\n", "#### Time spent on creating the table\n", "When $st$ is \"a\"$*1000$, repeat_hash1 is the slowest, since it spends time on creating an empty table of size 991." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "repeat_naive 6.799999937356915e-06 found? True\n", "repeat_hash1 0.00018980000004376052 found? True\n", "repeat_hash2 5.799999826194835e-06 found? True\n" ] } ], "source": [ "st=\"a\"*1000\n", "l=10\n", "for f in [repeat_naive,repeat_hash1,repeat_hash2]:\n", " t0=time.perf_counter()\n", " res=f(st, l)\n", " t1=time.perf_counter()\n", " print(f.__name__, t1-t0, \"found?\",res)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The effect of table size\n", "\n", "The second solution, with control over the table size" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "def repeat_hash1_var_size(st, l, m=0):\n", " if m==0: #default hash table size is ~number of substrings to be inserted\n", " m=len(st)-l+1\n", " htable = Hashtable(m)\n", " for i in range(len(st)-l+1):\n", " if htable.find(st[i:i+l])==False:\n", " htable.insert(st[i:i+l])\n", " else:\n", " return True\n", " return False" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Comparing tables sizes" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "str_len= 1000 repeating substring len= 10\n", "0.044579099999964455 found? False table size= 1\n", "0.00932720000014342 found? False table size= 10\n", "0.004570000000057917 found? False table size= 100\n", "0.0057033999999021034 found? False table size= 1000\n", "0.004408700000112731 found? False table size= 1500\n", "0.006572400000095513 found? False table size= 10000\n", "0.04230780000011691 found? False table size= 100000\n" ] } ], "source": [ "import time\n", "str_len=1000\n", "st=gen_str(str_len)\n", "l=10\n", "print(\"str_len=\",str_len, \"repeating substring len=\",l)\n", "for m in [1, 10, 100, 1000, 1500, 10000, 100000]:\n", " t0=time.perf_counter()\n", " res=repeat_hash1_var_size(st, l, m)\n", " t1=time.perf_counter()\n", " print(t1-t0, \"found?\",res, \"table size=\",m)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Summary\n", "Make sure you read the following summary that includes a detailed explanation on the experiments." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Nim\n", "\n", "In HW4, we were asked to give a recursive solution to the game of Nim. As a reminder, in the game of Nim we are given $n$ stacks, each containing some non-zero amount of coins. There are two players in the game, at each turn the player whose turn it is picks a stack which has some $k > 0$ coins and removes $1 \\leq i \\leq k$ coins from the stack. The game ends when all stacks are empty and the winner is the last player to remove coins from the last stack.\n", "\n", "Our goal is to implement the function $can\\_win\\_nim$ which gets a list $stack$ of positive integers and returns $True$ iff the first player has a winning strategy in the game.\n", "\n", "Let's try and find a recursive solution for our problem:\n", "* As a base case, if all stacks are empty, then the player currently playing loses and therefore we simply return $False$\n", "* Otherwise, there are some \"active\" stacks. I.e., stacks with a positive amount of coins in them. \n", "\n", "We now recall the reasoning we used when solving Munch: in order to see if we can win the game, we can ask \"is there a legal move we can make such that our opponent will lose?\" If there exists such a move, we will play it. If no such move exists, this is the same as saying that for any legal move we can make, our opponent has a winning strategy and therefore we will lose.\n", "\n", "This gives rise to a simple recursive solution:\n", "* Go over all stacks\n", "* For each \"active\" stack with $i > 0$ coins:\n", " * For each $1 \\leq j \\leq i$, recursively check if the opponent loses if we remove $j$ coins from the stack.\n", " * If the opponent loses, this is a move that leads to a winning strategy and therefore we return \"True\"\n", "* If all possible moves in all non-zero stacks lead to the opponent winning, we lose and return \"False\"" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "False" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "True" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def can_win_nim(stacks):\n", "\n", " #if len([s for s in stacks if s > 0]) == 0:\n", " if not any(stacks):\n", " return False\n", "\n", " for i in range(len(stacks)):\n", " for j in range(1, stacks[i]+1):\n", " stacks[i] -= j\n", " can_other_player_win = can_win_nim(stacks)\n", " stacks[i] += j\n", " if not can_other_player_win:\n", " return True\n", "\n", " return False\n", "\n", "can_win_nim([1])\n", "can_win_nim([2, 2])\n", "can_win_nim([2, 3, 1, 4])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Memoization\n", "\n", "In order to memoize this code we need to decide on the key for memoization. A natural candidate is the stacks themselves. We can convert our list into a tuple (so it will be immutable) and use this tuple as a key.\n", "\n", "The implementation is straightforward:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def can_win_nim_mem(stacks):\n", " return can_win_nim_mem_inner(stacks, {})\n", "\n", "def can_win_nim_mem_inner(stacks, mem): \n", " mem_key = tuple(stacks)\n", " if mem_key in mem:\n", " return mem[mem_key]\n", "\n", " if not any(stacks):\n", " return False\n", "\n", " for i in range(len(stacks)):\n", " for j in range(1, stacks[i]+1):\n", " stacks[i] -= j\n", " can_other_player_win = can_win_nim_mem_inner(stacks, mem)\n", "\n", " stacks[i] += j\n", " if not can_other_player_win:\n", " mem[mem_key] = True\n", " return True\n", "\n", " mem[mem_key] = False\n", " return False\n", "\n", "can_win_nim_mem([3 for i in range(10)])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's mention a slight improvement for this: we first make an almost trivial observation, and that is the fact that given an input to the function $stacks$, the result of the function does not depend on the order of the entries in it.\n", "\n", "For example, if $stacks = [2, 3, 1, 4]$ and the function returns $True$ on $stacks$, we know that it will also return $True$ on $[1, 2, 3, 4], [4, 3, 2, 1]$ and so forth. Since our goal is to do as little work as possible, we would like to make sure that once we've computed the function on $stacks$, we will now compute it on any permutation of the list. How can we do this?\n", "\n", "One possible solution is that once we know the value of the function on $stacks$, we can enter any permutation of it into our dictionary. This will work but would be prohibitive since merely manufacturing all permutations of $stacks$ can take a long time (if $stacks$ has 10 unique entries there are over 3 million possible permutations for it!).\n", "\n", "Can you think of a better way of avoiding extra work? Say we have a solution to $stacks$, we can enter it into the dictionary using the sorted list as a key. Sorting the list will take $O(n \\log n)$ which is not too bad. Are we done? There's one more important thing to remember, and that is that now when we want to check if a given list is already in our dictionary, we need to search for it after sorting it.\n", "\n", "The easiest way not to get confused about all of this is to simply compute the key at the beginning of the function execution and use it throughout the function.\n", "\n", "Here is the implementation:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def can_win_nim_mem_sorted(stacks):\n", " return can_win_nim_mem_inner_sorted(stacks, {})\n", "\n", "def can_win_nim_mem_inner_sorted(stacks, mem):\n", " # This is the only change!\n", " # mem_key = tuple(stacks)\n", " mem_key = tuple(sorted(stacks))\n", " if mem_key in mem:\n", " return mem[mem_key]\n", "\n", " if not any(stacks):\n", " return False\n", "\n", " #if len([s for s in stacks if s > 0]) == 1:\n", " # mem[mem_key] = True\n", " # return True\n", "\n", " for i in range(len(stacks)):\n", " for j in range(1, stacks[i]+1):\n", " stacks[i] -= j\n", " can_other_player_win = can_win_nim_mem_inner_sorted(stacks, mem)\n", "\n", " stacks[i] += j\n", " if not can_other_player_win:\n", " mem[mem_key] = True\n", " return True\n", "\n", " mem[mem_key] = False\n", " return False\n", "\n", "can_win_nim_mem_sorted([3 for i in range(30)])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How much did we gain? Our naive memoization started coughing for a list of length 10, whereas our improved implementation solved the problem for a list of length 30 almost instantaneously!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.2" } }, "nbformat": 4, "nbformat_minor": 1 }