\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Recitation 9"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We discussed two data structures - Binary Search Trees and Hash Tables.\n",
"\n",
"#### Takeaways:\n",
"- Important properties of Binary Search Trees:\n",
" - Insert and find take $O(h)$ time where $h$ is the height of the tree.\n",
" - When a tree containing $n$ nodes is balanced, $h = O(\\log{n})$, in the worst case (totally unbalanced), $h=O(n)$\n",
" - Many methods in this class are implemented using recursion.\n",
"- Important properties of Hash Tables:\n",
" - Hash tables can be useful for many algorithms, including memoization. \n",
" - Insert and find operation run in $O(1)$ on average, but $O(n)$ in the worst case (where $n$ is the number of elements in the table)\n",
" - Make sure you understand the complexity analysis for hash tables (see the links below)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Code for printing several outputs in one cell (not part of the recitation):"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"from IPython.core.interactiveshell import InteractiveShell\n",
"InteractiveShell.ast_node_interactivity = \"all\""
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"## This file contains functions for the representation of binary trees.\n",
"## used in class Binary_search_tree's __repr__\n",
"## Written by a former student in the course - thanks to Amitai Cohen\n",
"## No need to fully understand this code\n",
"\n",
"def printree(t, bykey = True):\n",
" \"\"\"Print a textual representation of t\n",
" bykey=True: show keys instead of values\"\"\"\n",
" #for row in trepr(t, bykey):\n",
" # print(row)\n",
" return trepr(t, bykey)\n",
"\n",
"def trepr(t, bykey = False):\n",
" \"\"\"Return a list of textual representations of the levels in t\n",
" bykey=True: show keys instead of values\"\"\"\n",
" if t==None:\n",
" return [\"#\"]\n",
"\n",
" thistr = str(t.key) if bykey else str(t.val)\n",
"\n",
" return conc(trepr(t.left,bykey), thistr, trepr(t.right,bykey))\n",
"\n",
"def conc(left,root,right):\n",
" \"\"\"Return a concatenation of textual represantations of\n",
" a root node, its left node, and its right node\n",
" root is a string, and left and right are lists of strings\"\"\"\n",
" \n",
" lwid = len(left[-1])\n",
" rwid = len(right[-1])\n",
" rootwid = len(root)\n",
" \n",
" result = [(lwid+1)*\" \" + root + (rwid+1)*\" \"]\n",
" \n",
" ls = leftspace(left[0])\n",
" rs = rightspace(right[0])\n",
" result.append(ls*\" \" + (lwid-ls)*\"_\" + \"/\" + rootwid*\" \" + \"|\" + rs*\"_\" + (rwid-rs)*\" \")\n",
" \n",
" for i in range(max(len(left),len(right))):\n",
" row = \"\"\n",
" if i node.key:\n",
" if node.right == None:\n",
" node.right = Tree_node(key, val)\n",
" else:\n",
" insert_rec(node.right, key, val)\n",
" return\n",
" \n",
" if self.root == None: #empty tree\n",
" self.root = Tree_node(key, val)\n",
" else:\n",
" insert_rec(self.root, key, val)\n",
"\n",
"\n",
" def minimum(self):\n",
" ''' return node with minimal key '''\n",
" if self.root == None:\n",
" return None\n",
" node = self.root\n",
" left = node.left\n",
" while left != None:\n",
" node = left\n",
" left = node.left\n",
" return node\n",
"\n",
"\n",
" def depth(self):\n",
" ''' return depth of tree, uses recursion'''\n",
" def depth_rec(node):\n",
" if node == None:\n",
" return -1\n",
" else:\n",
" return 1 + max(depth_rec(node.left), depth_rec(node.right))\n",
"\n",
" return depth_rec(self.root)\n",
"\n",
"\n",
" def size(self):\n",
" ''' return number of nodes in tree, uses recursion '''\n",
" def size_rec(node):\n",
" if node == None:\n",
" return 0\n",
" else:\n",
" return 1 + size_rec(node.left) + size_rec(node.right)\n",
"\n",
" return size_rec(self.root)\n",
" \n",
" def inorder(self):\n",
" '''prints the keys of the tree in a sorted order'''\n",
" def inorder_rec(node):\n",
" if node == None:\n",
" return\n",
" inorder_rec(node.left)\n",
" print(node.key)\n",
" inorder_rec(node.right)\n",
" \n",
" inorder_rec(self.root)\n",
"\n",
"\n",
"t = Binary_search_tree()\n",
"t.insert(4,'a')\n",
"t.insert(2,'a')\n",
"t.insert(6,'a')\n",
"t.insert(1,'a')\n",
"t.insert(3,'a')\n",
"t.insert(5,'a')\n",
"t.insert(7,'a')\n",
"t\n",
"t.inorder()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Claim: An inorder traversal of a binary search tree prints the keys in ascending order.\n",
"\n",
"Proof, by complete induction on the size of tree:\n",
"* Base - for $n = 1$ the claim is trivial\n",
"* Assume the claim holds for all $i \\leq n$\n",
"* For a tree of size $n+1$ with root $v$ consider the following:\n",
" * Both the right and the left subtree of $v$ have size at most $n$\n",
" * By induction, all keys in $v.left$ are printed in ascending order (and they are all smaller than $v.key$)\n",
" * Next, $v.key$ is printed\n",
" * Finally, by induction, all keys in $v.right$ are printed in ascending order (and they are all larger than $v.key$)\n",
" * Thus, the keys of the tree (which has size $n+1$) are printed in ascending order\n",
"\n",
"\n",
"More information on complete induction: https://en.wikipedia.org/wiki/Mathematical_induction#Complete_(strong)_induction"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Question: Assume $N = 2^n - 1$ for some $n$, find a method of inserting the numbers $[1,...,N]$ to a BST such that it is completely balanced (i.e. - it is a complete tree).\n",
"\n",
"Solution: As we usually do with trees, we give a recursive solution. Our input will be a node and a first and last index to enter into the tree rooted in the node. We start with the root, $first = 1, last = 2^n - 1$\n",
"\n",
"- Base case: If $first = last$ then we simply need to create a root with no sons labeled $first$ \n",
"- Otherwise, we find the mid point $mid = (first + last - 1 ) // 2$. We recursively insert $first,...,mid$ into the left son of the node, $mid + 1$ into the current node and $mid + 2, ..., last$ into the right son of the node.\n"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"[' 8 ',\n",
" ' ______________/ |________________ ',\n",
" ' 4 12 ',\n",
" ' ______/ |______ _______/ |_______ ',\n",
" ' 2 6 10 14 ',\n",
" ' __/ |__ __/ |__ __/ |__ __/ |__ ',\n",
" ' 1 3 5 7 9 11 13 15 ',\n",
" ' / | / | / | / | / | / | / | / | ',\n",
" '# # # # # # # # # # # # # # # #']"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"class Tree_node():\n",
" def __init__(self, key, val):\n",
" self.key = key\n",
" self.val = val\n",
" self.left = None\n",
" self.right = None\n",
"\n",
" def __repr__(self):\n",
" return \"(\" + str(self.key) + \":\" + str(self.val) + \")\"\n",
" \n",
" \n",
" \n",
"class Binary_search_tree():\n",
"\n",
" def __init__(self):\n",
" self.root = None\n",
"\n",
"\n",
" def __repr__(self): #no need to understand the implementation of this one\n",
" out = \"\"\n",
" for row in printree(self.root): #need printree.py file\n",
" out = out + row + \"\\n\"\n",
" return out\n",
"\n",
"\n",
" def lookup(self, key):\n",
" ''' return node with key, uses recursion '''\n",
"\n",
" def lookup_rec(node, key):\n",
" if node == None:\n",
" return None\n",
" elif key == node.key:\n",
" return node\n",
" elif key < node.key:\n",
" return lookup_rec(node.left, key)\n",
" else:\n",
" return lookup_rec(node.right, key)\n",
"\n",
" return lookup_rec(self.root, key)\n",
"\n",
"\n",
"\n",
" def insert(self, key, val):\n",
" ''' insert node with key,val into tree, uses recursion '''\n",
"\n",
" def insert_rec(node, key, val):\n",
" if key == node.key:\n",
" node.val = val # update the val for this key\n",
" elif key < node.key:\n",
" if node.left == None:\n",
" node.left = Tree_node(key, val)\n",
" else:\n",
" insert_rec(node.left, key, val)\n",
" else: #key > node.key:\n",
" if node.right == None:\n",
" node.right = Tree_node(key, val)\n",
" else:\n",
" insert_rec(node.right, key, val)\n",
" return\n",
" \n",
" if self.root == None: #empty tree\n",
" self.root = Tree_node(key, val)\n",
" else:\n",
" insert_rec(self.root, key, val)\n",
"\n",
"\n",
" def minimum(self):\n",
" ''' return node with minimal key '''\n",
" if self.root == None:\n",
" return None\n",
" node = self.root\n",
" left = node.left\n",
" while left != None:\n",
" node = left\n",
" left = node.left\n",
" return node\n",
"\n",
"\n",
" def depth(self):\n",
" ''' return depth of tree, uses recursion'''\n",
" def depth_rec(node):\n",
" if node == None:\n",
" return -1\n",
" else:\n",
" return 1 + max(depth_rec(node.left), depth_rec(node.right))\n",
"\n",
" return depth_rec(self.root)\n",
"\n",
"\n",
" def size(self):\n",
" ''' return number of nodes in tree, uses recursion '''\n",
" def size_rec(node):\n",
" if node == None:\n",
" return 0\n",
" else:\n",
" return 1 + size_rec(node.left) + size_rec(node.right)\n",
"\n",
" return size_rec(self.root)\n",
" \n",
" def inorder(self):\n",
" '''prints the keys of the tree in a sorted order'''\n",
" def inorder_rec(node):\n",
" if node == None:\n",
" return\n",
" inorder_rec(node.left)\n",
" print(node.key)\n",
" inorder_rec(node.right)\n",
" \n",
" inorder_rec(self.root)\n",
" \n",
" def insert_balanced(self, n):\n",
" '''inserts the numbers 1, ... , 2**n - 1 into a full BST'''\n",
" def insert_balanced_rec(node, first, last):\n",
" if first == last:\n",
" node.key = first\n",
" node.val = first\n",
" else:\n",
" mid = (first + last - 1) // 2\n",
" node.key = mid + 1\n",
" node.val = mid + 1\n",
" node.left = Tree_node(0,0)\n",
" insert_balanced_rec(node.left, first, mid)\n",
" node.right = Tree_node(0,0)\n",
" insert_balanced_rec(node.right, mid + 2, last)\n",
" \n",
" self.root = Tree_node(0,0)\n",
" insert_balanced_rec(self.root, 1, 2**n - 1)\n",
" \n",
" \n",
"\n",
"\n",
"t = Binary_search_tree()\n",
"t.insert(4,'a')\n",
"t.insert(2,'a')\n",
"t.insert(6,'a')\n",
"t.insert(1,'a')\n",
"t.insert(3,'a')\n",
"t.insert(5,'a')\n",
"t.insert(7,'a')\n",
"\n",
"#t.inorder()\n",
"\n",
"t = Binary_search_tree()\n",
"t.insert_balanced(4)\n",
"printree(t.root)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Hash Tables\n",
"\n",
"We wish to have a data structure that implements the operations: insert, search and delete in **expected** $O(1)$ time. \n",
"\n",
"Summarizing the insert and search complexity of the data structures that we have seen already:\n",
"\n",
"| implementation | insert | search | delete |\n",
"|-------------------------------|--------------------------|----------|---------------------|\n",
"| Python list | O(1) always at the end | O(n) | O(n) |\n",
"| Python ordered list | O(n) | O(log n) | O(n) |\n",
"| Linked list | O(1) always at the start | O(n) | O(1) given the node before the one to delete |\n",
"| Sorted linked list | O(n) | O(n) | O(1) given the node before the one to delete |\n",
"| Unbalanced Binary Search Tree | O(n) | O(n) | O(n) |\n",
"| Balanced Binary Search Tree | O(log n) | O(log n) | O(log n) |"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Please read the following summary on the various data structures mentioned in class.\n",
"\n",
"A detailed summary on the complexity of insert/search operations using hash tables can be found here. Make sure you read it."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise: \n",
"Given a string $st$ of length $n$ and a small integer $\\ell$, write a function that checks whether there is a substring in $st$ of length $\\ell$ that appears more than once.\n",
"\n",
"#### Solution \\#1: Naive\n",
"The complexity is $O(\\ell(n-\\ell)^2)$. \n",
"There $O((n-\\ell)^2)$ iterations (make sure you undersand why) and in each iteration we perform operations in $O(\\ell)$ time."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def repeat_naive(st, l): \n",
" for i in range(len(st)-l+1):\n",
" for j in range(i+1,len(st)-l+1):\n",
" if st[i:i+l]==st[j:j+l]:\n",
" return True\n",
" return False\n",
"\n",
"repeat_naive(\"hello\", 1)\n",
"repeat_naive(\"hello\"*10, 45)\n",
"repeat_naive(\"hello\"*10, 46)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's test our algorithm with by generating a random string of a given size."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import random\n",
"def gen_str(size, alphabet = \"abcdefghijklmnopqrstuvwxyz\"):\n",
" ''' Generate a random string of length size over alphabet '''\n",
" s=\"\"\n",
" for i in range(size):\n",
" s += random.choice(alphabet)\n",
" return s\n",
"rndstr = gen_str(1000)\n",
"print(rndstr)\n",
"repeat_naive(rndstr, 3)\n",
"repeat_naive(rndstr, 10)\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For bigger $n$ and $\\ell$, this could be quite slow:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"rndstr = gen_str(10000)\n",
"repeat_naive(rndstr, 3)\n",
"repeat_naive(rndstr, 10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### The class Hashtable from the lectures"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"class Hashtable:\n",
" def __init__(self, m, hash_func=hash):\n",
" \"\"\" initial hash table, m empty entries \"\"\"\n",
" ##bogus initialization #1:\n",
" #self.table = [[]*m]\n",
" ##bogus initialization #2:\n",
" #empty=[]\n",
" #self.table = [empty for i in range(m)]\n",
" \n",
" self.table = [ [] for i in range(m)]\n",
" self.hash_mod = lambda x: hash_func(x) % m # using python hash function\n",
"\n",
" def __repr__(self):\n",
" L = [self.table[i] for i in range(len(self.table))]\n",
" return \"\".join([str(i) + \" \" + str(L[i]) + \"\\n\" for i in range(len(self.table))])\n",
" \n",
" def find(self, item):\n",
" \"\"\" returns True if item in hashtable, False otherwise \"\"\"\n",
" i = self.hash_mod(item)\n",
" return item in self.table[i]\n",
" #if item in self.table[i]:\n",
" # return True\n",
" #else:\n",
" # return False\n",
"\n",
" def insert(self, item):\n",
" \"\"\" insert an item into table \"\"\"\n",
" i = self.hash_mod(item)\n",
" if item not in self.table[i]:\n",
" self.table[i].append(item)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Solution \\#2: using the class Hashtable"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def repeat_hash1(st, l):\n",
" m=len(st)-l+1\n",
" htable = Hashtable(m)\n",
" for i in range(len(st)-l+1):\n",
" if htable.find(st[i:i+l])==False:\n",
" htable.insert(st[i:i+l])\n",
" else:\n",
" return True\n",
" return False"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The expected (average) complexity is: $O(\\ell(n-\\ell))$\n",
"\n",
"Creating the table takes $O(n-\\ell)$ time, and there are $O(n-\\ell)$ iterations, each taking expected $O(\\ell)$ time.\n",
"\n",
"\n",
"\n",
"The worst case complexity is: $O(\\ell(n-\\ell)^2)$\n",
"\n",
"Creating the table takes $O(n-\\ell)$ time, and the time for executing the loop is\n",
"$\\ell\\cdot\\sum_{i=0}^{n-\\ell}{i}= O(\\ell(n-\\ell)^2)$\n",
"\n",
"Which native Python data structure is most suitable for this problem?\n",
"\n",
"#### Solution #3: using Python's set implementation\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def repeat_hash2(st, l):\n",
" htable = set() #Python sets use hash functions for fast lookup\n",
" for i in range(len(st)-l+1):\n",
" if st[i:i+l] not in htable:\n",
" htable.add(st[i:i+l])\n",
" else: return True\n",
" return False"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Competition between the 3 solutions"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import time\n",
"str_len=1000\n",
"st=gen_str(str_len)\n",
"l=10\n",
"for f in [repeat_naive,repeat_hash1,repeat_hash2]:\n",
" t0=time.perf_counter()\n",
" res=f(st, l)\n",
" t1=time.perf_counter()\n",
" print(f.__name__, t1-t0, \"found?\",res)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"str_len=2000\n",
"st=gen_str(str_len)\n",
"l=10\n",
"for f in [repeat_naive,repeat_hash1,repeat_hash2]:\n",
" t0=time.perf_counter()\n",
" res=f(st, l)\n",
" t1=time.perf_counter()\n",
" print(f.__name__, t1-t0, \"found?\",res)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For a random string of size $n=1000$ and for $l=10$ the running time of repeat_hash2 is the smallest, while the one for repeat_naive is the largest.\n",
"\n",
"When increasing $n$ to 2000, the running time of repeat_naive increases by ~4, while the running time of repeat_hash1, repeat_hash2 increases by ~2.\n",
"\n",
"#### Time spent on creating the table\n",
"When $st$ is \"a\"$*1000$, repeat_hash1 is the slowest, since it spends time on creating an empty table of size 991."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"st=\"a\"*1000\n",
"l=10\n",
"for f in [repeat_naive,repeat_hash1,repeat_hash2]:\n",
" t0=time.perf_counter()\n",
" res=f(st, l)\n",
" t1=time.perf_counter()\n",
" print(f.__name__, t1-t0, \"found?\",res)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### The effect of table size\n",
"\n",
"The second solution, with control over the table size"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def repeat_hash1_var_size(st, l, m=0):\n",
" if m==0: #default hash table size is ~number of substrings to be inserted\n",
" m=len(st)-l+1\n",
" htable = Hashtable(m)\n",
" for i in range(len(st)-l+1):\n",
" if htable.find(st[i:i+l])==False:\n",
" htable.insert(st[i:i+l])\n",
" else:\n",
" return True\n",
" return False"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Comparing tables sizes"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import time\n",
"str_len=1000\n",
"st=gen_str(str_len)\n",
"l=10\n",
"print(\"str_len=\",str_len, \"repeating substring len=\",l)\n",
"for m in [1, 10, 100, 1000, 1500, 10000, 100000]:\n",
" t0=time.perf_counter()\n",
" res=repeat_hash1_var_size(st, l, m)\n",
" t1=time.perf_counter()\n",
" print(t1-t0, \"found?\",res, \"table size=\",m)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Summary\n",
"Make sure you read the following summary that includes a detailed explanation on the experiments."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.2"
}
},
"nbformat": 4,
"nbformat_minor": 1
}