{
 "metadata": {
  "name": ""
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# CS1001.py\n",
      "\n",
      "## Extended Introduction to Computer Science with Python, Tel-Aviv University, Spring 2013\n",
      "\n",
      "# Recitation 8 - 2-6.5.2013\n",
      "\n",
      "## Last update: 6.5.2013"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Homework 3\n",
      "\n",
      "## Question 4b"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def f2(lst):\n",
      "    i = 1\n",
      "    while i < len(lst):\n",
      "        j = 1\n",
      "        while j < i:\n",
      "            print(lst[i], lst[j])\n",
      "            j += 5\n",
      "        i *= 2"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 1
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The inner loop does $O(i/5)$ iterations, because *j* is bounded by *i* and does steps of *5*. \n",
      "\n",
      "The value of *i* after *k* iterations of the outer loop is $2^k$, and this outer loop goes on untill $2^k=n$, or $k = \\log_{2}{n})$.\n",
      "\n",
      "So the total complexity is:\n",
      "\n",
      "$$\n",
      "O(\\sum_{k=1}^{\\log_{2}{n}}{\\frac{2^k}{5}}) = O(\\frac{1}{5}\\sum_{k=1}^{\\log_{2}{n}}{2^k}) = O(\\frac{2^{\\log_{2}{n}}-1}{2-1}) = O(2^{\\log_{2}{n}}) = O(n)\n",
      "$$"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Question 4c"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def f3(lst):\n",
      "    i = len(lst)\n",
      "    while i > 0:\n",
      "        for j in range(i):\n",
      "            for k in range(j, 10**5):\n",
      "                print(i)\n",
      "        i -= 2"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 2
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The the most inner loop does an $O(1)$ operation and at most $10^5$ iterations, so the inner loop's complexity is $O(1)$. This is because:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "list(range(5,2))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "pyout",
       "prompt_number": 7,
       "text": [
        "[]"
       ]
      }
     ],
     "prompt_number": 7
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The middle loop does at most *n* iterations at the first time, then *n-2* etc.:\n",
      "$$\n",
      "n + n-2 + n-4 + ... + 1 = O(n^2)\n",
      "$$\n",
      "\n",
      "And therefore the total complexity is $O(n^2)$."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Hashing\n",
      "\n",
      "We want a dynamic dictionary with insert, remove, search. \n",
      "\n",
      "The keys are from a very large set *U* but the number of distinct elements expected is *n*, and $n<<|U|$.\n",
      "\n",
      "We will use a table of size *m*, where $m\\approx n$.\n",
      "\n",
      "For example, *U* is the set of 8-digit ID numbers, so $|U| = 10^8$. \n",
      "\n",
      "*n* is the size of the Israeli population, so $m\\approx n \\approx 10^7$.\n",
      "\n",
      "## Hash function\n",
      "\n",
      "We need a function $h: U \\to {0,1,...,m-1}$. \n",
      "\n",
      "1. Collisons are unavoidable (*pigeon hole principle*), so we want *h* to distribute the values uniformly on ${0,1,...,m-1}$, and we can then avoid collisons using lists, coockoo hashing, etc. \n",
      "1. We want *h* to be deterministic (*not* random). That means that for a given input the output is always the same - *h(x)=h(x)*.\n",
      "1. We need *h* to be easily computable for all inputs.\n",
      "\n",
      "## Hash table"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "class MyHashtable:\n",
      "    def __init__(self, m, hash_func=hash):\n",
      "        \"\"\" initial hash table, m empty entries \"\"\"\n",
      "        self.data = [ [] for i in range(m)]\n",
      "        self.hash_mod = lambda x: hash_func(x) % m\n",
      "    \n",
      "    def find(self, item):\n",
      "        \"\"\" returns True if item in hashtable, False otherwise  \"\"\"\n",
      "        i = self.hash_mod(item)      \n",
      "        try:\n",
      "            self.data[i].index(item)\n",
      "            return True\n",
      "        except ValueError:\n",
      "            return False\n",
      "\n",
      "    def insert(self, item):\n",
      "        \"\"\" insert an item into table \"\"\"\n",
      "        i = self.hash_mod(item)\n",
      "        try:\n",
      "            self.data[i].index(item)\n",
      "        except ValueError:\n",
      "            self.data[i].append(item)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 1
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "ht = MyHashtable(100)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 2
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "ht.insert(\"yoav\")\n",
      "ht.insert(\"amir\")\n",
      "ht.insert(\"amiram\")\n",
      "ht.insert(\"haim\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 3
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "ht.find(\"yoav\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 4,
       "text": [
        "True"
       ]
      }
     ],
     "prompt_number": 4
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "ht.find(\"inon\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 5,
       "text": [
        "False"
       ]
      }
     ],
     "prompt_number": 5
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Exercise \u2013 repeating substring\n",
      "\n",
      "Input: We are given a string *st* of length *n*, and a small integer *k* (*k=O(1)* with respect to *n*).\n",
      "\n",
      "Output: Is there a substring of *st* of length *k* which repeats itself (more than once)?"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Naive solution"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def repeat_naive(st, k):\n",
      "    for i in range(len(st) - k + 1):\n",
      "        for j in range(i + 1, len(st) - k + 1):\n",
      "            if st[i : i + k] == st[j : j + k]:\n",
      "                return True\n",
      "    return False"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 6
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "song = '''So, so you think you can tell \n",
      "Heaven from Hell, \n",
      "Blue sky's from pain. \n",
      "Can you tell a green field \n",
      "From a cold steel rail? \n",
      "A smile from a veil? \n",
      "Do you think you can tell? \n",
      "\n",
      "And did they get you to trade \n",
      "Your heroes for ghosts? \n",
      "Hot ashes for trees? \n",
      "Hot air for a cool breeze? \n",
      "Cold comfort for change? \n",
      "And did you exchange \n",
      "A walk on part in the war \n",
      "For a lead role in a cage? \n",
      "\n",
      "How I wish, how I wish you were here. \n",
      "We're just two lost souls \n",
      "Swimming in a fish bowl, \n",
      "Year after year, \n",
      "Running over the same old ground. \n",
      "And how we found\n",
      "The same old fears. \n",
      "Wish you were here.'''"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 9
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print(repeat_naive(song, 5))\n",
      "%timeit -n 10 repeat_naive(song, 5)\n",
      "print(repeat_naive(song, 30))\n",
      "%timeit -n 10 repeat_naive(song, 30)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "True\n",
        "10 loops, best of 3: 2.07 ms per loop"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "False"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "10 loops, best of 3: 137 ms per loop"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n"
       ]
      }
     ],
     "prompt_number": 8
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This is fast but it's a short string.\n",
      "\n",
      "Let's try it with a larger string - [\"The Adventures of Tom Sawyer by Mark Twain\"](http://www.gutenberg.org/cache/epub/74/pg74.txt) from the [Gutenberg Project](http://www.gutenberg.org/). \n",
      "\n",
      "We use the builtin [urllib.request.urlopen](http://docs.python.org/3.0/library/urllib.request.html) to get the book from the web - this function returns a file-like object. The text is read as bytes and we have to tell python to decode it, usually using `utf-8` code which is a type of unicode. \n",
      "This decoding will allow us to read text in non-latin languages (like Hebrew, Arabic and Russian)."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import urllib.request\n",
      "with urllib.request.urlopen(\"http://www.gutenberg.org/cache/epub/74/pg74.txt\") as r:\n",
      "    book = r.read().decode('utf-8')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 16
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print(book[:book.index('\\n\\r')])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\ufeff\r\n",
        "The Project Gutenberg EBook of The Adventures of Tom Sawyer, Complete by\r\n",
        "Mark Twain (Samuel Clemens)\r\n"
       ]
      }
     ],
     "prompt_number": 17
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Looking for a short repeat is quick:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print(repeat_naive(book, 5))\n",
      "%timeit -n 3 repeat_naive(book, 5)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "True\n",
        "3 loops, best of 3: 709 ms per loop"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n"
       ]
      }
     ],
     "prompt_number": 18
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "But searching for a longer repeat takes a very long time:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print(repeat_naive(book, 30))\n",
      "%timeit -n 1 repeat_naive(book, 30)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "True\n",
        "1 loops, best of 3: 10.9 s per loop"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n"
       ]
      }
     ],
     "prompt_number": 110
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### Worst case complexity\n",
      "\n",
      "The worst case happens when there is no such repeat, which is likely when *k* is large.\n",
      "\n",
      "The number of iterations is $O((n-k)^2$. Each iteration we compare a substring of length *k*, which is $O(k)$.\n",
      "So the total complexity is $O(k(n-k)^2)$. Because we consider $k = O(1)\\;$ then we get $O(n^2)$.\n",
      "\n",
      "In *With you were here* we had *n=589*, so the number of character comparisons was 346,921.\n",
      "\n",
      "But with *The Adventures of Tom Sawyer*, we had *n=421869* and roghly 177,973,453,161 character comparisons."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Hash solution"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def repeat_hash1(st, k, m=0):\n",
      "    if m == 0: # default hash table size is ~number of substrings to be inserted\n",
      "        m = len(st) - k\n",
      "    htable = MyHashtable(m)\n",
      "    for i in range(len(st) - k + 1):\n",
      "        if htable.find(st[i : i + k]):\n",
      "            return True\n",
      "        else: \n",
      "            htable.insert(st[i : i + k])\n",
      "    return False"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 5
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%timeit -n 10 repeat_hash1(song, 5)\n",
      "%timeit -n 10 repeat_hash1(song, 30)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "10 loops, best of 3: 226 us per loop\n",
        "10 loops, best of 3: 7.38 ms per loop"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n"
       ]
      }
     ],
     "prompt_number": 6
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%timeit -n 3 repeat_hash1(book, 5)\n",
      "%timeit -n 1 repeat_hash1(book, 30)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "3 loops, best of 3: 102 ms per loop\n",
        "1 loops, best of 3: 96.3 ms per loop"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n"
       ]
      }
     ],
     "prompt_number": 7
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print(repeat_hash1(book, 1000))\n",
      "%timeit -n 1 repeat_hash1(book, 1000)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "False\n",
        "1 loops, best of 3: 15.5 s per loop"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n"
       ]
      }
     ],
     "prompt_number": 8
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### Average case complexity\n",
      "\n",
      "Creating hashtable - $O(m)=O(n-k)$.\n",
      "Number of iterations - $O(n-k)$.\n",
      "Each iteration we search a list of length $O(1)$ on average, so overall $O(k)$.\n",
      "Total complexity on average $O(k(n-k))=O(n)$.\n",
      "\n",
      "The worst case is not as good - all elements get the same hash and enter the same list, so the complexity is $O(k(n-k)^2)$. \n",
      "But the probability for this is *very* low."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Python builtin solution\n",
      "\n",
      "Which of python's data structures is appropriate?\n",
      "\n",
      "- `list` - mutable and ordered\n",
      "- `string`, `tuple` - immutable and ordered\n",
      "- `dict` - mutable and unordered with unique immutable keys\n",
      "- `set` - mutable and unordered with unique immutable values\n",
      "\n",
      "`set` is just the right structure - we need a structure we can insert items into and check if items are already in it."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def repeat_hash2(st, k, m=0):\n",
      "    htable = set() # Python sets use hash functions for fast lookup\n",
      "    for i in range(len(st) - k + 1):\n",
      "        if st[i : i + k] not in htable:\n",
      "            htable.add(st[i : i + k])\n",
      "        else: \n",
      "            return True\n",
      "    return False"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 9
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%timeit -n 10 repeat_hash2(song, 5)\n",
      "%timeit -n 10 repeat_hash2(song, 30)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "10 loops, best of 3: 17.7 us per loop\n",
        "10 loops, best of 3: 718 us per loop\n"
       ]
      }
     ],
     "prompt_number": 122
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%timeit -n 10 repeat_hash2(book, 5)\n",
      "%timeit -n 10 repeat_hash2(book, 30)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "10 loops, best of 3: 49.1 us per loop\n",
        "10 loops, best of 3: 458 us per loop\n"
       ]
      }
     ],
     "prompt_number": 123
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%timeit -n 1 repeat_hash2(book, 1000)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "1 loops, best of 3: 3.44 s per loop\n"
       ]
      }
     ],
     "prompt_number": 125
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### DNA sequences\n",
      "An interesting use for this kind of a `repeat` function is to find a recurring DNA motif in a DNA string.\n",
      "In this case the alphabet is much smaller - 4 letters: A, G, C, T - so the chance for repitiotion is higher and the worst case happens less frequently.\n",
      "But, [DNA strings are very long](https://en.wikipedia.org/wiki/Genome#Genome_size).\n",
      "\n",
      "Lets look at the [`lacZ`](https://en.wikipedia.org/wiki/Lacz) gene sequence (from <http://berry.engin.umich.edu/gene2oligo/lacZ.txt>):"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import urllib"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 10
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "with urllib.request.urlopen(\"http://berry.engin.umich.edu/gene2oligo/lacZ.txt\") as r:\n",
      "    lacZ = r.read().decode('utf-8')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 11
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print(lacZ)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Name: LacZ\n",
        "Sequence:\n",
        "ACCATGATTACGGATTCACTGGCCGTCGTTTTACAACGTCGTGACTGGGAAAACCCTGGCGTTACCCAAC\n",
        "TTAATCGCCTTGCAGCACATCCCCCTTTCGCCAGCTGGCGTAATAGCGAAGAGGCCCGCACCGATCGCCC\n",
        "TTCCCAACAGTTGCGCAGCCTGAATGGCGAATGGCGCTTTGCCTGGTTTCCGGCACCAGAAGCGGTGCCG\n",
        "GAAAGCTGGCTGGAGTGCGATCTTCCTGAGGCCGATACTGTCGTCGTCCCCTCAAACTGGCAGATGCACG\n",
        "GTTACGATGCGCCCATCTACACCAACGTAACCTATCCCATTACGGTCAATCCGCCGTTTGTTCCCACGGA\n",
        "GAATCCGACGGGTTGTTACTCGCTCACATTTAATGTTGATGAAAGCTGGCTACAGGAAGGCCAGACGCGA\n",
        "ATTATTTTTGATGGCGTTAACTCGGCGTTTCATCTGTGGTGCAACGGGCGCTGGGTCGGTTACGGCCAGG\n",
        "ACAGTCGTTTGCCGTCTGAATTTGACCTGAGCGCATTTTTACGCGCCGGAGAAAACCGCCTCGCGGTGAT\n",
        "GGTGCTGCGTTGGAGTGACGGCAGTTATCTGGAAGATCAGGATATGTGGCGGATGAGCGGCATTTTCCGT\n",
        "GACGTCTCGTTGCTGCATAAACCGACTACACAAATCAGCGATTTCCATGTTGCCACTCGCTTTAATGATG\n",
        "ATTTCAGCCGCGCTGTACTGGAGGCTGAAGTTCAGATGTGCGGCGAGTTGCGTGACTACCTACGGGTAAC\n",
        "AGTTTCTTTATGGCAGGGTGAAACGCAGGTCGCCAGCGGCACCGCGCCTTTCGGCGGTGAAATTATCGAT\n",
        "GAGCGTGGTGGTTATGCCGATCGCGTCACACTACGTCTGAACGTCGAAAACCCGAAACTGTGGAGCGCCG\n",
        "AAATCCCGAATCTCTATCGTGCGGTGGTTGAACTGCACACCGCCGACGGCACGCTGATTGAAGCAGAAGC\n",
        "CTGCGATGTCGGTTTCCGCGAGGTGCGGATTGAAAATGGTCTGCTGCTGCTGAACGGCAAGCCGTTGCTG\n",
        "ATTCGAGGCGTTAACCGTCACGAGCATCATCCTCTGCATGGTCAGGTCATGGATGAGCAGACGATGGTGC\n",
        "AGGATATCCTGCTGATGAAGCAGAACAACTTTAACGCCGTGCGCTGTTCGCATTATCCGAACCATCCGCT\n",
        "GTGGTACACGCTGTGCGACCGCTACGGCCTGTATGTGGTGGATGAAGCCAATATTGAAACCCACGGCATG\n",
        "GTGCCAATGAATCGTCTGACCGATGATCCGCGCTGGCTACCGGCGATGAGCGAACGCGTAACGCGAATGG\n",
        "TGCAGCGCGATCGTAATCACCCGAGTGTGATCATCTGGTCGCTGGGGAATGAATCAGGCCACGGCGCTAA\n",
        "TCACGACGCGCTGTATCGCTGGATCAAATCTGTCGATCCTTCCCGCCCGGTGCAGTATGAAGGCGGCGGA\n",
        "GCCGACACCACGGCCACCGATATTATTTGCCCGATGTACGCGCGCGTGGATGAAGACCAGCCCTTCCCGG\n",
        "CTGTGCCGAAATGGTCCATCAAAAAATGGCTTTCGCTACCTGGAGAGACGCGCCCGCTGATCCTTTGCGA\n",
        "ATACGCCCACGCGATGGGTAACAGTCTTGGCGGTTTCGCTAAATACTGGCAGGCGTTTCGTCAGTATCCC\n",
        "CGTTTACAGGGCGGCTTCGTCTGGGACTGGGTGGATCAGTCGCTGATTAAATATGATGAAAACGGCAACC\n",
        "CGTGGTCGGCTTACGGCGGTGATTTTGGCGATACGCCGAACGATCGCCAGTTCTGTATGAACGGTCTGGT\n",
        "CTTTGCCGACCGCACGCCGCATCCAGCGCTGACGGAAGCAAAACACCAGCAGCAGTTTTTCCAGTTCCGT\n",
        "TTATCCGGGCAAACCATCGAAGTGACCAGCGAATACCTGTTCCGTCATAGCGATAACGAGCTCCTGCACT\n",
        "GGATGGTGGCGCTGGATGGTAAGCCGCTGGCAAGCGGTGAAGTGCCTCTGGATGTCGCTCCACAAGGTAA\n",
        "ACAGTTGATTGAACTGCCTGAACTACCGCAGCCGGAGAGCGCCGGGCAACTCTGGCTCACAGTACGCGTA\n",
        "GTGCAACCGAACGCGACCGCATGGTCAGAAGCCGGGCACATCAGCGCCTGGCAGCAGTGGCGTCTGGCGG\n",
        "AAAACCTCAGTGTGACGCTCCCCGCCGCGTCCCACGCCATCCCGCATCTGACCACCAGCGAAATGGATTT\n",
        "TTGCATCGAGCTGGGTAATAAGCGTTGGCAATTTAACCGCCAGTCAGGCTTTCTTTCACAGATGTGGATT\n",
        "GGCGATAAAAAACAACTGCTGACGCCGCTGCGCGATCAGTTCACCCGTGCACCGCTGGATAACGACATTG\n",
        "GCGTAAGTGAAGCGACCCGCATTGACCCTAACGCCTGGGTCGAACGCTGGAAGGCGGCGGGCCATTACCA\n",
        "GGCCGAAGCAGCGTTGTTGCAGTGCACGGCAGATACACTTGCTGATGCGGTGCTGATTACGACCGCTCAC\n",
        "GCGTGGCAGCATCAGGGGAAAACCTTATTTATCAGCCGGAAAACCTACCGGATTGATGGTAGTGGTCAAA\n",
        "TGGCGATTACCGTTGATGTTGAAGTGGCGAGCGATACACCGCATCCGGCGCGGATTGGCCTGAACTGCCA\n",
        "GCTGGCGCAGGTAGCAGAGCGGGTAAACTGGCTCGGATTAGGGCCGCAAGAAAACTATCCCGACCGCCTT\n",
        "ACTGCCGCCTGTTTTGACCGCTGGGATCTGCCATTGTCAGACATGTATACCCCGTACGTCTTCCCGAGCG\n",
        "AAAACGGTCTGCGCTGCGGGACGCGCGAATTGAATTATGGCCCACACCAGTGGCGCGGCGACTTCCAGTT\n",
        "CAACATCAGCCGCTACAGTCAACAGCAACTGATGGAAACCAGCCATCGCCATCTGCTGCACGCGGAAGAA\n",
        "GGCACATGGCTGAATATCGACGGTTTCCATATGGGGATTGGTGGCGACGACTCCTGGAGCCCGTCAGTAT\n",
        "CGGCGGAATTCCAGCTGAGCGCCGGTCGCTACCATTACCAGTTGGTCTGGTGTCAAAAATAATAATAAcg\n",
        "gctgccgt\n",
        "\n",
        "\n"
       ]
      }
     ],
     "prompt_number": 12
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We need to remove the first two lines, remove newlines and change lowercase letters to uppercase: "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "lacZ = str.join('',lacZ.split('\\n')[2:]).upper()\n",
      "print(len(lacZ))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "3088\n"
       ]
      }
     ],
     "prompt_number": 13
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now we can use the repeat functions to check for repeating subsequences:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print(repeat_hash2(lacZ, 10))\n",
      "print(repeat_hash2(lacZ, 12))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "True\n",
        "False\n"
       ]
      }
     ],
     "prompt_number": 149
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%timeit -n 10 repeat_hash1(lacZ,12)\n",
      "%timeit -n 10 repeat_hash2(lacZ,12)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "10 loops, best of 3: 34.1 ms per loop\n",
        "10 loops, best of 3: 3.75 ms per loop"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n"
       ]
      }
     ],
     "prompt_number": 14
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Fin\n",
      "This notebook is part of the [Extended introduction to computer science](http://tau-cs1001-py.wikidot.com/) course at Tel-Aviv University.\n",
      "\n",
      "The notebook was written using Python 3.2 and IPython 0.13.1.\n",
      "\n",
      "The code is available at <https://raw.github.com/yoavram/CS1001.py/master/recitation8.ipynb>.\n",
      "\n",
      "The notebook can be viewed online at <http://nbviewer.ipython.org/urls/raw.github.com/yoavram/CS1001.py/master/recitation8.ipynb>.\n",
      "\n",
      "This work is licensed under a [Creative Commons Attribution-ShareAlike 3.0 Unported License](http://creativecommons.org/licenses/by-sa/3.0/)."
     ]
    }
   ],
   "metadata": {}
  }
 ]
}