{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Memoization with Fibonacci\n", "\n", "We discuss *what* vs *how* programming and resolve performance flaws with *memoization*.\n", "\n", "We use the classic Fibonacci sequence as an initial example\n", "\n", " 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, ...\n", " \n", "We finish with an example using memoization to minimize repeated web requests" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fib\n", "\n", "The Fibonacci sequence is often defined recursively as follows\n", "\n", " / 0 if i is 0\n", " fib(i) = | 1 if i is 1\n", " \\ fib(i - 1) + fib(i - 2) otherwise\n", "\n", "Because Python supports recursion we can translate this mathematical definition cleanly into code" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def fib_what(i):\n", " if i == 0:\n", " return 0\n", " elif i == 1:\n", " return 1\n", " else:\n", " return fib_what(i -1) + fib_what(i - 2)\n", "\n", "map(fib_what, range(11))" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 1, "text": [ "[0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55]" ] } ], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "A common *algorithm* for *how* to compute the $i^{th}$ Fibonacci number efficiently is to keep a running pair of numbers, summing them together to obtain the next" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def fib_how(i):\n", " a, b = 0, 1\n", " for _ in range(i):\n", " a, b = b, a + b\n", " return a\n", "\n", "map(fib_how, range(11))" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 2, "text": [ "[0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55]" ] } ], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "These two implementations differ in two ways. \n", "\n", "**Style: What v. How**: \n", "\n", "* The first function is similar to a mathematical definition, defining *what* is to be computed without clearly specifying about *how*. The implementation of this algorithm depends on how the Python runtime handles recursive calls.\n", "\n", "* The second function provides a clear recipe for *how* to compute the answer; it is not immediately clear from the function definition *what* is being computed however.\n", "\n", "*What* solutions tend to be more conceptually clear than *how* solutions. \n", "\n", "**Performance**\n", "\n", "Unfortunately the *what* solution is painfully slow. This is common with *what* code." ] }, { "cell_type": "code", "collapsed": false, "input": [ "%timeit fib_what(35)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "1 loops, best of 3: 4.1 s per loop\n" ] } ], "prompt_number": 3 }, { "cell_type": "code", "collapsed": false, "input": [ "%timeit fib_how(35)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "100000 loops, best of 3: 2.14 \u00b5s per loop\n" ] } ], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note the units. The what solution is two million times slower on this input. It only gets worse.\n", "\n", "The *what* solution requires exponentially more time than the *how* solution. This is because it repeatedly calls `fib` on the same values over and over again. \n", "\n", "For example:\n", "\n", "* `fib(5)` calls `fib(4)` and `fib(3)`\n", "* `fib(4)` calls `fib(3)` and `fib(2)`\n", "* `fib(3)` calls `fib(2)` and `fib(1)`\n", "* The second `fib(3)` also calls `fib(2)` and `fib(1)`\n", "* All three `fib(2)`s call `fib(1)` and `fib(0)`\n", "* We now have five redundant calls to `fib(1)`. \n", "\n", "For low inputs like `5` this is merely comical. For large values it rapidly becomes tragic." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Caching\n", "\n", "The *what* solution can be saved by caching intermediate values" ] }, { "cell_type": "code", "collapsed": false, "input": [ "cache = dict()\n", "\n", "def fib(i):\n", " if i in cache: # Have we seen this input before?\n", " return cache[i] # if so then return the cached result\n", " \n", " if i == 0:\n", " return 0\n", " if i == 1:\n", " return 1\n", " else:\n", " result = fib(i -1) + fib(i - 2)\n", " cache[i] = result # Remember that fib(i) == result\n", " return result\n", "\n", "%timeit cache.clear(); fib(35)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "100000 loops, best of 3: 16.8 \u00b5s per loop\n" ] } ], "prompt_number": 5 }, { "cell_type": "markdown", "metadata": {}, "source": [ "By caching results to expensive functions we decrease computation time but increase memory usage. This can often be a helpful technique." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Memoization\n", "\n", "Historically functional languages favor *what* over *how* implementations. \n", "\n", "Consider our friend, `map`. The `map` function defines *what* should be done while hiding exactly *how* it is done. We are free to implement `map` in a variety of interesting ways. Functional abstractions take computational control away from programmers and deliver it to language and infrastructure designers.\n", "\n", "Because functional languages favor *what* solutions they often run into issues like the repeated computation we saw in the Fibonacci example. Rather than explicitly cache every function we make we turn caching into a higher order function. \n", "\n", "This function is called `memoize`." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from toolz import memoize\n", "\n", "def fib(i):\n", " if i == 0:\n", " return 0\n", " elif i == 1:\n", " return 1\n", " else:\n", " return fib(i -1) + fib(i - 2)\n", "\n", "cache = dict()\n", "fib = memoize(fib, cache)\n", "\n", "%timeit cache.clear(); fib(35)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "10000 loops, best of 3: 33.4 \u00b5s per loop\n" ] } ], "prompt_number": 6 }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the above example we explicitly create a cache so that we can clear it before each timing. Usually we just call memoize with a single argument like so\n", "\n", " fib = memoize(fib)\n", " \n", "and the cache dictionary is generated automatically.\n", "\n", "Alternatively we may prefer to use the Python decorator syntax\n", "\n", "```\n", "@memoize\n", "def fib(i):\n", " if i == 0:\n", " return 0\n", " if i == 1:\n", " return 1\n", " else:\n", " return fib(i -1) + fib(i - 2)\n", "```\n", "\n", "Both of these are equivalent.\n", "\n", "Memoization provides a purely *how* optimization so that we can continue to shamelessly write mathematically clear *what* code." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Avoid Explicit State with Memoization\n", "\n", "The Fibonacci example is a canonical use case of memoization. Sadly we are rarely called upon to compute Fibonacci numbers in real life. \n", "\n", "Here we play with a slightly more realistic example. We want to sort a set of webpages by the number of distinct words in their body.\n", "\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# We'll use the Wikipedia articles for the fifty United States\n", "\n", "with open('data/states.txt') as f:\n", " urls = ['http://en.wikipedia.org/wiki/' + state for state in f.read().strip().split()]" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 51 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise\n", "\n", "Our task is to sort these URLs by the number of distinct words present in the text of the website. We'll do this by using the `sorted` function's keyword `cmp` and `urlopen` from `urllib`. \n", "\n", "We'll go through an implementation that explicitly caches results for performance. You'll simplify this implementation by adding `memoize` appropriately." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Naive Solution\n", "\n", "# We can get the raw HTML from a URL with `urlopen`\n", "# Depending on your internet connection this might take a little while\n", "import urllib2\n", "opener = urllib2.build_opener()\n", "opener.addheaders = [('User-agent', 'Mozilla/5.0')] # Trick Wikipedia into thinking that we're not a robot\n", "urlopen = opener.open\n", "\n", "\n", "def count_distinct_words(url):\n", " \"\"\" Count distinct words of a URL\n", " \n", " >>> count_distinct_words('http://www.example.com')\n", " 86\n", " \"\"\"\n", " text = urlopen(url).read()\n", " return len(set(text.split()))\n", "\n", " \n", "def cmp_url(a, b):\n", " \"\"\" A comparator function for URLs\n", " \n", " >>> cmp_url('http://en.wikipedia.org/', 'http://www.example.org')\n", " 1\n", " \"\"\"\n", " return count_distinct_words(a) - count_distinct_words(b)\n", "\n", "\n", "# sorted(urls, cmp=cmp_url) # We could do this but it would be slow" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 48 }, { "cell_type": "code", "collapsed": false, "input": [ "# Efficient Solution\n", "\n", "# To increase performance and reduce the extent to which we hammer the Wikipedia servers \n", "# we'll precompute all of the counts ahead of time in one pass \n", "# and store them in a global dictionary\n", "\n", "word_counts = dict()\n", "for url in urls:\n", " word_counts[url] = count_distinct_words(url)\n", "\n", "\n", "def cmp_url_cached(a, b):\n", " \"\"\" A Cached version of ``cmp_url`` \"\"\"\n", " return cmp(word_counts[a], word_counts[b])\n", " \n", "\n", "# sorted(urls cmp=cmp_url_cached)" ], "language": "python", "metadata": {}, "outputs": [ { "ename": "KeyboardInterrupt", "evalue": "", "output_type": "pyerr", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[1;31mKeyboardInterrupt\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[0;32m 5\u001b[0m \u001b[0mword_counts\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mdict\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 6\u001b[0m \u001b[1;32mfor\u001b[0m \u001b[0murl\u001b[0m \u001b[1;32min\u001b[0m \u001b[0murls\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 7\u001b[1;33m \u001b[0mword_counts\u001b[0m\u001b[1;33m[\u001b[0m\u001b[0murl\u001b[0m\u001b[1;33m]\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mcount_distinct_words\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0murl\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 8\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 9\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32m\u001b[0m in \u001b[0;36mcount_distinct_words\u001b[1;34m(url)\u001b[0m\n\u001b[0;32m 13\u001b[0m \u001b[1;36m86\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 14\u001b[0m \"\"\"\n\u001b[1;32m---> 15\u001b[1;33m \u001b[0mtext\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0murlopen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0murl\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mread\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 16\u001b[0m \u001b[1;32mreturn\u001b[0m \u001b[0mlen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mset\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mtext\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0msplit\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 17\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32m/home/mrocklin/Software/anaconda/lib/python2.7/socket.pyc\u001b[0m in \u001b[0;36mread\u001b[1;34m(self, size)\u001b[0m\n\u001b[0;32m 349\u001b[0m \u001b[1;32mwhile\u001b[0m \u001b[0mTrue\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 350\u001b[0m \u001b[1;32mtry\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 351\u001b[1;33m \u001b[0mdata\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_sock\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mrecv\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mrbufsize\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 352\u001b[0m \u001b[1;32mexcept\u001b[0m \u001b[0merror\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0me\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 353\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0me\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0margs\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;36m0\u001b[0m\u001b[1;33m]\u001b[0m \u001b[1;33m==\u001b[0m \u001b[0mEINTR\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32m/home/mrocklin/Software/anaconda/lib/python2.7/httplib.pyc\u001b[0m in \u001b[0;36mread\u001b[1;34m(self, amt)\u001b[0m\n\u001b[0;32m 565\u001b[0m \u001b[1;31m# connection, and the user is reading more bytes than will be provided\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 566\u001b[0m \u001b[1;31m# (for example, reading in 1k chunks)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 567\u001b[1;33m \u001b[0ms\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mfp\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mread\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mamt\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 568\u001b[0m \u001b[1;32mif\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[0ms\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 569\u001b[0m \u001b[1;31m# Ideally, we would raise IncompleteRead if the content-length\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32m/home/mrocklin/Software/anaconda/lib/python2.7/socket.pyc\u001b[0m in \u001b[0;36mread\u001b[1;34m(self, size)\u001b[0m\n\u001b[0;32m 378\u001b[0m \u001b[1;31m# fragmentation issues on many platforms.\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 379\u001b[0m \u001b[1;32mtry\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 380\u001b[1;33m \u001b[0mdata\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_sock\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mrecv\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mleft\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 381\u001b[0m \u001b[1;32mexcept\u001b[0m \u001b[0merror\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0me\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 382\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0me\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0margs\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;36m0\u001b[0m\u001b[1;33m]\u001b[0m \u001b[1;33m==\u001b[0m \u001b[0mEINTR\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;31mKeyboardInterrupt\u001b[0m: " ] } ], "prompt_number": 47 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task\n", "\n", "Your task is simple. Use memoize on our naive solution to achieve the same performance of the efficient solution." ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## End\n", "\n", "Establishing state is often required for performance. Memoization captures a common tactic for higher performance code, particularly when repeated expensive operations must be avoided. The `memoize` higher order function wraps this functionality into a simple function call or decorator." ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }