{
 "metadata": {
  "name": ""
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Memoization with Fibonacci\n",
      "\n",
      "We discuss *what* vs *how* programming and resolve performance flaws with *memoization*.\n",
      "\n",
      "We use the classic Fibonacci sequence as an initial example\n",
      "\n",
      "    0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, ...\n",
      "    \n",
      "We finish with an example using memoization to minimize repeated web requests"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Fib\n",
      "\n",
      "The Fibonacci sequence is often defined recursively as follows\n",
      "\n",
      "              /            0              if i is 0\n",
      "    fib(i) =  |            1              if i is 1\n",
      "              \\ fib(i - 1) + fib(i - 2)   otherwise\n",
      "\n",
      "Because Python supports recursion we can translate this mathematical definition cleanly into code"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def fib_what(i):\n",
      "    if i == 0:\n",
      "        return 0\n",
      "    elif i == 1:\n",
      "        return 1\n",
      "    else:\n",
      "        return fib_what(i -1) + fib_what(i - 2)\n",
      "\n",
      "map(fib_what, range(11))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 1,
       "text": [
        "[0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55]"
       ]
      }
     ],
     "prompt_number": 1
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "A common *algorithm* for *how* to compute the $i^{th}$ Fibonacci number efficiently is to keep a running pair of numbers, summing them together to obtain the next"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def fib_how(i):\n",
      "    a, b = 0, 1\n",
      "    for _ in range(i):\n",
      "        a, b = b, a + b\n",
      "    return a\n",
      "\n",
      "map(fib_how, range(11))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 2,
       "text": [
        "[0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55]"
       ]
      }
     ],
     "prompt_number": 2
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "These two implementations differ in two ways.  \n",
      "\n",
      "**Style: What v. How**:  \n",
      "\n",
      "*   The first function is similar to a mathematical definition, defining *what* is to be computed without clearly specifying about *how*.  The implementation of this algorithm depends on how the Python runtime handles recursive calls.\n",
      "\n",
      "*   The second function provides a clear recipe for *how* to compute the answer; it is not immediately clear from the function definition *what* is being computed however.\n",
      "\n",
      "*What* solutions tend to be more conceptually clear than *how* solutions.  \n",
      "\n",
      "**Performance**\n",
      "\n",
      "Unfortunately the *what* solution is painfully slow.  This is common with *what* code."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%timeit fib_what(35)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "1 loops, best of 3: 4.1 s per loop\n"
       ]
      }
     ],
     "prompt_number": 3
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%timeit fib_how(35)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "100000 loops, best of 3: 2.14 \u00b5s per loop\n"
       ]
      }
     ],
     "prompt_number": 4
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Note the units.  The what solution is two million times slower on this input.  It only gets worse.\n",
      "\n",
      "The *what* solution requires exponentially more time than the *how* solution.  This is because it repeatedly calls `fib` on the same values over and over again.  \n",
      "\n",
      "For example:\n",
      "\n",
      "*   `fib(5)` calls `fib(4)` and `fib(3)`\n",
      "*   `fib(4)` calls `fib(3)` and `fib(2)`\n",
      "*   `fib(3)` calls `fib(2)` and `fib(1)`\n",
      "*   The second `fib(3)` also calls `fib(2)` and `fib(1)`\n",
      "*   All three `fib(2)`s call `fib(1)` and `fib(0)`\n",
      "*   We now have five redundant calls to `fib(1)`. \n",
      "\n",
      "For low inputs like `5` this is merely comical.  For large values it rapidly becomes tragic."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Caching\n",
      "\n",
      "The *what* solution can be saved by caching intermediate values"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "cache = dict()\n",
      "\n",
      "def fib(i):\n",
      "    if i in cache:                # Have we seen this input before?\n",
      "        return cache[i]           # if so then return the cached result\n",
      "    \n",
      "    if i == 0:\n",
      "        return 0\n",
      "    if i == 1:\n",
      "        return 1\n",
      "    else:\n",
      "        result = fib(i -1) + fib(i - 2)\n",
      "        cache[i] = result         # Remember that fib(i) == result\n",
      "        return result\n",
      "\n",
      "%timeit cache.clear(); fib(35)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "100000 loops, best of 3: 16.8 \u00b5s per loop\n"
       ]
      }
     ],
     "prompt_number": 5
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "By caching results to expensive functions we decrease computation time but increase memory usage.  This can often be a helpful technique."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Memoization\n",
      "\n",
      "Historically functional languages favor *what* over *how* implementations.  \n",
      "\n",
      "Consider our friend, `map`.  The `map` function defines *what* should be done while hiding exactly *how* it is done.  We are free to implement `map` in a variety of interesting ways.  Functional abstractions take computational control away from programmers and deliver it to language and infrastructure designers.\n",
      "\n",
      "Because functional languages favor *what* solutions they often run into issues like the repeated computation we saw in the Fibonacci example.  Rather than explicitly cache every function we make we turn caching into a higher order function.  \n",
      "\n",
      "This function is called `memoize`."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from toolz import memoize\n",
      "\n",
      "def fib(i):\n",
      "    if i == 0:\n",
      "        return 0\n",
      "    elif i == 1:\n",
      "        return 1\n",
      "    else:\n",
      "        return fib(i -1) + fib(i - 2)\n",
      "\n",
      "cache = dict()\n",
      "fib = memoize(fib, cache)\n",
      "\n",
      "%timeit cache.clear(); fib(35)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "10000 loops, best of 3: 33.4 \u00b5s per loop\n"
       ]
      }
     ],
     "prompt_number": 6
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In the above example we explicitly create a cache so that we can clear it before each timing.  Usually we just call memoize with a single argument like so\n",
      "\n",
      "    fib = memoize(fib)\n",
      "    \n",
      "and the cache dictionary is generated automatically.\n",
      "\n",
      "Alternatively we may prefer to use the Python decorator syntax\n",
      "\n",
      "```\n",
      "@memoize\n",
      "def fib(i):\n",
      "    if i == 0:\n",
      "        return 0\n",
      "    if i == 1:\n",
      "        return 1\n",
      "    else:\n",
      "        return fib(i -1) + fib(i - 2)\n",
      "```\n",
      "\n",
      "Both of these are equivalent.\n",
      "\n",
      "Memoization provides a purely *how* optimization so that we can continue to shamelessly write mathematically clear *what* code."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Avoid Explicit State with Memoization\n",
      "\n",
      "The Fibonacci example is a canonical use case of memoization.  Sadly we are rarely called upon to compute Fibonacci numbers in real life.  \n",
      "\n",
      "Here we play with a slightly more realistic example.  We want to sort a set of webpages by the number of distinct words in their body.\n",
      "\n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# We'll use the Wikipedia articles for the fifty United States\n",
      "\n",
      "with open('data/states.txt') as f:\n",
      "    urls = ['http://en.wikipedia.org/wiki/' + state for state in f.read().strip().split()]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 51
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Exercise\n",
      "\n",
      "Our task is to sort these URLs by the number of distinct words present in the text of the website.  We'll do this by using the `sorted` function's keyword `cmp` and `urlopen` from `urllib`. \n",
      "\n",
      "We'll go through an implementation that explicitly caches results for performance.  You'll simplify this implementation by adding `memoize` appropriately."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Naive Solution\n",
      "\n",
      "# We can get the raw HTML from a URL with `urlopen`\n",
      "# Depending on your internet connection this might take a little while\n",
      "import urllib2\n",
      "opener = urllib2.build_opener()\n",
      "opener.addheaders = [('User-agent', 'Mozilla/5.0')]  # Trick Wikipedia into thinking that we're not a robot\n",
      "urlopen = opener.open\n",
      "\n",
      "\n",
      "def count_distinct_words(url):\n",
      "    \"\"\" Count distinct words of a URL\n",
      "    \n",
      "    >>> count_distinct_words('http://www.example.com')\n",
      "    86\n",
      "    \"\"\"\n",
      "    text = urlopen(url).read()\n",
      "    return len(set(text.split()))\n",
      "\n",
      "            \n",
      "def cmp_url(a, b):\n",
      "    \"\"\" A comparator function for URLs\n",
      "    \n",
      "    >>> cmp_url('http://en.wikipedia.org/', 'http://www.example.org')\n",
      "    1\n",
      "    \"\"\"\n",
      "    return count_distinct_words(a) - count_distinct_words(b)\n",
      "\n",
      "\n",
      "# sorted(urls, cmp=cmp_url)  # We could do this but it would be slow"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 48
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Efficient Solution\n",
      "\n",
      "# To increase performance and reduce the extent to which we hammer the Wikipedia servers \n",
      "# we'll precompute all of the counts ahead of time in one pass \n",
      "# and store them in a global dictionary\n",
      "\n",
      "word_counts = dict()\n",
      "for url in urls:\n",
      "    word_counts[url] = count_distinct_words(url)\n",
      "\n",
      "\n",
      "def cmp_url_cached(a, b):\n",
      "    \"\"\" A Cached version of ``cmp_url`` \"\"\"\n",
      "    return cmp(word_counts[a], word_counts[b])\n",
      "          \n",
      "\n",
      "# sorted(urls cmp=cmp_url_cached)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "ename": "KeyboardInterrupt",
       "evalue": "",
       "output_type": "pyerr",
       "traceback": [
        "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[1;31mKeyboardInterrupt\u001b[0m                         Traceback (most recent call last)",
        "\u001b[1;32m<ipython-input-47-1b7e62009eaa>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[0;32m      5\u001b[0m \u001b[0mword_counts\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mdict\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m      6\u001b[0m \u001b[1;32mfor\u001b[0m \u001b[0murl\u001b[0m \u001b[1;32min\u001b[0m \u001b[0murls\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 7\u001b[1;33m     \u001b[0mword_counts\u001b[0m\u001b[1;33m[\u001b[0m\u001b[0murl\u001b[0m\u001b[1;33m]\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mcount_distinct_words\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0murl\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m      8\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m      9\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n",
        "\u001b[1;32m<ipython-input-46-375f431f0072>\u001b[0m in \u001b[0;36mcount_distinct_words\u001b[1;34m(url)\u001b[0m\n\u001b[0;32m     13\u001b[0m     \u001b[1;36m86\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     14\u001b[0m     \"\"\"\n\u001b[1;32m---> 15\u001b[1;33m     \u001b[0mtext\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0murlopen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0murl\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mread\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m     16\u001b[0m     \u001b[1;32mreturn\u001b[0m \u001b[0mlen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mset\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mtext\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0msplit\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     17\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n",
        "\u001b[1;32m/home/mrocklin/Software/anaconda/lib/python2.7/socket.pyc\u001b[0m in \u001b[0;36mread\u001b[1;34m(self, size)\u001b[0m\n\u001b[0;32m    349\u001b[0m             \u001b[1;32mwhile\u001b[0m \u001b[0mTrue\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    350\u001b[0m                 \u001b[1;32mtry\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 351\u001b[1;33m                     \u001b[0mdata\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_sock\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mrecv\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mrbufsize\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    352\u001b[0m                 \u001b[1;32mexcept\u001b[0m \u001b[0merror\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0me\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    353\u001b[0m                     \u001b[1;32mif\u001b[0m \u001b[0me\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0margs\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;36m0\u001b[0m\u001b[1;33m]\u001b[0m \u001b[1;33m==\u001b[0m \u001b[0mEINTR\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
        "\u001b[1;32m/home/mrocklin/Software/anaconda/lib/python2.7/httplib.pyc\u001b[0m in \u001b[0;36mread\u001b[1;34m(self, amt)\u001b[0m\n\u001b[0;32m    565\u001b[0m         \u001b[1;31m# connection, and the user is reading more bytes than will be provided\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    566\u001b[0m         \u001b[1;31m# (for example, reading in 1k chunks)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 567\u001b[1;33m         \u001b[0ms\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mfp\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mread\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mamt\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    568\u001b[0m         \u001b[1;32mif\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[0ms\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    569\u001b[0m             \u001b[1;31m# Ideally, we would raise IncompleteRead if the content-length\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
        "\u001b[1;32m/home/mrocklin/Software/anaconda/lib/python2.7/socket.pyc\u001b[0m in \u001b[0;36mread\u001b[1;34m(self, size)\u001b[0m\n\u001b[0;32m    378\u001b[0m                 \u001b[1;31m# fragmentation issues on many platforms.\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    379\u001b[0m                 \u001b[1;32mtry\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 380\u001b[1;33m                     \u001b[0mdata\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_sock\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mrecv\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mleft\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    381\u001b[0m                 \u001b[1;32mexcept\u001b[0m \u001b[0merror\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0me\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    382\u001b[0m                     \u001b[1;32mif\u001b[0m \u001b[0me\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0margs\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;36m0\u001b[0m\u001b[1;33m]\u001b[0m \u001b[1;33m==\u001b[0m \u001b[0mEINTR\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
        "\u001b[1;31mKeyboardInterrupt\u001b[0m: "
       ]
      }
     ],
     "prompt_number": 47
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Task\n",
      "\n",
      "Your task is simple.  Use memoize on our naive solution to achieve the same performance of the efficient solution."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## End\n",
      "\n",
      "Establishing state is often required for performance.  Memoization captures a common tactic for higher performance code, particularly when repeated expensive operations must be avoided.  The `memoize` higher order function wraps this functionality into a simple function call or decorator."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}