{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Laziness\n", "\n", "*Efficiency and scale through work avoidance*" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# We build a somewhat large list of integers\n", "\n", "integers = range(10000000)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's 10,000,000 numbers. Not huge, but not small either. We need to be careful that we don't do anything too costly on these integers. \n", "\n", "Testing for primality is costly. Lets do that!" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import sympy # don't worry if you don't have this\n", "from toolz import filter\n", "primes = filter(sympy.isprime, integers)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Q: *I thought testing primality was moderately expensive. Why did that return so quickly?*\n", "\n", "A: The `toolz.filter` is lazy. No actual work was performed." ] }, { "cell_type": "code", "collapsed": false, "input": [ "primes # not actually a list of primes" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 3, "text": [ "" ] } ], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "And yet we can still treat the variable `primes` like a list of prime numbers. Lets print the first few" ] }, { "cell_type": "code", "collapsed": false, "input": [ "next(primes)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 4, "text": [ "2" ] } ], "prompt_number": 4 }, { "cell_type": "code", "collapsed": false, "input": [ "next(primes)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 5, "text": [ "3" ] } ], "prompt_number": 5 }, { "cell_type": "code", "collapsed": false, "input": [ "next(primes)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 6, "text": [ "5" ] } ], "prompt_number": 6 }, { "cell_type": "code", "collapsed": false, "input": [ "# Note that primes is not a list like integers. We can't index into it\n", "\n", "primes[1000] # 1000th prime please" ], "language": "python", "metadata": {}, "outputs": [ { "ename": "TypeError", "evalue": "'itertools.ifilter' object has no attribute '__getitem__'", "output_type": "pyerr", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[1;31mTypeError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[1;31m# Note that primes is not a list like integers. We can't index into it\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 2\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 3\u001b[1;33m \u001b[0mprimes\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;36m1000\u001b[0m\u001b[1;33m]\u001b[0m \u001b[1;31m# 1000th prime please\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[1;31mTypeError\u001b[0m: 'itertools.ifilter' object has no attribute '__getitem__'" ] } ], "prompt_number": 7 }, { "cell_type": "code", "collapsed": false, "input": [ "# But we *can* still iterate over it\n", "\n", "for p in primes:\n", " print p\n", " if p > 100: # stop us from going through the whole list\n", " break" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "7\n", "11\n", "13\n", "17\n", "19\n", "23\n", "29\n", "31\n", "37\n", "41\n", "43\n", "47\n", "53\n", "59\n", "61\n", "67\n", "71\n", "73\n", "79\n", "83\n", "89\n", "97\n", "101\n" ] } ], "prompt_number": 8 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the first few primes 2, 3, 5, aren't included in this list. Items in lazy iterators are consumed as you use them, never to be seen again. If you really want to store the iterator, then call the `list` constructor on it" ] }, { "cell_type": "code", "collapsed": false, "input": [ "primes = list(primes) # fully evaluate the lazy iterator, store it in a list" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 22 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Infinite iterators\n", "\n", "Laziness lets us talk about doing work without actually doing it. This separation from reality opens up new possibilities. For example, here are all of the fibonacci numbers" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def fib():\n", " \"\"\" A generator for all Fibonacci numbers \"\"\"\n", " a, b = 0, 1\n", " while True:\n", " yield a\n", " a, b = b, a + b\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 9 }, { "cell_type": "code", "collapsed": false, "input": [ "fibs = fib() # make a lazy iterator, fibs\n", "print next(fibs)\n", "print next(fibs)\n", "print next(fibs)\n", "print next(fibs)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "0\n", "1\n", "1\n", "2\n" ] } ], "prompt_number": 10 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Toolz functions\n", "\n", "Toolz provides functions to interact with lazy iterators." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from toolz import first, second, nth, drop, take" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 11 }, { "cell_type": "code", "collapsed": false, "input": [ "first(fib())" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 12, "text": [ "0" ] } ], "prompt_number": 12 }, { "cell_type": "code", "collapsed": false, "input": [ "second(fib())" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 13, "text": [ "1" ] } ], "prompt_number": 13 }, { "cell_type": "code", "collapsed": false, "input": [ "nth(10, fib())" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 14, "text": [ "55" ] } ], "prompt_number": 14 }, { "cell_type": "code", "collapsed": false, "input": [ "take(20, fib()) # Yet another iterator" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 15, "text": [ "" ] } ], "prompt_number": 15 }, { "cell_type": "code", "collapsed": false, "input": [ "list(take(20, fib())) # But this one is finite, so we can expand it with list" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 16, "text": [ "[0,\n", " 1,\n", " 1,\n", " 2,\n", " 3,\n", " 5,\n", " 8,\n", " 13,\n", " 21,\n", " 34,\n", " 55,\n", " 89,\n", " 144,\n", " 233,\n", " 377,\n", " 610,\n", " 987,\n", " 1597,\n", " 2584,\n", " 4181]" ] } ], "prompt_number": 16 }, { "cell_type": "code", "collapsed": false, "input": [ "later_fibs = drop(100, fib())" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 17 }, { "cell_type": "code", "collapsed": false, "input": [ "next(later_fibs)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 18, "text": [ "354224848179261915075L" ] } ], "prompt_number": 18 }, { "cell_type": "code", "collapsed": false, "input": [ "# These all depend on itertools.islice\n", "from itertools import islice\n", "list(islice(fib(), 10, 20, 1)) # fib()[10:20:1]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 19, "text": [ "[55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181]" ] } ], "prompt_number": 19 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example: Applying functions to text\n", "\n", "Python's `file` object is a lazy iterator. When I open a large text file Python doesn't actually read in all of the text at once" ] }, { "cell_type": "code", "collapsed": false, "input": [ "book = open('data/tale-of-two-cities.txt')" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 20 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each element of the book is a line of text. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "print 1, next(book)\n", "print 2, next(book)\n", "print 3, next(book)\n", "print 4, next(book)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "1 \ufeffThe Project Gutenberg EBook of A Tale of Two Cities, by Charles Dickens\r\n", "\n", "2 \r\n", "\n", "3 This eBook is for the use of anyone anywhere at no cost and with\r\n", "\n", "4 almost no restrictions whatsoever. You may copy it, give it away or\r\n", "\n" ] } ], "prompt_number": 21 }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is the Gutenberg project header. It is roughly 112 lines long. Lets start the book over and drop the header." ] }, { "cell_type": "code", "collapsed": false, "input": [ "book = open('data/tale-of-two-cities.txt')\n", "book = drop(112, book)\n", "next(book)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 22, "text": [ "'It was the best of times,\\r\\n'" ] } ], "prompt_number": 22 }, { "cell_type": "markdown", "metadata": {}, "source": [ "That looks familiar. Lets strip the `'\\r\\n'` from all of the lines though" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from toolz import map # toolz.map is lazy too" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 23 }, { "cell_type": "code", "collapsed": false, "input": [ "book = map(str.strip, book)\n", "next(book)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 24, "text": [ "'it was the worst of times,'" ] } ], "prompt_number": 24 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Semantically, it is as if we applied `str.strip` to every line in the book\n", "\n", "In reality we've done no work.\n", "\n", "Instead `map` returned a new lazy iterator that draws an element from the original `book`, applies `str.strip` to it, and then yields the result. It only does this when we ask it to." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can chain lazy operations on top of one another. After stripping each line we'll focus on those lines that contain the word \"Mr.\" or \"Miss\" or \"Mrs.\"" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def good_line(line):\n", " return \"Mr.\" in line or \"Miss\" in line or \"Mrs.\" in line\n", "\n", "book = filter(good_line, book)\n", "next(book)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 25, "text": [ "'as at this. Mrs. Southcott had recently attained her five-and-twentieth'" ] } ], "prompt_number": 25 }, { "cell_type": "code", "collapsed": false, "input": [ "# Lets take 10 lines from this iterator and display them\n", "\n", "for line in take(10, book):\n", " print 1, line\n" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "1 \"Mr. Jarvis Lorry.\"\n", "1 \"Yes, Mr. Lorry.\"\n", "1 \"I know this messenger, guard,\" said Mr. Lorry, getting down into the\n", "1 like a larger dog-kennel. Mr. Lorry, the passenger, shaking himself out\n", "1 Mr. Lorry dropped off to sleep. The arrival of his breakfast roused him,\n", "1 time to-day. She may ask for Mr. Jarvis Lorry, or she may only ask for a\n", "1 When Mr. Lorry had finished his breakfast, he went out for a stroll on\n", "1 again charged with mist and vapour, Mr. Lorry's thoughts seemed to cloud\n", "1 Mr. Lorry had been idle a long time, and had just poured out his last\n", "1 In a very few minutes the waiter came in to announce that Miss Manette\n" ] } ], "prompt_number": 26 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Motivation\n", "\n", "There are two main reasons for laziness\n", "\n", "1. Avoid useless work: Oftentimes we find what we need quickly, near the beginning of a dataset. Computing on the entire dataset is wasteful\n", "2. Memory: Only a small amount of the dataset needs to stay in memory at any one point. This allows us to stream very large datasets.\n", "\n", "In each case we stay close to traditional Python syntax. We often write lazy code without realizing it." ] } ], "metadata": {} } ] }