{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Map, Filter, Reduce, and Groupby\n", "\n", "This notebook shows off the canonical higher-order functions." ] }, { "cell_type": "code", "collapsed": false, "input": [ "data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "code", "collapsed": false, "input": [ "def square(x):\n", " return x ** 2\n", "\n", "def iseven(n):\n", " return n % 2 == 0\n", "\n", "def add(x, y):\n", " return x + y\n", "\n", "def mul(x, y):\n", " return x * y\n", "\n", "def lesser(x, y):\n", " if x < y:\n", " return x\n", " else:\n", " return y\n", "\n", "def greater(x, y):\n", " if x > y:\n", " return x\n", " else:\n", " return y" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Map" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# map works like this\n", "map(square, data)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 3, "text": [ "[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]" ] } ], "prompt_number": 3 }, { "cell_type": "code", "collapsed": false, "input": [ "# In this way it's like numpy's broadcasting operators\n", "import numpy as np\n", "X = np.arange(1, 11)\n", "X**2" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 4, "text": [ "array([ 1, 4, 9, 16, 25, 36, 49, 64, 81, 100])" ] } ], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "But `map` is pure Python and so \n", "\n", "* is slower\n", "* can handle general functions like fibonacci" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def fib(i):\n", " if i in (0, 1):\n", " return i\n", " else:\n", " return fib(i - 1) + fib(i - 2)\n", " \n", "map(fib, data)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 5, "text": [ "[1, 1, 2, 3, 5, 8, 13, 21, 34, 55]" ] } ], "prompt_number": 5 }, { "cell_type": "code", "collapsed": false, "input": [ "# Normally we would perform this in the following way\n", "result = []\n", "for item in data:\n", " result.append(fib(item))\n", " \n", "result" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 6, "text": [ "[1, 1, 2, 3, 5, 8, 13, 21, 34, 55]" ] } ], "prompt_number": 6 }, { "cell_type": "code", "collapsed": false, "input": [ "# looking at the function above gives us a good pattern for how to define `map`\n", "# We just abstract out the function `fib` for a user input\n", "\n", "# `map` is easy to define\n", "def map(fn, sequence):\n", " result = []\n", " for item in sequence:\n", " result.append(fn(item))\n", " return result" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 7 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Little known fact, object methods are perfectly valid functions" ] }, { "cell_type": "code", "collapsed": false, "input": [ "map(str.upper, ['Alice', 'Bob', 'Charlie'])" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 8, "text": [ "['ALICE', 'BOB', 'CHARLIE']" ] } ], "prompt_number": 8 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `map` function is so important that it was given its own syntax, the *list comprehension*" ] }, { "cell_type": "code", "collapsed": false, "input": [ "[fib(i) for i in data]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 9, "text": [ "[1, 1, 2, 3, 5, 8, 13, 21, 34, 55]" ] } ], "prompt_number": 9 }, { "cell_type": "code", "collapsed": false, "input": [ "[name.upper() for name in ['Alice', 'Bob', 'Charlie']]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 10, "text": [ "['ALICE', 'BOB', 'CHARLIE']" ] } ], "prompt_number": 10 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Filter\n", "\n", "The `filter` higher order function filters a dataset by a predicate. \n", "\n", "A predicate is a function that returns `True` or `False`. The `filter` function returns a new list of only those elements for which the predicate holds true." ] }, { "cell_type": "code", "collapsed": false, "input": [ "filter(iseven, data)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 11, "text": [ "[2, 4, 6, 8, 10]" ] } ], "prompt_number": 11 }, { "cell_type": "code", "collapsed": false, "input": [ "from sympy import isprime # Only works if you have the sympy math library installed\n", "filter(isprime, data)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 12, "text": [ "[2, 3, 5, 7]" ] } ], "prompt_number": 12 }, { "cell_type": "code", "collapsed": false, "input": [ "def filter(predicate, sequence):\n", " result = []\n", " for item in sequence:\n", " if predicate(item):\n", " result.append(item)\n", " return result" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 13 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reduce" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Reduce is the little sibling of `map` and `filter`. Reduce is much less popular and often scolded as being difficult to understand. \n", "\n", "Despite its social problems `reduce` is quite powerful and, once you write `reduce` once you'll understand how it works. More importantly you will learn how to identify reduction operations and how to pair them with binary operators. Reductions are common in data analytics, particularly when reducing large datasets into synopses.\n", "\n", "To show `reduce` we'll first implement two common reductions, `sum` and `min`. We've written them suggestively with the binary operators `add` and `lesser` to highlight their similar structure. Pick out the parts of the following two functions that differ from each other." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def sum(sequence):\n", " result = 0\n", " for item in sequence:\n", " # reult = result + item\n", " result = add(result, item)\n", " return result" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 14 }, { "cell_type": "code", "collapsed": false, "input": [ "def min(sequence):\n", " result = 99999999999999 # a really big number\n", " for item in sequence:\n", " # result = result if result < item else item\n", " result = lesser(result, item)\n", " return result" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 15 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise\n", "\n", "Now fill in the blanks below to complete the definition of `product`, a function that multiplies the elements of the sequence together. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "def product(sequence):\n", " result = ?\n", " for item in sequence:\n", " result = ?(result, item)\n", " return result\n", "\n", "assert product([2, 3, 10]) == 60" ], "language": "python", "metadata": {}, "outputs": [ { "ename": "SyntaxError", "evalue": "invalid syntax (, line 2)", "output_type": "pyerr", "traceback": [ "\u001b[1;36m File \u001b[1;32m\"\"\u001b[1;36m, line \u001b[1;32m2\u001b[0m\n\u001b[1;33m result = ?\u001b[0m\n\u001b[1;37m ^\u001b[0m\n\u001b[1;31mSyntaxError\u001b[0m\u001b[1;31m:\u001b[0m invalid syntax\n" ] } ], "prompt_number": 16 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise\n", "\n", "Write `reduce`.\n", "\n", "Start by copying the pattern of the above three functions. The parts that differ between the three are your inputs. Traditionally the arguments of reduce are ordered so that the examples below operate well." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def reduce(...):\n", " ..." ], "language": "python", "metadata": {}, "outputs": [ { "ename": "SyntaxError", "evalue": "invalid syntax (, line 1)", "output_type": "pyerr", "traceback": [ "\u001b[1;36m File \u001b[1;32m\"\"\u001b[1;36m, line \u001b[1;32m1\u001b[0m\n\u001b[1;33m def reduce(...):\u001b[0m\n\u001b[1;37m ^\u001b[0m\n\u001b[1;31mSyntaxError\u001b[0m\u001b[1;31m:\u001b[0m invalid syntax\n" ] } ], "prompt_number": 17 }, { "cell_type": "code", "collapsed": false, "input": [ "reduce(add, data, 0)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 18, "text": [ "55" ] } ], "prompt_number": 18 }, { "cell_type": "code", "collapsed": false, "input": [ "reduce(mul, data, 1)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 19, "text": [ "3628800" ] } ], "prompt_number": 19 }, { "cell_type": "code", "collapsed": false, "input": [ "\n", "reduce(lesser, data, 10000000)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 20, "text": [ "1" ] } ], "prompt_number": 20 }, { "cell_type": "code", "collapsed": false, "input": [ "reduce(greater, data, -100000000)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 21, "text": [ "10" ] } ], "prompt_number": 21 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Lambda" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We started this notebook with lots of little definitions like \n", "\n", "```\n", "def add(x, y):\n", " return x + y\n", "```\n", "\n", "These one-line functions sometimes seem a little silly. We use the `lambda` keyword to create small functions on the fly. The above definition could be expressed as follows\n", "\n", "```\n", "add = lambda x, y: x + y\n", "```\n", "\n", "The expression `lambda x, y: x + y` is a value, just like `3` or `Alice`. Just like literal ints and strings, Lambda expressions can be used on-the-fly without storing them in variables." ] }, { "cell_type": "code", "collapsed": false, "input": [ "reduce(add, data, 0)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 22, "text": [ "55" ] } ], "prompt_number": 22 }, { "cell_type": "code", "collapsed": false, "input": [ "reduce(lambda x, y: x + y, data, 0) # Define `add` on the fly" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 23, "text": [ "55" ] } ], "prompt_number": 23 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Additionally we can use `lambda` to quickly specify functions as specializations of more general ones. In the following we quickly define sum, min, and max" ] }, { "cell_type": "code", "collapsed": false, "input": [ "sum = lambda data: reduce(add, data, 0)\n", "min = lambda data: reduce(lesser, data, 99999999999)\n", "max = lambda data: reduce(greater, data, -999999999999)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 24 }, { "cell_type": "code", "collapsed": false, "input": [ "sum(data)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 25, "text": [ "55" ] } ], "prompt_number": 25 }, { "cell_type": "markdown", "metadata": {}, "source": [ "As an exercise make `product` using `lambda`, `reduce`, and `mul`." ] }, { "cell_type": "code", "collapsed": false, "input": [ "product = ...\n", "assert product([2, 3, 10]) == 60" ], "language": "python", "metadata": {}, "outputs": [ { "ename": "SyntaxError", "evalue": "invalid syntax (, line 1)", "output_type": "pyerr", "traceback": [ "\u001b[1;36m File \u001b[1;32m\"\"\u001b[1;36m, line \u001b[1;32m1\u001b[0m\n\u001b[1;33m product = ...\u001b[0m\n\u001b[1;37m ^\u001b[0m\n\u001b[1;31mSyntaxError\u001b[0m\u001b[1;31m:\u001b[0m invalid syntax\n" ] } ], "prompt_number": 26 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Groupby" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Groupby can be seen as a more powerful version of `filter`. Rather than give you one subset of the data it divides the data into all relevant subsets." ] }, { "cell_type": "code", "collapsed": false, "input": [ "filter(iseven, data)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 27, "text": [ "[2, 4, 6, 8, 10]" ] } ], "prompt_number": 27 }, { "cell_type": "code", "collapsed": false, "input": [ "from toolz import groupby\n", "groupby(iseven, data)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 29, "text": [ "{False: [1, 3, 5, 7, 9], True: [2, 4, 6, 8, 10]}" ] } ], "prompt_number": 29 }, { "cell_type": "code", "collapsed": false, "input": [ "groupby(isprime, data)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 30, "text": [ "{False: [1, 4, 6, 8, 9, 10], True: [2, 3, 5, 7]}" ] } ], "prompt_number": 30 }, { "cell_type": "markdown", "metadata": {}, "source": [ "But `groupby` is not restricted to predicates (functions which return `True` or `False`)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "groupby(lambda n: n % 3, data)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 31, "text": [ "{0: [3, 6, 9], 1: [1, 4, 7, 10], 2: [2, 5, 8]}" ] } ], "prompt_number": 31 }, { "cell_type": "code", "collapsed": false, "input": [ "groupby(len, ['Alice', 'Bob', 'Charlie', 'Dan', 'Edith', 'Frank'])" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 32, "text": [ "{3: ['Bob', 'Dan'], 5: ['Alice', 'Edith', 'Frank'], 7: ['Charlie']}" ] } ], "prompt_number": 32 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Amazingly `groupby` is not significantly more costly than `filter` in the common case. It computes these groups in a single pass through the data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Integrative example\n", "\n", "Lets bring it all together on a tiny dataset" ] }, { "cell_type": "code", "collapsed": false, "input": [ "likes = \"\"\"Alice likes Chocolate\n", "Bob likes Chocolate\n", "Bob likes Apples\n", "Charlie likes Apples\n", "Alice likes Peanut Butter\n", "Charlie likes Peanut Butter\"\"\"" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 33 }, { "cell_type": "code", "collapsed": false, "input": [ "tuples = map(lambda s: s.split(' likes '), likes.split('\\n'))\n", "tuples" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 34, "text": [ "[['Alice', 'Chocolate'],\n", " ['Bob', 'Chocolate'],\n", " ['Bob', 'Apples'],\n", " ['Charlie', 'Apples'],\n", " ['Alice', 'Peanut Butter'],\n", " ['Charlie', 'Peanut Butter']]" ] } ], "prompt_number": 34 }, { "cell_type": "code", "collapsed": false, "input": [ "groups = groupby(lambda x: x[0], tuples)\n", "groups" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 35, "text": [ "{'Alice': [['Alice', 'Chocolate'], ['Alice', 'Peanut Butter']],\n", " 'Bob': [['Bob', 'Chocolate'], ['Bob', 'Apples']],\n", " 'Charlie': [['Charlie', 'Apples'], ['Charlie', 'Peanut Butter']]}" ] } ], "prompt_number": 35 }, { "cell_type": "code", "collapsed": false, "input": [ "from toolz import valmap, first, second\n", "valmap(lambda L: map(second, L), groups)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 36, "text": [ "{'Alice': ['Chocolate', 'Peanut Butter'],\n", " 'Bob': ['Chocolate', 'Apples'],\n", " 'Charlie': ['Apples', 'Peanut Butter']}" ] } ], "prompt_number": 36 }, { "cell_type": "code", "collapsed": false, "input": [ "valmap(lambda L: map(first, L), groupby(lambda x: x[1], tuples))" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 37, "text": [ "{'Apples': ['Bob', 'Charlie'],\n", " 'Chocolate': ['Alice', 'Bob'],\n", " 'Peanut Butter': ['Alice', 'Charlie']}" ] } ], "prompt_number": 37 }, { "cell_type": "code", "collapsed": false, "input": [ "from toolz.curried import map, valmap, groupby, first, second, get, curry, compose, pipe\n", "tmap = curry(compose(tuple, map))\n", "pipe(tuples, groupby(second), valmap(tmap(first)))\n", "valmap(tmap(first), groupby(second, tuples))\n", "\n", "f = compose(valmap(tmap(first)), groupby(second))\n", "f(tuples)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 39, "text": [ "{'Apples': ('Bob', 'Charlie'),\n", " 'Chocolate': ('Alice', 'Bob'),\n", " 'Peanut Butter': ('Alice', 'Charlie')}" ] } ], "prompt_number": 39 }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }