{ "metadata": { "name": "", "signature": "sha256:671c3304216a623dbcdc39ba184d4aae7ce429f7c26ddb86caa9b8b0e13f32c6" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Python Data Structures: Beyond the Basics\n", "\n", "There are a number of features, idioms, and data types for manipulating Python data structures that are worth being familiar with. Much of the material that follows in this course assumes this familiarity. An overview of some of these concepts are below. The goal of this tutorial is not necessarily for you to be able to use these patterns fluently, but to be able to recognize them in use as the course proceeds, and to start to be able to reason about what built-in Python data structures and functions are the best match for the problems and tasks you'll encounter.\n", "\n", "I begin by going through the concepts in a technical manner, and end with an example combining many of the concepts that data from the Open Weather API.\n", "\n", "(Some of this may be review from Foundations, but I want to make sure we have common ground on these items in particular.)\n", "\n", "## List comprehensions\n", "\n", "A very common task in both data analysis and computer programming is applying some operation to every item in a list (e.g., scaling the numbers in a list by a fixed factor), or to create a copy of a list with only those items that match a particular criterion (e.g., eliminating values that fall below a certain threshold). Python has a succinct syntax, called a *list comprehension*, which allows you to easily write expressions that transform and filter lists.\n", "\n", "A list comprehension has a few parts:\n", "\n", "- a *source list*, or the list whose values will be transformed or filtered;\n", "- a *predicate expression*, to be evaluated for every item in the list; \n", "- (optionally) a *membership expression* that determines whether or not an item in the source list will be included in the result of evaluating the list comprehension, based on whether the expression evaluates to `True` or `False`; and\n", "- a *temporary variable name* by which each value from the source list will be known in the predicate expression and membership expression.\n", "\n", "These parts are arranged like so:\n", "\n", "> `[` *predicate expression* `for` *temporary variable name* `in` *source list* `if` *membership expression* `]`\n", "\n", "The words `for`, `in`, and `if` are a part of the syntax of the expression. They don't mean anything in particular (and in fact, they do completely different things in other parts of the Python language). You just have to spell them right and put them in the right place in order for the list comprehension to work.\n", "\n", "Here's an example, returning the squares of integers zero up to ten:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print [x * x for x in [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]\n" ] } ], "prompt_number": 166 }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the example above, `x*x` is the predicate expression; `x` is the temporary variable name; and `[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]` is the source list. There's no membership expression in this example, so we omit it (and the word `if`).\n", "\n", "There's nothing special about the variable `x`; it's just a name that we chose. We could easily choose any other temporary variable name, as long as we use it in the predicate expression as well. Below, I use the name of one of my cats as the temporary variable name, and the expression evaluates the same way it did with `x`:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print [shumai * shumai for shumai in [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]\n" ] } ], "prompt_number": 167 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The expression in the list comprehension can be any expression, even just the temporary variable itself, in which case the list comprehension will simply evaluate to a copy of the original list: " ] }, { "cell_type": "code", "collapsed": false, "input": [ "print [x for x in range(10)]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]\n" ] } ], "prompt_number": 168 }, { "cell_type": "markdown", "metadata": {}, "source": [ "You don't technically even need to use the temporary variable in the predicate expression:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print [42 for x in range(10)]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[42, 42, 42, 42, 42, 42, 42, 42, 42, 42]\n" ] } ], "prompt_number": 169 }, { "cell_type": "markdown", "metadata": {}, "source": [ "###The membership expression\n", "\n", "As indicated above, you can include an expression at the end of the list comprehension to determine whether or not the item in the source list will be evaluated and included in the resulting list. One way, for example, of including only those values from the source list that are greater than or equal to five:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print [x*x for x in range(10) if x >= 5]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[25, 36, 49, 64, 81]\n" ] } ], "prompt_number": 170 }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Randomness\n", "\n", "Random numbers are useful for a number of reasons, from testing to statistics to cryptography. Python has a built-in module `random` which allows you to do things with random numbers. I'm going to introduce just a few of the most useful functions from `random`.\n", "\n", "In order to use `random`, you need to `import` it first. Once you do, you can call `random.randrange()` to generate a random number from zero up to (but not including) the specified number:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import random\n", "print random.randrange(100)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "13\n" ] } ], "prompt_number": 171 }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can use `random.randrange()` to, for example, simulate a number of dice rolls. Here's 100 random rolls of a d6:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print [random.randrange(6)+1 for i in range(100)]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[6, 6, 5, 4, 4, 3, 2, 4, 4, 2, 6, 5, 6, 1, 2, 3, 2, 3, 2, 4, 4, 4, 1, 1, 2, 6, 4, 5, 4, 6, 3, 3, 4, 2, 1, 3, 4, 4, 5, 6, 5, 2, 1, 2, 1, 5, 3, 2, 4, 6, 4, 5, 1, 1, 2, 5, 6, 4, 4, 4, 1, 4, 4, 2, 5, 2, 2, 6, 4, 6, 1, 3, 4, 2, 6, 1, 3, 1, 6, 5, 5, 1, 4, 5, 1, 6, 3, 2, 5, 2, 3, 5, 5, 1, 3, 2, 3, 2, 1, 2]\n" ] } ], "prompt_number": 172 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `random` module also has a number of functions for getting random items from lists. The first is `random.choice()`, which simply returns a random item from a list:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "flavors = [\"vanilla\", \"chocolate\", \"red velvet\", \"durian\", \"cinnamon\", \"~mystery~\"]\n", "print random.choice(flavors)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "chocolate\n" ] } ], "prompt_number": 173 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `random.sample()` function randomly samples a specified number of items from a list (guaranteeing that the same item won't be drawn twice):" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print random.sample(flavors, 2)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "['~mystery~', 'durian']\n" ] } ], "prompt_number": 174 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, the `random.shuffle()` function sorts a list in random order:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print flavors\n", "random.shuffle(flavors)\n", "print flavors" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "['vanilla', 'chocolate', 'red velvet', 'durian', 'cinnamon', '~mystery~']\n", "['cinnamon', '~mystery~', 'red velvet', 'durian', 'vanilla', 'chocolate']\n" ] } ], "prompt_number": 175 }, { "cell_type": "markdown", "metadata": {}, "source": [ "These are just the most useful (in my opinion) functions from `random`; the module has many other helpful functions to (e.g.) generate random numbers with particular distributions. [Read more here](https://docs.python.org/2/library/random.html)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tuples\n", "\n", "Tuples (rhymes with \"supple\") are data structures very similar to lists. You can create a tuple using parentheses (instead of square brackets, as you would with a list):" ] }, { "cell_type": "code", "collapsed": false, "input": [ "t = (\"alpha\", \"beta\", \"gamma\", \"delta\")\n", "print t" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "('alpha', 'beta', 'gamma', 'delta')\n" ] } ], "prompt_number": 176 }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can access the values in a tuple in the same way as you access the values in a list: using square bracket indexing syntax. Tuples support slice syntax and negative indexes, just like lists:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "t[-2]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 177, "text": [ "'gamma'" ] } ], "prompt_number": 177 }, { "cell_type": "code", "collapsed": false, "input": [ "t[1:3]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 178, "text": [ "('beta', 'gamma')" ] } ], "prompt_number": 178 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The difference between a list and a tuple is that *the values in a tuple can't be changed after the tuple is created*. This means, for example, that attempting to `.append()` a value to a tuple will fail:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "t.append(\"epsilon\")" ], "language": "python", "metadata": {}, "outputs": [ { "ename": "AttributeError", "evalue": "'tuple' object has no attribute 'append'", "output_type": "pyerr", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[0;31mAttributeError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mt\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mappend\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"epsilon\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;31mAttributeError\u001b[0m: 'tuple' object has no attribute 'append'" ] } ], "prompt_number": 179 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Likewise, assigning to an index of a tuple will fail:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "t[2] = \"bravo\"" ], "language": "python", "metadata": {}, "outputs": [ { "ename": "TypeError", "evalue": "'tuple' object does not support item assignment", "output_type": "pyerr", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mt\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m2\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m\"bravo\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;31mTypeError\u001b[0m: 'tuple' object does not support item assignment" ] } ], "prompt_number": 180 }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Why tuples? Why now?\n", "\n", "\"So,\" you think to yourself. \"Tuples are just like... broken lists. That's strange and a little unreasonable. Why even have them in your programming language?\" That's a fair question, and answering it requires a bit of knowledge of how Python works with these two kinds of values (lists and tuples) behind the scenes.\n", "\n", "Essentially, tuples are *faster* and *smaller* than lists. Because lists can be modified, potentially becoming larger after they're initialized, Python has to allocate more memory than is strictly necessary whenever you create a list value. If your list grows beyond what Python has already allocated, Python has to allocate *more* memory. Allocating memory, copying values into memory, and then freeing memory when it's when no longer needed, are all (perhaps surprisingly) slow processes---slower, at least, than using data already loaded into memory when your program begins.\n", "\n", "Because a tuple can't grow or shrink after it's created, Python knows exactly how much memory to allocate when you create a tuple in your program. That means: less wasted memory, and less wasted time allocating a deallocating memory. The cost of this decreased resource footprint is less versatility.\n", "\n", "Tuples are often called an *immutable* data type. \"Immutable\" in this context simply means that it can't be changed after it's created.\n", "\n", "###Tuples in the standard library\n", "\n", "Because tuples are faster, they're often the data type that gets returned from methods and functions in Python's built-in library. For example, the `.items()` method of the dictionary object returns a list of tuples (rather than, as you might otherwise expect, a list of lists):" ] }, { "cell_type": "code", "collapsed": false, "input": [ "moon_counts = {'mercury': 0, 'venus': 0, 'earth': 1, 'mars': 2}\n", "moon_counts.items()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 181, "text": [ "[('mercury', 0), ('earth', 1), ('venus', 0), ('mars', 2)]" ] } ], "prompt_number": 181 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `tuple()` function takes a list and returns it as a tuple:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "tuple([1, 2, 3, 4, 5])" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 182, "text": [ "(1, 2, 3, 4, 5)" ] } ], "prompt_number": 182 }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you want to initialize a new list with with data in a tuple, you can pass the tuple to the `list()` function:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "list((1, 2, 3, 4, 5))" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 183, "text": [ "[1, 2, 3, 4, 5]" ] } ], "prompt_number": 183 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sets\n", "\n", "Sets are another list-like data structure. Like tuples, sets have limitations compared to lists, but are very useful in particular circumstances.\n", "\n", "You can create a set like this, by passing a list to the `set()` function:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "s = set([\"alpha\", \"beta\", \"gamma\", \"delta\", \"epsilon\"])\n", "print type(s)\n", "print s" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\n", "set(['epsilon', 'alpha', 'beta', 'gamma', 'delta'])\n" ] } ], "prompt_number": 184 }, { "cell_type": "markdown", "metadata": {}, "source": [ "In Python 2.7 and later, you can also create a set using curly brackets (`{` and `}`) with a comma-separated sequence of values:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "s = {\"alpha\", \"beta\", \"gamma\", \"delta\", \"epsilon\"}\n", "print type(s)\n", "print s" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\n", "set(['epsilon', 'beta', 'alpha', 'gamma', 'delta'])\n" ] } ], "prompt_number": 185 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sets, like lists, can be iterated over in a `for` loop and serve as the source value in a list comprehension:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "for item in s:\n", " print item" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "epsilon\n", "beta\n", "alpha\n", "gamma\n", "delta\n" ] } ], "prompt_number": 186 }, { "cell_type": "code", "collapsed": false, "input": [ "[item[0] for item in s]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 187, "text": [ "['e', 'b', 'a', 'g', 'd']" ] } ], "prompt_number": 187 }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can add an item to a set using the set object's `.add()` method:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "s.add(\"omega\")\n", "print s" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "set(['epsilon', 'beta', 'delta', 'alpha', 'omega', 'gamma'])\n" ] } ], "prompt_number": 188 }, { "cell_type": "markdown", "metadata": {}, "source": [ "And you can check to see if a particular value is present in a set using the `in` operator:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "\"beta\" in s" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 189, "text": [ "True" ] } ], "prompt_number": 189 }, { "cell_type": "code", "collapsed": false, "input": [ "\"emoji\" in s" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 190, "text": [ "False" ] } ], "prompt_number": 190 }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Sets can't contain duplicates\n", "\n", "So now you're asking \"okay, so... it's a list. You can put things in it and add things to it and check if things are in it. Big deal. Why am I even listening to this. I'm going to check Facebook. Ah, sweet, sweet Facebook.\" But wait! Sets are different from lists in several useful (and/or strange) ways. One useful property of the set is that once a value is in a set, any further attempts to add that value to the set will be ignored. That is: a set can't contain the same value twice. You can exploit this property of sets in order to remove duplicates from a list:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "source_list = [\"it\", \"is\", \"what\", \"it\", \"is\"]\n", "without_duplicates = set(source_list)\n", "print without_duplicates" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "set(['is', 'it', 'what'])\n" ] } ], "prompt_number": 191 }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Sets are faster (for certain applications) than lists\n", "\n", "Another useful property of sets is that the `in` operator is *much faster* when operating on sets than it is on lists, especially when the number of items you're working with is very large. The following code illustrates the speed difference by creating a list of 9999 values, and a set of the same values, then performing a thousand random `in` checks against both:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import time, random\n", "\n", "values = range(9999)\n", "values_set = set(values)\n", "\n", "start_list = time.clock()\n", "for i in range(1000):\n", " random.randrange(99999) in values\n", "end_list = time.clock()\n", "\n", "start_set = time.clock()\n", "for i in range(1000):\n", " random.randrange(99999) in values_set\n", "end_set = time.clock()\n", "\n", "print \"1000 random checks on list: \", end_list - start_list, \"seconds\"\n", "print \"1000 random checks on set: \", end_set - start_set, \"seconds\"" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "1000 random checks on list: 0.19032 seconds\n", "1000 random checks on set: 0.00268700000001 seconds\n" ] } ], "prompt_number": 192 }, { "cell_type": "markdown", "metadata": {}, "source": [ "(The `time.clock()` function returns the current time in seconds; subtracting one time from another tells us roughly how much time has passed between the two calls.) Depending on your computer, using the `set` instead of the list will be 10 to 100 times faster. In this example, the actual clock time difference is miniscule---two tenths of a second versus two hundredths of a second---but when you're working with billions or trillions of items instead of tens of thousands, that performance difference can really add up.\n", "\n", "(The reasons for this performance difference are outside the scope of this tutorial. Suffice it to say that the `in` operator must potentially check *every item* in a list to see if the first operand matches---meaning that as the list grows larger, the operation gets slower. With a set, the `in` operator needs to perform only *one* check, regardless of how large the data structure is.)\n", "\n", "The tradeoff for the set's speed is that it's less memory-efficient than a list. Here we'll use the `sys.getsizeof()` function to get a rough estimate of the size (in bytes) of both objects:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import sys\n", "print \"size of list: \", sys.getsizeof(values)\n", "print \"size of set: \", sys.getsizeof(values_set)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "size of list: 80064\n", "size of set: 524520\n" ] } ], "prompt_number": 193 }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Sets are unordered\n", "\n", "Another important difference between sets and lists is that sets are *unordered*. You may have noticed this in the examples above: the order of the items added to a set is not the same as the order of the items when you get them back out. (This is similar to how keys in Python dictionaries are unordered.)\n", "\n", "Sets and lists are similar, but not interchangeable. Use lists when it's important to know the order of a particular sequence; use sets when it's important to be able to quickly check to see if a particular item is in the sequence." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##The `dict()` function\n", "\n", "The `dict()` function creates a new dictionary. You can create an empty dictionary by calling this function with no parameters:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "t = dict() # same as t = {}\n", "print type(t)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\n" ] } ], "prompt_number": 194 }, { "cell_type": "markdown", "metadata": {}, "source": [ "But the `dict()` function can also be used to initialize a new dictionary from a *list of tuples*. Here's what that usage looks like:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "items = [(\"a\", 1), (\"b\", 2), (\"c\", 3)]\n", "t = dict(items)\n", "print t" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "{'a': 1, 'c': 3, 'b': 2}\n" ] } ], "prompt_number": 195 }, { "cell_type": "markdown", "metadata": {}, "source": [ "This might not seem immediately useful, but as we'll see below, the `dict()` function can be used to quickly make a dictionary out of sequential data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dictionary comprehensions\n", "\n", "A very common task in Python is to take some kind of sequential data and then turn it into a dictionary. Say, for example, that we wanted to take a list of strings and then create a dictionary mapping the strings to their lengths. Here's how to do that with a `for` loop:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "us_presidents = [\"carter\", \"reagan\", \"bush\", \"clinton\", \"bush\", \"obama\"]\n", "prez_lengths = {}\n", "for item in us_presidents:\n", " prez_lengths[item] = len(item)\n", "print prez_lengths" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "{'clinton': 7, 'bush': 4, 'reagan': 6, 'carter': 6, 'obama': 5}\n" ] } ], "prompt_number": 196 }, { "cell_type": "markdown", "metadata": {}, "source": [ "This task is so common that it's often written as a single expression. There are several ways to this; the first is by passing the result of a list comprehension to the `dict()` function: " ] }, { "cell_type": "code", "collapsed": false, "input": [ "prez_length_tuples = [(item, len(item)) for item in us_presidents]\n", "print \"our list of tuples: \", prez_length_tuples\n", "prez_lengths = dict(prez_length_tuples)\n", "print \"resulting dictionary: \", prez_lengths" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "our list of tuples: [('carter', 6), ('reagan', 6), ('bush', 4), ('clinton', 7), ('bush', 4), ('obama', 5)]\n", "resulting dictionary: {'clinton': 7, 'bush': 4, 'reagan': 6, 'carter': 6, 'obama': 5}\n" ] } ], "prompt_number": 197 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The example above is a little bit complicated! The tricky part is the list comprehension. The *source list* of the comprehension is our list of presidential names; the predicate expression is a *tuple* with two items: the name itself, and the length of the name. We then pass the resulting list of tuples to the `dict()` function, which evaluates to the desired dictionary. This bit of code can be rewritten as one expression:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "prez_lengths = dict([(item, len(item)) for item in us_presidents])\n", "print prez_lengths" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "{'clinton': 7, 'bush': 4, 'reagan': 6, 'carter': 6, 'obama': 5}\n" ] } ], "prompt_number": 198 }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you're a beginner Python programmer, it might be a while before you can formulate these expressions on your own. But it's important to be able to recognize this idiom when you see it in other people's code.\n", "\n", "Python 3 introduced a new syntax specifically for creating dictionaries in this manner; the syntax has subsequently been backported to Python 2.7. The expression above can be rewritten like so:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "prez_lengths = {item: len(item) for item in us_presidents}\n", "print prez_lengths" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "{'clinton': 7, 'bush': 4, 'reagan': 6, 'carter': 6, 'obama': 5}\n" ] } ], "prompt_number": 199 }, { "cell_type": "markdown", "metadata": {}, "source": [ "This syntax is called a *dictionary comprehension* (by analogy with \"list comprehension\"). A dictionary comprehension is like a list comprehension, except the \"predicate expression\" is not an expression proper, but a key/value pair separated by a colon. We'll see another example of this syntax below." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Combining lists with `zip`\n", "\n", "As you can see by now, the \"list of tuples\" is a very common configuration for data in Python. The `zip()` function allows you to create a list of tuples that combines values from two separate lists. For example, imagine that you've retrieved the names of certain US states from one source, and the estimated population for those states from a different source. You know that the data is in the same order in both sources, and you'd like to combine the two into one list (perhaps to eventually create a dictionary for easy population lookups). The `zip()` function does just this:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "state_names = [\"alabama\", \"alaska\", \"arizona\", \"arkansas\", \"california\"]\n", "state_pop = [4849377, 736732, 6731484, 2966369, 38802500]\n", "combo = zip(state_names, state_pop)\n", "print combo" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[('alabama', 4849377), ('alaska', 736732), ('arizona', 6731484), ('arkansas', 2966369), ('california', 38802500)]\n" ] } ], "prompt_number": 200 }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, the `zip()` function takes two lists as parameters and returns a *list of tuples* with the respective items from both lists. You could then (for example) pass the result of `zip()` to the `dict()` function, to create a dictionary mapping state names to state populations:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "state_pop_lookup = dict(zip(state_names, state_pop))\n", "print state_pop_lookup" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "{'california': 38802500, 'alabama': 4849377, 'arizona': 6731484, 'arkansas': 2966369, 'alaska': 736732}\n" ] } ], "prompt_number": 201 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Enumerating lists\n", "\n", "Let's say you want to iterate through a list, prepending each item in the list with its index: a numbered list. One way to do this is to write a `for` loop to print out the items of a list with their index, keeping track of the current index in a separate variable:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "elements = [\"hydrogen\", \"helium\", \"lithium\", \"beryllium\", \"boron\"]\n", "index = 0\n", "for item in elements:\n", " print index, item\n", " index += 1" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "0 hydrogen\n", "1 helium\n", "2 lithium\n", "3 beryllium\n", "4 boron\n" ] } ], "prompt_number": 202 }, { "cell_type": "markdown", "metadata": {}, "source": [ "That whole `index` variable thing, though---kind of ugly and non-Pythonic. What if we used the `zip()` function to create instead a list of tuples, where the first item is the index of the item in the source list, and the second item is the item itself? That might look like this:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# the range() function returns a list from 0 up to the specified value\n", "enumerated_elements = zip(range(len(elements)), elements)\n", "print \"enumerated list: \", enumerated_elements\n", "\n", "# now, iterate over each tuple in the enumerated list...\n", "for index_item_tuple in enumerated_elements:\n", " print index_item_tuple[0], index_item_tuple[1]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "enumerated list: [(0, 'hydrogen'), (1, 'helium'), (2, 'lithium'), (3, 'beryllium'), (4, 'boron')]\n", "0 hydrogen\n", "1 helium\n", "2 lithium\n", "3 beryllium\n", "4 boron\n" ] } ], "prompt_number": 203 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `zip()` function here takes two lists: the first is a list returned from `range()` that has numbers from zero up to the number of items in the `elements` list (i.e., `[0, 1, 2, 3, 4]`). The second is the `elements` list itself. The call to `zip()` evaluates to the list shown above: a list of 2-tuples with index/item pairs.\n", "\n", "The `for` loop above is a little awkward: the temporary loop variable `index_item_tuple` has the value of each tuple in the `enumerated_elements` list in turn, so we need to use square brackets to get the values from the tuple. It turns out there's an easier, more Pythonic way to do this, using a feature called \"tuple unpacking\":" ] }, { "cell_type": "code", "collapsed": false, "input": [ "for index, item in enumerated_elements:\n", " print index, item" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "0 hydrogen\n", "1 helium\n", "2 lithium\n", "3 beryllium\n", "4 boron\n" ] } ], "prompt_number": 204 }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you know that each item of a list is a tuple, you can write a series of comma-separated temporary variables between the `for` and the `in` of a `for` loop. Python will assign the first element of each tuple to the first variable listed, the second element of each tuple to the second variable listed, etc. This `for` loop accomplishes the same thing as the previous one, but it's much cleaner.\n", "\n", "Lists of index/value 2-tuples are needed fairly frequently in Python. So frequently that there's a built-in function for constructing such lists. That function is called `enumerate()`:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# this code:\n", "print \"with zip/range/len:\"\n", "for index, item in zip(range(len(elements)), elements):\n", " print index, item\n", "\n", "print \"\\nwith enumerate:\"\n", "# ... can also be written like this:\n", "for index, item in enumerate(elements):\n", " print index, item" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "with zip/range/len:\n", "0 hydrogen\n", "1 helium\n", "2 lithium\n", "3 beryllium\n", "4 boron\n", "\n", "with enumerate:\n", "0 hydrogen\n", "1 helium\n", "2 lithium\n", "3 beryllium\n", "4 boron\n" ] } ], "prompt_number": 205 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Functions as values\n", "\n", "As you're aware, you can define a function in Python using the `def` keyword and an indented code block. For example, here's a function `first()` which returns the first item of a list or string:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def first(t):\n", " return t[0]\n", "\n", "first(\"all of these wonderful characters\")" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 206, "text": [ "'a'" ] } ], "prompt_number": 206 }, { "cell_type": "markdown", "metadata": {}, "source": [ "What you may not know is that functions are themselves *values*, just like an integer or a floating-point number or a string or a list. Once a function has been defined, the name of a function is a variable that *contains* that value. You can ask Python to print that value out, just like you can ask Python to print the value of a list or string:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print first" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\n" ] } ], "prompt_number": 207 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Printing a function isn't very useful, of course; you just get a string with information about where in memory Python is storing the function's code. But you can do other interesting things. For example, you can create a new variable that points to that function:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "grab_index_zero = first\n", "grab_index_zero(\"all of these wonderful characters\")" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 208, "text": [ "'a'" ] } ], "prompt_number": 208 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Above, we created a new variable `grab_index_zero` and assigned to it the value `first`. Now `grab_index_zero` can be called as a function, just like we can call `first` as a function! This works even for built-in functions like `len()`:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "how_many_things_are_in = len\n", "how_many_things_are_in([\"hi\", \"there\", \"how\", \"are\", \"you\"])" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 209, "text": [ "5" ] } ], "prompt_number": 209 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Passing function values to functions\n", "\n", "Importantly, because Python functions are values, we can *pass them as parameters* to other Python functions. To illustrate, let's write a function `say_hello` which accomplishes a simple task: it prints out a random greeting." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import random\n", "def say_hello():\n", " greetz = [\"hey\", \"howdy\", \"hello\", \"greetings\", \"yo\", \"hi\"]\n", " print random.choice(greetz) + \"!\"\n", "say_hello()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "greetings!\n" ] } ], "prompt_number": 210 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's write a function called `thrice`. This function takes *another* function as a parameter, and calls that function three times:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def thrice(func):\n", " for i in range(3):\n", " func()\n", "\n", "# let's try it out...\n", "thrice(say_hello)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "hey!\n", "greetings!\n", "hi!\n" ] } ], "prompt_number": 211 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `thrice` function has limited utility, of course (if for no other reason than the fact that most Python functions *return* values, instead of just printing them out). But it at least illustrates the concept.\n", "\n", "### Map and filter\n", "\n", "There are two built-in Python functions that I want to mention here, called `map()` and `filter()`. These are functions that operate on lists and take other functions as parameters.\n", "\n", "The `map()` function takes two parameters: a function and a list (or other sequence). It returns a new list, which contains the result of calling the given function on every item of the list:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def first(t):\n", " return t[0]\n", "\n", "elements = [\"hydrogen\", \"helium\", \"lithium\", \"beryllium\", \"boron\"]\n", "\n", "# a new list containing the first character of each string\n", "map(first, elements)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 212, "text": [ "['h', 'h', 'l', 'b', 'b']" ] } ], "prompt_number": 212 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `map()` call above is essentially the same thing as this list comprehension:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "[first(item) for item in elements]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 213, "text": [ "['h', 'h', 'l', 'b', 'b']" ] } ], "prompt_number": 213 }, { "cell_type": "markdown", "metadata": {}, "source": [ "There's no real reason to choose one idiom over the other (`map()` vs. list comprehension), and you'll often see Python programmers switch between the two. But it's important to be able to recognize that these two bits of code do the same thing." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `filter()` function takes a function and a list (or other sequence), and returns a new list containing only those items from the source list that, when passed as a parameter to the given function, evaluate to `True`:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def greater_than_ten(num):\n", " return num > 10\n", "\n", "numbers = [-10, 17, 4, 94, 2, 0, 10]\n", "filter(greater_than_ten, numbers)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 214, "text": [ "[17, 94]" ] } ], "prompt_number": 214 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Again, this call to `filter()` can be re-written as a list comprehension:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "[item for item in numbers if greater_than_ten(item)]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 215, "text": [ "[17, 94]" ] } ], "prompt_number": 215 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Lambda functions\n", "\n", "With functions like `map()` and `filter()`, it's very common to write quick, one-off functions simply for the purposes of processing a list. The `greater_than_ten` and `first` functions above are great examples of this: these are tiny functions that only have a `return` statement in them.\n", "\n", "Writing functions like this is *so* common, in fact, that there's a little shorthand for writing them. This shorthand allows you to define a function all in one line, without having to type out the `def` and the `return`. It looks like this:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# the regular way\n", "def first(t):\n", " return t[0]\n", "\n", "# the \"shorthand\" way\n", "first = lambda t: t[0]\n", "\n", "# test it out!\n", "first(\"cheese\")" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 216, "text": [ "'c'" ] } ], "prompt_number": 216 }, { "cell_type": "markdown", "metadata": {}, "source": [ "This shorthand method is called a \"lambda function.\" (Why \"lambda\"? For secret programming reasons that you can investigate on your own. It's a ~mystery~.) A lambda function is essentially an alternate syntax for defining a function. Schematically, it looks like this:\n", "\n", "> lambda vars: expression\n", "\n", "... where \"lambda\" is the `lambda` keyword, `vars` is a comma-separated list of temporary variable names for parameters passed to the function, and `expression` is the expression that describes what the function will evaluate to.\n", "\n", "Here's a lambda function that takes two parameters:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# squish combines the first item of its first parameter with the last item of its second parameter\n", "squish = lambda one, two: one[0] + two[-1]\n", "squish(\"hi\", \"there\")\n", "\n", "# you could also write \"squish\" the longhand way, like this:\n", "#def squish(one, two):\n", "# return one[0] + two[-1]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 217, "text": [ "'he'" ] } ], "prompt_number": 217 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lambda functions have a serious limitation, though: a lambda function can consist of *only one expression*---you can't have an entire block of statements like you can with a regular function.\n", "\n", "The real utility of this alternate syntax comes from the fact that you can define a lambda function *in-line*: you don't have to assign a lambda function to a variable before you use it. So, for example, you can write this:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "elements = [\"hydrogen\", \"helium\", \"lithium\", \"beryllium\", \"boron\"]\n", "map(lambda x: x[0], elements)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 218, "text": [ "['h', 'h', 'l', 'b', 'b']" ] } ], "prompt_number": 218 }, { "cell_type": "markdown", "metadata": {}, "source": [ "... or this:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "numbers = [-10, 17, 4, 94, 2, 0, 10]\n", "filter(lambda x: x > 10, numbers)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 219, "text": [ "[17, 94]" ] } ], "prompt_number": 219 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sorting\n", "\n", "You can sort a list two ways. The first is to use the list object's `.sort()` method, which sorts the list in-place:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "elements = [\"hydrogen\", \"helium\", \"lithium\", \"beryllium\", \"boron\"]\n", "elements.sort()\n", "print elements" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "['beryllium', 'boron', 'helium', 'hydrogen', 'lithium']\n" ] } ], "prompt_number": 220 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The second is to use Python's built-in `sorted()` function, which evaluates to a copy of the list with its elements in order, while leaving the original list the same:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "elements = [\"hydrogen\", \"helium\", \"lithium\", \"beryllium\", \"boron\"]\n", "print sorted(elements)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "['beryllium', 'boron', 'helium', 'hydrogen', 'lithium']\n" ] } ], "prompt_number": 221 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Both the `.sort()` method and the `sorted()` function take an optional keyword parameter `reverse` which, if set to `True`, causes the sorting to happen in reverse order:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# with .sort()\n", "numbers = [52, 54, 108, 13, 7, 2]\n", "numbers.sort(reverse=True)\n", "print numbers" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[108, 54, 52, 13, 7, 2]\n" ] } ], "prompt_number": 222 }, { "cell_type": "code", "collapsed": false, "input": [ "# with sorted()\n", "numbers = [52, 54, 108, 13, 7, 2]\n", "sorted(numbers, reverse=True)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 223, "text": [ "[108, 54, 52, 13, 7, 2]" ] } ], "prompt_number": 223 }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is all well and good, but what if we want to sort using some method than numerically or alphabetically? Say, for example, we had the following list of tuples describing state populations, and we wanted to sort the list by population. Just using `.sort()` or `sorted()` doesn't return the desired result:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "states = [\n", " ('Alabama', 4849377),\n", " ('Alaska', 736732),\n", " ('Arizona', 6731484),\n", " ('Arkansas', 2966369),\n", " ('California', 38802500)\n", "]\n", "# doesn't sort based on population!\n", "sorted(states)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 224, "text": [ "[('Alabama', 4849377),\n", " ('Alaska', 736732),\n", " ('Arizona', 6731484),\n", " ('Arkansas', 2966369),\n", " ('California', 38802500)]" ] } ], "prompt_number": 224 }, { "cell_type": "markdown", "metadata": {}, "source": [ "What we need is some way to tell Python which part of the data to look at when performing the sort. Python provides a way to do this with the `key` parameter, which you can pass to either `.sort()` or `sorted()`. The value passed to the `key` parameter should be a function. When sorting the list, Python will evaluate this function for each item in the list, and will decide how that item should be sorted based on the value returned from the function.\n", "\n", "So, to perform the task outlined above (sorting the list by the second item in each tuple), we could do something like this:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def get_second(t):\n", " return t[1]\n", "\n", "sorted(states, key=get_second)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 225, "text": [ "[('Alaska', 736732),\n", " ('Arkansas', 2966369),\n", " ('Alabama', 4849377),\n", " ('Arizona', 6731484),\n", " ('California', 38802500)]" ] } ], "prompt_number": 225 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because we specified the `key` parameter, Python calls the `get_second` function for each item in the list. The result of this function (i.e., the second value in the tuple) is then used when sorting the list. We can rewrite this more succinctly using a lambda function:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "sorted(states, key=lambda t: t[1])" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 226, "text": [ "[('Alaska', 736732),\n", " ('Arkansas', 2966369),\n", " ('Alabama', 4849377),\n", " ('Arizona', 6731484),\n", " ('California', 38802500)]" ] } ], "prompt_number": 226 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The expression `lambda t: t[1]` is just a shorter way of writing the function `get_second` above." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###\"Sorting\" dictionaries by value\n", "\n", "It's common to use Python dictionaries to count things---say, for example, how often words are repeated in a given source text. You'll often end up with a dictionary that looks something like this:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "word_counts = {'it': 123, 'was': 48, 'the': 423, 'best': 7, 'worst': 13, 'of': 350, 'times': 2}" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 227 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once you have data like this, it's only natural to want to see, e.g., what the most common word is and what the least common word is. It should be simple enough to do this, right? Just pass the dictionary to the `sorted()` function!" ] }, { "cell_type": "code", "collapsed": false, "input": [ "sorted(word_counts)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 228, "text": [ "['best', 'it', 'of', 'the', 'times', 'was', 'worst']" ] } ], "prompt_number": 228 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hmm. That didn't work. It looks like Python is sorting the dictionary... in alphabetical order? Which is weird. Actually, what's happening is that when you pass a dictionary to `sorted()`, Python implicitly assumes you meant to sort just the *keys* of the dictionary---and, in this case, it sorts them in alphabetical order, because we haven't specified an alternative order!\n", "\n", "Maybe it would help to step back and remember that dictionaries are an inherently *unordered* data type. So sorting a dictionary doesn't make any sense! What we need is some way to turn a dictionary *into* a sortable data type, like a list. The `.items()` method of the dictionary object does just this: it evaluates to a list of tuples containing the key-value pairs from the dictionary." ] }, { "cell_type": "code", "collapsed": false, "input": [ "word_counts.items()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 229, "text": [ "[('of', 350),\n", " ('it', 123),\n", " ('times', 2),\n", " ('worst', 13),\n", " ('the', 423),\n", " ('was', 48),\n", " ('best', 7)]" ] } ], "prompt_number": 229 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hey now, this is looking familiar! A list of tuples! We just finished learning how to sort lists of tuples by particular members of the tuples. We just need to use the `sorted()` function and specify a `key` parameter that is a function returning the second value from the tuple! Like so:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "sorted(word_counts.items(), key=lambda x: x[1])" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 230, "text": [ "[('times', 2),\n", " ('best', 7),\n", " ('worst', 13),\n", " ('was', 48),\n", " ('it', 123),\n", " ('of', 350),\n", " ('the', 423)]" ] } ], "prompt_number": 230 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We did it! This expression evalues to a list of tuples from the `word_counts` dictionary, ordered by the *value* for each key in the original dictionary. The least common word (\"times\") is the first item in the list. We can use the `reverse` parameter of `sorted()` to order from most common to least common instead:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "sorted(word_counts.items(), key=lambda x: x[1], reverse=True)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 231, "text": [ "[('the', 423),\n", " ('of', 350),\n", " ('it', 123),\n", " ('was', 48),\n", " ('worst', 13),\n", " ('best', 7),\n", " ('times', 2)]" ] } ], "prompt_number": 231 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Beautiful. Simply stunning." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Putting it all together: an API example\n", "\n", "Let's combine a number of these concepts to get useful information from a web API. For this section, we'll use the [Open Weather Map API](http://openweathermap.org/api). The Open Weather Map API provides several different kinds of data about the weather.\n", "\n", "Let's say that we want to pick which of the next five days will be the best day for our outdoor Data Journalism Picnic. We want to choose the day based on the day's weather---not rainy, not too hot. (I'm assuming you're running this code in the summer. If you're planning an outdoor picnic in some other season, you may want to change your criteria.) For this purpose, we can use the API endpoint that returns a daily weather forecast for a particular city. [You can read more about how to use this endpoint here](http://openweathermap.org/forecast16). Here's a query that gets the daily forecast for the next five days in New York City:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import urllib\n", "import json\n", "\n", "query_url = \"http://api.openweathermap.org/data/2.5/forecast/daily?id=5128581&cnt=5&units=imperial\"\n", "resp = urllib.urlopen(query_url).read()\n", "data = json.loads(resp)\n", "data" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 232, "text": [ "{u'city': {u'coord': {u'lat': 40.714272, u'lon': -74.005966},\n", " u'country': u'US',\n", " u'id': 5128581,\n", " u'name': u'New York',\n", " u'population': 0},\n", " u'cnt': 5,\n", " u'cod': u'200',\n", " u'list': [{u'clouds': 68,\n", " u'deg': 177,\n", " u'dt': 1436893200,\n", " u'humidity': 60,\n", " u'pressure': 1000.31,\n", " u'rain': 1.4,\n", " u'speed': 6.09,\n", " u'temp': {u'day': 80.47,\n", " u'eve': 75.45,\n", " u'max': 80.47,\n", " u'min': 71.26,\n", " u'morn': 77.47,\n", " u'night': 71.26},\n", " u'weather': [{u'description': u'light rain',\n", " u'icon': u'10d',\n", " u'id': 500,\n", " u'main': u'Rain'}]},\n", " {u'clouds': 92,\n", " u'deg': 3,\n", " u'dt': 1436979600,\n", " u'humidity': 77,\n", " u'pressure': 998.18,\n", " u'rain': 0.58,\n", " u'speed': 3.93,\n", " u'temp': {u'day': 79.99,\n", " u'eve': 70.9,\n", " u'max': 79.99,\n", " u'min': 65.43,\n", " u'morn': 71.76,\n", " u'night': 65.43},\n", " u'weather': [{u'description': u'light rain',\n", " u'icon': u'10d',\n", " u'id': 500,\n", " u'main': u'Rain'}]},\n", " {u'clouds': 0,\n", " u'deg': 13,\n", " u'dt': 1437066000,\n", " u'humidity': 51,\n", " u'pressure': 1010.92,\n", " u'speed': 4.45,\n", " u'temp': {u'day': 74.95,\n", " u'eve': 70.14,\n", " u'max': 76.12,\n", " u'min': 59.11,\n", " u'morn': 63.81,\n", " u'night': 59.11},\n", " u'weather': [{u'description': u'sky is clear',\n", " u'icon': u'01d',\n", " u'id': 800,\n", " u'main': u'Clear'}]},\n", " {u'clouds': 8,\n", " u'deg': 179,\n", " u'dt': 1437152400,\n", " u'humidity': 52,\n", " u'pressure': 1013.8,\n", " u'speed': 4.13,\n", " u'temp': {u'day': 79.12,\n", " u'eve': 72.52,\n", " u'max': 79.12,\n", " u'min': 62.98,\n", " u'morn': 66.36,\n", " u'night': 62.98},\n", " u'weather': [{u'description': u'sky is clear',\n", " u'icon': u'02d',\n", " u'id': 800,\n", " u'main': u'Clear'}]},\n", " {u'clouds': 4,\n", " u'deg': 204,\n", " u'dt': 1437238800,\n", " u'humidity': 0,\n", " u'pressure': 1010.92,\n", " u'rain': 9.19,\n", " u'speed': 6.63,\n", " u'temp': {u'day': 82.94,\n", " u'eve': 77.65,\n", " u'max': 82.94,\n", " u'min': 72.09,\n", " u'morn': 72.09,\n", " u'night': 74.84},\n", " u'weather': [{u'description': u'moderate rain',\n", " u'icon': u'10d',\n", " u'id': 501,\n", " u'main': u'Rain'}]}],\n", " u'message': 0.0306}" ] } ], "prompt_number": 232 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The part of this data structure that we're interested in is the `list` attribute of the top-level dictionary, which contains a list of dictionaries describing the weather on a particular day. We'll create a variable `days` which points at just this information:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "days = data['list']\n", "# each item in \"days\" is a dictionary with weather information for that day\n", "days[0]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 233, "text": [ "{u'clouds': 68,\n", " u'deg': 177,\n", " u'dt': 1436893200,\n", " u'humidity': 60,\n", " u'pressure': 1000.31,\n", " u'rain': 1.4,\n", " u'speed': 6.09,\n", " u'temp': {u'day': 80.47,\n", " u'eve': 75.45,\n", " u'max': 80.47,\n", " u'min': 71.26,\n", " u'morn': 77.47,\n", " u'night': 71.26},\n", " u'weather': [{u'description': u'light rain',\n", " u'icon': u'10d',\n", " u'id': 500,\n", " u'main': u'Rain'}]}" ] } ], "prompt_number": 233 }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Consult the API documentation](http://openweathermap.org/forecast16) to learn what each key/value pair in the dictionary represents. There's a lot of data here, and we're not necessarily interested in all of it. The data items that we *are* interested in are:\n", "\n", "* `dt` is a UNIX timestamp that indicates which day the forecast applies to\n", "* `humidity` indicates the day's humidity\n", "* `temp['min']` and `temp['max']` are the day's minimum and maximum temperatures, respectively\n", "* `weather[0]['description']` has a brief description of the day's weather\n", "\n", "To simplify the task of analyzing this data, let's create a new list of dictionaries that only has the parts of the list-item dictionaries that we're most interested in. We'll do this in a `for` loop:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import datetime\n", "\n", "def timestamp_to_date(dt):\n", " return datetime.datetime.fromtimestamp(dt).date().isoformat()\n", "\n", "cleaned = list()\n", "for item in days:\n", " new_item = {\n", " 'date': timestamp_to_date(item['dt']),\n", " 'max_temp': item['temp']['max'],\n", " 'min_temp': item['temp']['min'],\n", " 'humidity': item['humidity'],\n", " 'description': item['weather'][0]['description']\n", " }\n", " cleaned.append(new_item)\n", "cleaned" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 234, "text": [ "[{'date': '2015-07-14',\n", " 'description': u'light rain',\n", " 'humidity': 60,\n", " 'max_temp': 80.47,\n", " 'min_temp': 71.26},\n", " {'date': '2015-07-15',\n", " 'description': u'light rain',\n", " 'humidity': 77,\n", " 'max_temp': 79.99,\n", " 'min_temp': 65.43},\n", " {'date': '2015-07-16',\n", " 'description': u'sky is clear',\n", " 'humidity': 51,\n", " 'max_temp': 76.12,\n", " 'min_temp': 59.11},\n", " {'date': '2015-07-17',\n", " 'description': u'sky is clear',\n", " 'humidity': 52,\n", " 'max_temp': 79.12,\n", " 'min_temp': 62.98},\n", " {'date': '2015-07-18',\n", " 'description': u'moderate rain',\n", " 'humidity': 0,\n", " 'max_temp': 82.94,\n", " 'min_temp': 72.09}]" ] } ], "prompt_number": 234 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Much better! Now we have just the data that we're interested in.\n", "\n", "> SIDEBAR: UNIX timestamps? \"Datetime\"? What is all this nonsense? This is really a subject for its own tutorial, but here's the brief. Computers usually internally represent time as a number. One of the most common ways of indicating time is to count how many seconds have passed since the [\"UNIX epoch\"](https://en.wikipedia.org/wiki/Unix_time), or January 1st, 1970 at 12:00am UTC. The \"timestamp\" that the Open Weather API returns is such an integer. In order to convert this integer into a readable date, I used Python's built-in [`datetime`](https://docs.python.org/2/library/datetime.html) module. The code in `timestamp_to_date()` looks convoluted, but basically boils down to \"convert this weird UNIX timestamp into a readable representation of the date it corresponds to.\"\n", "\n", "With our revised, cleaned data, we can start doing some fun tricks. First, let's sort the days by their high temperature, in ascending order:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "sorted(cleaned, key=lambda x: x['max_temp'])" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 235, "text": [ "[{'date': '2015-07-16',\n", " 'description': u'sky is clear',\n", " 'humidity': 51,\n", " 'max_temp': 76.12,\n", " 'min_temp': 59.11},\n", " {'date': '2015-07-17',\n", " 'description': u'sky is clear',\n", " 'humidity': 52,\n", " 'max_temp': 79.12,\n", " 'min_temp': 62.98},\n", " {'date': '2015-07-15',\n", " 'description': u'light rain',\n", " 'humidity': 77,\n", " 'max_temp': 79.99,\n", " 'min_temp': 65.43},\n", " {'date': '2015-07-14',\n", " 'description': u'light rain',\n", " 'humidity': 60,\n", " 'max_temp': 80.47,\n", " 'min_temp': 71.26},\n", " {'date': '2015-07-18',\n", " 'description': u'moderate rain',\n", " 'humidity': 0,\n", " 'max_temp': 82.94,\n", " 'min_temp': 72.09}]" ] } ], "prompt_number": 235 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can get the date of the coolest day by grabbing the first item from the sorted list and accessing its `date` key, like so:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "by_temp = sorted(cleaned, key=lambda x: x['max_temp'])\n", "by_temp[0]['date']" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 236, "text": [ "'2015-07-16'" ] } ], "prompt_number": 236 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or as one expression:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "sorted(cleaned, key=lambda x: x['max_temp'])[0]['date']" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 237, "text": [ "'2015-07-16'" ] } ], "prompt_number": 237 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's filter the list to ensure that we only have days where there is no rain. You might accomplish this like so:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "filter(lambda x: \"rain\" not in x['description'], cleaned)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 238, "text": [ "[{'date': '2015-07-16',\n", " 'description': u'sky is clear',\n", " 'humidity': 51,\n", " 'max_temp': 76.12,\n", " 'min_temp': 59.11},\n", " {'date': '2015-07-17',\n", " 'description': u'sky is clear',\n", " 'humidity': 52,\n", " 'max_temp': 79.12,\n", " 'min_temp': 62.98}]" ] } ], "prompt_number": 238 }, { "cell_type": "markdown", "metadata": {}, "source": [ "You could also write this as a list comprehension, of course:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "[day for day in cleaned if \"rain\" not in day['description']]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 239, "text": [ "[{'date': '2015-07-16',\n", " 'description': u'sky is clear',\n", " 'humidity': 51,\n", " 'max_temp': 76.12,\n", " 'min_temp': 59.11},\n", " {'date': '2015-07-17',\n", " 'description': u'sky is clear',\n", " 'humidity': 52,\n", " 'max_temp': 79.12,\n", " 'min_temp': 62.98}]" ] } ], "prompt_number": 239 }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we wanted an overview of all the kinds of weather that we might experience in the given forecast range, we could generate a set of all unique descriptions:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "set([day['description'] for day in cleaned])" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 240, "text": [ "{u'light rain', u'moderate rain', u'sky is clear'}" ] } ], "prompt_number": 240 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or, we could make a dictionary that maps dates to humidity:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "dict([(day['date'], day['humidity']) for day in cleaned])" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 241, "text": [ "{'2015-07-14': 60,\n", " '2015-07-15': 77,\n", " '2015-07-16': 51,\n", " '2015-07-17': 52,\n", " '2015-07-18': 0}" ] } ], "prompt_number": 241 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above could also be written as a dictionary comprehension:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "date_humidity = {day['date']: day['humidity'] for day in cleaned}\n", "date_humidity" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 242, "text": [ "{'2015-07-14': 60,\n", " '2015-07-15': 77,\n", " '2015-07-16': 51,\n", " '2015-07-17': 52,\n", " '2015-07-18': 0}" ] } ], "prompt_number": 242 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have a dictionary, we can easily look up the humidity for a given date:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "date_humidity['2015-07-17']" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 243, "text": [ "52" ] } ], "prompt_number": 243 }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Conclusion\n", "\n", "I hope this has been a helpful overview! Here's some further reading:\n", "\n", "* [Official Python tutorial](https://docs.python.org/2/tutorial/datastructures.html) on data structures (like lists, sets and dictionaries)\n", "* [An introduction to list comprehensions in Python](http://carlgroner.me/Python/2011/11/09/An-Introduction-to-List-Comprehensions-in-Python.html)\n", "* [Tuples](http://openbookproject.net/thinkcs/python/english3e/tuples.html) from [How to Think Like a Computer Scientist](http://openbookproject.net/thinkcs/python/english3e/index.html)\n", "* [Lambda, filter, map, reduce](http://www.python-course.eu/lambda.php)\n", "* [Zip, map and lambda](https://bradmontgomery.net/blog/2013/04/01/pythons-zip-map-and-lambda/)" ] } ], "metadata": {} } ] }