{ "metadata": { "name": "MongoDB" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": "#MongoDB\n\nIt's about time for us to stop talking about data and start talking about databases. In particular, we're going to talk about database software called [MongoDB](http://www.mongodb.org/). MongoDB is very versatile database software, used widely for a number of applications. It's especially good for operating on large amounts of data and data that needs to be used at scale (i.e., by many users simultaneously).\n\nThe purpose of this tutorial is to show you the basics of working with MongoDB. We'll cover how to insert documents into a MongoDB database and how to get lists of documents back from the database that match a particular set of criteria. This is barely scratching the surface of MongoDB's potential! But hopefully by the end of this session you'll have enough literacy in how MongoDB works to explore on your own its more advanced features and applications.\n\n##Why use a database?\n\nIt seems like we've been doing fine so far in class just working with CSV files and web APIs. So what exactly is a database, and why might we need one? For our purpose, we can define a \"database\" as a piece of software whose main purpose is to make it possible for us to store data somewhere, then later retrieve it, usually in a way that pays attention to the structure of the data itself. We haven't used \"databases\" per se in class, opting instead to download data from CSVs and web APIs straight into our Python programs, then discard our local version of the data when we're done with it. There are several reasons we might want to put this data into a database instead:\n\n* Persistence. With the programs we've written so far in our notebooks, we download data, process it into a form that we like, draw conclusions from it and then... it disappears forever, once we close the notebook. To get that data again, we have to *download* and *process* it again, from scratch. This is fine for small amounts of data, but with larger amounts it can be very time-consuming. Having separate database software allows us to store our data in a way that persists from one notebook session to the next. Very convenient.\n\n* Sharing. Another problem with downloading and processing data on demand is that it's difficult for us to share the result of our data processing with other people. The data exists in our IPython notebook and nowhere else---there's no easy way to let someone else access it. A database like MongoDB, on the other hand, can be used by many people simultaneously. It's also easy to create a \"dump\" of a MongoDB database's contents and send it to a colleague, who can then reconstruct the data on their own server with a minimum amount of hassle.\n\n* Performance. Many databases, like MongoDB, boast features (like indexing, aggregation, and map-reduce) that can make accessing and processing data very fast, faster than we could do in Python on our own.\n\n##Client/server\n\nMongoDB is \"client/server\" software, which means that the software itself runs on a server somewhere, and various clients on other computers can access it. The clients each talk to the server over the network, with a particular protocol unique to MongoDB. (Most database software works like this, but there are some exceptions, like SQLite, which work with files stored locally on your machine.)\n\nWe're going to write our \"client\" software in IPython Notebook, using a library called `pymongo`. The `pymongo` library gives us an easy way to write Python code that opens a network connection to the server, sends it commands using the MongoDB protocol, and interprets the results that come back.\n\nThe `pymongo` library is pre-installed on your EC2 instance, but you can install it (using pip) on other machines like so:\n\n pip install pymongo\n\nAs a quick note---in this class, the \"server\" software (MongoDB itself) and the \"client\" software (the Python code running in your notebook) both live on the same machine (i.e., your EC2 server). When you see the word `localhost` below, that's what it means---`localhost` is a special word in Internet talk that means \"connect to the same server that I'm running on.\" Other than that word, though, everything you'll learn here applies to connecting to MongoDB on non-localhost servers.\n\n##How MongoDB is structured\n\nMongoDB is a \"document-based\" database. MongoDB \"documents\" are essentially Python dictionaries: a lists of key/value pairs that describe some particular thing. Documents are stored in a structure called a \"collection,\" which is essentially like a list of dictionaries in Python. Most of the work we do in MongoDB will be adding documents to a collection, and then asking that collection to return documents that match particular criteria.\n\nCollections themselves are grouped into \"databases,\" and each MongoDB server can support multiple databases.\n\n##Connect to MongoDB and insert a document\n\nOkay enough prefatory material, let's get to the meat. First, we'll import `pymongo` and call its `Connection` function, which returns a new object that represents our network connection to the server. We'll pass the string `localhost` as the first argument, which tells Python to connect to the MongoDB server running on your own EC2 machine." }, { "cell_type": "code", "collapsed": false, "input": "import pymongo\n\nconn = pymongo.Connection(\"localhost\")\nprint type(conn)", "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": "\n" } ], "prompt_number": 126 }, { "cell_type": "markdown", "metadata": {}, "source": "One thing you can do with a `Connection` object is call its `.database_names()` method, which returns a list of all databases on the server." }, { "cell_type": "code", "collapsed": false, "input": "conn.database_names()", "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 128, "text": "[u'local']" } ], "prompt_number": 128 }, { "cell_type": "markdown", "metadata": {}, "source": "You should see only one database right now---`local`. The `local` database is for MongoDB's internal use, so we won't mess with it. Instead, we'll use the `Connection` object to get another object that represents a new database, like so:" }, { "cell_type": "code", "collapsed": false, "input": "db = conn['lede_program']\nprint type(db)", "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": "\n" } ], "prompt_number": 129 }, { "cell_type": "markdown", "metadata": {}, "source": "Note: We haven't done anything at this point to explicitly create the `lede_program` database! MongoDB automatically creates databases when you first use them.\n\nThis `Database` object supports several interesting methods, among them `.collection_names()`, which shows all of the collections in this database:" }, { "cell_type": "code", "collapsed": false, "input": "db.collection_names()", "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 130, "text": "[]" } ], "prompt_number": 130 }, { "cell_type": "markdown", "metadata": {}, "source": "It's an empty list right now (except maybe for a `system.indexes` collection, which is for internal MongoDB use and which you can ignore for now), because we haven't made any collections yet! Using the `Database` object as a dictionary, we can get an object representing a collection:" }, { "cell_type": "code", "collapsed": false, "input": "collection = db['kittens']\nprint type(collection)", "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": "\n" } ], "prompt_number": 131 }, { "cell_type": "markdown", "metadata": {}, "source": "Now we're in business. Let's insert our first document into the collection, using the collection's `.insert()` method. In between the parentheses of the `.insert()` method, we need to supply an expression that evaluates to a Python dictionary. PyMongo will convert this dictionary into a MongoDB document, and then add that document to the collection. Calling the `.insert()` method evaluates to a MongoDB `ObjectId` object, which contains a randomly generated number that uniquely identifies the record that we just added." }, { "cell_type": "code", "collapsed": false, "input": "collection.insert({\"name\": \"Fluffy\", \"favorite_color\": \"chartreuse\", \"lbs\": 9.5})", "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 132, "text": "ObjectId('53b2a26e2735fe2db55a9871')" } ], "prompt_number": 132 }, { "cell_type": "markdown", "metadata": {}, "source": "Let's insert a few more records!" }, { "cell_type": "code", "collapsed": false, "input": "collection.insert({\"name\": \"Monsieur Whiskeurs\", \"favorite_color\": \"cerulean\", \"lbs\": 10.8})\ncollection.insert({\"name\": \"Grandpa Pants\", \"favorite_color\": \"mauve\", \"lbs\": 14.1})\ncollection.insert({\"name\": \"Susan B. Meownthony\", \"favorite_color\": \"cerulean\", \"lbs\": 9.0})", "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 133, "text": "ObjectId('53b2a2702735fe2db55a9874')" } ], "prompt_number": 133 }, { "cell_type": "markdown", "metadata": {}, "source": "##Finding a document\n\nOf course, inserting documents on its own is not very useful. We'd like to be able to retrieve them later. To do so, we can use the `.find_one()` method of a collection object. Between the parentheses of the `.find_one()` call, we give a Python dictionary that tells MongoDB which documents to return. The `.find_one()` evaluates to the document that has an exact match for whichever key/value pairs are specified in the dictionary. To demonstrate:" }, { "cell_type": "code", "collapsed": false, "input": "collection.find_one({\"name\": \"Monsieur Whiskeurs\"})", "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 134, "text": "{u'_id': ObjectId('53b2a2702735fe2db55a9872'),\n u'favorite_color': u'cerulean',\n u'lbs': 10.8,\n u'name': u'Monsieur Whiskeurs'}" } ], "prompt_number": 134 }, { "cell_type": "markdown", "metadata": {}, "source": "If more than one document had the value `Monsieur Whiskeurs` for the key `name`, MongoDB would have returned only the first matching document. If no documents match, this happens:" }, { "cell_type": "code", "collapsed": false, "input": "val = collection.find_one({\"name\": \"Big Shoes\"})\nprint val", "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": "None\n" } ], "prompt_number": 135 }, { "cell_type": "markdown", "metadata": {}, "source": "... the method evaluates to `None`.\n\nYou may have noticed the key `_id` in the document above. We didn't specify that key when we created the document, so where did it come from? It turns out that unless we specify the `_id` key manually, MongoDB will add it automatically and give it a randomly generated and unique `ObjectId` object as a value.\n\nLet's do that `.find_one()` call again and see what else we can do with it." }, { "cell_type": "code", "collapsed": false, "input": "doc = collection.find_one({\"name\": \"Monsieur Whiskeurs\"})\nprint type(doc)\nprint doc['favorite_color']", "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": "\ncerulean\n" } ], "prompt_number": 136 }, { "cell_type": "markdown", "metadata": {}, "source": "As you can see, the value returned from `.find_one()` is just a Python dictionary. We can use it in any of the ways we usually use Python dictionaries---by getting a value for one of its keys, for example.\n\n> EXERCISE: Use the `.find_one()` method to print out the `favorite_color` value for our kitten named `Grandpa Pants`.\n\n##Finding more than one document\n\nThe collection object has a method `.find()` that allows you to access every document in the collection. It doesn't return a list, but a weird thing called a `Cursor`. To get data from a cursor, you either have to use it in a `for` loop like this:" }, { "cell_type": "code", "collapsed": false, "input": "for doc in collection.find():\n print doc", "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": "{u'favorite_color': u'chartreuse', u'_id': ObjectId('53b2a26e2735fe2db55a9871'), u'name': u'Fluffy', u'lbs': 9.5}\n{u'favorite_color': u'cerulean', u'_id': ObjectId('53b2a2702735fe2db55a9872'), u'name': u'Monsieur Whiskeurs', u'lbs': 10.8}\n{u'favorite_color': u'mauve', u'_id': ObjectId('53b2a2702735fe2db55a9873'), u'name': u'Grandpa Pants', u'lbs': 14.1}\n{u'favorite_color': u'cerulean', u'_id': ObjectId('53b2a2702735fe2db55a9874'), u'name': u'Susan B. Meownthony', u'lbs': 9.0}\n" } ], "prompt_number": 137 }, { "cell_type": "markdown", "metadata": {}, "source": "... or explicitly convert it to a list, with the `list()` function:" }, { "cell_type": "code", "collapsed": false, "input": "documents = list(collection.find())\ndocuments", "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 138, "text": "[{u'_id': ObjectId('53b2a26e2735fe2db55a9871'),\n u'favorite_color': u'chartreuse',\n u'lbs': 9.5,\n u'name': u'Fluffy'},\n {u'_id': ObjectId('53b2a2702735fe2db55a9872'),\n u'favorite_color': u'cerulean',\n u'lbs': 10.8,\n u'name': u'Monsieur Whiskeurs'},\n {u'_id': ObjectId('53b2a2702735fe2db55a9873'),\n u'favorite_color': u'mauve',\n u'lbs': 14.1,\n u'name': u'Grandpa Pants'},\n {u'_id': ObjectId('53b2a2702735fe2db55a9874'),\n u'favorite_color': u'cerulean',\n u'lbs': 9.0,\n u'name': u'Susan B. Meownthony'}]" } ], "prompt_number": 138 }, { "cell_type": "markdown", "metadata": {}, "source": "We can also pass a dictionary to `.find()` to tell MongoDB to only return a subset of documents, namely, only those documents that match the key/value pairs in the dictionary we put in the parentheses. For example, to fetch only those kittens whose `favorite_color` is `cerulean`:" }, { "cell_type": "code", "collapsed": false, "input": "cerulean_lovers = list(collection.find({'favorite_color': 'cerulean'}))\ncerulean_lovers", "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 139, "text": "[{u'_id': ObjectId('53b2a2702735fe2db55a9872'),\n u'favorite_color': u'cerulean',\n u'lbs': 10.8,\n u'name': u'Monsieur Whiskeurs'},\n {u'_id': ObjectId('53b2a2702735fe2db55a9874'),\n u'favorite_color': u'cerulean',\n u'lbs': 9.0,\n u'name': u'Susan B. Meownthony'}]" } ], "prompt_number": 139 }, { "cell_type": "markdown", "metadata": {}, "source": "> EXERCISE: Write a list comprehension that evaluates to a list of the names of kittens whose favorite color is cerulean.\n\n##Simple aggregations\n\nYou can ask MongoDB how many documents are in a collection with the collection's `.count()` method:" }, { "cell_type": "code", "collapsed": false, "input": "collection.count()", "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 140, "text": "4" } ], "prompt_number": 140 }, { "cell_type": "markdown", "metadata": {}, "source": "It's also easy to get a list of distinct values there are for a particular field, using the `distinct` method:" }, { "cell_type": "code", "collapsed": false, "input": "collection.distinct(\"favorite_color\")", "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 141, "text": "[u'chartreuse', u'cerulean', u'mauve']" } ], "prompt_number": 141 }, { "cell_type": "markdown", "metadata": {}, "source": "##Removing documents\n\nYou can remove a single document from a collection with the `.remove()` method, passing in a dictionary that describes which documents you want to remove. For example, to `.remove()` documents where the `name` key has the value `Fluffy`:" }, { "cell_type": "code", "collapsed": false, "input": "collection.remove({'name': 'Fluffy'})\nlist(collection.find())", "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 17, "text": "[{u'_id': ObjectId('53b07f092735fe2d9e61325d'),\n u'favorite_color': u'cerulean',\n u'name': u'Monsieur Whiskeurs'},\n {u'_id': ObjectId('53b07f092735fe2d9e61325e'),\n u'favorite_color': u'mauve',\n u'name': u'Grandpa Pants'},\n {u'_id': ObjectId('53b07f092735fe2d9e61325f'),\n u'favorite_color': u'cerulean',\n u'name': u'Susan B. Meownthony'}]" } ], "prompt_number": 17 }, { "cell_type": "markdown", "metadata": {}, "source": "You can see that `Fluffy` has now gone missing. You can also easily remove *all* documents from a collection, using the `.remove()` method without any parameters. *WARNING*: Don't run this cell unless you want to remove everything you've inserted so far!" }, { "cell_type": "code", "collapsed": false, "input": "collection.remove()\nlist(collection.find())", "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 18, "text": "[]" } ], "prompt_number": 18 }, { "cell_type": "markdown", "metadata": {}, "source": "##More sophisticated queries\n\nWe can be a bit more specific about which documents we want from the collection using MongoDB *query selectors*. Query selectors take the form of dictionaries that we pass to the `.find()` method. Keys in this dictionary should be the field that you want to match against, and the value for such a key should be *another* dictionary, that has as its key a MongoDB query operator (listed below), and as its value the number to go with the operator. Here's an example, to make it more clear, searching our collection of kittens for documents where the `lbs` field is greater than `10`:" }, { "cell_type": "code", "collapsed": false, "input": "list(collection.find({'lbs': {'$gt': 10}}))", "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 142, "text": "[{u'_id': ObjectId('53b2a2702735fe2db55a9872'),\n u'favorite_color': u'cerulean',\n u'lbs': 10.8,\n u'name': u'Monsieur Whiskeurs'},\n {u'_id': ObjectId('53b2a2702735fe2db55a9873'),\n u'favorite_color': u'mauve',\n u'lbs': 14.1,\n u'name': u'Grandpa Pants'}]" } ], "prompt_number": 142 }, { "cell_type": "markdown", "metadata": {}, "source": "Other operators that are supported ([full list here](http://docs.mongodb.org/manual/reference/operator/query/#query-selectors)):\n\n* `$gt` greater than\n* `$gte` greater than or equal\n* `$lt` less than\n* `$lte` less than or equal\n* `$ne` *not* equal to\n\nYou can combine more than one operator for a particular field, in which case MongoDB will find documents that match *all* criteria:" }, { "cell_type": "code", "collapsed": false, "input": "list(collection.find({'lbs': {'$gt': 9, '$lt': 10.8}}))", "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 143, "text": "[{u'_id': ObjectId('53b2a26e2735fe2db55a9871'),\n u'favorite_color': u'chartreuse',\n u'lbs': 9.5,\n u'name': u'Fluffy'}]" } ], "prompt_number": 143 }, { "cell_type": "markdown", "metadata": {}, "source": "You can also include conditions for more than one field in the dictionary, in which case MongoDB will find documents that match those criteria for each respective field:" }, { "cell_type": "code", "collapsed": false, "input": "list(collection.find({'favorite_color': 'cerulean'}))", "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 144, "text": "[{u'_id': ObjectId('53b2a2702735fe2db55a9872'),\n u'favorite_color': u'cerulean',\n u'lbs': 10.8,\n u'name': u'Monsieur Whiskeurs'},\n {u'_id': ObjectId('53b2a2702735fe2db55a9874'),\n u'favorite_color': u'cerulean',\n u'lbs': 9.0,\n u'name': u'Susan B. Meownthony'}]" } ], "prompt_number": 144 }, { "cell_type": "code", "collapsed": false, "input": "list(collection.find({'favorite_color': 'cerulean', 'name': {'$ne': 'Monsieur Whiskeurs'}}))", "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 145, "text": "[{u'_id': ObjectId('53b2a2702735fe2db55a9874'),\n u'favorite_color': u'cerulean',\n u'lbs': 9.0,\n u'name': u'Susan B. Meownthony'}]" } ], "prompt_number": 145 }, { "cell_type": "markdown", "metadata": {}, "source": "###Regular expression searches\n\nAnother valuable search criterion that MongoDB supports is `$regex`, which will return documents that match a regular expression for a particular field. For example, to find all kittens whose name ends with the letter `y`:" }, { "cell_type": "code", "collapsed": false, "input": "list(collection.find({'name': {'$regex': 'y$'}}))", "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 146, "text": "[{u'_id': ObjectId('53b2a26e2735fe2db55a9871'),\n u'favorite_color': u'chartreuse',\n u'lbs': 9.5,\n u'name': u'Fluffy'},\n {u'_id': ObjectId('53b2a2702735fe2db55a9874'),\n u'favorite_color': u'cerulean',\n u'lbs': 9.0,\n u'name': u'Susan B. Meownthony'}]" } ], "prompt_number": 146 }, { "cell_type": "markdown", "metadata": {}, "source": "> EXERCISE: Write a call to `.find()` that returns all kittens whose favorite color begins with the letter `c`." }, { "cell_type": "markdown", "metadata": {}, "source": "##Sorting and limiting\n\nResults from `.find()` aren't returned in a particular order. You may find it helpful for this reason to sort the results. You can specify a sort order for results from the `.find()` method by tacking on a `.sort()` call to the end. It looks like this:" }, { "cell_type": "code", "collapsed": false, "input": "list(collection.find().sort('lbs'))", "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 147, "text": "[{u'_id': ObjectId('53b2a2702735fe2db55a9874'),\n u'favorite_color': u'cerulean',\n u'lbs': 9.0,\n u'name': u'Susan B. Meownthony'},\n {u'_id': ObjectId('53b2a26e2735fe2db55a9871'),\n u'favorite_color': u'chartreuse',\n u'lbs': 9.5,\n u'name': u'Fluffy'},\n {u'_id': ObjectId('53b2a2702735fe2db55a9872'),\n u'favorite_color': u'cerulean',\n u'lbs': 10.8,\n u'name': u'Monsieur Whiskeurs'},\n {u'_id': ObjectId('53b2a2702735fe2db55a9873'),\n u'favorite_color': u'mauve',\n u'lbs': 14.1,\n u'name': u'Grandpa Pants'}]" } ], "prompt_number": 147 }, { "cell_type": "markdown", "metadata": {}, "source": "The parameter you pass to `.sort()` specifies which field the documents should be sorted by. Specifying descending order is a bit more tricky:" }, { "cell_type": "code", "collapsed": false, "input": "list(collection.find().sort('lbs', -1))", "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 148, "text": "[{u'_id': ObjectId('53b2a2702735fe2db55a9873'),\n u'favorite_color': u'mauve',\n u'lbs': 14.1,\n u'name': u'Grandpa Pants'},\n {u'_id': ObjectId('53b2a2702735fe2db55a9872'),\n u'favorite_color': u'cerulean',\n u'lbs': 10.8,\n u'name': u'Monsieur Whiskeurs'},\n {u'_id': ObjectId('53b2a26e2735fe2db55a9871'),\n u'favorite_color': u'chartreuse',\n u'lbs': 9.5,\n u'name': u'Fluffy'},\n {u'_id': ObjectId('53b2a2702735fe2db55a9874'),\n u'favorite_color': u'cerulean',\n u'lbs': 9.0,\n u'name': u'Susan B. Meownthony'}]" } ], "prompt_number": 148 }, { "cell_type": "markdown", "metadata": {}, "source": "(The `-1` means 'in reverse order'.) The `.sort()` method works even if you've specified query selectors in the call to `.find()`:" }, { "cell_type": "code", "collapsed": false, "input": "list(collection.find({'lbs': {'$gt': 9.0}}).sort('name'))", "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 149, "text": "[{u'_id': ObjectId('53b2a26e2735fe2db55a9871'),\n u'favorite_color': u'chartreuse',\n u'lbs': 9.5,\n u'name': u'Fluffy'},\n {u'_id': ObjectId('53b2a2702735fe2db55a9873'),\n u'favorite_color': u'mauve',\n u'lbs': 14.1,\n u'name': u'Grandpa Pants'},\n {u'_id': ObjectId('53b2a2702735fe2db55a9872'),\n u'favorite_color': u'cerulean',\n u'lbs': 10.8,\n u'name': u'Monsieur Whiskeurs'}]" } ], "prompt_number": 149 }, { "cell_type": "markdown", "metadata": {}, "source": "You can also limit the number of results returned from `.find()` using the `.limit()` method, which, like `.sort()`, gets tacked on to the end of `.find()`. To return only two kittens:" }, { "cell_type": "code", "collapsed": false, "input": "list(collection.find().limit(2))", "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 150, "text": "[{u'_id': ObjectId('53b2a26e2735fe2db55a9871'),\n u'favorite_color': u'chartreuse',\n u'lbs': 9.5,\n u'name': u'Fluffy'},\n {u'_id': ObjectId('53b2a2702735fe2db55a9872'),\n u'favorite_color': u'cerulean',\n u'lbs': 10.8,\n u'name': u'Monsieur Whiskeurs'}]" } ], "prompt_number": 150 }, { "cell_type": "markdown", "metadata": {}, "source": "Search for all kittens weighing less than 10 pounds, limit to one result:" }, { "cell_type": "code", "collapsed": false, "input": "list(collection.find({'lbs': {'$lt': 10}}).limit(1))", "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 151, "text": "[{u'_id': ObjectId('53b2a26e2735fe2db55a9871'),\n u'favorite_color': u'chartreuse',\n u'lbs': 9.5,\n u'name': u'Fluffy'}]" } ], "prompt_number": 151 }, { "cell_type": "markdown", "metadata": {}, "source": "You can put a `.limit()` after a `.sort()` to get only the first few results from a sorted list of documents. So, for example, to get only the heaviest cat:" }, { "cell_type": "code", "collapsed": false, "input": "list(collection.find().sort(\"lbs\", -1).limit(1))", "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 152, "text": "[{u'_id': ObjectId('53b2a2702735fe2db55a9873'),\n u'favorite_color': u'mauve',\n u'lbs': 14.1,\n u'name': u'Grandpa Pants'}]" } ], "prompt_number": 152 }, { "cell_type": "markdown", "metadata": {}, "source": "##Only get certain fields\n\nIf we want our result to only include certain key/value pairs from the document, we can provide a second argument to the `find` method. This argument should be a dictionary whose keys are the fields we want included, and whose values are all `1`. For example, to find all kittens whose favorite color is `cerulean`, but only return their names, we could do this:" }, { "cell_type": "code", "collapsed": false, "input": "list(collection.find({\"favorite_color\": \"cerulean\"}, {\"name\": 1}))", "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 153, "text": "[{u'_id': ObjectId('53b2a2702735fe2db55a9872'),\n u'name': u'Monsieur Whiskeurs'},\n {u'_id': ObjectId('53b2a2702735fe2db55a9874'),\n u'name': u'Susan B. Meownthony'}]" } ], "prompt_number": 153 }, { "cell_type": "markdown", "metadata": {}, "source": "The `_id` field is always included by default. If we want to get rid of it, we can include the `_id` key in our list of fields, giving it a `0` (instead of a `1`):" }, { "cell_type": "code", "collapsed": false, "input": "list(collection.find({\"favorite_color\": \"cerulean\"}, {\"name\": 1, \"_id\": 0}))", "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 154, "text": "[{u'name': u'Monsieur Whiskeurs'}, {u'name': u'Susan B. Meownthony'}]" } ], "prompt_number": 154 }, { "cell_type": "markdown", "metadata": {}, "source": "##Let's get real\n\nI want to take you through a real-world example of consuming data from a source, putting it into MongoDB, then querying MongoDB to find interesting stuff in that data. Specifically, we're going to fetch a big ol' CSV of historic data about congress members from the \"Bulk Data\" section of [govtrack.us](https://www.govtrack.us/developers/data). [Here's the file](https://www.govtrack.us/data/congress-legislators/legislators-historic.csv), which contains a row for every member of Congress in the history of the United States (who isn't currently a sitting member).\n\nFirst, let's retrieve the file to our EC2 machines, as we've done in the past with CSV files:" }, { "cell_type": "code", "collapsed": false, "input": "import urllib\nurllib.urlretrieve(\"https://www.govtrack.us/data/congress-legislators/legislators-historic.csv\",\n \"legislators-historic.csv\")", "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 93, "text": "('legislators-historic.csv', )" } ], "prompt_number": 93 }, { "cell_type": "markdown", "metadata": {}, "source": "Let's play with the `csv` library's `DictReader` class to see what the data looks like." }, { "cell_type": "code", "collapsed": false, "input": "import csv\nrows = csv.DictReader(open(\"legislators-historic.csv\"))\nall_rows = list(rows)\nall_rows[0]", "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 155, "text": "{'address': '',\n 'ballotpedia_id': '',\n 'bioguide_id': 'B000226',\n 'birthday': '1745-04-02',\n 'contact_form': '',\n 'cspan_id': '',\n 'facebook': '',\n 'facebook_id': '',\n 'first_name': 'Richard',\n 'gender': 'M',\n 'govtrack_id': '401222',\n 'icpsr_id': '507',\n 'last_name': 'Bassett',\n 'lis_id': '',\n 'opensecrets_id': '',\n 'party': 'Anti-Administration',\n 'phone': '',\n 'rss_url': '',\n 'state': 'DE',\n 'thomas_id': '',\n 'twitter': '',\n 'type': 'sen',\n 'url': '',\n 'votesmart_id': '',\n 'washington_post_id': '',\n 'wikipedia_id': '',\n 'youtube': '',\n 'youtube_id': ''}" } ], "prompt_number": 155 }, { "cell_type": "markdown", "metadata": {}, "source": "What we seem to have here is a dictionary that describes a member of congress. This happens to be one [Richard Bassett](http://en.wikipedia.org/wiki/Richard_Bassett_(politician)), born in 1745. So that's pretty cool! I don't really know what most of these fields mean, but we'll take some guesses at them later.\n\nSo how about putting all those rows into MongoDB? Here's how it would go. It's pretty simple! I'm going to create a separate collection in our database for these legislators, called `legislators`." }, { "cell_type": "code", "collapsed": false, "input": "legislators_coll = db['legislators']", "language": "python", "metadata": {}, "outputs": [], "prompt_number": 156 }, { "cell_type": "markdown", "metadata": {}, "source": "Now, I'm going to loop through the rows of the table and just... insert each dictionary from `DictReader` straight into MongoDB. Easy!" }, { "cell_type": "code", "collapsed": false, "input": "for row in all_rows:\n legislators_coll.insert(row)", "language": "python", "metadata": {}, "outputs": [], "prompt_number": 157 }, { "cell_type": "markdown", "metadata": {}, "source": "At this point, the number of documents in the database should match the number of rows in the CSV file. Let's make sure." }, { "cell_type": "code", "collapsed": false, "input": "len(all_rows) == legislators_coll.count()", "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 158, "text": "True" } ], "prompt_number": 158 }, { "cell_type": "markdown", "metadata": {}, "source": "And how many exactly is that?" }, { "cell_type": "code", "collapsed": false, "input": "legislators_coll.count()", "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 159, "text": "11741" } ], "prompt_number": 159 }, { "cell_type": "markdown", "metadata": {}, "source": "Eleven thousand legislators. Not exactly \"big data,\" I admit, but hopefully you can still see the benefit of having this data in one place without having to re-download and parse the data each time we want to use it.\n\n###Meet the press\n\nLet's do some queries on our data now! For example, let's make a list of all legislators who are women." }, { "cell_type": "code", "collapsed": false, "input": "legislators_coll.find({\"gender\": \"F\"}).count()", "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 160, "text": "64" } ], "prompt_number": 160 }, { "cell_type": "markdown", "metadata": {}, "source": "How about a list of legislators who are women, whose party is not `Democrat`?" }, { "cell_type": "code", "collapsed": false, "input": "legislators_coll.find({\"gender\": \"F\", \"party\": {\"$ne\": \"Democrat\"}}).count()", "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 161, "text": "30" } ], "prompt_number": 161 }, { "cell_type": "markdown", "metadata": {}, "source": "Let's make a list of these women, including their names, states, and birthdays:" }, { "cell_type": "code", "collapsed": false, "input": "list(legislators_coll.find(\n {\"gender\": \"F\", \"party\": {\"$ne\": \"Democrat\"}},\n {\"first_name\": 1, \"last_name\": 1, \"state\": 1, \"birthday\": 1, \"_id\": 0}))", "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 162, "text": "[{u'birthday': u'1958-06-05',\n u'first_name': u'Enid',\n u'last_name': u'Greene Waldholtz',\n u'state': u'UT'},\n {u'birthday': u'1938-01-27',\n u'first_name': u'Helen',\n u'last_name': u'Chenoweth-Hage',\n u'state': u'ID'},\n {u'birthday': u'1931-02-12',\n u'first_name': u'Constance',\n u'last_name': u'Morella',\n u'state': u'MD'},\n {u'birthday': u'1929-09-19',\n u'first_name': u'Marge',\n u'last_name': u'Roukema',\n u'state': u'NJ'},\n {u'birthday': u'1936-07-29',\n u'first_name': u'Elizabeth',\n u'last_name': u'Dole',\n u'state': u'NC'},\n {u'birthday': u'1941-07-29',\n u'first_name': u'Jennifer',\n u'last_name': u'Dunn',\n u'state': u'WA'},\n {u'birthday': u'1957-04-05',\n u'first_name': u'Katherine',\n u'last_name': u'Harris',\n u'state': u'FL'},\n {u'birthday': u'1962-04-04',\n u'first_name': u'Melissa',\n u'last_name': u'Hart',\n u'state': u'PA'},\n {u'birthday': u'1935-01-05',\n u'first_name': u'Nancy',\n u'last_name': u'Johnson',\n u'state': u'CT'},\n {u'birthday': u'1936-09-26',\n u'first_name': u'Sue',\n u'last_name': u'Kelly',\n u'state': u'NY'},\n {u'birthday': u'1948-01-22',\n u'first_name': u'Anne',\n u'last_name': u'Northup',\n u'state': u'KY'},\n {u'birthday': u'1953-06-22',\n u'first_name': u'Shelley',\n u'last_name': u'Sekula-Gibbs',\n u'state': u'TX'},\n {u'birthday': u'1946-11-30',\n u'first_name': u'Barbara',\n u'last_name': u'Cubin',\n u'state': u'WY'},\n {u'birthday': u'1950-06-29',\n u'first_name': u'Jo Ann',\n u'last_name': u'Davis',\n u'state': u'VA'},\n {u'birthday': u'1949-11-20',\n u'first_name': u'Thelma',\n u'last_name': u'Drake',\n u'state': u'VA'},\n {u'birthday': u'1949-01-27',\n u'first_name': u'Marilyn',\n u'last_name': u'Musgrave',\n u'state': u'CO'},\n {u'birthday': u'1951-07-29',\n u'first_name': u'Deborah',\n u'last_name': u'Pryce',\n u'state': u'OH'},\n {u'birthday': u'1960-12-30',\n u'first_name': u'Heather',\n u'last_name': u'Wilson',\n u'state': u'NM'},\n {u'birthday': u'1943-10-05',\n u'first_name': u'Virginia',\n u'last_name': u'Brown-Waite',\n u'state': u'FL'},\n {u'birthday': u'1954-12-09',\n u'first_name': u'Mary',\n u'last_name': u'Fallin',\n u'state': u'OK'},\n {u'birthday': u'1943-07-22',\n u'first_name': u'Kay',\n u'last_name': u'Hutchison',\n u'state': u'TX'},\n {u'birthday': u'1947-02-21',\n u'first_name': u'Olympia',\n u'last_name': u'Snowe',\n u'state': u'ME'},\n {u'birthday': u'1956-12-14',\n u'first_name': u'Sandy',\n u'last_name': u'Adams',\n u'state': u'FL'},\n {u'birthday': u'1937-08-15',\n u'first_name': u'Judy',\n u'last_name': u'Biggert',\n u'state': u'IL'},\n {u'birthday': u'1961-10-24',\n u'first_name': u'Mary',\n u'last_name': u'Bono Mack',\n u'state': u'CA'},\n {u'birthday': u'1951-05-08',\n u'first_name': u'Ann Marie',\n u'last_name': u'Buerkle',\n u'state': u'NY'},\n {u'birthday': u'1959-12-14',\n u'first_name': u'Nan',\n u'last_name': u'Hayworth',\n u'state': u'NY'},\n {u'birthday': u'1941-08-01',\n u'first_name': u'Sue',\n u'last_name': u'Myrick',\n u'state': u'NC'},\n {u'birthday': u'1951-11-29',\n u'first_name': u'Jean',\n u'last_name': u'Schmidt',\n u'state': u'OH'},\n {u'birthday': u'1950-09-16',\n u'first_name': u'Jo Ann',\n u'last_name': u'Emerson',\n u'state': u'MO'}]" } ], "prompt_number": 162 }, { "cell_type": "markdown", "metadata": {}, "source": "How about the youngest five Republican legislators, as determined by their birthday?" }, { "cell_type": "code", "collapsed": false, "input": "list(legislators_coll.find(\n {'party': 'Republican'},\n {'first_name': 1, 'last_name': 1, 'birthday': 1, 'state': 1, '_id': 0}\n ).sort(\"birthday\", -1).limit(5)\n)", "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 163, "text": "[{u'birthday': u'1976-11-03',\n u'first_name': u'Ben',\n u'last_name': u'Quayle',\n u'state': u'AZ'},\n {u'birthday': u'1976-04-20',\n u'first_name': u'Trey',\n u'last_name': u'Radel',\n u'state': u'FL'},\n {u'birthday': u'1974-07-31',\n u'first_name': u'Adam',\n u'last_name': u'Putnam',\n u'state': u'FL'},\n {u'birthday': u'1971-06-10',\n u'first_name': u'Bobby',\n u'last_name': u'Jindal',\n u'state': u'LA'},\n {u'birthday': u'1970-12-23',\n u'first_name': u'Jeff',\n u'last_name': u'Landry',\n u'state': u'LA'}]" } ], "prompt_number": 163 }, { "cell_type": "markdown", "metadata": {}, "source": "How about a list of all distinct parties (witness the varieties of American democracy)?" }, { "cell_type": "code", "collapsed": false, "input": "legislators_coll.distinct(\"party\")", "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 164, "text": "[u'Anti-Administration',\n u'',\n u'Pro-Administration',\n u'Republican',\n u'Federalist',\n u'Democratic Republican',\n u'Unknown',\n u'Adams',\n u'Jackson',\n u'Jackson Republican',\n u'Crawford Republican',\n u'Whig',\n u'Anti-Jacksonian',\n u'Adams Democrat',\n u'Nullifier',\n u'Anti Masonic',\n u'Anti Jacksonian',\n u'Jacksonian',\n u'Democrat',\n u'Anti Jackson',\n u'Union Democrat',\n u'Conservative',\n u'Ind. Democrat',\n u'Law and Order',\n u'American',\n u'Liberty',\n u'Free Soil',\n u'Independent',\n u'Ind. Republican-Democrat',\n u'Ind. Whig',\n u'Unionist',\n u'States Rights',\n u'Anti-Lecompton Democrat',\n u'Constitutional Unionist',\n u'Independent Democrat',\n u'Unconditional Unionist',\n u'Conservative Republican',\n u'Ind. Republican',\n u'Liberal Republican',\n u'National Greenbacker',\n u'Readjuster Democrat',\n u'Readjuster',\n u'Union',\n u'Union Labor',\n u'Populist',\n u'Silver Republican',\n u'Free Silver',\n u'Democratic and Union Labor',\n u'Progressive Republican',\n u'Progressive',\n u'Prohibitionist',\n u'Socialist',\n u'Farmer-Labor',\n u'Nonpartisan',\n u'Coalitionist',\n u'Popular Democrat',\n u'American Labor',\n u'New Progressive',\n u'Republican-Conservative',\n u'Democrat-Liberal',\n u'Democrat/Republican',\n u'Democrat Farmer Labor']" } ], "prompt_number": 164 }, { "cell_type": "markdown", "metadata": {}, "source": "> EXERCISE: Investigate [MongoDB's `$nin` operator](http://docs.mongodb.org/manual/reference/operator/query/nin/) to write a MongoDB query that returns a list of the names, states, and parties of all legislators whose party is neither Republican nor Democrat." }, { "cell_type": "markdown", "metadata": {}, "source": "##Where to go next\n\nGreat work---you've learned the basics. Where to go next?\n\n* The [PyMongo tutorial](http://api.mongodb.org/python/current/tutorial.html) covers a lot of the same material that we've covered here, but it's always nice to have a different perspective on this.\n* [Here's another tutorial](http://blog.pythonisito.com/2012/01/moving-along-with-pymongo.html) that looks pretty good.\n* If you want to be able to modify your data after you've imported it, you may want to learn how to [update documents](http://docs.mongodb.org/manual/tutorial/modify-documents/) once they're already in the database.\n* Once you're working with a sufficiently large amount of data, you'll be able to speed up your queries significantly using [indexes](http://docs.mongodb.org/manual/core/indexes-introduction/).\n* We didn't cover some of MongoDB's most powerful features in this tutorial, including its [aggregation](http://docs.mongodb.org/manual/core/aggregation-introduction/) and [map-reduce](http://docs.mongodb.org/manual/core/map-reduce/) features. Both are very handy, but difficult to teach in a Python class, because they often require a bit of JavaScript (!). [Here's a good overview](http://blog.safaribooksonline.com/2013/06/21/aggregation-in-mongodb/) of MongoDB's aggregation framework.\n\nHave fun!" } ], "metadata": {} } ] }