{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# MongoDB Tutorial - Monday 6 March 2017" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In what follows, we assume that you have installed MongoDB according to the [instructions online](https://docs.mongodb.org/manual/installation/) and started the mongo daemon with the following command.\n", "\n", "> mongod\n", "\n", "Note that you might have to create a folder `/data/db` with appropriate access rights before the daemon starts successfully.\n", "\n", "We also assume that you use Python 3 and have the [pymongo driver](http://api.mongodb.org/python/current/installation.html) installed.\n", "\n", "** Note ** To run the notebook yourself, [install Jupyter](http://jupyter.readthedocs.org/en/latest/install.html), [download](https://raw.githubusercontent.com/mmathioudakis/moderndb/master/2017/mongodb.tutorial.ipynb) the notebook, and [open it](http://jupyter.readthedocs.org/en/latest/running.html) with Jupyter.\n", "\n", "** Note ** This notebook might be updated later. Major updates will be listed at its bottom." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pymongo as pm\n", "client = pm.MongoClient()\n", "client.drop_database(\"tutorial\")\n", "\n", "import bson.son as son" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## \"Hello World!\" : Databases, Collections, Documents\n", "\n", "*Relational databases* contain *tables* that contain *records*.\n", "\n", "A **MongoDB database** contains **collections** that contain **documents**." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# start a client\n", "client = pm.MongoClient() \n", "\n", "# connect to a database\n", "db = client.tutorial\n", "\n", "# get a collection\n", "coll = db.test_collection" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documents follow the [JSON](http://json.org/) format and MongoDB stores them in a binary version of it ([BSON](http://bsonspec.org/)).\n", "\n", "\n", "\n", "\n", "Below you see examples of JSON documents.\n", "\n", "** JSON example 0 **\n", "```\n", "{}\n", "```\n", "\n", "** JSON example 1 **\n", "```\n", "{\n", "\"name\" : \"Michael\",\n", "\"age\": 32,\n", "\"grades\": [71, 85, 90, 34]\n", "}\n", "```\n", "\n", "** JSON example 2 **\n", "\n", "```\n", "{\n", " \"first name\": \"Michael\",\n", " \"last name\": \"Mathioudakis\",\n", " \"age\": 32,\n", " \"grades\": {\n", " \"ModernDB\": 69,\n", " \"Data Mining\": 71,\n", " \"Machine Learning\": 95\n", " },\n", " \"graduated\": true,\n", " \"previous schools\": [\"NTUA\", \"UofT\"]\n", "}\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In Python, JSON documents are represented as dictionaries.\n", "The examples from above are therefore represented as follows." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "example_0 = {}" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "example_1 = {\"name\": \"Michael\", \"age\": 32, \"grades\": [71, 85, 90, 34]}" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "example_2 = \\\n", "{\"first name\": \"Michael\",\n", " \"last name\": \"Mathioudakis\",\n", " \"age\": 32,\n", " \"grades\": {\n", " \"ModernDB\": 69,\n", " \"Data Mining\": 71,\n", " \"Machine Learning\": 95\n", " },\n", " \"graduated\": True,\n", " \"previous schools\": [\"NTUA\", \"UofT\"]\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that we can also use native Python objects, like the `datetime` object below, to specify values." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import datetime\n", "example_3 = {\"name\": \"Modern Database Systems\",\n", " \"start\": datetime.datetime(2016, 1, 12),\n", " \"end\": datetime.datetime(2016, 3, 26),\n", " \"tags\": [\"rdbms\", \"mongodb\", \"spark\"]}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Inserting and finding documents\n", "\n", "Our collection `coll` is currently empty. Let's add one document to it." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [], "source": [ "coll.insert_one(example_0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we call the collection's function `find()`, we get back a cursor." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "coll.find()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use the cursor to iterate over all documents in the collection." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "for doc in coll.find():\n", " print(doc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that the empty document we inserted is not *really* empty, but associated with an \"\\_id\" key, added by MongoDB.\n", "\n", "Let's try another one." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "coll.insert_one(example_1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "for doc in coll.find():\n", " print(doc)\n", " print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice how MongoDB added an \"\\_id\" for the new document, as well.\n", "Let's insert more documents." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "coll.insert_many([example_2, example_3])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "for doc in coll.find():\n", " print(doc)\n", " print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice how the document we insert do not follow a schema?\n", "\n", "Let us now find documents that match a condition -- let's say we want to find documents that have a field \"name\" with value \"Michael\"." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "query_result = coll.find({\"name\": \"Michael\"})\n", "for doc in query_result:\n", " print(doc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Projecting fields\n", "\n", "We can use **find()** not only to retrieve documents that match a condition, but also to project only those fields that we are interested in. \n", "\n", "For example, to suppress the \"\\_id\" field from appearing in the results, we can provide a second argument to __find()__, as follows." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "query_result = coll.find({\"name\": \"Michael\"}, {\"_id\": 0})\n", "for doc in query_result:\n", " print(doc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What if we're interested in keeping only some of the rest of the fields -- let's say, only \"grades\"?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "query_result = coll.find({\"name\": \"Michael\"}, {\"_id\": 0, \"grades\": 1})\n", "for doc in query_result:\n", " print(doc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading a larger dataset\n", "\n", "Download file [primer-dataset.json](https://raw.githubusercontent.com/mongodb/docs-assets/primer-dataset/primer-dataset.json), store it in the same folder as this notebook, and load it into mongodb by running the command below." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%%bash\n", "mongoimport --db tutorial --collection restaurants --drop --file primer-dataset.json" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively, you can import the dataset by running the same command on a terminal.\n", "> mongoimport --db moderndb --collection restaurants --drop --file dataset.json" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The dataset contains documents that look like the one below.\n", "\n", "** Restaurant Example **\n", "\n", "```\n", "{\n", " \"address\": {\n", " \"building\": \"1007\",\n", " \"coord\": [ -73.856077, 40.848447 ],\n", " \"street\": \"Morris Park Ave\",\n", " \"zipcode\": \"10462\"\n", " },\n", " \"borough\": \"Bronx\",\n", " \"cuisine\": \"Bakery\",\n", " \"grades\": [\n", " { \"date\": { \"$date\": 1393804800000 }, \"grade\": \"A\", \"score\": 2 },\n", " { \"date\": { \"$date\": 1378857600000 }, \"grade\": \"A\", \"score\": 6 },\n", " { \"date\": { \"$date\": 1358985600000 }, \"grade\": \"A\", \"score\": 10 },\n", " { \"date\": { \"$date\": 1322006400000 }, \"grade\": \"A\", \"score\": 9 },\n", " { \"date\": { \"$date\": 1299715200000 }, \"grade\": \"B\", \"score\": 14 }\n", " ],\n", " \"name\": \"Morris Park Bake Shop\",\n", " \"restaurant_id\": \"30075445\"\n", "}\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "restaurants = db.restaurants # our new collection" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# how many restaurants?\n", "restaurants.count()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Querying the Dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# retrieve a cursor over all documents in the collection\n", "cursor = restaurants.find()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# define printing function\n", "def print_my_docs(cursor, num):\n", " for i in range(num): # print only up to num next documents from cursor\n", " try:\n", " print(next(cursor))\n", " print()\n", " except:\n", " break" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# let's print a few documents\n", "print_my_docs(cursor, 3)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "next(cursor) # get one more document" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Specify equality conditions" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# top-level field\n", "cursor = restaurants.find({\"borough\": \"Manhattan\"})\n", "\n", "print_my_docs(cursor, 2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# nested field (in embedded document)\n", "cursor = restaurants.find({\"address.zipcode\": \"10075\"})\n", "\n", "print_my_docs(cursor, 2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# query by field in array\n", "cursor = restaurants.find({\"grades.grade\": \"B\"})" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# print one document from the query result\n", "next(cursor)['grades']" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# exact array match\n", "cursor = restaurants.find({\"address.coord\": [-73.98513559999999, 40.7676919]})" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print_my_docs(cursor, 10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Specify Range Conditions" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "cursor = restaurants.find({\"grades.score\": {\"$gt\": 30}})" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "cursor = restaurants.find({\"grades.score\": {\"$lt\": 10}})" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "next(cursor)[\"grades\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Multiple Conditions" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# logical AND\n", "cursor = restaurants.find({\"cuisine\": \"Italian\", \"address.zipcode\": \"10075\"})" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "next(cursor)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# logical OR\n", "cursor = restaurants.find({\"$or\": [{\"cuisine\": \"Italian\"},\n", " {\"address.zipcode\": \"10075\"}]})" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print_my_docs(cursor, 3)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# logical AND, differently\n", "cursor = restaurants.find({\"$and\": [{\"cuisine\": \"Italian\"},\n", " {\"address.zipcode\": \"10075\"}]})" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "next(cursor)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sorting" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "cursor = restaurants.find()\n", "\n", "# to sort, specify list of sorting criteria,\n", "# each criterion given as a tuple\n", "# (field_name, sort_order)\n", "# here we have only one\n", "sorted_cursor = cursor.sort([(\"borough\", pm.ASCENDING)]) " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print_my_docs(cursor, 2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "another_sorted_cursor = restaurants.find().sort([(\"borough\", pm.ASCENDING),\n", " (\"address.zipcode\", pm.DESCENDING)])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print_my_docs(another_sorted_cursor, 3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Aggregation\n", "\n", "Aggregation happens in stages." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Group Documents by a Field and Calculate Count\n", "cursor = restaurants.aggregate(\n", " [\n", " {\"$group\": {\"_id\": \"$borough\", \"count\": {\"$sum\": 1}}}\n", " ]\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print_my_docs(cursor, 10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Filter and Group Documents\n", "cursor = restaurants.aggregate(\n", " [\n", " {\"$match\": {\"borough\": \"Queens\", \"cuisine\": \"Brazilian\"}},\n", " {\"$group\": {\"_id\": \"$address.zipcode\", \"count\": {\"$sum\": 1}}}\n", " ]\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print_my_docs(cursor, 10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Filter and Group and then Filter Again documents\n", "cursor = restaurants.aggregate(\n", " [\n", " {\"$match\": {\"borough\": \"Manhattan\", \"cuisine\": \"American\"}},\n", " {\"$group\": {\"_id\": \"$address.zipcode\", \"count\": {\"$sum\": 1}}},\n", " {\"$match\": {\"count\": {\"$gt\": 1}}}\n", " ]\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print_my_docs(cursor, 10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Filter and Group and then Filter Again and then Sort Documents\n", "cursor = restaurants.aggregate(\n", " [\n", " {\"$match\": {\"borough\": \"Manhattan\", \"cuisine\": \"American\"}},\n", " {\"$group\": {\"_id\": \"$address.zipcode\", \"count\": {\"$sum\": 1}}},\n", " {\"$match\": {\"count\": {\"$gt\": 1}}},\n", " {\"$sort\": {\"count\": -1, \"_id\": -1}}\n", " ]\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print_my_docs(cursor, 10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Same but sort by multiple fields\n", "# Filter and Group and then Filter Again and then Sort Documents\n", "cursor = restaurants.aggregate(\n", " [\n", " {\"$match\": {\"borough\": \"Manhattan\", \"cuisine\": \"American\"}},\n", " {\"$group\": {\"_id\": \"$address.zipcode\", \"count\": {\"$sum\": 1}}},\n", " {\"$match\": {\"count\": {\"$gt\": 1}}},\n", " {\"$sort\": son.SON([(\"count\", -1), (\"_id\", 1)])} # order matters!!\n", " ]\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print_my_docs(cursor, 10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# what will this do?\n", "cursor = restaurants.aggregate(\n", " [\n", " {\"$group\": {\"_id\": None, \"count\": {\"$sum\": 1}} }\n", " ]\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print_my_docs(cursor, 10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# projection\n", "# what will this do?\n", "cursor = restaurants.aggregate(\n", " [\n", " {\"$group\": {\"_id\": \"$address.zipcode\", \"count\": {\"$sum\": 1}}},\n", " {\"$project\": {\"_id\": 0, \"count\": 1}}\n", " ]\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print_my_docs(cursor, 10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# what will this do?\n", "cursor = restaurants.aggregate(\n", " [\n", " {\"$group\": {\"_id\": {\"cuisine\": \"$cuisine\"}, \"count\": {\"$sum\": 1}}},\n", " {\"$sort\": {\"count\": -1}}\n", " ]\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print_my_docs(cursor, 5)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# what will this do?\n", "cursor = restaurants.aggregate(\n", " [\n", " {\"$group\": {\"_id\": {\"zip\": \"$address.zipcode\"}, \"count\": {\"$sum\": 1}}},\n", " {\"$sort\": {\"count\": -1}}\n", " ]\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print_my_docs(cursor, 5)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# what will this do?\n", "cursor = restaurants.aggregate(\n", " [\n", " {\"$group\": {\"_id\": {\"cuisine\": \"$cuisine\", \"zip\": \"$address.zipcode\"}, \"count\": {\"$sum\": 1}}},\n", " {\"$sort\": {\"count\": -1}}\n", " ]\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print_my_docs(cursor, 5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Limiting the number of results" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# what will this do?\n", "\n", "cursor = restaurants.aggregate(\n", " [\n", " {\"$group\": {\"_id\": {\"cuisine\": \"$cuisine\", \"zip\": \"$address.zipcode\"}, \"count\": {\"$sum\": 1}}},\n", " {\"$sort\": {\"count\": -1}},\n", " {\"$limit\": 10} # See comment under \"In-class questions\"\n", " ]\n", ")\n", "\n", "for doc in cursor:\n", " print(doc[\"_id\"][\"cuisine\"], doc[\"_id\"][\"zip\"], doc[\"count\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Storing the result as a collection" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use operator [\\$out](https://docs.mongodb.org/manual/reference/operator/aggregation/out/) in a final stage to store the result of a query into a new collection. The following example selects restaurants from Manhattan and stores them in their own collection in the same database." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "restaurants.aggregate(\n", " [\n", " {\"$match\": {\"borough\": \"Manhattan\"}},\n", " {\"$out\": \"manhattan\"}\n", " ]\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## SQL to Aggregation\n", "\n", "Here we explore the correspondence between SQL queries and the aggregation framework." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "** SQL query **\n", "```\n", "SELECT COUNT(*) AS count\n", "FROM restaurants\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "cursor = restaurants.aggregate(\n", " [\n", " {\"$group\": {\"_id\": None, \"count\": {\"$sum\": 1}} }\n", " ]\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "** SQL query **\n", "```\n", "SELECT borough, cuisine, COUNT(*) as count\n", "FROM restaurants\n", "GROUP BY borough, cuisine\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "cursor = restaurants.aggregate(\n", " [\n", " {\"$group\": {\"_id\": {\"borough\": \"$borough\", \"cuisine\": \"$cuisine\"}, \"count\": {\"$sum\": 1}}}\n", " ]\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "** SQL query **\n", "```\n", "SELECT borough, cuisine, COUNT(*) as count\n", "FROM restaurants\n", "GROUP BY borough, cuisine\n", "HAVING COUNT(*) > 3\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "cursor = restaurants.aggregate(\n", " [\n", " {\"$group\": {\"_id\": {\"borough\": \"$borough\", \"cuisine\": \"$cuisine\"}, \"count\": {\"$sum\": 1}}},\n", " {\"$match\": {\"count\": {\"$gt\": 3}}}\n", " ]\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "** SQL Query **\n", "```\n", "SELECT zipcode, cuisine, COUNT(*) as count\n", "FROM restaurants\n", "WHERE borough = \"Manhattan\"\n", "GROUP BY zipcode, cuisine\n", "HAVING COUNT(*) > 3\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "cursor = restaurants.aggregate(\n", " [\n", " {\"$match\": {\"borough\": \"Manhattan\"}},\n", " {\"$group\": {\"_id\": {\"zipcode\": \"$address.zipcode\", \"cuisine\": \"$cuisine\"}, \"count\": {\"$sum\": 1}}},\n", " {\"$match\": {\"count\": {\"$gt\": 3}}}\n", " ]\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print_my_docs(cursor, 5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "** SQL Query **\n", "```\n", "SELECT zipcode, cuisine, COUNT(*) as count\n", "FROM restaurants\n", "WHERE borough = \"Manhattan\"\n", "GROUP BY zipcode, cuisine\n", "HAVING COUNT(*) > 3\n", "ORDER BY count\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "cursor = restaurants.aggregate(\n", " [\n", " {\"$match\": {\"borough\": \"Manhattan\"}},\n", " {\"$group\": {\"_id\": {\"zipcode\": \"$address.zipcode\", \"cuisine\": \"$cuisine\"}, \"count\": {\"$sum\": 1}}},\n", " {\"$match\": {\"count\": {\"$gt\": 3}}},\n", " {\"$sort\": {\"count\": 1}}\n", " ]\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using secondary memory (disk)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "cursor = restaurants.aggregate(\n", " [\n", " {\"$match\": {\"borough\": \"Manhattan\"}},\n", " {\"$group\": {\"_id\": {\"zipcode\": \"$address.zipcode\", \"cuisine\": \"$cuisine\"}, \"count\": {\"$sum\": 1}}},\n", " {\"$match\": {\"count\": {\"$gt\": 3}}},\n", " {\"$sort\": {\"count\": 1}}\n", " ],\n", " allowDiskUse = True # this can be useful when data does not fit in memory, e.g., to perform external sorting\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Indexing\n", "\n", "MongoDb automatically creates an index on the `_id` field upon creating a collection.\n", "We can use `create_index()` to create index on one or more fields of a collection." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Single-field index" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# note that the argument is a list of tuples\n", "# [(: ), ...]\n", "# here, we specify only one such tuple for one field\n", "restaurants.create_index([(\"borough\", pm.ASCENDING)])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The index is created only if it does not already exist." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Compound index" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# compound index (more than one indexed fields)\n", "restaurants.create_index([\n", " (\"cuisine\", pm.ASCENDING),\n", " (\"address.zipcode\", pm.DESCENDING)\n", "])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Deleting indexes" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "restaurants.drop_index('borough_1') # drop this index" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "restaurants.drop_index('cuisine_1_address.zipcode_-1') # drop that index" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "restaurants.drop_indexes() # drop all indexes!!1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Multi-key index\n", "\n", "An index for a fields with array value." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "restaurants.find_one()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "restaurants.create_index([(\"address.coord\", 1)])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "restaurants.create_index([(\"grades.score\", 1)])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "restaurants.create_index([(\"grades.grade\", 1), (\"grades.score\", 1)])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following will not work!\n", "We cannot _currently_ have compound multi-key indexes." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [], "source": [ "restaurants.create_index([(\"address.coord\", 1), (\"grades.score\", 1)]) # NOPE!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Retrieving the execution plan\n", "\n", "We can retrieve the execution plan for a **find()** query by calling the [explain()](https://docs.mongodb.org/manual/reference/method/cursor.explain/) function on the result cursor. We demonstrate this in the following example." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "restaurants.drop_indexes() # we drop all indexes first -- use this with care!\n", "restaurants.create_index([(\"borough\", pm.ASCENDING)]) # build an index on field \"borough\", in ascending order\n", "my_cursor = restaurants.find({\"borough\": \"brooklyn\"}) # submit query to find restaurants from specific borough\n", "my_cursor.explain()[\"queryPlanner\"][\"winningPlan\"] # ask mongodb to explain execution plan" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we see in this example, MongoDB makes use of an index (as indicated by keyword \"IXSCAN\") -- and particularly the index ('borough_1') we constructed to execute the query.\n", "What if we had not built this index?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "restaurants.drop_indexes() # we drop all indexes first -- use this with care!\n", "my_cursor = restaurants.find({\"borough\": \"brooklyn\"}) # submit query to find restaurants from specific borough\n", "my_cursor.explain()[\"queryPlanner\"][\"winningPlan\"] # ask mongodb to explain execution plan" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In that case, MongoDB simply performs a scan over the collection (as indicated by keyword \"COLLSCAN\")." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Joins\n", "\n", "Until very recently, MongoDB did not support joins.\n", "It was up to the user to implement a join if needed -- as in the cell below." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "for a in restaurants.find({\"borough\": \"Manhattan\"}).limit(7):\n", " for b in restaurants.find({\"borough\": \"Bronx\"}).limit(5):\n", " if a[\"cuisine\"] == b[\"cuisine\"]:\n", " print(a[\"cuisine\"], a[\"address\"][\"zipcode\"], b[\"address\"][\"zipcode\"])\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Joins with \\$lookup\n", "\n", "This is a new aggregation stage that implements *left outer equi-joins*.\n", "\n", "\"A [left outer equi-join](https://www.mongodb.com/blog/post/joins-and-other-aggregation-enhancements-coming-in-mongodb-3-2-part-1-of-3-introduction) produces a result set that contains data for all documents from the left table (collection) together with data from the right table (collection) for documents where there is a match with documents from the left table (collection).\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# create first collection\n", "orders_docs = [{ \"_id\" : 1, \"item\" : \"abc\", \"price\" : 12, \"quantity\" : 2 },\n", "{ \"_id\" : 2, \"item\" : \"jkl\", \"price\" : 20, \"quantity\" : 1 },\n", "{ \"_id\" : 3 }]\n", "orders = db.orders\n", "orders.drop()\n", "orders.insert_many(orders_docs)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# create second collection\n", "inventory_docs = [\n", " { \"_id\" : 1, \"item\" : \"abc\", \"description\": \"product 1\", \"instock\" : 120 },\n", "{ \"_id\" : 2, \"item\" : \"def\", \"description\": \"product 2\", \"instock\" : 80 },\n", "{ \"_id\" : 3, \"item\" : \"ijk\", \"description\": \"product 3\", \"instock\" : 60 },\n", "{ \"_id\" : 4, \"item\" : \"jkl\", \"description\": \"product 4\", \"instock\" : 70 },\n", "{ \"_id\" : 5, \"item\": None, \"description\": \"Incomplete\" },\n", "{ \"_id\" : 6 }\n", "]\n", "inventory = db.inventory\n", "inventory.drop()\n", "inventory.insert_many(inventory_docs)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "result = orders.aggregate([ # \"orders\" is the outer collection\n", " {\n", " \"$lookup\":\n", " {\n", " \"from\": \"inventory\", # the inner collection \n", " \"localField\": \"item\", # the join field of the outer collection\n", " \"foreignField\": \"item\", # the join field of the outer collection\n", " \"as\": \"inventory_docs\" # name of field with array of joined inner docs\n", " }\n", " }\n", "])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print_my_docs(result, 10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Questions from tutorial sessions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Question: How do we query for documents with an array field, all the elements of which satisfy a condition?\n", "\n", "Two approaches (if you can think of a different approach, please let me know):\n", "* Use the [**\\$not**](https://docs.mongodb.org/manual/reference/operator/query/not/#op._S_not) operators: form a query to express that \"there is no element in the array that does not satisfy the condition\".\n", "* In aggregation, combine an [**\\$unwind**](https://docs.mongodb.org/manual/reference/operator/aggregation/unwind/) stage with a [**$group**](https://docs.mongodb.org/manual/reference/operator/aggregation/group/) stage.\n", "\n", "To provide an example, let's say we want to __find restaurants with 'A' grades only__.\n", "Below we show how we can use each of the aforementioned approaches." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### First approach: using $not" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# using the $not operator\n", "# \"find restaurants that contain no grades that are not equal to A\"\n", "cursor = restaurants.find({\"grades.grade\": {\"$exists\": True}, \"grades\": {\"$not\": {\"$elemMatch\": {\"grade\": {\"$ne\": \"A\"}}}}})" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print_my_docs(cursor, 3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Note on the semantics of the \\$not operator\n", "\n", "The operator selects documents that _do not match_ the specified condition on the specified field. These documents include ones that _do not contain_ the field.\n", "\n", "To demonstrate this, consider the following simple example of a collection." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# simple example of a collection\n", "mycoll = db.mycoll\n", "mycoll.drop()\n", "\n", "# insert three documents\n", "mycoll.insert_one({\"grades\": [7, 7]})\n", "mycoll.insert_one({\"grades\": [7, 3]})\n", "mycoll.insert_one({\"grades\": [3, 3]})\n", "mycoll.insert_one({\"grades\": []})\n", "mycoll.insert_one({})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result of the following query contains documents that do not contain the \"grades\" field." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# find documents that have no \"grades\" elements that are not equal to \"A\"\n", "mycursor = mycoll.find({\"grades\": {\"$not\": {\"$elemMatch\": {\"$ne\": 7}}}})\n", "print_my_docs(mycursor, 10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can remove such documents from the result as a post-processing step. (**Exercise**: how?)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Second approach: aggregation pipeline" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# using aggregation\n", "mycursor = restaurants.aggregate(\n", " [\n", " # unwind the grades array\n", " {\"$unwind\": \"$grades\"}, #now each document contains one \"grades\" value\n", " \n", " # group by document \"_id\" and count:\n", " # (i) the total number of documents in each group as `count`\n", " # -- this is the same as the number of elements in the original array\n", " # (ii) the number of documents that satisfy the condition (grade = \"A\") as `num_satisfied`\n", " {\"$group\": {\"_id\": \"$_id\", \"count\": {\"$sum\": 1}, \"num_satisfied\": {\"$sum\": {\"$cond\": [{\"$eq\": [\"$grades.grade\", \"A\"]}, 1, 0]}}}},\n", " \n", " # create a field (named `same`) that is 1 if (count = num_satisfied) and 0 otherwise\n", " {\"$project\": {\"_id\": 1, \"same_count\": {\"$cond\": [{\"$eq\": [\"$count\", \"$num_satisfied\"]} , 1, 0]}}},\n", " \n", " # keep only the document ids for which (same = 1)\n", " {\"$match\": {\"same_count\": 1}}\n", " ]\n", ")\n", "\n", "print_my_docs(mycursor, 5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Question: Does MongoDB optimize the stages of an aggregation pipeline?\n", "\n", "The question was asked in relation to the \"limit\" query we saw above (\"Limiting the number of results\").\n", "\n", "Indeed, MongoDB does optimize the execution of the aggregation pipeline, as explained [here](https://docs.mongodb.org/manual/core/aggregation-pipeline-optimization/). In relation to the aforementioned query, see, in particular, the part on [sort+limit coalescence](https://docs.mongodb.org/manual/core/aggregation-pipeline-optimization/#sort-limit-coalescence)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Credits and references\n", "\n", "We used and consulted material from:\n", "* the offficial [PyMongo tutorial](https://docs.mongodb.org/getting-started/python/) as well as this shorter [one](http://api.mongodb.org/python/current/tutorial.html),\n", "* the [JSON](http://json.org/) and [BSON](http://bsonspec.org) documentation, as well as [SON](http://api.mongodb.org/python/current/api/bson/son.html#bson.son.SON),\n", "* these [posts](https://www.mongodb.com/blog/post/joins-and-other-aggregation-enhancements-coming-in-mongodb-3-2-part-1-of-3-introduction) on the MongoDB blog about the new (v.3.2) left outer equi-join functionality,\n", "* this [StackOverflow thread](http://stackoverflow.com/questions/18123300/mongo-array-query-only-find-where-all-elements-match)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python [Root]", "language": "python", "name": "Python [Root]" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 0 }