{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Database Programming\n", "\n", "Due to the sheer size of many modern datasets, statistical applications can quickly outstrip the capabilities of unstructured data stores, such as spreadsheets and text files. So, it is useful to be able to work with databases for both the retrieval of information when it is required for analysis and for the storage of processed data and analytic outputs.\n", "\n", "Relational database management systems (RDBMS) are used to store user-defined records in large tables. Each table has columns (also known as **fields**) which describe the data, and rows (also known as **records**) which contain the data.\n", " \n", "Unlike when we are using a spreadsheet, where we put formulas into cells to calculate values, when we are using a database, we send **queries** to the database management system to manipulate the database for us.\n", "The database manager can process complex query commands that combine data from multiple tables to generate reports and data summaries. \n", "\n", "> Every database manager—Oracle,\n", "> IBM DB2, PostgreSQL, MySQL, Microsoft Access, and SQLite—stores\n", "> data in a different way,\n", "> so a database created with one cannot be used directly by another.\n", "> However,\n", "> every database manager can import and export data in a variety of formats,\n", "> so it *is* possible to move information from one to another.\n", "\n", "Writing Python code to access databases is made easier by the presence of the **Python Database API** (commonly referred to as the DB-API). This API fits nicely into existing Python code and allows programmers to easily store and retrieve data from databases.\n", "\n", "The Python DB-API specifies a standard way to connect to databases and issue commands to them, and provides a common starting point for anyone writing code to interact with databases. There are several Python modules that extend the DB-API with different levels of abstraction. They range from simple result set wrappers to full blown object relational mappers. The simple wrappers typically wrap the results of database operations in more Pythonic data structures like dictionaries, while object relational mappers graft object interfaces onto the database, and model tables as Python classes. This allows Python programmers to (largely) distance themselves from writing SQL statements.\n", "\n", "\n", "## Client-server model\n", "\n", "Databases are typically based on a client-server model. Just a web server provides content to multiple website visitors that request and render pages, databases return query results to one or more clients.\n", "\n", "The model consists of one server instance and many client instances.\n", "\n", "![rdbms](http://d.pr/i/7ft0+)\n", "\n", "Server:\n", "\n", "* accepts connections from clients\n", "* processes requests and provide results\n", "\n", "Client:\n", "\n", "* connects to server\n", "* makes requests and receives results from server\n", "\n", "Popular database servers include: Oracle, MySQL, Microsoft SQL, SQLite\n", "\n", "The database interaction paradigm consists of multiple steps:\n", "\n", "1. Connect to database server\n", "2. Issue request\n", "3. Fetch result\n", "4. Repeat 2,3 as necessary\n", "5. Disconnect from server\n", "\n", "## The SQL Interface\n", "\n", "Databases can be accessed from Python through the interface described by the Python DB-API. Different brands of database will have slightly different implementations, and may not completely conform to the DB-API, but most are quite similar.\n", "\n", "Irrespective of the database system you are using, there are three core concepts to database access in Python:\n", "\n", "+ **Connection objects**\n", ": This is a connection to a database, which provides specific implementation details and access control.\n", "\n", "+ **Cursor objects**\n", ": A cursor is a means for traversing the database, and governs the execution of SQL statements and the return of results.\n", "\n", "+ **Query results**\n", ": The output from database queries in Python is always a sequence of sequences, representing database tables of rows.\n", "\n", "Database queries are sent via the cursor as strings of SQL. Thus, productive use of databases requires some mastery of the SQL language, and we will teach the basics of SQL here.\n", "\n", "## SQLite\n", "\n", "A good entry point is the **SQLite** database; SQLite is a public-domain software package that provides a relational database management system. \n", "\n", "SQLite is defined by the following features:\n", "\n", "+ **Serverless**\n", ": SQLite does not require a separate server process or system to operate. The SQLite library accesses its storage files directly.\n", "+ **Zero Configuration**\n", ": No server means no setup. Creating an SQLite database instance is as easy as opening a file.\n", "+ **Cross-Platform**\n", ": The entire database instance resides in a single cross-platform file, requiring no administration.\n", "+ **Self-Contained**\n", ": A single library contains the entire database system, which integrates directly into a host application.\n", "+ **Small Runtime Footprint**\n", ": The default build is less than a megabyte of code and requires only a few megabytes of memory. \n", "+ **Transactional**\n", ": SQLite transactions are fully ACID-compliant, allowing safe access from multiple processes or threads.\n", "\n", "Unlike most RDBMS products, SQLite does not have a client/server architecture. Most large-scale database systems have a large server package that makes up the database engine. The database server often consists of multiple processes that work in concert to manage client connections, file I/O, caches, query optimization, and query processing. A database instance typically consists of a large number of files organized into one or more directory trees on the server filesystem. In order to access the database, all of the files must be present and correct. This can make it somewhat difficult to move or reliably back up a database instance.\n", "All of these components require resources and support from the host computer. \n", "\n", "![sqlite arch](http://d.pr/i/moUV+)\n", "\n", "In contrast, SQLite has no separate server. The entire database engine is integrated into whatever application needs to access a database. The only shared resource among applications is the single database file as it sits on disk. If you need to move or back up the database, you can simply copy the file. Unlike a traditional RDBMS server that requires advanced multitasking and high-performance inter-process communication, SQLite requires little more than the ability to read and write to some type of storage.\n", "\n", "SQLite doesn't make sense for scenarios where there are a high number of transactions, a very large quantity of data, or multiple users requiring access control. Its suitability depends on the degree to which trading off simplicity and portability with security and robustness makes sense.\n", "\n", "The Python Standard Library includes a module called `sqlite3` intended for working with this database. This module is a SQL interface compliant with the DB-API 2.0 specification." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Running SQLite\n", "\n", "The sqlite3 command line tool accepts SQL commands from an interactive prompt and passes those commands to the SQLite core for processing. To get started, just run the `sqlite3` command; If you provide a filename (such as our sample database `survey.db`), sqlite3 will open (or create) that file. If no filename is given, sqlite3 will automatically open an unnamed temporary database. \n", "\n", "![terminal](http://d.pr/i/XolC+)\n", "\n", "The `sqlite>` prompt means sqlite3 is ready to accept commands, similar to `>>>` in Python. We can issue some basic expressions:\n", "\n", "![interaction](http://d.pr/i/1iOvG+)\n", "\n", "The first line is a simple SQL expression that returns a sum and an integer. All SQL commands in the command line tool must end with semicolons, which indicate the command is complete. We will explore the SELECT statement fully later on.\n", "\n", "In addition to processing SQL statements, there is a series of shell-specific commands. These are sometimes referred to as **dot-commands** because they start with a period. Dot-commands control the shell’s output formatting, and also provide a number of utility features. \n", "\n", "In the above example, `.databases` lists the names and files of attached databases, while `.tables` lists the tables in the attached databases. Finally, we exited with `.quit`.\n", "\n", "\n", "### Running SQLite from Python\n", "\n", "Relational databases are more useful to us when integrated with our analytic tools. The SQLite3 can be integrated with Python using the `sqlite3` module which was written by Gerhard Haring. It provides an SQL interface compliant with the DB-API 2.0 specification. You do not need to install this module separately because it is part of Python's core set of modules.\n", "\n", "The `connect` function is used to establish a connection to a new or existing database. Since we're using SQLite, all we need to specify is the name of the database file. Other systems may require us to provide a username and password as well.\n", "\n", "Let's create a new database." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import sqlite3\n", "\n", "con = sqlite3.connect('microbiome.db')\n", "con" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to communicate with the database, we need to create a **cursor** object. A database cursor is a control structure that enables traversal over the records in a database. Cursors facilitate subsequent processing in conjunction with the traversal, such as retrieval, addition and removal of database records. Just like the cursor in an editor, its role is to keep track of where we are in the database." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "cur = con.cursor() " ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cur" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The cursor's `execute` method provides the means for passing queries to the database. Let's create a table." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cur.execute('create table samples (Taxon text, Patient integer, NEC integer, Tissue integer, Stool integer)')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It's our job to make sure that SQL is properly formatted; if it isn't, or if something goes wrong when it is being executed, the database will report an error." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "con.commit()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `commit` function completes the **transaction**. \n", "\n", "Now to add some data to the empty table. We will now pass an INSERT query to the database, with some new data." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cur.execute('insert into samples values (\\'Bacteroidetes\\',14,1,102,33)')" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [], "source": [ "con.commit()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Of course, it would be painstaking to construct a new query for each new observation that we intend to store in SQLite. Fortunately, data can be added as a set; for example, here are some microbiome sample from high-throughput RNA 16S sequencing." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "samples = [('Firmicutes',1,0,136,4182),\n", "('Firmicutes',2,1,1174,703),\n", "('Firmicutes',3,0,408,3946),\n", "('Firmicutes',4,1,831,8605),\n", "('Firmicutes',5,0,693,50),\n", "('Firmicutes',6,1,718,717),\n", "('Proteobacteria',8,1,2651,767),\n", "('Proteobacteria',9,0,1195,76),\n", "('Proteobacteria',10,1,6857,795),\n", "('Proteobacteria',11,0,483,666),\n", "('Proteobacteria',12,1,2950,3994),\n", "('Actinobacteria',11,0,42,75),\n", "('Actinobacteria',12,1,109,59),\n", "('Actinobacteria',13,0,51,183),\n", "('Actinobacteria',14,1,310,204),\n", "('Bacteroidetes',1,0,67,0),\n", "('Bacteroidetes',2,1,0,0),\n", "('Bacteroidetes',3,0,85,5),\n", "('Bacteroidetes',4,1,143,7)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `executemany` function will allow us to add them as a group, iterating over the list of tuples." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cur.executemany('insert into samples values (?,?,?,?,?)', samples)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's have a look at what is in the table now. For this we will execute a SELECT command. This returns the cursor, which contains an iterable data structure." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results = cur.execute('select * from samples')\n", "results" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[(u'Bacteroidetes', 14, 1, 102, 33),\n", " (u'Firmicutes', 1, 0, 136, 4182),\n", " (u'Firmicutes', 2, 1, 1174, 703),\n", " (u'Firmicutes', 3, 0, 408, 3946),\n", " (u'Firmicutes', 4, 1, 831, 8605),\n", " (u'Firmicutes', 5, 0, 693, 50),\n", " (u'Firmicutes', 6, 1, 718, 717),\n", " (u'Proteobacteria', 8, 1, 2651, 767),\n", " (u'Proteobacteria', 9, 0, 1195, 76),\n", " (u'Proteobacteria', 10, 1, 6857, 795),\n", " (u'Proteobacteria', 11, 0, 483, 666),\n", " (u'Proteobacteria', 12, 1, 2950, 3994),\n", " (u'Actinobacteria', 11, 0, 42, 75),\n", " (u'Actinobacteria', 12, 1, 109, 59),\n", " (u'Actinobacteria', 13, 0, 51, 183),\n", " (u'Actinobacteria', 14, 1, 310, 204),\n", " (u'Bacteroidetes', 1, 0, 67, 0),\n", " (u'Bacteroidetes', 2, 1, 0, 0),\n", " (u'Bacteroidetes', 3, 0, 85, 5),\n", " (u'Bacteroidetes', 4, 1, 143, 7)]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results.fetchall()" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [], "source": [ "for row in results:\n", " print(row)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we need to close our cursor and our connection, since the database can only keep a limited number of these open at one time. Since establishing a connection takes time, though, we shouldn't open a connection, do one operation, then close the connection, only to reopen it a few microseconds later to do another operation. Instead, it's normal to create one connection that stays open for the lifetime of the program." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": true }, "outputs": [], "source": [ "cur.close()\n", "con.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### IPython SQL Magic\n", "\n", "Catherine Devlin has created a SQL %magic function for IPython that allows for more direct querying of SQLite and other databases. \n", "\n", "Go ahead and install the extension using pip:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Downloading/unpacking ipython-sql\n", " Downloading ipython-sql-0.3.4.tar.gz\n", " Running setup.py (path:/private/var/folders/z_/kgsm76k94s7_jhnkfydfnd380000gn/T/pip_build_fonnescj/ipython-sql/setup.py) egg_info for package ipython-sql\n", " \n", "Downloading/unpacking prettytable (from ipython-sql)\n", " Downloading prettytable-0.7.2.tar.bz2\n", " Running setup.py (path:/private/var/folders/z_/kgsm76k94s7_jhnkfydfnd380000gn/T/pip_build_fonnescj/prettytable/setup.py) egg_info for package prettytable\n", " \n", "Requirement already satisfied (use --upgrade to upgrade): ipython>=1.0 in /Users/fonnescj/anaconda/lib/python2.7/site-packages (from ipython-sql)\n", "Requirement already satisfied (use --upgrade to upgrade): sqlalchemy>=0.6.7 in /Users/fonnescj/anaconda/lib/python2.7/site-packages (from ipython-sql)\n", "Downloading/unpacking sqlparse (from ipython-sql)\n", " Downloading sqlparse-0.1.13.tar.gz (54kB): 54kB downloaded\n", " Running setup.py (path:/private/var/folders/z_/kgsm76k94s7_jhnkfydfnd380000gn/T/pip_build_fonnescj/sqlparse/setup.py) egg_info for package sqlparse\n", " \n", "Requirement already satisfied (use --upgrade to upgrade): six in /Users/fonnescj/anaconda/lib/python2.7/site-packages (from ipython-sql)\n", "Installing collected packages: ipython-sql, prettytable, sqlparse\n", " Running setup.py install for ipython-sql\n", " \n", " Running setup.py install for prettytable\n", " \n", " Running setup.py install for sqlparse\n", " changing mode of build/scripts-2.7/sqlformat from 644 to 755\n", " \n", " changing mode of /Users/fonnescj/anaconda/bin/sqlformat to 755\n", "Successfully installed ipython-sql prettytable sqlparse\n", "Cleaning up...\n" ] } ], "source": [ "!pip install ipython-sql" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After loading the extension in the current session, it is easy to connect to databases and execute queries, without having to use the `sqlite3` module at all." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%load_ext sql" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "u'Connected: None@microbiome.db'" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%sql sqlite:///microbiome.db" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 rows affected.\n" ] }, { "data": { "text/plain": [ "[]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "INSERT INTO samples VALUES ('Actinobacteria',7,0,260,58);" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] } ], "source": [ "writer_results = %sql select * from samples where NEC==1;" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TaxonPatientNECTissueStool
Bacteroidetes14110233
" ], "text/plain": [ "[(u'Bacteroidetes', 14, 1, 102, 33)]" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "writer_results" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TaxonPatientNECTissueStool
0 Bacteroidetes 14 1 102 33
\n", "
" ], "text/plain": [ " Taxon Patient NEC Tissue Stool\n", "0 Bacteroidetes 14 1 102 33" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "writer_results.DataFrame()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This functionality is better for interactive computing than the `sqlite3` mdoule, so we will be using it for most of the remainder of the section." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Selecting Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the late 1920s and early 1930s,\n", "William Dyer,\n", "Frank Pabodie,\n", "and Valentina Roerich led expeditions to the\n", "[Pole of Inaccessibility](http://en.wikipedia.org/wiki/Pole_of_inaccessibility)\n", "in the South Pacific,\n", "and then onward to Antarctica.\n", "Two years ago,\n", "their expeditions were found in a storage locker at Miskatonic University.\n", "We have scanned and OCR'd the data they contain,\n", "and we now want to store that information\n", "in a way that will make search and analysis easy.\n", "\n", "To load this database, we pass its URI to the SQL magic:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "u'Connected: None@../data/survey.db'" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%sql sqlite:///../data/survey.db" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "\n", "
\n", "Person: people who took readings.\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "
ident personal family
dyer William Dyer
pb Frank Pabodie
lake Anderson Lake
roe Valentina Roerich
danforth Frank Danforth
\n", "\n", "Site: locations where readings were taken.\n", "\n", "\n", " \n", " \n", " \n", " \n", "
name lat long
DR-1 -49.85 -128.57
DR-3 -47.15 -126.72
MSK-4 -48.87 -123.4
\n", "\n", "Visited: when readings were taken at specific sites.\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ident site dated
619 DR-1 1927-02-08
622 DR-1 1927-02-10
734 DR-3 1939-01-07
735 DR-3 1930-01-12
751 DR-3 1930-02-26
752 DR-3  
837 MSK-4 1932-01-14
844 DR-1 1932-03-22
\n", "
\n", "Survey: the actual readings.\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
taken person quant reading
619 dyer rad 9.82
619 dyer sal 0.13
622 dyer rad 7.8
622 dyer sal 0.09
734 pb rad 8.41
734 lake sal 0.05
734 pb temp -21.5
735 pb rad 7.22
735   sal 0.06
735   temp -26.0
751 pb rad 4.35
751 pb temp -18.5
751 lake sal 0.1
752 lake rad 2.19
752 lake sal 0.09
752 lake temp -16.0
752 roe sal 41.6
837 lake rad 1.46
837 lake sal 0.21
837 roe sal 22.5
844 roe rad 11.25
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that three entries—one in the `Visited` table,\n", "and two in the `Survey` table—are shown in red\n", "because they don't contain any actual data:\n", "we'll return to these missing values.\n", "For now,\n", "let's write an SQL query that displays scientists' names.\n", "We do this using the SQL command `select`,\n", "giving it the names of the columns we want and the table we want them from.\n", "Our query and its output look like this:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
familypersonal
DyerWilliam
PabodieFrank
LakeAnderson
RoerichValentina
DanforthFrank
" ], "text/plain": [ "[(u'Dyer', u'William'),\n", " (u'Pabodie', u'Frank'),\n", " (u'Lake', u'Anderson'),\n", " (u'Roerich', u'Valentina'),\n", " (u'Danforth', u'Frank')]" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%sql select family, personal from Person;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The semi-colon at the end of the query\n", "tells the database manager that the query is complete and ready to run.\n", "We have written our commands and column names in lower case,\n", "and the table name in Title Case,\n", "but we don't have to:\n", "as the example below shows,\n", "SQL is [case insensitive](../../gloss.html#case-insensitive)." ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
familypersonal
DyerWilliam
PabodieFrank
LakeAnderson
RoerichValentina
DanforthFrank
" ], "text/plain": [ "[(u'Dyer', u'William'),\n", " (u'Pabodie', u'Frank'),\n", " (u'Lake', u'Anderson'),\n", " (u'Roerich', u'Valentina'),\n", " (u'Danforth', u'Frank')]" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "SeLeCt FaMiLy, PeRsOnAl FrOm PeRsOn;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Many people format queries as:\n", "\n", "~~~\n", "SELECT family, personal FROM person;\n", "~~~\n", "\n", "or as:\n", "\n", "~~~\n", "select Family, Personal from PERSON;\n", "~~~\n", "\n", "Whatever casing convention you choose,\n", "please be consistent:\n", "complex queries are hard enough to read without the extra cognitive load of random capitalization." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Going back to our query,\n", "it's important to understand that\n", "the rows and columns in a database table aren't actually stored in any particular order.\n", "They will always be *displayed* in some order,\n", "but we can control that in various ways.\n", "For example,\n", "we could swap the columns in the output by writing our query as:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
personalfamily
WilliamDyer
FrankPabodie
AndersonLake
ValentinaRoerich
FrankDanforth
" ], "text/plain": [ "[(u'William', u'Dyer'),\n", " (u'Frank', u'Pabodie'),\n", " (u'Anderson', u'Lake'),\n", " (u'Valentina', u'Roerich'),\n", " (u'Frank', u'Danforth')]" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select personal, family from Person;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "or even repeat columns:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
identident_1ident_2
dyerdyerdyer
pbpbpb
lakelakelake
roeroeroe
danforthdanforthdanforth
" ], "text/plain": [ "[(u'dyer', u'dyer', u'dyer'),\n", " (u'pb', u'pb', u'pb'),\n", " (u'lake', u'lake', u'lake'),\n", " (u'roe', u'roe', u'roe'),\n", " (u'danforth', u'danforth', u'danforth')]" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select ident, ident, ident from Person;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As a shortcut,\n", "we can select all of the columns in a table using `*`:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
identpersonalfamily
dyerWilliamDyer
pbFrankPabodie
lakeAndersonLake
roeValentinaRoerich
danforthFrankDanforth
" ], "text/plain": [ "[(u'dyer', u'William', u'Dyer'),\n", " (u'pb', u'Frank', u'Pabodie'),\n", " (u'lake', u'Anderson', u'Lake'),\n", " (u'roe', u'Valentina', u'Roerich'),\n", " (u'danforth', u'Frank', u'Danforth')]" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select * from Person;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise\n", "\n", "Write a query that selects only site names from the `Site` table." ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
name
DR-1
DR-3
MSK-4
" ], "text/plain": [ "[(u'DR-1',), (u'DR-3',), (u'MSK-4',)]" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select name from site;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sorting and Removing Duplicates" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Data is often redundant,\n", "so queries often return redundant information.\n", "For example,\n", "if we select the quantitites that have been measured\n", "from the `survey` table,\n", "we get this:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
quant
rad
sal
rad
sal
rad
sal
temp
rad
sal
temp
rad
temp
sal
rad
sal
temp
sal
rad
sal
sal
rad
" ], "text/plain": [ "[(u'rad',),\n", " (u'sal',),\n", " (u'rad',),\n", " (u'sal',),\n", " (u'rad',),\n", " (u'sal',),\n", " (u'temp',),\n", " (u'rad',),\n", " (u'sal',),\n", " (u'temp',),\n", " (u'rad',),\n", " (u'temp',),\n", " (u'sal',),\n", " (u'rad',),\n", " (u'sal',),\n", " (u'temp',),\n", " (u'sal',),\n", " (u'rad',),\n", " (u'sal',),\n", " (u'sal',),\n", " (u'rad',)]" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select quant from Survey;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can eliminate the redundant output\n", "to make the result more readable\n", "by adding the `distinct` keyword\n", "to our query:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
quant
rad
sal
temp
" ], "text/plain": [ "[(u'rad',), (u'sal',), (u'temp',)]" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select distinct quant from Survey;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we select more than one column—for example,\n", "both the survey site ID and the quantity measured—then\n", "the distinct pairs of values are returned:" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
takenquant
619rad
619sal
622rad
622sal
734rad
734sal
734temp
735rad
735sal
735temp
751rad
751temp
751sal
752rad
752sal
752temp
837rad
837sal
844rad
" ], "text/plain": [ "[(619, u'rad'),\n", " (619, u'sal'),\n", " (622, u'rad'),\n", " (622, u'sal'),\n", " (734, u'rad'),\n", " (734, u'sal'),\n", " (734, u'temp'),\n", " (735, u'rad'),\n", " (735, u'sal'),\n", " (735, u'temp'),\n", " (751, u'rad'),\n", " (751, u'temp'),\n", " (751, u'sal'),\n", " (752, u'rad'),\n", " (752, u'sal'),\n", " (752, u'temp'),\n", " (837, u'rad'),\n", " (837, u'sal'),\n", " (844, u'rad')]" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select distinct taken, quant from Survey;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice in both cases that duplicates are removed\n", "even if they didn't appear to be adjacent in the database.\n", "Again,\n", "it's important to remember that rows aren't actually ordered:\n", "they're just displayed that way." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise\n", "\n", "Write a query that selects distinct dates from the `Site` table." ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dated
1927-02-08
1927-02-10
1939-01-07
1930-01-12
1930-02-26
None
1932-01-14
1932-03-22
" ], "text/plain": [ "[(u'1927-02-08',),\n", " (u'1927-02-10',),\n", " (u'1939-01-07',),\n", " (u'1930-01-12',),\n", " (u'1930-02-26',),\n", " (None,),\n", " (u'1932-01-14',),\n", " (u'1932-03-22',)]" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select dated from visited; -- here is a comment" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we mentioned earlier,\n", "database records are not stored in any particular order.\n", "This means that query results aren't necessarily sorted,\n", "and even if they are,\n", "we often want to sort them in a different way,\n", "*e.g.*, by the name of the project instead of by the name of the scientist.\n", "We can do this in SQL by adding an `order by` clause to our query:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
identpersonalfamily
danforthFrankDanforth
dyerWilliamDyer
lakeAndersonLake
pbFrankPabodie
roeValentinaRoerich
" ], "text/plain": [ "[(u'danforth', u'Frank', u'Danforth'),\n", " (u'dyer', u'William', u'Dyer'),\n", " (u'lake', u'Anderson', u'Lake'),\n", " (u'pb', u'Frank', u'Pabodie'),\n", " (u'roe', u'Valentina', u'Roerich')]" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select * from Person order by ident;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default,\n", "results are sorted in ascending order\n", "(*i.e.*,\n", "from least to greatest).\n", "We can sort in the opposite order using `desc` (for \"descending\"):" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
identpersonalfamily
roeValentinaRoerich
pbFrankPabodie
lakeAndersonLake
dyerWilliamDyer
danforthFrankDanforth
" ], "text/plain": [ "[(u'roe', u'Valentina', u'Roerich'),\n", " (u'pb', u'Frank', u'Pabodie'),\n", " (u'lake', u'Anderson', u'Lake'),\n", " (u'dyer', u'William', u'Dyer'),\n", " (u'danforth', u'Frank', u'Danforth')]" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select * from person order by ident desc;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(And if we want to make it clear that we're sorting in ascending order,\n", "we can use `asc` instead of `desc`.)\n", " \n", "We can also sort on several fields at once.\n", "For example,\n", "this query sorts results first in ascending order by `taken`,\n", "and then in descending order by `person`\n", "within each group of equal `taken` values:" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
takenperson
619dyer
619dyer
622dyer
622dyer
734pb
734pb
734lake
735pb
735None
735None
751pb
751pb
751lake
752roe
752lake
752lake
752lake
837roe
837lake
837lake
844roe
" ], "text/plain": [ "[(619, u'dyer'),\n", " (619, u'dyer'),\n", " (622, u'dyer'),\n", " (622, u'dyer'),\n", " (734, u'pb'),\n", " (734, u'pb'),\n", " (734, u'lake'),\n", " (735, u'pb'),\n", " (735, None),\n", " (735, None),\n", " (751, u'pb'),\n", " (751, u'pb'),\n", " (751, u'lake'),\n", " (752, u'roe'),\n", " (752, u'lake'),\n", " (752, u'lake'),\n", " (752, u'lake'),\n", " (837, u'roe'),\n", " (837, u'lake'),\n", " (837, u'lake'),\n", " (844, u'roe')]" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select taken, person from Survey order by taken asc, person desc;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is easier to understand if we also remove duplicates:" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
takenperson
619dyer
622dyer
734pb
734lake
735pb
735None
751pb
751lake
752roe
752lake
837roe
837lake
844roe
" ], "text/plain": [ "[(619, u'dyer'),\n", " (622, u'dyer'),\n", " (734, u'pb'),\n", " (734, u'lake'),\n", " (735, u'pb'),\n", " (735, None),\n", " (751, u'pb'),\n", " (751, u'lake'),\n", " (752, u'roe'),\n", " (752, u'lake'),\n", " (837, u'roe'),\n", " (837, u'lake'),\n", " (844, u'roe')]" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select distinct taken, person from Survey order by taken asc, person desc;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercises\n", "\n", "1. Write a query that returns the distinct dates in the `Visited` table.\n", "\n", "2. Write a query that displays the full names of the scientists in the `Person` table, ordered by family name." ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
personalfamily
FrankDanforth
WilliamDyer
AndersonLake
FrankPabodie
ValentinaRoerich
" ], "text/plain": [ "[(u'Frank', u'Danforth'),\n", " (u'William', u'Dyer'),\n", " (u'Anderson', u'Lake'),\n", " (u'Frank', u'Pabodie'),\n", " (u'Valentina', u'Roerich')]" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select personal, family from Person order by family;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Filtering" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One of the most powerful features of a database is\n", "the ability to **filter** data,\n", "*i.e.*,\n", "to select only those records that match certain criteria.\n", "For example,\n", "suppose we want to see when a particular site was visited.\n", "We can select these records from the `Visited` table\n", "by using a `where` clause in our query:" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
identsitedated
619DR-11927-02-08
622DR-11927-02-10
844DR-11932-03-22
" ], "text/plain": [ "[(619, u'DR-1', u'1927-02-08'),\n", " (622, u'DR-1', u'1927-02-10'),\n", " (844, u'DR-1', u'1932-03-22')]" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select * from Visited where site='DR-1';" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The database manager executes this query in two stages.\n", "First,\n", "it checks at each row in the `Visited` table\n", "to see which ones satisfy the `where`.\n", "It then uses the column names following the `select` keyword\n", "to determine what columns to display." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This processing order means that\n", "we can filter records using `where`\n", "based on values in columns that aren't then displayed:" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ident
619
622
844
" ], "text/plain": [ "[(619,), (622,), (844,)]" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select ident from Visited where site='DR-1';" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\"SQL" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### We can use many other Boolean operators to filter our data.\n", "\n", "For example, we can ask for all information from the DR-1 site collected since 1930:" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
identsitedated
844DR-11932-03-22
" ], "text/plain": [ "[(844, u'DR-1', u'1932-03-22')]" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select * from Visited where (site='DR-1') and (dated>='1930-00-00');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(The parentheses around the individual tests aren't strictly required,\n", "but they help make the query easier to read.)\n", "\n", "> Most database managers have a special data type for dates.\n", "> In fact, many have two:\n", "> one for dates,\n", "> such as \"May 31, 1971\",\n", "> and one for durations,\n", "> such as \"31 days\".\n", "> SQLite doesn't:\n", "> instead,\n", "> it stores dates as either text\n", "> (in the ISO-8601 standard format \"YYYY-MM-DD HH:MM:SS.SSSS\"),\n", "> real numbers\n", "> (the number of days since November 24, 4714 BCE),\n", "> or integers\n", "> (the number of seconds since midnight, January 1, 1970)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we want to find out what measurements were taken by either Lake or Roerich,\n", "we can combine the tests on their names using `or`:" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
takenpersonquantreading
734lakesal0.05
751lakesal0.1
752lakerad2.19
752lakesal0.09
752laketemp-16.0
752roesal41.6
837lakerad1.46
837lakesal0.21
837roesal22.5
844roerad11.25
" ], "text/plain": [ "[(734, u'lake', u'sal', 0.05),\n", " (751, u'lake', u'sal', 0.1),\n", " (752, u'lake', u'rad', 2.19),\n", " (752, u'lake', u'sal', 0.09),\n", " (752, u'lake', u'temp', -16.0),\n", " (752, u'roe', u'sal', 41.6),\n", " (837, u'lake', u'rad', 1.46),\n", " (837, u'lake', u'sal', 0.21),\n", " (837, u'roe', u'sal', 22.5),\n", " (844, u'roe', u'rad', 11.25)]" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select * from Survey where person='lake' or person='roe';" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively,\n", "we can use `in` to see if a value is in a specific set:" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
takenpersonquantreading
734lakesal0.05
751lakesal0.1
752lakerad2.19
752lakesal0.09
752laketemp-16.0
752roesal41.6
837lakerad1.46
837lakesal0.21
837roesal22.5
844roerad11.25
" ], "text/plain": [ "[(734, u'lake', u'sal', 0.05),\n", " (751, u'lake', u'sal', 0.1),\n", " (752, u'lake', u'rad', 2.19),\n", " (752, u'lake', u'sal', 0.09),\n", " (752, u'lake', u'temp', -16.0),\n", " (752, u'roe', u'sal', 41.6),\n", " (837, u'lake', u'rad', 1.46),\n", " (837, u'lake', u'sal', 0.21),\n", " (837, u'roe', u'sal', 22.5),\n", " (844, u'roe', u'rad', 11.25)]" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select * from Survey where person in ('lake', 'roe');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can combine `and` with `or`,\n", "but we need to be careful about which operator is executed first.\n", "If we *don't* use parentheses,\n", "we get this:" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
takenpersonquantreading
734lakesal0.05
751lakesal0.1
752lakesal0.09
752roesal41.6
837lakesal0.21
837roesal22.5
844roerad11.25
" ], "text/plain": [ "[(734, u'lake', u'sal', 0.05),\n", " (751, u'lake', u'sal', 0.1),\n", " (752, u'lake', u'sal', 0.09),\n", " (752, u'roe', u'sal', 41.6),\n", " (837, u'lake', u'sal', 0.21),\n", " (837, u'roe', u'sal', 22.5),\n", " (844, u'roe', u'rad', 11.25)]" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select * from Survey where quant='sal' and person='lake' or person='roe';" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "which is salinity measurements by Lake,\n", "and *any* measurement by Roerich.\n", "We probably want this instead:" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
takenpersonquantreading
734lakesal0.05
751lakesal0.1
752lakesal0.09
752roesal41.6
837lakesal0.21
837roesal22.5
" ], "text/plain": [ "[(734, u'lake', u'sal', 0.05),\n", " (751, u'lake', u'sal', 0.1),\n", " (752, u'lake', u'sal', 0.09),\n", " (752, u'roe', u'sal', 41.6),\n", " (837, u'lake', u'sal', 0.21),\n", " (837, u'roe', u'sal', 22.5)]" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select * from Survey where quant='sal' and (person='lake' or person='roe');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally,\n", "we can use `distinct` with `where`\n", "to give a second level of filtering:" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
personquant
lakesal
lakerad
laketemp
roesal
roerad
" ], "text/plain": [ "[(u'lake', u'sal'),\n", " (u'lake', u'rad'),\n", " (u'lake', u'temp'),\n", " (u'roe', u'sal'),\n", " (u'roe', u'rad')]" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select distinct person, quant from Survey where person='lake' or person='roe';" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But remember:\n", "`distinct` is applied to the values displayed in the chosen columns,\n", "not to the entire rows as they are being processed.\n", "\n", "> What we have just done is how most people \"grow\" their SQL queries.\n", "> We started with something simple that did part of what we wanted,\n", "> then added more clauses one by one,\n", "> testing their effects as we went.\n", "> This is a good strategy—in fact,\n", "> for complex queries it's often the *only* strategy—but\n", "> it depends on quick turnaround,\n", "> and on us recognizing the right answer when we get it.\n", "> \n", "> The best way to achieve quick turnaround is often\n", "> to put a subset of data in a temporary database\n", "> and run our queries against that,\n", "> or to fill a small database with synthesized records.\n", "> For example,\n", "> instead of trying our queries against an actual database of 20 million records,\n", "> we could run it against a sample of ten thousand,\n", "> or write a small program to generate ten thousand random (but plausible) records\n", "> and use that." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also filter results using pattern-matching. The SQL test `*column-name* like *pattern*` is true if the value in the named column matches the pattern given; the character '%' can be used any number of times in the pattern to mean \"match zero or more characters\".\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "
Expression Value
'a' like 'a' True
'a' like '%a' True
'b' like '%a' False
'alpha' like 'a%' True
'alpha' like 'a%p%' True
\n", " \n", "The expression `*column-name* not like *pattern*` inverts the test. \n", "\n", "For example, we can write a query that finds all the records in `Visited` that *aren't* from sites labelled 'DR-something'." ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
identsitedated
837MSK-41932-01-14
" ], "text/plain": [ "[(837, u'MSK-4', u'1932-01-14')]" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select * from Visited where site not like 'DR-%';" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example\n", "\n", "Normalized salinity readings are supposed to be between 0.0 and 1.0. Write a query that selects all records from `Survey` with salinity values outside this range." ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "collapsed": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
takenpersonquantreading
752roesal41.6
837roesal22.5
" ], "text/plain": [ "[(752, u'roe', u'sal', 41.6), (837, u'roe', u'sal', 22.5)]" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select * from Survey where quant='sal' and (reading>1 or reading<0);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Calculating New Values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After carefully re-reading the expedition logs,\n", "we realize that the radiation measurements they report\n", "may need to be corrected upward by 5%.\n", "Rather than modifying the stored data,\n", "we can do this calculation on the fly\n", "as part of our query:" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
05 * reading
10.311
8.19
8.8305
7.581
4.5675
2.2995
1.533
11.8125
" ], "text/plain": [ "[(10.311,),\n", " (8.19,),\n", " (8.8305,),\n", " (7.581,),\n", " (4.5675,),\n", " (2.2995,),\n", " (1.533,),\n", " (11.8125,)]" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select 1.05 * reading from Survey where quant='rad';" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When we run the query,\n", "the expression `1.05 * reading` is evaluated for each row.\n", "Expressions can use any of the fields,\n", "all of usual arithmetic operators,\n", "and a variety of common functions.\n", "(Exactly which ones depends on which database manager is being used.)\n", "For example,\n", "we can convert temperature readings from Fahrenheit to Celsius\n", "and round to two decimal places:" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
takenround(5*(reading-32)/9, 2)
734-29.72
735-32.22
751-28.06
752-26.67
" ], "text/plain": [ "[(734, -29.72), (735, -32.22), (751, -28.06), (752, -26.67)]" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select taken, round(5*(reading-32)/9, 2) from Survey where quant='temp';" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also combine values from different fields,\n", "for example by using the string concatenation operator `||`:" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
personal || ' ' || family
William Dyer
Frank Pabodie
Anderson Lake
Valentina Roerich
Frank Danforth
" ], "text/plain": [ "[(u'William Dyer',),\n", " (u'Frank Pabodie',),\n", " (u'Anderson Lake',),\n", " (u'Valentina Roerich',),\n", " (u'Frank Danforth',)]" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select personal || ' ' || family from Person;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `union` operator combines the results of two queries:" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
identpersonalfamily
dyerWilliamDyer
roeValentinaRoerich
" ], "text/plain": [ "[(u'dyer', u'William', u'Dyer'), (u'roe', u'Valentina', u'Roerich')]" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select * from Person where ident='dyer' union select * from Person where ident='roe';" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As a more complicated example, let's assume that Roerich mistakenly recorded all her measurements as percentages, rather than decimal values. We can use `union` to create a consolidated list of salinity measurements in which Roerich's (and only Roerich's) have been corrected." ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
taken
6190.13
6220.09
7340.05
7510.1
7520.09
7520.416
8370.21
8370.225
" ], "text/plain": [ "[(619, 0.13),\n", " (622, 0.09),\n", " (734, 0.05),\n", " (751, 0.1),\n", " (752, 0.09),\n", " (752, 0.41600000000000004),\n", " (837, 0.21),\n", " (837, 0.225)]" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select taken, reading/100. from Survey where person='roe' and quant='sal' \n", "union select taken, reading from Survey where not person='roe' and quant='sal';" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example\n", "\n", "The site identifiers in the `Visited` table have two parts separated by a '-':" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
site
DR-1
DR-3
MSK-4
" ], "text/plain": [ "[(u'DR-1',), (u'DR-3',), (u'MSK-4',)]" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select distinct site from Visited;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some major site identifiers are two letters long and some are three.\n", "The \"in string\" function `instr(X, Y)`\n", "returns the 1-based index of the first occurrence of string Y in string X,\n", "or 0 if Y does not exist in X.\n", "The substring function `substr(X, I, Z)`\n", "returns the substring of X starting at index I that is Z characters long.\n", "Use these two functions to produce a list of unique major site identifiers.\n", "(For this data,\n", "the list should contain only \"DR\" and \"MSK\")." ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
substr(site, 0, instr(site, '-'))
DR
MSK
" ], "text/plain": [ "[(u'DR',), (u'MSK',)]" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select distinct substr(site, 0, instr(site, '-')) from visited;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Missing Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Real-world data is rarely complete.\n", "Databases represent missing values using special value called `null`.\n", "`null` is not zero, `False`, or the empty string;\n", "it is a one-of-a-kind value that means \"nothing here\".\n", "Dealing with `null` requires a few special tricks\n", "and some careful thinking.\n", "\n", "To start,\n", "let's have a look at the `Visited` table.\n", "There are eight records,\n", "but #752 doesn't have a date—or rather,\n", "its date is null:" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
identsitedated
619DR-11927-02-08
622DR-11927-02-10
734DR-31939-01-07
735DR-31930-01-12
751DR-31930-02-26
752DR-3None
837MSK-41932-01-14
844DR-11932-03-22
" ], "text/plain": [ "[(619, u'DR-1', u'1927-02-08'),\n", " (622, u'DR-1', u'1927-02-10'),\n", " (734, u'DR-3', u'1939-01-07'),\n", " (735, u'DR-3', u'1930-01-12'),\n", " (751, u'DR-3', u'1930-02-26'),\n", " (752, u'DR-3', None),\n", " (837, u'MSK-4', u'1932-01-14'),\n", " (844, u'DR-1', u'1932-03-22')]" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select * from Visited;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Null doesn't behave like other values.\n", "If we select the records that come before 1930:" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
identsitedated
619DR-11927-02-08
622DR-11927-02-10
" ], "text/plain": [ "[(619, u'DR-1', u'1927-02-08'), (622, u'DR-1', u'1927-02-10')]" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select * from Visited where dated<'1930-00-00';" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "we get two results,\n", "and if we select the ones that come during or after 1930:" ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
identsitedated
734DR-31939-01-07
735DR-31930-01-12
751DR-31930-02-26
837MSK-41932-01-14
844DR-11932-03-22
" ], "text/plain": [ "[(734, u'DR-3', u'1939-01-07'),\n", " (735, u'DR-3', u'1930-01-12'),\n", " (751, u'DR-3', u'1930-02-26'),\n", " (837, u'MSK-4', u'1932-01-14'),\n", " (844, u'DR-1', u'1932-03-22')]" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select * from Visited where dated>='1930-00-00';" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "we get five,\n", "but record #752 isn't in either set of results.\n", "The reason is that\n", "`null<'1930-00-00'`\n", "is neither true nor false:\n", "null means, \"We don't know,\"\n", "and if we don't know the value on the left side of a comparison,\n", "we don't know whether the comparison is true or false.\n", "Since databases represent \"don't know\" as null,\n", "the value of `null<'1930-00-00'`\n", "is actually `null`.\n", "`null>='1930-00-00'` is also null\n", "because we can't answer to that question either.\n", "And since the only records kept by a `where`\n", "are those for which the test is true,\n", "record #752 isn't included in either set of results.\n", "\n", "Comparisons aren't the only operations that behave this way with nulls.\n", "`1+null` is `null`,\n", "`5*null` is `null`,\n", "`log(null)` is `null`,\n", "and so on.\n", "In particular,\n", "comparing things to null with = and != produces null:" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", "
identsitedated
" ], "text/plain": [ "[]" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select * from Visited where dated=NULL;" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", "
identsitedated
" ], "text/plain": [ "[]" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select * from Visited where dated!=NULL;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To check whether a value is `null` or not,\n", "we must use a special test `is null`:" ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
identsitedated
752DR-3None
" ], "text/plain": [ "[(752, u'DR-3', None)]" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select * from Visited where dated is NULL;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "or its inverse `is not null`:" ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
identsitedated
619DR-11927-02-08
622DR-11927-02-10
734DR-31939-01-07
735DR-31930-01-12
751DR-31930-02-26
837MSK-41932-01-14
844DR-11932-03-22
" ], "text/plain": [ "[(619, u'DR-1', u'1927-02-08'),\n", " (622, u'DR-1', u'1927-02-10'),\n", " (734, u'DR-3', u'1939-01-07'),\n", " (735, u'DR-3', u'1930-01-12'),\n", " (751, u'DR-3', u'1930-02-26'),\n", " (837, u'MSK-4', u'1932-01-14'),\n", " (844, u'DR-1', u'1932-03-22')]" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select * from Visited where dated is not NULL;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Null values cause headaches wherever they appear.\n", "For example,\n", "suppose we want to find all the salinity measurements\n", "that weren't taken by Dyer.\n", "It's natural to write the query like this:" ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
takenpersonquantreading
619dyersal0.13
622dyersal0.09
752roesal41.6
837roesal22.5
" ], "text/plain": [ "[(619, u'dyer', u'sal', 0.13),\n", " (622, u'dyer', u'sal', 0.09),\n", " (752, u'roe', u'sal', 41.6),\n", " (837, u'roe', u'sal', 22.5)]" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select * from Survey where quant='sal' and person!='lake';" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "but this query filters omits the records\n", "where we don't know who took the measurement.\n", "Once again,\n", "the reason is that when `person` is `null`,\n", "the `!=` comparison produces `null`,\n", "so the record isn't kept in our results.\n", "If we want to keep these records\n", "we need to add an explicit check:" ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
takenpersonquantreading
619dyersal0.13
622dyersal0.09
735Nonesal0.06
752roesal41.6
837roesal22.5
" ], "text/plain": [ "[(619, u'dyer', u'sal', 0.13),\n", " (622, u'dyer', u'sal', 0.09),\n", " (735, None, u'sal', 0.06),\n", " (752, u'roe', u'sal', 41.6),\n", " (837, u'roe', u'sal', 22.5)]" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select * from Survey where quant='sal' and (person!='lake' or person is null);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We still have to decide whether this is the right thing to do or not.\n", "If we want to be absolutely sure that\n", "we aren't including any measurements by Lake in our results,\n", "we need to exclude all the records for which we don't know who did the work.\n", "\n", "Note also that the following does not work as you might expect:" ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
identsitedated
619DR-11927-02-08
" ], "text/plain": [ "[(619, u'DR-1', u'1927-02-08')]" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select * from Visited where dated in ('1927-02-08', null);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise\n", "\n", "Write a query that sorts the records in `Visited` by date, omitting entries for which the date is not known (i.e., is null)." ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "collapsed": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
identsitedated
619DR-11927-02-08
622DR-11927-02-10
735DR-31930-01-12
751DR-31930-02-26
837MSK-41932-01-14
844DR-11932-03-22
734DR-31939-01-07
" ], "text/plain": [ "[(619, u'DR-1', u'1927-02-08'),\n", " (622, u'DR-1', u'1927-02-10'),\n", " (735, u'DR-3', u'1930-01-12'),\n", " (751, u'DR-3', u'1930-02-26'),\n", " (837, u'MSK-4', u'1932-01-14'),\n", " (844, u'DR-1', u'1932-03-22'),\n", " (734, u'DR-3', u'1939-01-07')]" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select * from visited where dated is not null order by dated;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Aggregation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now want to calculate **ranges** and **averages** for our data.\n", "We know how to select all of the dates from the `Visited` table:" ] }, { "cell_type": "code", "execution_count": 65, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dated
1927-02-08
1927-02-10
1939-01-07
1930-01-12
1930-02-26
None
1932-01-14
1932-03-22
" ], "text/plain": [ "[(u'1927-02-08',),\n", " (u'1927-02-10',),\n", " (u'1939-01-07',),\n", " (u'1930-01-12',),\n", " (u'1930-02-26',),\n", " (None,),\n", " (u'1932-01-14',),\n", " (u'1932-03-22',)]" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select dated from Visited;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "but to combine them, we must use an aggregation function such as `min` or `max`.\n", "Each of these functions takes a set of records as input, and produces a single record as output:" ] }, { "cell_type": "code", "execution_count": 66, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", "
min(dated)
1927-02-08
" ], "text/plain": [ "[(u'1927-02-08',)]" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select min(dated) from Visited;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\"SQL" ] }, { "cell_type": "code", "execution_count": 67, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", "
max(dated)
1939-01-07
" ], "text/plain": [ "[(u'1939-01-07',)]" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select max(dated) from Visited;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`min` and `max` are just two of\n", "the aggregation functions built into SQL.\n", "Three others are `avg`,\n", "`count`,\n", "and `sum`:" ] }, { "cell_type": "code", "execution_count": 68, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", "
avg(reading)
7.20333333333
" ], "text/plain": [ "[(7.203333333333335,)]" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select avg(reading) from Survey where quant='sal';" ] }, { "cell_type": "code", "execution_count": 69, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", "
count(reading)
9
" ], "text/plain": [ "[(9,)]" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select count(reading) from Survey where quant='sal';" ] }, { "cell_type": "code", "execution_count": 70, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", "
sum(reading)
64.83
" ], "text/plain": [ "[(64.83000000000001,)]" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select sum(reading) from Survey where quant='sal';" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We used `count(reading)` here,\n", "but we could just as easily have counted `quant`\n", "or any other field in the table,\n", "or even used `count(*)`,\n", "since the function doesn't care about the values themselves,\n", "just how many values there are.\n", "\n", "SQL lets us do several aggregations at once.\n", "We can,\n", "for example,\n", "find the range of sensible salinity measurements:" ] }, { "cell_type": "code", "execution_count": 71, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
min(reading)max(reading)
0.050.21
" ], "text/plain": [ "[(0.05, 0.21)]" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select min(reading), max(reading) from Survey where quant='sal' and reading<=1.0;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also combine aggregated results with raw results,\n", "although the output might surprise you:" ] }, { "cell_type": "code", "execution_count": 72, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
personcount(*)
lake7
" ], "text/plain": [ "[(u'lake', 7)]" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select person, count(*) from Survey where quant='sal' and reading<=1.0;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Why does Lake's name appear rather than Roerich's or Dyer's?\n", "The answer is that when it has to aggregate a field,\n", "but isn't told how to,\n", "the database manager chooses an actual value from the input set.\n", "It might use the first one processed,\n", "the last one,\n", "or something else entirely.\n", "\n", "Another important fact is that when there are no values to aggregate,\n", "aggregation's result is \"don't know\"\n", "rather than zero or some other arbitrary value:" ] }, { "cell_type": "code", "execution_count": 73, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
personmax(reading)sum(reading)
NoneNoneNone
" ], "text/plain": [ "[(None, None, None)]" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select person, max(reading), sum(reading) from Survey where quant='missing';" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One final important feature of aggregation functions is that\n", "they are inconsistent with the rest of SQL in a very useful way.\n", "If we add two values,\n", "and one of them is null,\n", "the result is null.\n", "By extension,\n", "if we use `sum` to add all the values in a set,\n", "and any of those values are null,\n", "the result should also be null.\n", "It's much more useful,\n", "though,\n", "for aggregation functions to ignore null values\n", "and only combine those that are non-null.\n", "This behavior lets us write our queries as:" ] }, { "cell_type": "code", "execution_count": 74, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", "
min(dated)
1927-02-08
" ], "text/plain": [ "[(u'1927-02-08',)]" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select min(dated) from Visited;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "instead of always having to filter explicitly:" ] }, { "cell_type": "code", "execution_count": 75, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", "
min(dated)
1927-02-08
" ], "text/plain": [ "[(u'1927-02-08',)]" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select min(dated) from Visited where dated is not null;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Aggregating all records at once doesn't always make sense.\n", "For example, suppose we suspect that there is a systematic bias in the data,\n", "and that some scientists' radiation readings are higher than others.\n", "\n", "We know that this doesn't work:" ] }, { "cell_type": "code", "execution_count": 76, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
personcount(reading)round(avg(reading), 2)
roe86.56
" ], "text/plain": [ "[(u'roe', 8, 6.56)]" ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select person, count(reading), round(avg(reading), 2)\n", "from Survey\n", "where quant='rad';" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "because the database manager selects a single arbitrary scientist's name\n", "rather than aggregating separately for each scientist.\n", "Since there are only five scientists,\n", "she could write five queries of the form:" ] }, { "cell_type": "code", "execution_count": 77, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
personcount(reading)round(avg(reading), 2)
dyer28.81
" ], "text/plain": [ "[(u'dyer', 2, 8.81)]" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select person, count(reading), round(avg(reading), 2)\n", "from Survey\n", "where quant='rad'\n", "and person='dyer';" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "but this would be tedious,\n", "and if she ever had a data set with fifty or five hundred scientists,\n", "the chances of her getting all of those queries right is small.\n", "\n", "What we need to do is\n", "tell the database manager to aggregate the hours for each scientist separately\n", "using a `group by` clause (recall a similar function in Pandas):" ] }, { "cell_type": "code", "execution_count": 78, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
personcount(reading)round(avg(reading), 2)
dyer28.81
lake21.82
pb36.66
roe111.25
" ], "text/plain": [ "[(u'dyer', 2, 8.81), (u'lake', 2, 1.82), (u'pb', 3, 6.66), (u'roe', 1, 11.25)]" ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select person, count(reading), round(avg(reading), 2)\n", "from Survey\n", "where quant='rad'\n", "group by person;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`group by` does exactly what its name implies:\n", "groups all the records with the same value for the specified field together\n", "so that aggregation can process each batch separately.\n", "Since all the records in each batch have the same value for `person`,\n", "it no longer matters that the database manager\n", "is picking an arbitrary one to display\n", "alongside the aggregated `reading` values." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Just as we can sort by multiple criteria at once,\n", "we can also group by multiple criteria.\n", "To get the average reading by scientist and quantity measured,\n", "for example,\n", "we just add another field to the `group by` clause:" ] }, { "cell_type": "code", "execution_count": 79, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
personquantcount(reading)round(avg(reading), 2)
Nonesal10.06
Nonetemp1-26.0
dyerrad28.81
dyersal20.11
lakerad21.82
lakesal40.11
laketemp1-16.0
pbrad36.66
pbtemp2-20.0
roerad111.25
roesal232.05
" ], "text/plain": [ "[(None, u'sal', 1, 0.06),\n", " (None, u'temp', 1, -26.0),\n", " (u'dyer', u'rad', 2, 8.81),\n", " (u'dyer', u'sal', 2, 0.11),\n", " (u'lake', u'rad', 2, 1.82),\n", " (u'lake', u'sal', 4, 0.11),\n", " (u'lake', u'temp', 1, -16.0),\n", " (u'pb', u'rad', 3, 6.66),\n", " (u'pb', u'temp', 2, -20.0),\n", " (u'roe', u'rad', 1, 11.25),\n", " (u'roe', u'sal', 2, 32.05)]" ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select person, quant, count(reading), round(avg(reading), 2)\n", "from Survey\n", "group by person, quant;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that we have added `person` to the list of fields displayed,\n", "since the results wouldn't make much sense otherwise.\n", "\n", "Let's go one step further and remove all the entries\n", "where we don't know who took the measurement:" ] }, { "cell_type": "code", "execution_count": 80, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
personquantcount(reading)round(avg(reading), 2)
dyerrad28.81
dyersal20.11
lakerad21.82
lakesal40.11
laketemp1-16.0
pbrad36.66
pbtemp2-20.0
roerad111.25
roesal232.05
" ], "text/plain": [ "[(u'dyer', u'rad', 2, 8.81),\n", " (u'dyer', u'sal', 2, 0.11),\n", " (u'lake', u'rad', 2, 1.82),\n", " (u'lake', u'sal', 4, 0.11),\n", " (u'lake', u'temp', 1, -16.0),\n", " (u'pb', u'rad', 3, 6.66),\n", " (u'pb', u'temp', 2, -20.0),\n", " (u'roe', u'rad', 1, 11.25),\n", " (u'roe', u'sal', 2, 32.05)]" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select person, quant, count(reading), round(avg(reading), 2)\n", "from Survey\n", "where person is not null\n", "group by person, quant\n", "order by person, quant;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The order of operations for this query is as follows:\n", "\n", "1. selected records from the `Survey` table\n", " where the `person` field was not null;\n", "\n", "2. grouped those records into subsets\n", " so that the `person` and `quant` values in each subset\n", " were the same;\n", "\n", "3. ordered those subsets first by `person`,\n", " and then within each sub-group by `quant`;\n", " and\n", "\n", "4. counted the number of records in each subset,\n", " calculated the average `reading` in each,\n", " and chose a `person` and `quant` value from each\n", " (it doesn't matter which ones,\n", " since they're all equal)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Suppose now we wanted to *center* some of the readings, by subtracting the mean. One might expect the following would do the trick:" ] }, { "cell_type": "code", "execution_count": 81, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", "
reading - avg(reading)
4.6875
" ], "text/plain": [ "[(4.6875,)]" ] }, "execution_count": 81, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select reading - avg(reading) from Survey where quant='rad';" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, you cannot mix aggregated and non-aggregated values in the same query. The (somewhat awkward) solution is to calculate the average in a separate query, and use that value in a query of the non-aggregated values." ] }, { "cell_type": "code", "execution_count": 82, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
reading - (select avg(reading) from Survey where quant='rad')
3.2575
1.2375
1.8475
0.6575
-2.2125
-4.3725
-5.1025
4.6875
" ], "text/plain": [ "[(3.2575000000000003,),\n", " (1.2374999999999998,),\n", " (1.8475000000000001,),\n", " (0.6574999999999998,),\n", " (-2.2125000000000004,),\n", " (-4.3725000000000005,),\n", " (-5.1025,),\n", " (4.6875,)]" ] }, "execution_count": 82, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select reading - (select avg(reading) from Survey where quant='rad') \n", "from Survey where quant='rad';" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise\n", "\n", "The function `group_concat(field, separator)` concatenates all the values in a field using the specified separator character (or ',' if the separator isn't specified). Use this to produce a one-line list of scientists' names, such as:\n", "\n", "~~~\n", "William Dyer, Frank Pabodie, Anderson Lake, Valentina Roerich, Frank Danforth\n", "~~~\n", "\n", "Can you find a way to order the list by surname?" ] }, { "cell_type": "code", "execution_count": 83, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", "
group_concat(personal || ' ' || family, ', ')
Frank Danforth, William Dyer, Anderson Lake, Frank Pabodie, Valentina Roerich
" ], "text/plain": [ "[(u'Frank Danforth, William Dyer, Anderson Lake, Frank Pabodie, Valentina Roerich',)]" ] }, "execution_count": 83, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select group_concat(personal || ' ' || family, ', ') from (select * from person order by family);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Combining Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Suppose we wanted to submit the data to a web site that aggregates historical meteorological data,\n", "which needs the data formated as: latitude, longitude, date, quantity, and reading.\n", "However, the latitudes and longitudes are in the `Site` table, while the dates of measurements are in the `Visited` table, and the readings themselves are in the `Survey` table. These fields need to be sensibly combined.\n", "\n", "\n", "The SQL command to do this is `join`." ] }, { "cell_type": "code", "execution_count": 84, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namelatlongidentsitedated
DR-1-49.85-128.57619DR-11927-02-08
DR-1-49.85-128.57622DR-11927-02-10
DR-1-49.85-128.57734DR-31939-01-07
DR-1-49.85-128.57735DR-31930-01-12
DR-1-49.85-128.57751DR-31930-02-26
DR-1-49.85-128.57752DR-3None
DR-1-49.85-128.57837MSK-41932-01-14
DR-1-49.85-128.57844DR-11932-03-22
DR-3-47.15-126.72619DR-11927-02-08
DR-3-47.15-126.72622DR-11927-02-10
DR-3-47.15-126.72734DR-31939-01-07
DR-3-47.15-126.72735DR-31930-01-12
DR-3-47.15-126.72751DR-31930-02-26
DR-3-47.15-126.72752DR-3None
DR-3-47.15-126.72837MSK-41932-01-14
DR-3-47.15-126.72844DR-11932-03-22
MSK-4-48.87-123.4619DR-11927-02-08
MSK-4-48.87-123.4622DR-11927-02-10
MSK-4-48.87-123.4734DR-31939-01-07
MSK-4-48.87-123.4735DR-31930-01-12
MSK-4-48.87-123.4751DR-31930-02-26
MSK-4-48.87-123.4752DR-3None
MSK-4-48.87-123.4837MSK-41932-01-14
MSK-4-48.87-123.4844DR-11932-03-22
" ], "text/plain": [ "[(u'DR-1', -49.85, -128.57, 619, u'DR-1', u'1927-02-08'),\n", " (u'DR-1', -49.85, -128.57, 622, u'DR-1', u'1927-02-10'),\n", " (u'DR-1', -49.85, -128.57, 734, u'DR-3', u'1939-01-07'),\n", " (u'DR-1', -49.85, -128.57, 735, u'DR-3', u'1930-01-12'),\n", " (u'DR-1', -49.85, -128.57, 751, u'DR-3', u'1930-02-26'),\n", " (u'DR-1', -49.85, -128.57, 752, u'DR-3', None),\n", " (u'DR-1', -49.85, -128.57, 837, u'MSK-4', u'1932-01-14'),\n", " (u'DR-1', -49.85, -128.57, 844, u'DR-1', u'1932-03-22'),\n", " (u'DR-3', -47.15, -126.72, 619, u'DR-1', u'1927-02-08'),\n", " (u'DR-3', -47.15, -126.72, 622, u'DR-1', u'1927-02-10'),\n", " (u'DR-3', -47.15, -126.72, 734, u'DR-3', u'1939-01-07'),\n", " (u'DR-3', -47.15, -126.72, 735, u'DR-3', u'1930-01-12'),\n", " (u'DR-3', -47.15, -126.72, 751, u'DR-3', u'1930-02-26'),\n", " (u'DR-3', -47.15, -126.72, 752, u'DR-3', None),\n", " (u'DR-3', -47.15, -126.72, 837, u'MSK-4', u'1932-01-14'),\n", " (u'DR-3', -47.15, -126.72, 844, u'DR-1', u'1932-03-22'),\n", " (u'MSK-4', -48.87, -123.4, 619, u'DR-1', u'1927-02-08'),\n", " (u'MSK-4', -48.87, -123.4, 622, u'DR-1', u'1927-02-10'),\n", " (u'MSK-4', -48.87, -123.4, 734, u'DR-3', u'1939-01-07'),\n", " (u'MSK-4', -48.87, -123.4, 735, u'DR-3', u'1930-01-12'),\n", " (u'MSK-4', -48.87, -123.4, 751, u'DR-3', u'1930-02-26'),\n", " (u'MSK-4', -48.87, -123.4, 752, u'DR-3', None),\n", " (u'MSK-4', -48.87, -123.4, 837, u'MSK-4', u'1932-01-14'),\n", " (u'MSK-4', -48.87, -123.4, 844, u'DR-1', u'1932-03-22')]" ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select * from Site join Visited;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`join` creates the **cross product** of two tables, that is, it joins each record of one with each record of the other to give all possible combinations. Since there are three records in `Site` and eight in `Visited`, the join's output has 24 records. And since each table has three fields, the output has six fields.\n", " \n", "What the join *hasn't* done is figure out if the records being joined have anything to do with each other. It has no way of knowing whether they do or not until we tell it how. To do that, we add a clause specifying that we're only interested in combinations that have the same site name:" ] }, { "cell_type": "code", "execution_count": 85, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namelatlongidentsitedated
DR-1-49.85-128.57619DR-11927-02-08
DR-1-49.85-128.57622DR-11927-02-10
DR-1-49.85-128.57844DR-11932-03-22
DR-3-47.15-126.72734DR-31939-01-07
DR-3-47.15-126.72735DR-31930-01-12
DR-3-47.15-126.72751DR-31930-02-26
DR-3-47.15-126.72752DR-3None
MSK-4-48.87-123.4837MSK-41932-01-14
" ], "text/plain": [ "[(u'DR-1', -49.85, -128.57, 619, u'DR-1', u'1927-02-08'),\n", " (u'DR-1', -49.85, -128.57, 622, u'DR-1', u'1927-02-10'),\n", " (u'DR-1', -49.85, -128.57, 844, u'DR-1', u'1932-03-22'),\n", " (u'DR-3', -47.15, -126.72, 734, u'DR-3', u'1939-01-07'),\n", " (u'DR-3', -47.15, -126.72, 735, u'DR-3', u'1930-01-12'),\n", " (u'DR-3', -47.15, -126.72, 751, u'DR-3', u'1930-02-26'),\n", " (u'DR-3', -47.15, -126.72, 752, u'DR-3', None),\n", " (u'MSK-4', -48.87, -123.4, 837, u'MSK-4', u'1932-01-14')]" ] }, "execution_count": 85, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select * from Site join Visited on Site.name=Visited.site;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`on` does the same job as `where`: it only keeps records that pass some test. (The difference between the two is that `on` filters records as they're being created, while `where` waits until the join is done and then does the filtering.) Once we add this to our query, the database manager throws away records that combined information about two different sites, leaving us with just the ones we want.\n", " \n", "Notice that we used `.` to specify field names in the output of the join. We do this because tables can have fields with the same name, and we need to be specific which ones we're talking about. For example, if we joined the `person` and `visited` tables, the result would inherit a field called `ident` from each of the original tables.\n", "\n", "We can now use the same dotted notation to select the three columns we actually want out of our join:" ] }, { "cell_type": "code", "execution_count": 86, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
latlongdated
-49.85-128.571927-02-08
-49.85-128.571927-02-10
-49.85-128.571932-03-22
-47.15-126.72None
-47.15-126.721930-01-12
-47.15-126.721930-02-26
-47.15-126.721939-01-07
-48.87-123.41932-01-14
" ], "text/plain": [ "[(-49.85, -128.57, u'1927-02-08'),\n", " (-49.85, -128.57, u'1927-02-10'),\n", " (-49.85, -128.57, u'1932-03-22'),\n", " (-47.15, -126.72, None),\n", " (-47.15, -126.72, u'1930-01-12'),\n", " (-47.15, -126.72, u'1930-02-26'),\n", " (-47.15, -126.72, u'1939-01-07'),\n", " (-48.87, -123.4, u'1932-01-14')]" ] }, "execution_count": 86, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select Site.lat, Site.long, Visited.dated\n", "from Site join Visited\n", "on Site.name=Visited.site;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If joining two tables is good,\n", "joining many tables must be better.\n", "In fact,\n", "we can join any number of tables\n", "simply by adding more `join` clauses to our query,\n", "and more `on` tests to filter out combinations of records\n", "that don't make sense:" ] }, { "cell_type": "code", "execution_count": 87, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
latlongdatedquantreading
-49.85-128.571927-02-08rad9.82
-49.85-128.571927-02-08sal0.13
-49.85-128.571927-02-10rad7.8
-49.85-128.571927-02-10sal0.09
-47.15-126.721939-01-07rad8.41
-47.15-126.721939-01-07sal0.05
-47.15-126.721939-01-07temp-21.5
-47.15-126.721930-01-12rad7.22
-47.15-126.721930-01-12sal0.06
-47.15-126.721930-01-12temp-26.0
-47.15-126.721930-02-26rad4.35
-47.15-126.721930-02-26sal0.1
-47.15-126.721930-02-26temp-18.5
-48.87-123.41932-01-14rad1.46
-48.87-123.41932-01-14sal0.21
-48.87-123.41932-01-14sal22.5
-49.85-128.571932-03-22rad11.25
" ], "text/plain": [ "[(-49.85, -128.57, u'1927-02-08', u'rad', 9.82),\n", " (-49.85, -128.57, u'1927-02-08', u'sal', 0.13),\n", " (-49.85, -128.57, u'1927-02-10', u'rad', 7.8),\n", " (-49.85, -128.57, u'1927-02-10', u'sal', 0.09),\n", " (-47.15, -126.72, u'1939-01-07', u'rad', 8.41),\n", " (-47.15, -126.72, u'1939-01-07', u'sal', 0.05),\n", " (-47.15, -126.72, u'1939-01-07', u'temp', -21.5),\n", " (-47.15, -126.72, u'1930-01-12', u'rad', 7.22),\n", " (-47.15, -126.72, u'1930-01-12', u'sal', 0.06),\n", " (-47.15, -126.72, u'1930-01-12', u'temp', -26.0),\n", " (-47.15, -126.72, u'1930-02-26', u'rad', 4.35),\n", " (-47.15, -126.72, u'1930-02-26', u'sal', 0.1),\n", " (-47.15, -126.72, u'1930-02-26', u'temp', -18.5),\n", " (-48.87, -123.4, u'1932-01-14', u'rad', 1.46),\n", " (-48.87, -123.4, u'1932-01-14', u'sal', 0.21),\n", " (-48.87, -123.4, u'1932-01-14', u'sal', 22.5),\n", " (-49.85, -128.57, u'1932-03-22', u'rad', 11.25)]" ] }, "execution_count": 87, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select Site.lat, Site.long, Visited.dated, Survey.quant, Survey.reading\n", "from Site join Visited join Survey\n", "on Site.name=Visited.site\n", "and Visited.ident=Survey.taken\n", "and Visited.dated is not null;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can tell which records from `Site`, `Visited`, and `Survey` correspond with each other because those tables contain primary keys and foreign keys. \n", "\n", "+ **primary key** \n", ": a value, or combination of values, that uniquely identifies each record in a table. \n", "+ **foreign key** \n", ": a value (or combination of values) from one table that identifies a unique record in another table. \n", "\n", "In other words, a foreign key is the primary key of one table that appears in some other table. In our database, `Person.ident` is the primary key in the `Person` table, while `Survey.person` is a foreign key relating the `Survey` table's entries to entries in `Person`.\n", "\n", "Most database designers believe that every table should have a well-defined primary key. They also believe that this key should be separate from the data itself, so that if we ever need to change the data, we only need to make one change in one place. One easy way to do this is to create an arbitrary, unique ID for each record as we add it to the database. This is actually very common: those IDs have names like \"student numbers\" and \"patient numbers\", and they almost always turn out to have originally been a unique record identifier in some database system or other. As the query below demonstrates, SQLite automatically numbers records as they're added to tables, and we can use those record numbers in queries:" ] }, { "cell_type": "code", "execution_count": 88, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
rowididentpersonalfamily
1dyerWilliamDyer
2pbFrankPabodie
3lakeAndersonLake
4roeValentinaRoerich
5danforthFrankDanforth
" ], "text/plain": [ "[(1, u'dyer', u'William', u'Dyer'),\n", " (2, u'pb', u'Frank', u'Pabodie'),\n", " (3, u'lake', u'Anderson', u'Lake'),\n", " (4, u'roe', u'Valentina', u'Roerich'),\n", " (5, u'danforth', u'Frank', u'Danforth')]" ] }, "execution_count": 88, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select rowid, * from Person;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data Hygiene" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have seen how joins work, we can see why the relational model is so useful and how best to use it. \n", "\n", "The first rule is that every value should be **atomic**, i.e., not contain parts that we might want to work with separately. We store personal and family names in separate columns instead of putting the entire name in one column so that we don't have to use substring operations to get the name's components. More importantly, we store the two parts of the name separately because splitting on spaces is unreliable: just think of a name like \"Eloise St. Cyr\" or \"Jan Mikkel Steubart\".\n", "\n", "The second rule is that every record should have a **unique primary key**. This can be a serial number that has no intrinsic meaning, one of the values in the record (like the `ident` field in the `Person` table), or even a combination of values: the triple `(taken, person, quant)` from the `Survey` table uniquely identifies every measurement.\n", "\n", "The third rule is that there should be **no redundant information**. For example, we could get rid of the `Site` table and rewrite the `Visited` table like this:\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
619 -49.85 -128.57 1927-02-08
622 -49.85 -128.57 1927-02-10
734 -47.15 -126.72 1939-01-07
735 -47.15 -126.72 1930-01-12
751 -47.15 -126.72 1930-02-26
752 -47.15 -126.72 null
837 -48.87 -123.40 1932-01-14
844 -49.85 -128.57 1932-03-22
\n", "\n", "In fact, we could use a single table that recorded all the information about each reading in each row, just as a spreadsheet would. The problem is that it's very hard to keep data organized this way consistent: if we realize that the date of a particular visit to a particular site is wrong, we have to change multiple records in the database. What's worse, we may have to guess which records to change, since other sites may also have been visited on that date.\n", "\n", "The fourth rule is that the **units** for every value should be stored explicitly. Our database doesn't do this, and that's a problem: Roerich's salinity measurements are several orders of magnitude larger than anyone else's, but we don't know if that means she was using parts per million instead of parts per thousand, or whether there actually was a saline anomaly at that site in 1932.\n", "\n", "Stepping back, data and the tools used to store it have a symbiotic relationship: we use tables and joins because it's efficient, provided our data is organized a certain way, but organize our data that way because we have tools to manipulate it efficiently if it's in a certain form. As anthropologists say, *the tool shapes the hand that shapes the tool*." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example\n", "\n", "Write a query that lists all radiation readings from the DR-1 site." ] }, { "cell_type": "code", "execution_count": 88, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating and Modifying Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following statements create the four tables in our survey database:\n", "\n", "~~~sql\n", "create table Person(ident text, personal text, family text);\n", "create table Site(name text, lat real, long real);\n", "create table Visited(ident integer, site text, dated text);\n", "create table Survey(taken integer, person text, quant real, reading real);\n", "~~~\n", "\n", "We can get rid of one of our tables using:\n", "\n", "~~~sql\n", "drop table Survey;\n", "~~~\n", "\n", "Be very careful when doing this: most databases have some support for undoing changes, but it's better not to have to rely on it.\n", " \n", "Different database systems support different data types for table columns, but most provide the following:\n", "\n", "\n", " \n", " \n", " \n", " \n", "
integer a signed integer
real a floating point number
text a character string
blob a \"binary large object\", such as an image
\n", "\n", "Most databases also support Booleans and date/time values; SQLite uses the integers 0 and 1 for the former, and there are several ways to represent dates, as discussed earlier. An increasing number of databases also support geographic data types, such as latitude and longitude. Keeping track of what particular systems do or do not offer, and what names they give different data types, is an unending portability headache.\n", " \n", "When we create a table, we can specify several kinds of constraints on its columns. For example, a better definition for the `Survey` table would be:\n", "\n", "~~~sql\n", "create table Survey(\n", " taken integer not null, -- where reading taken\n", " person text, -- may not know who took it\n", " quant real not null, -- the quantity measured\n", " reading real not null, -- the actual reading\n", " primary key(taken, quant),\n", " foreign key(taken) references Visited(ident),\n", " foreign key(person) references Person(ident)\n", ");\n", "~~~\n", "\n", "Once again, exactly what constraints are avialable and what they're called depends on which database manager we are using." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Adding, Removing, and Updating Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once tables have been created,\n", "we can add, change, and remove records using our other set of commands,\n", "`insert`, `update`, and `delete`.\n", "\n", "The simplest form of `insert` statement lists values in order:\n", "\n", "~~~sql\n", "insert into Site values('DR-1', -49.85, -128.57);\n", "insert into Site values('DR-3', -47.15, -126.72);\n", "insert into Site values('MSK-4', -48.87, -123.40);\n", "~~~\n", "\n", "We can also insert values into one table directly from another:" ] }, { "cell_type": "code", "execution_count": 89, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(OperationalError) table JustLatLong already exists u'create table JustLatLong(name text, lat text, long text);' ()\n" ] } ], "source": [ "%%sql\n", "create table JustLatLong(name text, lat text, long text);\n", "insert into JustLatLong select name, lat, long from site;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Modifying existing records is done using the `update` statement. To do this we tell the database which table we want to update, what we want to change the values to for any or all of the fields, and under what conditions we should update the values.\n", "\n", "For example, if we made a mistake when entering the lat and long values of the last `insert` statement above:" ] }, { "cell_type": "code", "execution_count": 90, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 rows affected.\n" ] }, { "data": { "text/plain": [ "[]" ] }, "execution_count": 90, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "update JustLatLong set lat=-47.87, long=-122.40 where name='MSK-4'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Be care to not forget the `where` clause or the update statement will modify *all* of the records in the database!\n", "\n", "Deleting records can be a bit trickier, because we have to ensure that the database remains internally consistent. If all we care about is a single table, we can use the `delete` command with a `where` clause that matches the records we want to discard. For example, once we realize that Frank Danforth didn't take any measurements, we can remove him from the `Person` table like this:\n", "\n", "~~~sql\n", "delete from Person where ident = \"danforth\";\n", "~~~\n", "\n", "But what if we removed Anderson Lake instead? Our `Survey` table would still contain seven records of measurements he'd taken, but that's never supposed to happen: `Survey.person` is a foreign key into the `Person` table, and all our queries assume there will be a row in the latter matching every value in the former.\n", " \n", "This problem is called **referential integrity**; we need to ensure that all references between tables can always be resolved correctly. One way to do this is to delete all the records that use `'lake'` as a foreign key before deleting the record that uses it as a primary key.\n", "\n", "> Many applications use a hybrid storage model\n", "> instead of putting everything into a database:\n", "> the actual data (such as images) are stored in files,\n", "> while the database stores the files' names,\n", "> their modification dates,\n", "> the region of the sky they cover,\n", "> their spectral characteristics,\n", "> and so on.\n", "> This is also how most music player software is built:\n", "> the database inside the application keeps track of the MP3 files,\n", "> but the files themselves live on disk." ] }, { "cell_type": "code", "execution_count": 91, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/plain": [ "[]" ] }, "execution_count": 91, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "drop table JustLatLong;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise\n", "\n", "One of our colleagues has sent us a CSV file containing temperature readings by Robert Olmstead, which is formatted like this:\n", "\n", "~~~\n", "Taken,Temp\n", "619,-21.5\n", "622,-15.5\n", "~~~\n", "\n", "Write some Python code that reads this in and prints out the SQL `insert` statements needed\n", "to add these records to the survey database.\n", "\n", "(Note: you will need to add an entry for Olmstead to the `Person` table.\n", "If you are testing your program repeatedly, you may want to investigate SQL's `insert or replace` command.)\n" ] }, { "cell_type": "code", "execution_count": 91, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Programming with Databases" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's return to the `sqlite3` module. Here's a short Python program that selects latitudes and longitudes\n", "from `survey.db`:" ] }, { "cell_type": "code", "execution_count": 92, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(-49.85, -128.57)\n", "(-47.15, -126.72)\n", "(-48.87, -123.4)\n" ] } ], "source": [ "connection = sqlite3.connect(\"../data/survey.db\")\n", "cursor = connection.cursor()\n", "cursor.execute(\"select site.lat, site.long from site;\")\n", "results = cursor.fetchall()\n", "for r in results:\n", " print(r)\n", "cursor.close()\n", "connection.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Queries in real applications will often depend on values provided by users.\n", "For example,\n", "this function takes a user's ID as a parameter and returns their name:" ] }, { "cell_type": "code", "execution_count": 93, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "('full name for dyer:', u'William Dyer')\n" ] } ], "source": [ "def get_name(database_file, person_ident):\n", " query = \"select personal || ' ' || family from Person where ident='\" + person_ident + \"';\"\n", "\n", " connection = sqlite3.connect(database_file)\n", " cursor = connection.cursor()\n", " cursor.execute(query)\n", " results = cursor.fetchall()\n", " cursor.close()\n", " connection.close()\n", "\n", " return results[0][0]\n", "\n", "print(\"full name for dyer:\", get_name('../data/survey.db', 'dyer'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We use string concatenation on the first line of this function to construct a query containing the user ID we have been given. This seems simple enough, but what happens if someone gives us this string as input?\n", "\n", "~~~sql\n", "dyer'; drop table Survey; select '\n", "~~~\n", "\n", "It looks like there's garbage after the name of the project, but it is very carefully chosen garbage. If we insert this string into our query, the result is:\n", "\n", "~~~sql\n", "select personal || ' ' || family from Person where ident='dyer'; drop tale Survey; select '';\n", "~~~\n", "\n", "If we execute this, it will erase one of the tables in our database.\n", " \n", "This is called an **SQL injection attack**, and it has been used to attack thousands of programs over the years. In particular, many web sites that take data from users insert values directly into queries without checking them carefully first.\n", " \n", "Since a villain might try to smuggle commands into our queries in many different ways, the safest way to deal with this threat is to replace characters like quotes with their escaped equivalents, so that we can safely put whatever the user gives us inside a string. We can do this by using a **prepared statement** instead of formatting our statements as strings. Here's what our example program looks like if we do this:" ] }, { "cell_type": "code", "execution_count": 94, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "('full name for dyer:', u'William Dyer')\n" ] } ], "source": [ "def get_name(database_file, person_ident):\n", " query = \"select personal || ' ' || family from Person where ident=?;\"\n", "\n", " connection = sqlite3.connect(database_file)\n", " cursor = connection.cursor()\n", " cursor.execute(query, [person_ident])\n", " results = cursor.fetchall()\n", " cursor.close()\n", " connection.close()\n", "\n", " return results[0][0]\n", "\n", "print(\"full name for dyer:\", get_name('../data/survey.db', 'dyer'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The key changes are in the query string and the `execute` call.\n", "Instead of formatting the query ourselves,\n", "we put question marks in the query template where we want to insert values.\n", "When we call `execute`,\n", "we provide a list\n", "that contains as many values as there are question marks in the query.\n", "The library matches values to question marks in order,\n", "and translates any special characters in the values\n", "into their escaped equivalents\n", "so that they are safe to use." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## SQLite in PyMC\n", "\n", "PyMC uses SQLite as an optional backend for storing MCMC traces. This is a good idea when you are collecting a lot of simulation output, or you want your output to be permanent and portable. PyMC stores traces in memory by default, meaning that they disappear when the current session ends.\n", "\n", "The SQLite support resides in a subclass of PyMC's `Database` class that provides a common interface for all backend types. Instantiating the SQLite `Database` object intitializes (or opens) a database, and provides methods for committing and closing.\n", "\n", "```python\n", "class Database(base.Database):\n", "\n", " \"\"\"SQLite database.\n", " \"\"\"\n", "\n", " def __init__(self, dbname, dbmode='a'):\n", " \"\"\"Open or create an SQL database.\n", "\n", " :Parameters:\n", " dbname : string\n", " The name of the database file.\n", " dbmode : {'a', 'w'}\n", " File mode. Use `a` to append values, and `w` to overwrite\n", " an existing file.\n", " \"\"\"\n", " self.__name__ = 'sqlite'\n", " self.dbname = dbname\n", " self.__Trace__ = Trace\n", "\n", " self.trace_names = []\n", " # A list of sequences of names of the objects to tally.\n", " self._traces = {} # A dictionary of the Trace objects.\n", "\n", " if os.path.exists(dbname) and dbmode == 'w':\n", " os.remove(dbname)\n", "\n", " self.DB = sqlite3.connect(dbname, check_same_thread=False)\n", " self.cur = self.DB.cursor()\n", "\n", " existing_tables = get_table_list(self.cur)\n", " if existing_tables:\n", " # Get number of existing chains\n", " self.cur.execute(\n", " 'SELECT MAX(trace) FROM [%s]' %\n", " existing_tables[0])\n", " self.chains = self.cur.fetchall()[0][0] + 1\n", " self.trace_names = self.chains * [existing_tables, ]\n", " else:\n", " self.chains = 0\n", "\n", " def commit(self):\n", " \"\"\"Commit updates to database\"\"\"\n", " self.DB.commit()\n", "\n", " def close(self, *args, **kwds):\n", " \"\"\"Close database.\"\"\"\n", " self.cur.close()\n", " self.commit()\n", " self.DB.close()\n", "```\n", "\n", "The second class is the `Trace`, which represents a single MCMC trace, and interacts with the `Database` to execute queries.\n", "\n", "```python\n", "class Trace(base.Trace):\n", "\n", " \"\"\"SQLite Trace class.\"\"\"\n", "\n", " def _initialize(self, chain, length):\n", " \"\"\"Create an SQL table.\n", " \"\"\"\n", "\n", " if self._getfunc is None:\n", " self._getfunc = self.db.model._funs_to_tally[self.name]\n", "\n", " # Determine size\n", " try:\n", " self._shape = np.shape(self._getfunc())\n", " except TypeError:\n", " self._shape = None\n", "\n", " self._vstr = ', '.join(var_str(self._shape))\n", "\n", " # If the table already exists, exit now.\n", " if chain != 0:\n", " return\n", "\n", " # Create the variable name strings.\n", " vstr = ', '.join(v + ' FLOAT' for v in var_str(self._shape))\n", " query = \"\"\"CREATE TABLE IF NOT EXISTS [%s]\n", " (recid INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,\n", " trace int(5), %s)\"\"\" % (self.name, vstr)\n", " self.db.cur.execute(query)\n", "\n", " def tally(self, chain):\n", " \"\"\"Adds current value to trace.\"\"\"\n", "\n", " try:\n", " valstring = ', '.join(\n", " ['%f' %\n", " x for x in np.ravel(self._getfunc())])\n", " except:\n", " valstring = str(self._getfunc())\n", "\n", " # Add value to database\n", " query = \"INSERT INTO [%s] (recid, trace, %s) values (NULL, %s, %s)\" % \\\n", " (self.name, self._vstr, chain, valstring)\n", " self.db.cur.execute(query)\n", "\n", " def gettrace(self, burn=0, thin=1, chain=-1, slicing=None):\n", " \"\"\"Return the trace (last by default).\n", "\n", " Input:\n", " - burn (int): The number of transient steps to skip.\n", " - thin (int): Keep one in thin.\n", " - chain (int): The index of the chain to fetch. If None, return all\n", " chains. By default, the last chain is returned.\n", " - slicing: A slice, overriding burn and thin assignement.\n", " \"\"\"\n", " if not slicing:\n", " slicing = slice(burn, None, thin)\n", "\n", " # If chain is None, get the data from all chains.\n", " if chain is None:\n", " self.db.cur.execute('SELECT * FROM [%s]' % self.name)\n", " trace = self.db.cur.fetchall()\n", " else:\n", " # Deal with negative chains (starting from the end)\n", " if chain < 0:\n", " chain = range(self.db.chains)[chain]\n", " self.db.cur.execute(\n", " 'SELECT * FROM [%s] WHERE trace=%s' %\n", " (self.name, chain))\n", " trace = self.db.cur.fetchall()\n", " trace = np.array(trace)[:, 2:]\n", " if len(self._shape) > 1:\n", " trace = trace.reshape(-1, *self._shape)\n", " return squeeze(trace[slicing])\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise\n", "\n", "Write SQL code to import the tables from the course project data into an SQLite database." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## References\n", "\n", "Kreibich JA. [Using SQLite](http://shop.oreilly.com/product/9780596521196.do). O'Reilly Media. 2010:1–528.\n", "\n", "[Using Databases and SQL](http://software-carpentry.org/v5/novice/sql/index.html). Software Carpentry.\n", "\n", "---" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "" ], "text/plain": [ "" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from IPython.core.display import HTML\n", "def css_styling():\n", " styles = open(\"styles/custom.css\", \"r\").read()\n", " return HTML(styles)\n", "css_styling()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.4.2" } }, "nbformat": 4, "nbformat_minor": 0 }