{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Extract The Data\n", "\n", "We've learned how to create a database, insert data and then update or delete it.\n", "\n", "We've also seen how to view the content of a table using pandas.\n", "\n", "But we've only seen how to display the entire contents of a single table. What if we only want a subset of the records? What if we want to extract a set based on records in more than one table?\n", "\n", "It's time to learn about the SELECT statement.\n", "\n", "First, let's connect to our database, empty both tables and create records for three flights.\n", "\n", "We'll need to enable foreign keys again as this setting applies to the connection:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namecountry_codelatitudelongitude
0hab1GB51.507220.00000
1hab2GB53.46667-2.23333
2hab3US40.76480-73.98080
\n", "
" ], "text/plain": [ " name country_code latitude longitude\n", "0 hab1 GB 51.50722 0.00000\n", "1 hab2 GB 53.46667 -2.23333\n", "2 hab3 US 40.76480 -73.98080" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import sqlalchemy as sa\n", "import pandas as pd\n", "engine = sa.create_engine('sqlite:///flight.db')\n", "connection = engine.connect()\n", "connection.execute(\"PRAGMA foreign_keys=on\")\n", "connection.execute(\"DELETE FROM flights\")\n", "\n", "sql = \"\"\"\n", " INSERT INTO flights(name, country_code, latitude, longitude)\n", " VALUES\n", " ('hab1', 'GB', 51.50722, 0),\n", " ('hab2', 'GB', 53.46667, -2.23333),\n", " ('hab3', 'US', 40.7648, -73.9808)\n", "\"\"\"\n", "connection.execute(sql)\n", "pd.read_sql('flights', connection)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And then generate 20 randomised readings for each flight:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "from collections import namedtuple\n", "from random import uniform, randint\n", "\n", "Reading = namedtuple('Reading', 'flight, ts, temp, pressure, humidity')\n", "\n", "readings = [\n", " Reading(\n", " flight=flight,\n", " ts=f'2015-01-01 09:{str(i+1).zfill(2)}:00',\n", " temp=round(uniform(23, 27), 1),\n", " pressure=randint(1020, 1025),\n", " humidity=randint(30, 50))\n", " for flight in ['hab1', 'hab2', 'hab3']\n", " for i in range(20)\n", "]\n", "\n", "sql = \"\"\"\n", " INSERT INTO readings\n", " (flight, ts, temp, pressure, humidity)\n", " VALUES\n", " (?, ?, ?, ?, ?)\n", "\"\"\"\n", "\n", "for reading in readings:\n", " values = (reading.flight, reading.ts, reading.temp, reading.pressure, reading.humidity)\n", " connection.execute(sql, values)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's extract and display the readings for flight hab2.\n", "\n", "We've seen previously that we can pass a table name to `read_sql`. Now, we'll see that we can also pass a SELECT statement:\n", "\n", "```sql\n", "SELECT flight, ts, temp, pressure, humidity\n", "FROM readings\n", "WHERE flight = 'hab2'\n", "```\n", "\n", "The SELECT clause defines the columns we want to see.\n", "\n", "The FROM clause defines the table(s) we want to query.\n", "\n", "We've seen the WHERE clause used previously in INSERT, UPDATE and DELETE statements. It applies a filter for us to define the records we want returned.\n", "\n", "Let's run that query:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
flighttstemppressurehumidity
0hab22015-01-01 09:01:0025.9102340
1hab22015-01-01 09:02:0025.4102540
2hab22015-01-01 09:03:0024.2102549
3hab22015-01-01 09:04:0024.8102448
4hab22015-01-01 09:05:0025.4102531
5hab22015-01-01 09:06:0024.8102342
6hab22015-01-01 09:07:0024.1102248
7hab22015-01-01 09:08:0025.3102442
8hab22015-01-01 09:09:0025.3102534
9hab22015-01-01 09:10:0024.2102133
10hab22015-01-01 09:11:0025.4102244
11hab22015-01-01 09:12:0023.5102234
12hab22015-01-01 09:13:0024.5102037
13hab22015-01-01 09:14:0026.0102233
14hab22015-01-01 09:15:0025.4102341
15hab22015-01-01 09:16:0023.2102539
16hab22015-01-01 09:17:0024.1102036
17hab22015-01-01 09:18:0026.8102349
18hab22015-01-01 09:19:0023.8102338
19hab22015-01-01 09:20:0025.1102244
\n", "
" ], "text/plain": [ " flight ts temp pressure humidity\n", "0 hab2 2015-01-01 09:01:00 25.9 1023 40\n", "1 hab2 2015-01-01 09:02:00 25.4 1025 40\n", "2 hab2 2015-01-01 09:03:00 24.2 1025 49\n", "3 hab2 2015-01-01 09:04:00 24.8 1024 48\n", "4 hab2 2015-01-01 09:05:00 25.4 1025 31\n", "5 hab2 2015-01-01 09:06:00 24.8 1023 42\n", "6 hab2 2015-01-01 09:07:00 24.1 1022 48\n", "7 hab2 2015-01-01 09:08:00 25.3 1024 42\n", "8 hab2 2015-01-01 09:09:00 25.3 1025 34\n", "9 hab2 2015-01-01 09:10:00 24.2 1021 33\n", "10 hab2 2015-01-01 09:11:00 25.4 1022 44\n", "11 hab2 2015-01-01 09:12:00 23.5 1022 34\n", "12 hab2 2015-01-01 09:13:00 24.5 1020 37\n", "13 hab2 2015-01-01 09:14:00 26.0 1022 33\n", "14 hab2 2015-01-01 09:15:00 25.4 1023 41\n", "15 hab2 2015-01-01 09:16:00 23.2 1025 39\n", "16 hab2 2015-01-01 09:17:00 24.1 1020 36\n", "17 hab2 2015-01-01 09:18:00 26.8 1023 49\n", "18 hab2 2015-01-01 09:19:00 23.8 1023 38\n", "19 hab2 2015-01-01 09:20:00 25.1 1022 44" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sql = \"\"\"\n", " SELECT flight, ts, temp, pressure, humidity\n", " FROM readings\n", " WHERE flight = 'hab2'\n", "\"\"\"\n", "pd.read_sql(sql, connection)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try another example where we might be interested only in the temperature of readings taken before 09:15:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
temp
025.9
125.4
224.2
324.8
425.4
524.8
624.1
725.3
825.3
924.2
1025.4
1123.5
1224.5
1326.0
\n", "
" ], "text/plain": [ " temp\n", "0 25.9\n", "1 25.4\n", "2 24.2\n", "3 24.8\n", "4 25.4\n", "5 24.8\n", "6 24.1\n", "7 25.3\n", "8 25.3\n", "9 24.2\n", "10 25.4\n", "11 23.5\n", "12 24.5\n", "13 26.0" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sql = \"\"\"\n", " SELECT temp\n", " FROM readings\n", " WHERE\n", " flight = 'hab2'\n", " AND ts < '2015-01-01 09:15:00'\n", "\"\"\"\n", "pd.read_sql(sql, connection)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But what if we wanted to see the first few readings for any flights that took place in the UK?\n", "\n", "We'll use a JOIN clause to join the readings and flights tables together\n", "\n", "```sql\n", "SELECT readings.*\n", "FROM readings JOIN flights\n", " ON readings.flight = flights.name\n", "WHERE \n", " flights.country_code = 'GB'\n", " AND readings.ts < '2015-01-01 09:05:00'\n", "```\n", "\n", "Here, we're telling the database engine that, for any record in the readings table, it can use the value in the 'flight' column to find a record in the flights table by matching on the 'name' column.\n", "\n", "It's common in SQL to use aliases for table names in queries using joins:\n", "\n", "```sql\n", "SELECT r.*\n", "FROM readings AS r JOIN flights AS f\n", " ON r.flight = f.name\n", "WHERE \n", " f.country_code = 'GB'\n", " AND r.ts < '2015-01-01 09:05:00'\n", "```\n", "\n", "We can also include fields from both tables in the select clause:\n", "\n", "```sql\n", "SELECT r.flight, r.ts, r.temp, r.pressure, f.longitude, f.latitude\n", "FROM readings AS r JOIN flights AS f\n", " ON r.flight = f.name\n", "WHERE \n", " f.country_code = 'GB'\n", " AND r.ts < '2015-01-01 09:05:00'\n", "```\n", "\n", "Let's try running that last example:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
flighttstemppressurelongitudelatitude
0hab12015-01-01 09:01:0024.710240.0000051.50722
1hab12015-01-01 09:02:0023.010240.0000051.50722
2hab12015-01-01 09:03:0024.110250.0000051.50722
3hab12015-01-01 09:04:0026.510200.0000051.50722
4hab22015-01-01 09:01:0025.91023-2.2333353.46667
5hab22015-01-01 09:02:0025.41025-2.2333353.46667
6hab22015-01-01 09:03:0024.21025-2.2333353.46667
7hab22015-01-01 09:04:0024.81024-2.2333353.46667
\n", "
" ], "text/plain": [ " flight ts temp pressure longitude latitude\n", "0 hab1 2015-01-01 09:01:00 24.7 1024 0.00000 51.50722\n", "1 hab1 2015-01-01 09:02:00 23.0 1024 0.00000 51.50722\n", "2 hab1 2015-01-01 09:03:00 24.1 1025 0.00000 51.50722\n", "3 hab1 2015-01-01 09:04:00 26.5 1020 0.00000 51.50722\n", "4 hab2 2015-01-01 09:01:00 25.9 1023 -2.23333 53.46667\n", "5 hab2 2015-01-01 09:02:00 25.4 1025 -2.23333 53.46667\n", "6 hab2 2015-01-01 09:03:00 24.2 1025 -2.23333 53.46667\n", "7 hab2 2015-01-01 09:04:00 24.8 1024 -2.23333 53.46667" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sql = \"\"\"\n", " SELECT r.flight, r.ts, r.temp, r.pressure, f.longitude, f.latitude\n", " FROM readings AS r JOIN flights AS f\n", " ON r.flight = f.name\n", " WHERE \n", " f.country_code = 'GB'\n", " AND r.ts < '2015-01-01 09:05:00'\n", "\"\"\"\n", "pd.read_sql(sql, connection)" ] } ], "metadata": { "kernelspec": { "display_name": "Python [conda env:sql_python_tutorial]", "language": "python", "name": "conda-env-sql_python_tutorial-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }