{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import sqlite3"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 9.1 Reading data from SQL databases"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So far we've only talked about reading data from CSV files. That's a pretty common way to store data, but there are many others! Pandas can read from HTML, JSON, SQL, Excel (!!!), HDF5, Stata, and a few other things. In this chapter we'll talk about reading data from SQL databases.\n",
"\n",
"You can read data from a SQL database using the `pd.read_sql` function. `read_sql` will automatically convert SQL column names to DataFrame column names.\n",
"\n",
"`read_sql` takes 2 arguments: a `SELECT` statement, and a database connection object. This is great because it means you can read from *any* kind of SQL database -- it doesn't matter if it's MySQL, SQLite, PostgreSQL, or something else.\n",
"\n",
"This example reads from a SQLite database, but any other database would work the same way."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"
\n",
" \n",
" \n",
" | \n",
" id | \n",
" date_time | \n",
" temp | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 2012-01-01 00:00:00 | \n",
" -1.8 | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" 2012-01-01 01:00:00 | \n",
" -1.8 | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" 2012-01-01 02:00:00 | \n",
" -1.8 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" id date_time temp\n",
"0 1 2012-01-01 00:00:00 -1.8\n",
"1 2 2012-01-01 01:00:00 -1.8\n",
"2 3 2012-01-01 02:00:00 -1.8"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"con = sqlite3.connect(\"../data/weather_2012.sqlite\")\n",
"df = pd.read_sql(\"SELECT * from weather_2012 LIMIT 3\", con)\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`read_sql` doesn't automatically set the primary key (`id`) to be the index of the dataframe. You can make it do that by adding an `index_col` argument to `read_sql`. \n",
"\n",
"If you've used `read_csv` a lot, you may have seen that it has an `index_col` argument as well. This one behaves the same way."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" date_time | \n",
" temp | \n",
"
\n",
" \n",
" id | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 1 | \n",
" 2012-01-01 00:00:00 | \n",
" -1.8 | \n",
"
\n",
" \n",
" 2 | \n",
" 2012-01-01 01:00:00 | \n",
" -1.8 | \n",
"
\n",
" \n",
" 3 | \n",
" 2012-01-01 02:00:00 | \n",
" -1.8 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" date_time temp\n",
"id \n",
"1 2012-01-01 00:00:00 -1.8\n",
"2 2012-01-01 01:00:00 -1.8\n",
"3 2012-01-01 02:00:00 -1.8"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.read_sql(\"SELECT * from weather_2012 LIMIT 3\", con, index_col='id')\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you want your dataframe to be indexed by more than one column, you can give a list of columns to `index_col`:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" | \n",
" temp | \n",
"
\n",
" \n",
" id | \n",
" date_time | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 1 | \n",
" 2012-01-01 00:00:00 | \n",
" -1.8 | \n",
"
\n",
" \n",
" 2 | \n",
" 2012-01-01 01:00:00 | \n",
" -1.8 | \n",
"
\n",
" \n",
" 3 | \n",
" 2012-01-01 02:00:00 | \n",
" -1.8 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" temp\n",
"id date_time \n",
"1 2012-01-01 00:00:00 -1.8\n",
"2 2012-01-01 01:00:00 -1.8\n",
"3 2012-01-01 02:00:00 -1.8"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.read_sql(\"SELECT * from weather_2012 LIMIT 3\", con, \n",
" index_col=['id', 'date_time'])\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 9.2 Writing to a SQLite database"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Pandas has a `write_frame` function which creates a database table from a dataframe. Right now this only works for SQLite databases. Let's use it to move our 2012 weather data into SQL.\n",
"\n",
"You'll notice that this function is in `pd.io.sql`. There are a ton of useful functions for reading and writing various kind of data in `pd.io`, and it's worth spending some time exploring them. ([see the documentation!](http://pandas.pydata.org/pandas-docs/stable/io.html))"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/usr/lib64/python2.7/site-packages/pandas/io/sql.py:1309: UserWarning: The spaces in these column names will not be changed. In pandas versions < 0.14, spaces were converted to underscores.\n",
" warnings.warn(_SAFE_NAMES_WARNING)\n"
]
}
],
"source": [
"weather_df = pd.read_csv('../data/weather_2012.csv')\n",
"con = sqlite3.connect(\"../data/test_db.sqlite\")\n",
"con.execute(\"DROP TABLE IF EXISTS weather_2012\")\n",
"weather_df.to_sql(\"weather_2012\", con)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can now read from the `weather_2012` table in `test_db.sqlite`, and we see that we get the same data back:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" index | \n",
" Date/Time | \n",
" Temp (C) | \n",
" Dew Point Temp (C) | \n",
" Rel Hum (%) | \n",
" Wind Spd (km/h) | \n",
" Visibility (km) | \n",
" Stn Press (kPa) | \n",
" Weather | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0 | \n",
" 2012-01-01 00:00:00 | \n",
" -1.8 | \n",
" -3.9 | \n",
" 86 | \n",
" 4 | \n",
" 8 | \n",
" 101.24 | \n",
" Fog | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" 2012-01-01 01:00:00 | \n",
" -1.8 | \n",
" -3.7 | \n",
" 87 | \n",
" 4 | \n",
" 8 | \n",
" 101.24 | \n",
" Fog | \n",
"
\n",
" \n",
" 2 | \n",
" 2 | \n",
" 2012-01-01 02:00:00 | \n",
" -1.8 | \n",
" -3.4 | \n",
" 89 | \n",
" 7 | \n",
" 4 | \n",
" 101.26 | \n",
" Freezing Drizzle,Fog | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" index Date/Time Temp (C) Dew Point Temp (C) Rel Hum (%) \\\n",
"0 0 2012-01-01 00:00:00 -1.8 -3.9 86 \n",
"1 1 2012-01-01 01:00:00 -1.8 -3.7 87 \n",
"2 2 2012-01-01 02:00:00 -1.8 -3.4 89 \n",
"\n",
" Wind Spd (km/h) Visibility (km) Stn Press (kPa) Weather \n",
"0 4 8 101.24 Fog \n",
"1 4 8 101.24 Fog \n",
"2 7 4 101.26 Freezing Drizzle,Fog "
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"con = sqlite3.connect(\"../data/test_db.sqlite\")\n",
"df = pd.read_sql(\"SELECT * from weather_2012 LIMIT 3\", con)\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The nice thing about having your data in a database is that you can do arbitrary SQL queries. This is cool especially if you're more familiar with SQL. Here's an example of sorting by the Weather column:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" index | \n",
" Date/Time | \n",
" Temp (C) | \n",
" Dew Point Temp (C) | \n",
" Rel Hum (%) | \n",
" Wind Spd (km/h) | \n",
" Visibility (km) | \n",
" Stn Press (kPa) | \n",
" Weather | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 67 | \n",
" 2012-01-03 19:00:00 | \n",
" -16.9 | \n",
" -24.8 | \n",
" 50 | \n",
" 24 | \n",
" 25 | \n",
" 101.74 | \n",
" Clear | \n",
"
\n",
" \n",
" 1 | \n",
" 114 | \n",
" 2012-01-05 18:00:00 | \n",
" -7.1 | \n",
" -14.4 | \n",
" 56 | \n",
" 11 | \n",
" 25 | \n",
" 100.71 | \n",
" Clear | \n",
"
\n",
" \n",
" 2 | \n",
" 115 | \n",
" 2012-01-05 19:00:00 | \n",
" -9.2 | \n",
" -15.4 | \n",
" 61 | \n",
" 7 | \n",
" 25 | \n",
" 100.80 | \n",
" Clear | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" index Date/Time Temp (C) Dew Point Temp (C) Rel Hum (%) \\\n",
"0 67 2012-01-03 19:00:00 -16.9 -24.8 50 \n",
"1 114 2012-01-05 18:00:00 -7.1 -14.4 56 \n",
"2 115 2012-01-05 19:00:00 -9.2 -15.4 61 \n",
"\n",
" Wind Spd (km/h) Visibility (km) Stn Press (kPa) Weather \n",
"0 24 25 101.74 Clear \n",
"1 11 25 100.71 Clear \n",
"2 7 25 100.80 Clear "
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"con = sqlite3.connect(\"../data/test_db.sqlite\")\n",
"df = pd.read_sql(\"SELECT * from weather_2012 ORDER BY Weather LIMIT 3\", con)\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you have a PostgreSQL database or MySQL database, reading from it works exactly the same way as reading from a SQLite database. You create a connection using `psycopg2.connect()` or `MySQLdb.connect()`, and then use\n",
"\n",
"`pd.read_sql(\"SELECT whatever from your_table\", con)`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 9.3 Connecting to other kinds of database"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To connect to a MySQL database:\n",
"\n",
"*Note: For these to work, you will need a working MySQL / PostgreSQL database, with the correct localhost, database name, etc.*"
]
},
{
"cell_type": "raw",
"metadata": {
"collapsed": false
},
"source": [
"import MySQLdb\n",
"con = MySQLdb.connect(host=\"localhost\", db=\"test\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To connect to a PostgreSQL database:"
]
},
{
"cell_type": "raw",
"metadata": {
"collapsed": false
},
"source": [
"import psycopg2\n",
"con = psycopg2.connect(host=\"localhost\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"