{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import pandas as pd\n", "import sqlite3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 9.1 Reading data from SQL databases" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So far we've only talked about reading data from CSV files. That's a pretty common way to store data, but there are many others! Pandas can read from HTML, JSON, SQL, Excel (!!!), HDF5, Stata, and a few other things. In this chapter we'll talk about reading data from SQL databases.\n", "\n", "You can read data from a SQL database using the `pd.read_sql` function. `read_sql` will automatically convert SQL column names to DataFrame column names.\n", "\n", "`read_sql` takes 2 arguments: a `SELECT` statement, and a database connection object. This is great because it means you can read from *any* kind of SQL database -- it doesn't matter if it's MySQL, SQLite, PostgreSQL, or something else.\n", "\n", "This example reads from a SQLite database, but any other database would work the same way." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
iddate_timetemp
0 1 2012-01-01 00:00:00-1.8
1 2 2012-01-01 01:00:00-1.8
2 3 2012-01-01 02:00:00-1.8
\n", "
" ], "text/plain": [ " id date_time temp\n", "0 1 2012-01-01 00:00:00 -1.8\n", "1 2 2012-01-01 01:00:00 -1.8\n", "2 3 2012-01-01 02:00:00 -1.8" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "con = sqlite3.connect(\"../data/weather_2012.sqlite\")\n", "df = pd.read_sql(\"SELECT * from weather_2012 LIMIT 3\", con)\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`read_sql` doesn't automatically set the primary key (`id`) to be the index of the dataframe. You can make it do that by adding an `index_col` argument to `read_sql`. \n", "\n", "If you've used `read_csv` a lot, you may have seen that it has an `index_col` argument as well. This one behaves the same way." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
date_timetemp
id
1 2012-01-01 00:00:00-1.8
2 2012-01-01 01:00:00-1.8
3 2012-01-01 02:00:00-1.8
\n", "
" ], "text/plain": [ " date_time temp\n", "id \n", "1 2012-01-01 00:00:00 -1.8\n", "2 2012-01-01 01:00:00 -1.8\n", "3 2012-01-01 02:00:00 -1.8" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_sql(\"SELECT * from weather_2012 LIMIT 3\", con, index_col='id')\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you want your dataframe to be indexed by more than one column, you can give a list of columns to `index_col`:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
temp
iddate_time
12012-01-01 00:00:00-1.8
22012-01-01 01:00:00-1.8
32012-01-01 02:00:00-1.8
\n", "
" ], "text/plain": [ " temp\n", "id date_time \n", "1 2012-01-01 00:00:00 -1.8\n", "2 2012-01-01 01:00:00 -1.8\n", "3 2012-01-01 02:00:00 -1.8" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_sql(\"SELECT * from weather_2012 LIMIT 3\", con, \n", " index_col=['id', 'date_time'])\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 9.2 Writing to a SQLite database" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pandas has a `write_frame` function which creates a database table from a dataframe. Right now this only works for SQLite databases. Let's use it to move our 2012 weather data into SQL.\n", "\n", "You'll notice that this function is in `pd.io.sql`. There are a ton of useful functions for reading and writing various kind of data in `pd.io`, and it's worth spending some time exploring them. ([see the documentation!](http://pandas.pydata.org/pandas-docs/stable/io.html))" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/lib64/python2.7/site-packages/pandas/io/sql.py:1309: UserWarning: The spaces in these column names will not be changed. In pandas versions < 0.14, spaces were converted to underscores.\n", " warnings.warn(_SAFE_NAMES_WARNING)\n" ] } ], "source": [ "weather_df = pd.read_csv('../data/weather_2012.csv')\n", "con = sqlite3.connect(\"../data/test_db.sqlite\")\n", "con.execute(\"DROP TABLE IF EXISTS weather_2012\")\n", "weather_df.to_sql(\"weather_2012\", con)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now read from the `weather_2012` table in `test_db.sqlite`, and we see that we get the same data back:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
indexDate/TimeTemp (C)Dew Point Temp (C)Rel Hum (%)Wind Spd (km/h)Visibility (km)Stn Press (kPa)Weather
0 0 2012-01-01 00:00:00-1.8-3.9 86 4 8 101.24 Fog
1 1 2012-01-01 01:00:00-1.8-3.7 87 4 8 101.24 Fog
2 2 2012-01-01 02:00:00-1.8-3.4 89 7 4 101.26 Freezing Drizzle,Fog
\n", "
" ], "text/plain": [ " index Date/Time Temp (C) Dew Point Temp (C) Rel Hum (%) \\\n", "0 0 2012-01-01 00:00:00 -1.8 -3.9 86 \n", "1 1 2012-01-01 01:00:00 -1.8 -3.7 87 \n", "2 2 2012-01-01 02:00:00 -1.8 -3.4 89 \n", "\n", " Wind Spd (km/h) Visibility (km) Stn Press (kPa) Weather \n", "0 4 8 101.24 Fog \n", "1 4 8 101.24 Fog \n", "2 7 4 101.26 Freezing Drizzle,Fog " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "con = sqlite3.connect(\"../data/test_db.sqlite\")\n", "df = pd.read_sql(\"SELECT * from weather_2012 LIMIT 3\", con)\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The nice thing about having your data in a database is that you can do arbitrary SQL queries. This is cool especially if you're more familiar with SQL. Here's an example of sorting by the Weather column:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
indexDate/TimeTemp (C)Dew Point Temp (C)Rel Hum (%)Wind Spd (km/h)Visibility (km)Stn Press (kPa)Weather
0 67 2012-01-03 19:00:00-16.9-24.8 50 24 25 101.74 Clear
1 114 2012-01-05 18:00:00 -7.1-14.4 56 11 25 100.71 Clear
2 115 2012-01-05 19:00:00 -9.2-15.4 61 7 25 100.80 Clear
\n", "
" ], "text/plain": [ " index Date/Time Temp (C) Dew Point Temp (C) Rel Hum (%) \\\n", "0 67 2012-01-03 19:00:00 -16.9 -24.8 50 \n", "1 114 2012-01-05 18:00:00 -7.1 -14.4 56 \n", "2 115 2012-01-05 19:00:00 -9.2 -15.4 61 \n", "\n", " Wind Spd (km/h) Visibility (km) Stn Press (kPa) Weather \n", "0 24 25 101.74 Clear \n", "1 11 25 100.71 Clear \n", "2 7 25 100.80 Clear " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "con = sqlite3.connect(\"../data/test_db.sqlite\")\n", "df = pd.read_sql(\"SELECT * from weather_2012 ORDER BY Weather LIMIT 3\", con)\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you have a PostgreSQL database or MySQL database, reading from it works exactly the same way as reading from a SQLite database. You create a connection using `psycopg2.connect()` or `MySQLdb.connect()`, and then use\n", "\n", "`pd.read_sql(\"SELECT whatever from your_table\", con)`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 9.3 Connecting to other kinds of database" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To connect to a MySQL database:\n", "\n", "*Note: For these to work, you will need a working MySQL / PostgreSQL database, with the correct localhost, database name, etc.*" ] }, { "cell_type": "raw", "metadata": { "collapsed": false }, "source": [ "import MySQLdb\n", "con = MySQLdb.connect(host=\"localhost\", db=\"test\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To connect to a PostgreSQL database:" ] }, { "cell_type": "raw", "metadata": { "collapsed": false }, "source": [ "import psycopg2\n", "con = psycopg2.connect(host=\"localhost\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "