{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "#From CSV to SQL: A Journey\n", "\n", "You're sold on SQL as a powerful language for querying data, and on PostgreSQL as a powerful engine for storing and sharing that data. But how do you get data *in* there? That's a fine question. In this tutorial, I'm going to show you the steps necessary to import a CSV into PostgreSQL.\n", "\n", "We're going to use the [MovieLens review data](http://grouplens.org/datasets/movielens/) that you worked with in a previous tutorial. If you don't have the data handy anymore, [download it here]( http://files.grouplens.org/datasets/movielens/ml-100k.zip). For convenience, unzip the file in the same directory as this notebook." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Creating a table\n", "\n", "PostgreSQL has a built in command---`\\copy`---that can import data from a CSV file. However, that command can only import data into an *existing table*. In order to make the command work, we first need to create tables to contain the data.\n", "\n", "Our eventual goal is to create tables for the users, reviews, and movies in the MovieLens data set, then use the `\\copy` command to import the data into those tables. First, though, let's do a smaller example, just to acclimatize ourselves to the basics.\n", "\n", "###Types\n", "\n", "The task of creating a table consists of figuring out which columns we need, and then figuring out the correct data type for each column. The kinds of types supported by SQL (and PostgreSQL in particular) are *different* from the data types that we're used to working with in Python, though there are some obvious analogues between the types supported by both. [Here's a list of all the types supported in PostgreSQL](http://www.postgresql.org/docs/9.4/static/datatype.html).\n", "\n", "The data types you're most likely to encounter in SQL are the following:\n", "\n", "| type | description |\n", "| ---- | ----------- |\n", "| `int` | an integer value (exact range varies depending on database) |\n", "| `numeric(digits, fraction)` | a number with user-specified precision (see below) |\n", "| `varchar(n)` | a string of characters with a given maximum length `n` |\n", "| `text` | a string with unlimited length |\n", "| `timestamp` | holds a date and time (e.g., 2015-08-01T12:34:56)|\n", "| `date` | holds a date only (e.g., 2015-08-01) |\n", "| `boolean` | holds true or false |\n", "\n", "Let's say that we're creating a table to store information about the [widgets](http://static.decontextualize.com/widgets2015.html) from a previous homework assignment. Looking at the data, we can make the following determinations:\n", "\n", "* The table should contain one row per widget.\n", "* The table needs fields for the part number, the widget name, the widget price, and the quantity of widgets on hand.\n", "* The part number and widget names are both strings. The quantity of widgets is an integer. The widget price is a number, but it has a decimal point in it, so we can't just use an integer for that column in the table.\n", "\n", "That's enough information to create the necessary table. The command to create a table is, unsurprisingly enough, `CREATE TABLE` and the syntax looks like this:\n", "\n", "```\n", "CREATE TABLE table_name (\n", " field1 datatype1,\n", " field2 datatype2,\n", " field3 datatype3...\n", ")\n", "```\n", "\n", "... replacing `table_name` with the desired name of the table, replacing `field1`, `field2`, etc. with the names of the desired fields, and replacing `datatype1`, `datatype2`, etc. with the names of the desired data types for the corresponding fields.\n", "\n", "Okay, now we know enough to create a table for our widgets. Let's create a database to hold the table, first of all. Open `psql` and enter the following commands:\n", "\n", "```\n", "CREATE DATABASE widgetdb;\n", "\\c widgetdb\n", "```\n", "\n", "Once you've created the database and connected to it, you can run the create table command for our widget data. Here's what it looks like:\n", "\n", "```\n", "CREATE TABLE widget (\n", " part_no varchar(20),\n", " name text,\n", " price numeric(10, 2),\n", " quantity int\n", ");\n", "```\n", "\n", "Copy that text and paste it into `psql` and hit Enter. You should see a confirmation message (`CREATE TABLE`); if you don't, check again and make sure that you got the syntax right.\n", "\n", "The `\\d` command should now display the table that you just created:\n", "\n", "```\n", "widgetdb=# \\d\n", " List of relations\n", " Schema | Name | Type | Owner \n", "--------+--------+-------+---------\n", " public | widget | table | allison\n", "(1 row)\n", "\n", "widgetdb=# \\d widget\n", " Table \"public.widget\"\n", " Column | Type | Modifiers \n", "----------+-----------------------+-----------\n", " part_no | character varying(20) | \n", " name | text | \n", " price | numeric(10,2) | \n", " quantity | integer | \n", "```\n", "\n", "####Numeric? What's that?\n", "\n", "Computers are notoriously imprecise when handling floating-point numbers ([see here](https://en.wikipedia.org/wiki/Floating_point#Accuracy_problems) for more details). This is problematic in many situations, especially when you're using a floating point number to represent money. (Recall the rounding error subplots from *Superman III* and *Office Space*.) The `numeric` type side-steps this problem with a different internal representation of numbers that isn't subject to these errors. When specifying a numeric field, you need to give the type two parameters: the number of digits that the number should be able to contain, and the number of digits that can go after the decimal point. (e.g., `numeric(10, 2)` specifies a numeric field that can store a number as high as 999999.99---definitely sufficient for our widget database.)\n", "\n", "###Inserting data\n", "\n", "Ah, a fresh new table with no mistakes in it yet. Or data, for that matter. Now that we have a table, how do we get data *into* it? With the `INSERT` command!\n", "\n", "The `INSERT` command allows you to add data one row at a time to the table. The syntax looks like this:\n", "\n", "```\n", "INSERT INTO table_name (col1, col2, col3, ...) VALUES (val1, val2, val3, ...)\n", "```\n", "\n", "... replacing `table_name` with the name of the table you want to insert values into, and replacing `col1` (etc.) with the columns you want to insert data into, and `val1` (etc.) with the corresponding data. For example, here's the first widget listed on the widget page:\n", "\n", "| Part No. | Name | Price | Quantity in warehouse |\n", "| -------- | ---- | ----- | --------------------- |\n", "| C1-9476 | Skinner Widget | $2.70 | 512 |\n", "\n", "To insert the data from this row, issue the following command:\n", "\n", "```\n", "INSERT INTO widget (part_no, name, price, quantity) VALUES ('C1-9476', 'Skinner Widget', 2.70, 512);\n", "```\n", "\n", "You'll get a confirmation message (i.e., `INSERT 0 1`) that tells you that the insert was successful. (If you didn't get that message, check the syntax and try again.)\n", "\n", "Confirm that the insert was successful by issuing a `SELECT` statement:\n", "\n", "```\n", "widgetdb=# SELECT * FROM widget;\n", " part_no | name | price | quantity \n", "---------+----------------+-------+----------\n", " C1-9476 | Skinner Widget | 2.70 | 512\n", "(1 row)\n", "```\n", "\n", "Hey perfect! It worked.\n", "\n", "##Importing CSVs\n", "\n", "You could continue on in this way, issuing `INSERT` statements for each widget that you want to add to the table. This is kind of tedious, obviously. In real life, you'd probably get the data into your database by *importing* it from another source. One method of doing this would be to programmatically issue `INSERT` statements in Python (say, looping over a list of dictionaries from an API).\n", "\n", "Another method is to use PostgreSQL's `\\copy` command, which takes a CSV file and inserts its contents, row by row, into an existing table.\n", "\n", "But there has to be an *existing* table. So before we can use the `\\copy` command, we need to create the table first. Let's try it out with the `u.user` file from the MovieLens data set.\n", "\n", "Here are the first several lines from that file:\n" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1|24|M|technician|85711\r\n", "2|53|F|other|94043\r\n", "3|23|M|writer|32067\r\n", "4|24|M|technician|43537\r\n", "5|33|F|other|15213\r\n", "6|42|M|executive|98101\r\n", "7|57|M|administrator|91344\r\n", "8|36|M|administrator|05201\r\n", "9|29|M|student|01002\r\n", "10|53|M|lawyer|90703\r\n" ] } ], "source": [ "!head -10 ml-100k/u.user" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Referring back to the `README` file, the meaning of these fields are:\n", "\n", "* user id\n", "* age\n", "* gender\n", "* occupation\n", "* zip code\n", "\n", "Here are the data types I would pick for these fields:\n", "\n", "* user id: `int`\n", "* age: `int`\n", "* gender: `varchar(1)`\n", "* occupation: `varchar(80)`\n", "* zip code: `varchar(10)` (in case there are any extended ZIP codes?)\n", "\n", "And the `CREATE TABLE` statement:\n", "\n", "```\n", "CREATE TABLE uuser (\n", " user_id int,\n", " age int,\n", " gender varchar(1),\n", " occupation varchar(80),\n", " zip_code varchar(10)\n", ");\n", "```\n", "\n", "(Note that I called the table `uuser`---you can't just use the name `user`, as it's a reserved word in SQL.)\n", "\n", "Okay cool. Now that the table exists, we can use the `\\copy` command to import the actual data. Here's what the command looks like:\n", "\n", "```\n", "\\copy table_name from path delimiter ',' csv\n", "```\n", "\n", "The words `from` and `delimiter` and `csv` are part of the command. The parts you should change are `table_name` (replace this with the name of the table you want to import into), `path` (replace this with the path to the CSV file on your hard drive) and `','` (replace the comma with whatever character separates the items on each line of the file).\n", "\n", "So, for example, to import the data from `u.user`:\n", "\n", "```\n", "\\copy uuser from /Users/allison/Dropbox/projects/lede/pandas-notes/ml-100k/u.user delimiter '|' csv\n", "```\n", "\n", "Make sure to replace the path with the path to your actual file. If everything's worked out correctly, you should see a confirmation (`COPY 943`, indicating the 943 rows were copied). If not, check the syntax and try again.\n", "\n", "You can confirm that the copy worked by issuing a `SELECT` statement, like:\n", "\n", "```\n", "widgetdb=# SELECT * FROM uuser WHERE gender = 'F' AND occupation = 'engineer';\n", " user_id | age | gender | occupation | zip_code \n", "---------+-----+--------+------------+----------\n", " 786 | 36 | F | engineer | 01754\n", " 827 | 23 | F | engineer | 80228\n", "(2 rows)\n", "```\n", "\n", "###Repeating the process!\n", "\n", "Let's repeat the process with the `u.data` table. Here's what the data looks like:" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "196\t242\t3\t881250949\r\n", "186\t302\t3\t891717742\r\n", "22\t377\t1\t878887116\r\n", "244\t51\t2\t880606923\r\n", "166\t346\t1\t886397596\r\n", "298\t474\t4\t884182806\r\n", "115\t265\t2\t881171488\r\n", "253\t465\t5\t891628467\r\n", "305\t451\t3\t886324817\r\n", "6\t86\t3\t883603013\r\n" ] } ], "source": [ "!head -10 ml-100k/u.data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "According to the README, here's what the columns mean:\n", "\n", "```\n", "user id | item id | rating | timestamp\n", "```\n", "\n", "My best guess at a table schema:\n", "\n", "```\n", "CREATE TABLE udata (\n", " user_id int,\n", " item_id int,\n", " rating int,\n", " timestamp int\n", ");\n", "```\n", "\n", "And the `\\copy` command:\n", "\n", "```\n", "\\copy udata from /Users/allison/Dropbox/projects/lede/pandas-notes/ml-100k/u.data delimiter E'\\t' csv\n", "```\n", "\n", "(Why the `E` in front of the `'\\t'`? I don't know. I had to google it. This is computers we're talking about here, there's always one damn thing or another getting in the way of your doing what you want to do)\n", "\n", "If everything goes according to plan, you'll see a confirmation (`COPY 100000`). You can test the data with a `SELECT` statement like so:\n", "\n", "```\n", "widgetdb=# SELECT user_id, avg(rating), count(rating)\n", " FROM udata GROUP BY user_id ORDER BY avg(rating) DESC LIMIT 10;\n", " user_id | avg | count \n", "---------+--------------------+-------\n", " 849 | 4.8695652173913043 | 23\n", " 688 | 4.8333333333333333 | 24\n", " 507 | 4.7241379310344828 | 58\n", " 628 | 4.7037037037037037 | 27\n", " 928 | 4.6875000000000000 | 32\n", " 118 | 4.6619718309859155 | 71\n", " 907 | 4.5714285714285714 | 147\n", " 686 | 4.5633802816901408 | 71\n", " 427 | 4.5483870967741935 | 31\n", " 565 | 4.5428571428571429 | 35\n", "(10 rows)\n", "```\n", "\n", "This query shows the users with the highest average ratings (along with the number of ratings). Let's join on the `uuser` table to get that user's age and occupation as well:\n", "\n", "```\n", "widgetdb=# SELECT udata.user_id, uuser.occupation, uuser.age, \n", "widgetdb-# avg(udata.rating), count(udata.rating)\n", "widgetdb-# FROM udata JOIN uuser ON udata.user_id = uuser.user_id\n", "widgetdb-# GROUP BY udata.user_id, uuser.occupation, uuser.age\n", "widgetdb-# ORDER BY avg(udata.rating) DESC\n", "widgetdb-# LIMIT 10;\n", " user_id | occupation | age | avg | count \n", "---------+---------------+-----+--------------------+-------\n", " 849 | student | 15 | 4.8695652173913043 | 23\n", " 688 | administrator | 37 | 4.8333333333333333 | 24\n", " 507 | writer | 18 | 4.7241379310344828 | 58\n", " 628 | none | 13 | 4.7037037037037037 | 27\n", " 928 | student | 21 | 4.6875000000000000 | 32\n", " 118 | administrator | 21 | 4.6619718309859155 | 71\n", " 907 | other | 25 | 4.5714285714285714 | 147\n", " 686 | educator | 32 | 4.5633802816901408 | 71\n", " 427 | doctor | 51 | 4.5483870967741935 | 31\n", " 565 | student | 40 | 4.5428571428571429 | 35\n", "(10 rows)\n", "```\n", "\n", "Another example: average ratings, grouped by occupation:\n", "\n", "```\n", "widgetdb=# select uuser.occupation, avg(udata.rating) from uuser join udata on uuser.user_id = udata.user_id group by uuser.occupation order by avg(udata.rating) desc;\n", " occupation | avg \n", "---------------+--------------------\n", " none | 3.7791342952275250\n", " lawyer | 3.7353159851301115\n", " doctor | 3.6888888888888889\n", " educator | 3.6706206312221987\n", " artist | 3.6533795493934142\n", " administrator | 3.6356464768017115\n", " scientist | 3.6112730806608358\n", " salesman | 3.5829439252336449\n", " programmer | 3.5682604794257147\n", " librarian | 3.5607813388962640\n", " other | 3.5523773797242802\n", " engineer | 3.5414067278287462\n", " technician | 3.5322304620650314\n", " student | 3.5151432345038029\n", " marketing | 3.4856410256410256\n", " retired | 3.4667495338719702\n", " entertainment | 3.4410501193317422\n", " writer | 3.3757225433526012\n", " executive | 3.3491037320011754\n", " homemaker | 3.3010033444816054\n", " healthcare | 2.8962196861626248\n", "(21 rows)\n", "```\n", "\n", "> NOTE: Importing the `u.item` table is left as an exercise for the reader. But think of all the fun stuff you could do!\n", "\n", "###If something goes wrong\n", "\n", "Here are some situations you might encounter and some strategies for fixing them.\n", "\n", "If you messed up your table schema, *or* if you made your table in the wrong database, use the `DROP` command to drop the table:\n", "\n", "```\n", "DROP TABLE your_table;\n", "```\n", "\n", "... and then create the table again.\n", "\n", "If you messed up the data import (e.g., you used the wrong file, or the wrong delimiter), you can delete all of the rows from a table using:\n", "\n", "```\n", "DELETE FROM your_table;\n", "```\n", "\n", "After you've deleted the rows, you can run the `\\copy` command again.\n", "\n", "##Further notes\n", "\n", "* The tables we created in this tutorial work fine, but relational databases can be very *slow* when you're working with millions of records, or complicated joins. You can alleviate this problem with [indexes](http://www.postgresql.org/docs/9.4/static/indexes-intro.html).\n", "* We've talked about inserting data, querying data, and (to some extent) deleting data. There's one other common relational database operation that we *haven't* talked about, which is *updating* existing data. If you're interested in this particular task, read up on the [`UPDATE` statement](http://www.postgresql.org/docs/9.4/static/sql-update.html).\n", "* You can also create a new table based on a query of an existing table. See [the CREATE TABLE AS statement](http://www.postgresql.org/docs/9.4/static/sql-createtableas.html).\n", "* Pandas has [a fair amount of SQL interoperability](http://pandas.pydata.org/pandas-docs/stable/io.html#io-sql), including the ability to export Pandas DataFrames as SQL tables. Handy!\n", "* [csvkit](http://csvkit.readthedocs.org/en/latest/index.html) also has what looks to be a very extensible [SQL importer for CSV files](http://csvkit.readthedocs.org/en/latest/tutorial/3_power_tools.html#csvsql-and-sql2csv-ultimate-power)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 0 }