{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Joining DataFrames in Pandas\n", "\n", "In previous labs, we've explored the power tables as a data management abstraction, in particular with the Pandas DataFrame object.\n", "Tables let us select rows and columns of interest, group data, and measure aggregates.\n", "\n", "But what happens when we have more than one table?\n", "Traditional relational databases usually contain many tables.\n", "Moreover, when integrating multiple data sets, we necessarily need tools to combine them.\n", "\n", "In this lab, we will use Panda's take on the database **join** operation to see how tables can be linked together.\n", "Specifically, we're going to perform a \"fuzzy join\" based on string edit-distance as another approach to finding duplicate records." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup\n", "\n", "### Data\n", "\n", "Today we'll be using a small data set of restaurants.\n", "Download the data from [here](https://raw.github.com/amplab/datascience-sp14/master/lab4/data/restaurants.csv).\n", "Put the data file, \"restaurants.csv\", in the same directory as this notebook.\n", "\n", "### Edit Distance\n", "\n", "We're going to be using a string-similarity python library to compute \"edit distance\".\n", "Install it on your VM by running the following:\n", "\n", "`sudo apt-get install python-levenshtein`\n", "\n", "**NOTE**: You may also need to run `sudo apt-get update`.\n", "\n", "To test that it works, the following should run OK:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import Levenshtein as L" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Joins\n", "\n", "A **join** is a way to connect rows in two different data tables based on some criteria.\n", "Suppose the university has a database for student records with two tables in it: *Students* and *Grades*.\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd\n", "\n", "Students = pd.DataFrame({'student_id': [1, 2], 'name': ['Alice', 'Bob']})\n", "Students" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", " | name | \n", "student_id | \n", "
---|---|---|
0 | \n", "Alice | \n", "1 | \n", "
1 | \n", "Bob | \n", "2 | \n", "
2 rows \u00d7 2 columns
\n", "\n", " | class_id | \n", "grade | \n", "student_id | \n", "
---|---|---|---|
0 | \n", "1 | \n", "A | \n", "1 | \n", "
1 | \n", "2 | \n", "C | \n", "1 | \n", "
2 | \n", "1 | \n", "B | \n", "2 | \n", "
3 | \n", "3 | \n", "B | \n", "2 | \n", "
4 rows \u00d7 3 columns
\n", "\n", " | name | \n", "student_id | \n", "class_id | \n", "grade | \n", "
---|---|---|---|---|
0 | \n", "Alice | \n", "1 | \n", "1 | \n", "A | \n", "
1 | \n", "Alice | \n", "1 | \n", "2 | \n", "C | \n", "
2 | \n", "Bob | \n", "2 | \n", "1 | \n", "B | \n", "
3 | \n", "Bob | \n", "2 | \n", "3 | \n", "B | \n", "
4 rows \u00d7 4 columns
\n", "\n", " | name | \n", "student_id | \n", "class_id | \n", "grade | \n", "title | \n", "
---|---|---|---|---|---|
0 | \n", "Alice | \n", "1 | \n", "1 | \n", "A | \n", "Math | \n", "
1 | \n", "Bob | \n", "2 | \n", "1 | \n", "B | \n", "Math | \n", "
2 | \n", "Alice | \n", "1 | \n", "2 | \n", "C | \n", "English | \n", "
3 | \n", "Bob | \n", "2 | \n", "3 | \n", "B | \n", "Spanish | \n", "
4 rows \u00d7 5 columns
\n", "\n", " | id | \n", "cluster | \n", "name | \n", "city | \n", "
---|---|---|---|---|
0 | \n", "560 | \n", "453 | \n", "2223 | \n", "san francisco | \n", "
1 | \n", "781 | \n", "675 | \n", "103 west | \n", "atlanta | \n", "
2 | \n", "279 | \n", "172 | \n", "20 mott | \n", "new york | \n", "
3 | \n", "43 | \n", "23 | \n", "21 club | \n", "new york | \n", "
4 | \n", "44 | \n", "23 | \n", "21 club | \n", "new york city | \n", "
5 | \n", "280 | \n", "173 | \n", "9 jones street | \n", "new york | \n", "
6 | \n", "486 | \n", "379 | \n", "abbey | \n", "atlanta | \n", "
7 | \n", "145 | \n", "74 | \n", "abruzzi | \n", "atlanta | \n", "
8 | \n", "146 | \n", "74 | \n", "abruzzi | \n", "atlanta | \n", "
9 | \n", "561 | \n", "454 | \n", "acquarello | \n", "san francisco | \n", "
10 rows \u00d7 4 columns
\n", "\n", " | id_x | \n", "cluster | \n", "name_x | \n", "city_x | \n", "id_y | \n", "name_y | \n", "city_y | \n", "
---|---|---|---|---|---|---|---|
4 | \n", "43 | \n", "23 | \n", "21 club | \n", "new york | \n", "44 | \n", "21 club | \n", "new york city | \n", "
5 | \n", "44 | \n", "23 | \n", "21 club | \n", "new york city | \n", "43 | \n", "21 club | \n", "new york | \n", "
10 | \n", "145 | \n", "74 | \n", "abruzzi | \n", "atlanta | \n", "146 | \n", "abruzzi | \n", "atlanta | \n", "
11 | \n", "146 | \n", "74 | \n", "abruzzi | \n", "atlanta | \n", "145 | \n", "abruzzi | \n", "atlanta | \n", "
20 | \n", "184 | \n", "94 | \n", "alain rondelli | \n", "san francisco | \n", "185 | \n", "alain rondelli | \n", "san francisco | \n", "
21 | \n", "185 | \n", "94 | \n", "alain rondelli | \n", "san francisco | \n", "184 | \n", "alain rondelli | \n", "san francisco | \n", "
36 | \n", "186 | \n", "95 | \n", "aqua | \n", "san francisco | \n", "187 | \n", "aqua | \n", "san francisco | \n", "
37 | \n", "187 | \n", "95 | \n", "aqua | \n", "san francisco | \n", "186 | \n", "aqua | \n", "san francisco | \n", "
40 | \n", "45 | \n", "24 | \n", "aquavit | \n", "new york | \n", "46 | \n", "aquavit | \n", "new york city | \n", "
41 | \n", "46 | \n", "24 | \n", "aquavit | \n", "new york city | \n", "45 | \n", "aquavit | \n", "new york | \n", "
10 rows \u00d7 7 columns
\n", "\n", " | id_x | \n", "cluster | \n", "name_x | \n", "city_x | \n", "id_y | \n", "name_y | \n", "city_y | \n", "
---|---|---|---|---|---|---|---|
4 | \n", "43 | \n", "23 | \n", "21 club | \n", "new york | \n", "44 | \n", "21 club | \n", "new york city | \n", "
10 | \n", "145 | \n", "74 | \n", "abruzzi | \n", "atlanta | \n", "146 | \n", "abruzzi | \n", "atlanta | \n", "
20 | \n", "184 | \n", "94 | \n", "alain rondelli | \n", "san francisco | \n", "185 | \n", "alain rondelli | \n", "san francisco | \n", "
36 | \n", "186 | \n", "95 | \n", "aqua | \n", "san francisco | \n", "187 | \n", "aqua | \n", "san francisco | \n", "
40 | \n", "45 | \n", "24 | \n", "aquavit | \n", "new york | \n", "46 | \n", "aquavit | \n", "new york city | \n", "
46 | \n", "1 | \n", "0 | \n", "arnie morton's of chicago | \n", "los angeles | \n", "2 | \n", "arnie morton's of chicago | \n", "los angeles | \n", "
51 | \n", "3 | \n", "1 | \n", "art's delicatessen | \n", "studio city | \n", "4 | \n", "art's deli | \n", "studio city | \n", "
58 | \n", "47 | \n", "25 | \n", "aureole | \n", "new york | \n", "48 | \n", "aureole | \n", "new york city | \n", "
62 | \n", "147 | \n", "75 | \n", "bacchanalia | \n", "atlanta | \n", "148 | \n", "bacchanalia | \n", "atlanta | \n", "
79 | \n", "5 | \n", "2 | \n", "hotel bel-air | \n", "bel air | \n", "6 | \n", "bel-air hotel | \n", "bel air | \n", "
10 rows \u00d7 7 columns
\n", "\n", " | id_x | \n", "cluster_x | \n", "name_x | \n", "city_x | \n", "id_y | \n", "cluster_y | \n", "name_y | \n", "city_y | \n", "
---|---|---|---|---|---|---|---|---|
0 | \n", "560 | \n", "453 | \n", "2223 | \n", "san francisco | \n", "560 | \n", "453 | \n", "2223 | \n", "san francisco | \n", "
1 | \n", "560 | \n", "453 | \n", "2223 | \n", "san francisco | \n", "781 | \n", "675 | \n", "103 west | \n", "atlanta | \n", "
2 | \n", "560 | \n", "453 | \n", "2223 | \n", "san francisco | \n", "279 | \n", "172 | \n", "20 mott | \n", "new york | \n", "
3 | \n", "560 | \n", "453 | \n", "2223 | \n", "san francisco | \n", "43 | \n", "23 | \n", "21 club | \n", "new york | \n", "
4 | \n", "560 | \n", "453 | \n", "2223 | \n", "san francisco | \n", "44 | \n", "23 | \n", "21 club | \n", "new york city | \n", "
5 | \n", "560 | \n", "453 | \n", "2223 | \n", "san francisco | \n", "280 | \n", "173 | \n", "9 jones street | \n", "new york | \n", "
6 | \n", "560 | \n", "453 | \n", "2223 | \n", "san francisco | \n", "486 | \n", "379 | \n", "abbey | \n", "atlanta | \n", "
7 | \n", "560 | \n", "453 | \n", "2223 | \n", "san francisco | \n", "145 | \n", "74 | \n", "abruzzi | \n", "atlanta | \n", "
8 | \n", "560 | \n", "453 | \n", "2223 | \n", "san francisco | \n", "146 | \n", "74 | \n", "abruzzi | \n", "atlanta | \n", "
9 | \n", "560 | \n", "453 | \n", "2223 | \n", "san francisco | \n", "561 | \n", "454 | \n", "acquarello | \n", "san francisco | \n", "
10 rows \u00d7 8 columns
\n", "\n", " | id_x | \n", "cluster_x | \n", "name_x | \n", "city_x | \n", "id_y | \n", "cluster_y | \n", "name_y | \n", "city_y | \n", "
---|---|---|---|---|---|---|---|---|
1 | \n", "560 | \n", "453 | \n", "2223 | \n", "san francisco | \n", "781 | \n", "675 | \n", "103 west | \n", "atlanta | \n", "
9 | \n", "560 | \n", "453 | \n", "2223 | \n", "san francisco | \n", "561 | \n", "454 | \n", "acquarello | \n", "san francisco | \n", "
12 | \n", "560 | \n", "453 | \n", "2223 | \n", "san francisco | \n", "708 | \n", "602 | \n", "afghan kebab house | \n", "new york city | \n", "
20 | \n", "560 | \n", "453 | \n", "2223 | \n", "san francisco | \n", "782 | \n", "676 | \n", "alon's at the terrace | \n", "atlanta | \n", "
24 | \n", "560 | \n", "453 | \n", "2223 | \n", "san francisco | \n", "762 | \n", "656 | \n", "andre's french restaurant | \n", "las vegas | \n", "
28 | \n", "560 | \n", "453 | \n", "2223 | \n", "san francisco | \n", "640 | \n", "534 | \n", "apple pan the | \n", "west la | \n", "
33 | \n", "560 | \n", "453 | \n", "2223 | \n", "san francisco | \n", "709 | \n", "603 | \n", "arcadia | \n", "new york city | \n", "
40 | \n", "560 | \n", "453 | \n", "2223 | \n", "san francisco | \n", "641 | \n", "535 | \n", "asahi ramen | \n", "west la | \n", "
47 | \n", "560 | \n", "453 | \n", "2223 | \n", "san francisco | \n", "642 | \n", "536 | \n", "baja fresh | \n", "westlake village | \n", "
48 | \n", "560 | \n", "453 | \n", "2223 | \n", "san francisco | \n", "783 | \n", "677 | \n", "baker's cajun cafe | \n", "atlanta | \n", "
10 rows \u00d7 8 columns
\n", "\n", " | id_x | \n", "cluster_x | \n", "name_x | \n", "city_x | \n", "id_y | \n", "cluster_y | \n", "name_y | \n", "city_y | \n", "distance | \n", "ratio | \n", "
---|---|---|---|---|---|---|---|---|---|---|
1 | \n", "560 | \n", "453 | \n", "2223 | \n", "san francisco | \n", "781 | \n", "675 | \n", "103 west | \n", "atlanta | \n", "8 | \n", "0.166667 | \n", "
9 | \n", "560 | \n", "453 | \n", "2223 | \n", "san francisco | \n", "561 | \n", "454 | \n", "acquarello | \n", "san francisco | \n", "10 | \n", "0.000000 | \n", "
12 | \n", "560 | \n", "453 | \n", "2223 | \n", "san francisco | \n", "708 | \n", "602 | \n", "afghan kebab house | \n", "new york city | \n", "18 | \n", "0.000000 | \n", "
20 | \n", "560 | \n", "453 | \n", "2223 | \n", "san francisco | \n", "782 | \n", "676 | \n", "alon's at the terrace | \n", "atlanta | \n", "21 | \n", "0.000000 | \n", "
24 | \n", "560 | \n", "453 | \n", "2223 | \n", "san francisco | \n", "762 | \n", "656 | \n", "andre's french restaurant | \n", "las vegas | \n", "25 | \n", "0.000000 | \n", "
28 | \n", "560 | \n", "453 | \n", "2223 | \n", "san francisco | \n", "640 | \n", "534 | \n", "apple pan the | \n", "west la | \n", "14 | \n", "0.000000 | \n", "
33 | \n", "560 | \n", "453 | \n", "2223 | \n", "san francisco | \n", "709 | \n", "603 | \n", "arcadia | \n", "new york city | \n", "7 | \n", "0.000000 | \n", "
40 | \n", "560 | \n", "453 | \n", "2223 | \n", "san francisco | \n", "641 | \n", "535 | \n", "asahi ramen | \n", "west la | \n", "11 | \n", "0.000000 | \n", "
47 | \n", "560 | \n", "453 | \n", "2223 | \n", "san francisco | \n", "642 | \n", "536 | \n", "baja fresh | \n", "westlake village | \n", "10 | \n", "0.000000 | \n", "
48 | \n", "560 | \n", "453 | \n", "2223 | \n", "san francisco | \n", "783 | \n", "677 | \n", "baker's cajun cafe | \n", "atlanta | \n", "18 | \n", "0.000000 | \n", "
10 rows \u00d7 10 columns
\n", "