{ "metadata": { "name": "", "signature": "sha256:e6abee32eeddadd6f71a99481ddd1c6aaa82ac25f60c00d949e9a1987425a06d" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Lab 3 Tabular data processing with Pandas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this lab exercise we will look at how to work with data stored in a tabular form and perform exploratory data analysis on it. We will be using the Python Data Analysis Library (aka Pandas) to do this.\n", "\n", "At the end of the exercise, remember to fill out the response form <a href=\"https://bcourses.berkeley.edu/courses/1377158/quizzes/2045084\">here</a>." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Introducting Pandas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The two main data structures that Pandas supports are Series and DataFrames. Series are one-dimensional data structures that are a collection of any numpy data type. DataFrames on the other hand are two dimensional data structures which resemble a database table or say an Excel spreadsheet. In this lab we will primarily be using DataFrames and will look at operations that we can perform using them.\n", "\n", "Before you start, download <a href=\"https://raw.github.com/biddata/datascience/master/F15/lab3/data/wc_day6_1_sample.tar.bz2\">this file</a> of world cup data into the same directory as this notebook (e.g. /home/datascience/lab3) and untar it.\n", "\n", "\n", "Also Download this restaurant dataset from <a href=\"https://raw.github.com/biddata/datascience/master/F15/lab3/data/restaurants.csv\">here</a>. Put the data file, \"restaurants.csv\", in the same directory as this notebook." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from pylab import *\n", "%matplotlib inline\n", "import pandas as pd\n", "df = pd.DataFrame( { 'a' : [1, 2, 3, 4], 'b': [ 'w', 'x', 'y', 'z'] })" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you need clarifications about the Pandas API you can type the function name followed by ? to get inline help. For example to get help with the above call run:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "pd.DataFrame?" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you want to see the same in a browser lookup the function in the <a href=\"http://pandas.pydata.org/pandas-docs/stable/api.html\">API documentation</a>" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "DataFrame Basics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The simplest way to see what is in a DataFrame is to just print it to the console. For example to see the DataFrame we created before you can just type df and see something like" ] }, { "cell_type": "code", "collapsed": false, "input": [ "df" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This shows that we have two columnns 'a' and 'b' and four rows in our DataFrame.\n", "\n", "However large DataFrames cannot be printed to the console and we have higher level commands to inspect its contents. To get information on the schema of the DataFrames, we can use the info function" ] }, { "cell_type": "code", "collapsed": false, "input": [ "df.info()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To see the first few rows you can use <code>head</code> and to see the last few rows you can use <code>tail</code>. This is similar to the UNIX-command line tools (Remember Lab 1 !?)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "df.head(2)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To print any range of rows from the DataFrame you can use array-like indexing of row ids. As you might have noticed rows are numbered from 0 in Pandas, so to get the middle two rows we can use the range 1:3" ] }, { "cell_type": "code", "collapsed": false, "input": [ "df[1:3]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, Pandas also has a useful function describe that summarizes the contents of numerical columns in a DataFrame. For example in df we can see the mean, standard deviation etc. for the column a by running describe." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Having worked our way through the basics, lets now see how we can use Pandas for data analysis. To do this part of the lab we will reuse the World Cup soccer logs from Lab 1. However this time the input data has been sampled and formatted as a csv file that you will load first." ] }, { "cell_type": "code", "collapsed": false, "input": [ "log_df = pd.read_csv(\"wc_day6_1_sample.csv\",\n", " names=['ClientID', 'Date', 'Time', 'URL', 'ResponseCode', 'Size'],\n", " na_values=['-'])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The names argument tells Pandas what the column names are in our file and na_values indicates what character is used for missing values in our dataset. Use the commands from the previous section to explore the dataset and its summary statistics.\n", "\n", "> TODO:\n", "\n", " How many rows are present in log_df ?\n", " What are the URLs between rows 85 and 87 inclusive ?\n" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "SQL-like operators" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we will look at operators in Pandas that allow us to perform SQL-like queries on the dataset." ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Selection" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A SQL statement typically selects a subset of rows from a table that match a given criteria. This is known as the Selection (Links to an external site.) operator in Relational Algebra. Similarly we can perform selections in Pandas using boolean indexing.\n", "\n", "Boolean indexing refers to a technique where you can use a list of boolean values to filter a DataFrame. For example lets say we only want entries from '01/May/1998'. To do this we can create a boolean list like" ] }, { "cell_type": "code", "collapsed": false, "input": [ "is_may1st = log_df['Date'] == '01/May/1998'\n", "is_may1st.head(2)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can filter our DataFrame by passing it the boolean list." ] }, { "cell_type": "code", "collapsed": false, "input": [ "may1_df = log_df[is_may1st]\n", "may1_df.head()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or we can directly do this by passing in the boolean clause to the DataFrame" ] }, { "cell_type": "code", "collapsed": false, "input": [ "may1_df = log_df[log_df['Date'] == '01/May/1998']\n", "may1_df.head()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Projection" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While selection is used for filtering rows, <a href=\"http://en.wikipedia.org/wiki/Projection_%28relational_algebra%29\">projection</a> is the relational algebra operator used to select columns. To do this with Pandas we just need to pass in a list of columns that we wish to select. For example to only keep the 'URL' and 'ResponseCode' column we would run" ] }, { "cell_type": "code", "collapsed": false, "input": [ "url_codes = log_df[['URL', 'ResponseCode']]\n", "url_codes.head(5)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Grouping" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pandas also allows you to group the DataFrame by values in any column. For example to group requests by 'ResponseCode' you can run" ] }, { "cell_type": "code", "collapsed": false, "input": [ "grouped = log_df.groupby('ResponseCode')\n", "grouped" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see from the output above, grouped is not a DataFrame but an object of type DataFrameGroupBy. This just means that it contains a number of groups and each group is in turn a DataFrame. To see this try" ] }, { "cell_type": "code", "collapsed": false, "input": [ "grouped.ngroups\n", "grouped.groups.keys()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "grouped.get_group(200).head()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also group by multiple columns by just passing the a list of column names. For example to group by both date and response code you can run" ] }, { "cell_type": "code", "collapsed": false, "input": [ "multi_grouped = log_df.groupby(['ResponseCode', 'Date'])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pandas also has useful commands (Links to an external site.) to print various statistics about elements in each group.\n", "\n", "1. grouped.describe() prints summary statistics for numeric columns in each group\n", "2. grouped.size() prints the number of elements in each group\n", "3. Similarly grouped.sum(), grouped.mean() and grouped.median() print the sum, mean and median values for numeric columns in each group\n" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Plotting data in a DataFrame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To plot a Series or a DataFrame you can just call plot() on the object and for a histogram just call hist()" ] }, { "cell_type": "code", "collapsed": false, "input": [ "rand_df = pd.DataFrame({'a' : randn(100)})\n", "rand_df.plot()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "rand_df.hist()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Joins in Pandas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Joins\n", "\n", "A **join** is a way to connect rows in two different data tables based on some criteria.\n", "Suppose the university has a database for student records with two tables in it: *Students* and *Grades*.\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd\n", "\n", "Students = pd.DataFrame({'student_id': [1, 2], 'name': ['Alice', 'Bob']})\n", "Students" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "Grades = pd.DataFrame({'student_id': [1, 1, 2, 2], 'class_id': [1, 2, 1, 3], 'grade': ['A', 'C', 'B', 'B']})\n", "Grades" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's say we want to know all of Bob's grades.\n", "Then, we can look up Bob's student ID in the Students table, and with the ID, look up his grades in the Grades table.\n", "Joins naturally express this process: when two tables share a common type of column (student ID in this case), we can join the tables together to get a complete view.\n", "\n", "In Pandas, we can use the **merge** method to perform a join.\n", "Pass the two tables to join as the first arguments, then the \"on\" parameter is set to the join column name." ] }, { "cell_type": "code", "collapsed": false, "input": [ "Student_Grades = pd.merge(Students, Grades, on='student_id')\n", "Student_Grades" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> TODO\n", "\n", "1. Use **merge** to join Student_Grades with the Classes table below, and find out what class Alice got an A in." ] }, { "cell_type": "code", "collapsed": false, "input": [ "Classes = pd.DataFrame({'class_id': [1, 2, 3], 'title': ['Math', 'English', 'Spanish']})" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Joining the Restaurant Data\n", "\n", "Now let's load the restaurant data that we will be analyzing:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "resto = pd.read_csv('restaurants.csv')\n", "resto.info()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "resto[:10]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The restaurant data has four columns.\n", "**id** is a unique ID field (unique for each row), **name** is the name of the restaurant, and **city** is where it is located.\n", "The fourth column, **cluster**, is a \"gold standard\" column.\n", "If two records have the same **cluster**, that means they are both about the same restaurant.\n", "\n", "The type of join we made above between Students and Grades, where we link records with equal values in a common column, is called an *equijoin*.\n", "Equijoins may join on more than one column, too (both value have to match).\n", "\n", "Let's use an equijoin to find pairs of duplicate restaurant records.\n", "We join the data to itself, on the **cluster** column.\n", "\n", "> Note: a join between a table and itself is called a *self-join*.\n", "\n", "The result (\"clusters\" below) has a lot of extra records in it.\n", "For example, since we're joining a table to itself, every record matches itself.\n", "We can filter on IDs to get rid of these extra join results.\n", "Note that when Pandas joins two tables that have columns with the same name, it appends \"_x\" and \"_y\" to the names to distinguish them." ] }, { "cell_type": "code", "collapsed": false, "input": [ "clusters = pd.merge(resto, resto, on='cluster')\n", "clusters = clusters[clusters.id_x != clusters.id_y]\n", "clusters[:10]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> TODO\n", "\n", "1. There are still extra records in *clusters*, above. If records *A* and *B* match each other, then we will get both (*A*, *B*) and (*B*, *A*) in the output.\n", "Filter *clusters* so that we only keep one instance of each matching pair (HINT: use the IDs again).\n" ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> TODO: How many rows are there in your filtered table?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Optional: Fuzzy Joins\n", "\n", "Do this section if you have time. There are no lab responses for it however. \n", "\n", "Sometimes an equijoin isn't good enough.\n", "\n", "Say you want to match up records that are *almost* equal in a column.\n", "Or where a *function* of a columns is equal.\n", "Or maybe you don't care about equality: maybe \"less than\" or \"greater than or equal to\" is what you want.\n", "These cases call for a more general join than equijoin.\n", "\n", "We are going to make one of these joins between the restaurants data and itself.\n", "Specifically, we want to match up pairs of records whose restaurant names are *almost* the same.\n", "We call this a **fuzzy join**.\n", "\n", "To do a fuzzy join in Pandas we need to go about it in a few steps:\n", "\n", "1. Join every record in the first table with every record in the second table. This is called the **Cartesian product** of the tables, and it's simply a list of all possible pairs of records.\n", "2. Add a column to the Cartesian product that measures how \"similar\" each pair of records is. This is our **join criterion**.\n", "3. Filter the Cartesian product based on when the join criterion is \"similar enough.\"\n", "\n", "> SQL Aside: In SQL, all of joins are supported in about the same way as equijoins are.\n", "> Essentially, you write a boolean expression on columns from the join-tables, and whenever that expression is true, you join the records together.\n", "> This is very similar to writing an **if** statement in python or Java.\n", "\n", "Let's do an example to get the hang of it." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Edit Distance\n", "\n", "We're going to be using a string-similarity python library to compute \"edit distance\".\n", "\n", "To test that it works, the following should run OK:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import Levenshtein as L" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 1. Join every record in the first table with every record in the second table.\n", "\n", "We use a \"dummy\" column to compute the Cartesian product of the data with itself.\n", "**dummy** takes the same value for every record, so we can do an equijoin and get back all pairs." ] }, { "cell_type": "code", "collapsed": false, "input": [ "resto['dummy'] = 0\n", "prod = pd.merge(resto, resto, on='dummy')\n", "\n", "# Clean up\n", "del prod['dummy']\n", "del resto['dummy']\n", "\n", "# Show that prod is the size of \"resto\" squared:\n", "print len(prod), len(resto)**2" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "prod[:10]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Like we did with *clusters* remove \"extra\" record pairs, e.g., ones with the same ID." ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2. Add a column to the Cartesian product that measures how \"similar\" each pair of records is.\n", "\n", "In the homework assignment, we used a string similarity metric called *cosine similarity* which measured how many \"tokens\" two strings shared in common.\n", "Now, we're going to use an alternative measure of string similarity called *edit-distance*.\n", "[Edit-distance](http://en.wikipedia.org/wiki/Edit_distance) counts the number of simple changes you have to make to a string to turn it into another string.\n", "\n", "Import the edit distance library:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import Levenshtein as L\n", "\n", "L.distance('Hello, World!', 'Hallo, World!')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we add a computed column, named **distance**, that measures the edit distance between the names of two restaurants:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# This takes a minute or two to run\n", "prod['distance'] = prod.apply(lambda r: L.distance(r['name_x'], r['name_y']), axis=1)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3. Filter the Cartesian product based on when the join criterion is \"similar enough.\"\n", "\n", "Now we complete the join by filtering out pairs or records that aren't equal enough for our liking.\n", "As in the first homework assignment, we can only figure out how similar is \"similar enough\" by trying out some different options.\n", "Let's try maximum edit-distance from 0 to 10 and compute precision and recall." ] }, { "cell_type": "code", "collapsed": false, "input": [ "%matplotlib inline\n", "import pylab\n", "\n", "def accuracy(max_distance):\n", " similar = prod[prod.distance < max_distance]\n", " correct = float(sum(similar.cluster_x == similar.cluster_y))\n", " precision = correct / len(similar)\n", " recall = correct / len(clusters)\n", " return (precision, recall)\n", "\n", "thresholds = range(1, 11)\n", "p = []\n", "r = []\n", "\n", "for t in thresholds:\n", " acc = accuracy(t)\n", " p.append(acc[0])\n", " r.append(acc[1])\n", "\n", "pylab.plot(thresholds, p)\n", "pylab.plot(thresholds, r)\n", "pylab.legend(['precision', 'recall'], loc='upper left')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "_1._ Another common way to visualize the tradeoff between precision and recall is to plot them directly against each other.\n", "Create a scatterplot with precision on one axis and recall on the other.\n", "Where are \"good\" points on the plot, and where are \"bad\" ones.\n" ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, remember to fill out the response form <a href=\"https://bcourses.berkeley.edu/courses/1377158/quizzes/2045084\">here</a> !" ] } ], "metadata": {} } ] }