{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Wrangling in Pandas\n", "\n", "This session draws primarily on Chapter 7 in Python for Data Analysis. It covers methods that are used heavily in 'data wrangling', which refers to the data manipulation that is often needed to transform raw data into a form that is useful for analysis. We'll stick to the data and examples used in the book for most of this session, since the examples are clearer on the tiny datasets. After that we will work through some of these methods again using real data.\n", "\n", "Key methods covered include:\n", "\n", "* Merging and Concatenating\n", "* Reshaping data\n", "* Data transformations\n", "* Categorization\n", "* Detecting and Filtering Outliers\n", "* Creating Dummy Variables\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Merging\n", "\n", "Merging two datasets is a very common operation in preparing data for analysis. It generally means adding columns from one table to colums from another, where the value of some key, or merge field, matches.\n", "\n", "Let's begin by creating two simple DataFrames to be merged." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],'data1': range(7)})\n", "df2 = pd.DataFrame({'key': ['a', 'b', 'd'],'data2': range(3)})\n", "print(df1)\n", "print(df2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is a many to one merge. The join field is implicit, based on what columns it finds in common between the two dataframes. Note that they share some values of the key field (a, b), but do not share key values c and d. What do you expect to happen when we merge them? The result contains the values from both inputs where they both have a value of the merge field, which is 'key' in this example. The default behavior is that the key value has to be in both inputs to be kept. In set terms it would be an intersection of the two sets." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "pd.merge(df1,df2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is the same merge, but making the join field explicit.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "pd.merge(df1,df2, on='key')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#what if there are more than one value of key in both dataframes? This is a many-to-many merge.\n", "df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],'data1': range(7)})\n", "df3 = pd.DataFrame({'key': ['a', 'b', 'b', 'd'],'data2': range(4)})\n", "print(df1)\n", "print(df3)\n", "pd.merge(df1,df3, on='key')\n", "#This produces a cartesian product of the number of occurrences of each key value in both dataframes:\n", "# (b shows up 3 times in df1 and 2 times in df3, so we get 6 occurrences in the result of the merge)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# There are several types of joins: left, right, inner, and outer. Let's compare them.\n", "# How does a 'left' join compare to our initial join? Note that it keeps the result if it shows up in df1,\n", "# regardless of whether it also shows up in df2. It fills in a value of NaN for the missing value from df2.\n", "pd.merge(df1,df3, on='key', how='left')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#How does a 'right' join compare? Same idea, but this time it keeps a result if it shows up in df2, regardless\n", "# of whether it also shows up in df1.\n", "pd.merge(df1,df3, on='key', how='right')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#How does an 'inner' join compare?\n", "pd.merge(df1,df3, on='key', how='inner')\n", "# seems to be the default argument..." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#How does an 'outer' join compare? If inner joins are like an intersection of two sets, outer joins are unions.\n", "pd.merge(df1,df3, on='key', how='outer')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#What if the join fields have different names? No problem - just specify the names.\n", "df4 = pd.DataFrame({'key_1': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],'data1': range(7)})\n", "df5 = pd.DataFrame({'key_2': ['a', 'b', 'b', 'd'],'data2': range(4)})\n", "pd.merge(df4,df5, left_on='key_1', right_on='key_2')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Here is an example that uses a combination of a data column and an index to merge two dataframes.\n", "df4 = pd.DataFrame({'key_1': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],'data1': range(7)})\n", "df5 = pd.DataFrame({'data2': [4,6,8,10]}, index=['a','b','c','d'])\n", "pd.merge(df4,df5, left_on='key_1', right_index=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Concatenating" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Concatenating can append rows, or columns, depending on which axis you use. Default is 0\n", "s1 = pd.Series([0, 1], index=['a', 'b'])\n", "s2 = pd.Series([2, 3, 4], index=['c', 'd', 'e'])\n", "s3 = pd.Series([5, 6], index=['f', 'g'])\n", "pd.concat([s1, s2, s3])\n", "# Since we are concatenating series on axis 0, this creates a longer series, appending each of the three series" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# What if we concatenate on axis 1?\n", "pd.concat([s1, s2, s3], axis=1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Outer join is the default:\n", "pd.concat([s1, s2, s3], axis=1, join='outer')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# What would an inner join produce?\n", "pd.concat([s1, s2, s3], axis=1, join='inner')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# We need some overlapping values to have the inner join produe non-empty results\n", "s4 = pd.Series([4, 5, 6], index=['c', 'd', 'e'])\n", "s5 = pd.Series([1, 2, 3], index=['d', 'e', 'f'])\n", "s6 = pd.Series([7, 8, 9, 10], index=['d', 'e', 'f', 'g'])\n", "pd.concat([s4, s5, s6], axis=1, join='outer')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Here is the inner join \n", "pd.concat([s4, s5, s6], axis=1, join='inner')\n", "# Note that it contains only entries that overlap in all three series." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reshaping with Hierarchical Indexing" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data = pd.DataFrame(np.arange(6).reshape((2, 3)),\n", " index=pd.Index(['Ohio', 'Colorado'], name='state'),\n", " columns=pd.Index(['one', 'two', 'three'], name='number'))\n", "data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Stack pivots the columns into rows, producing a Series with a hierarchical index:\n", "result = data.stack()\n", "result" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Unstack reverses this process:\n", "result.unstack()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "See also the related pivot method" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Transformations" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Start with a dataframe containing some duplicate values\n", "data = pd.DataFrame({'k1': ['one'] * 3 + ['two'] * 4,'k2': [1, 1, 2, 3, 3, 4, 99]})\n", "data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# How to see which rows contain duplicate values\n", "data.duplicated()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# How to remove duplicate values\n", "data.drop_duplicates()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#If 99 is a code for missing data, we could replace any such values with NaNs\n", "data['k2'].replace(99,np.nan)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Categorization (binning)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Let's look at how to create categories of data using ranges to bin the data using cut\n", "ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]\n", "bins = [18, 25, 35, 60, 100]\n", "cats = pd.cut(ages, bins)\n", "type(cats)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "cats.categories" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "cats.codes" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "pd.value_counts(cats)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Consistent with mathematical notation for intervals, a parenthesis means that the side is open while the \n", "#square bracket means it is closed (inclusive). Which side is closed can be changed by passing right=False:\n", "cats = pd.cut(ages, bins, right=False)\n", "print(ages)\n", "print(pd.value_counts(cats))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Removing Outliers" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Start by creating a dataframe with 4 columns of 1,000 random numbers\n", "# We'll use a fixed seed for the random number generator to get repeatable results\n", "np.random.seed(12345)\n", "data = pd.DataFrame(np.random.randn(1000, 4))\n", "data.describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# This identifies any values in column 3 with absolute values > 3\n", "col = data[3]\n", "col[np.abs(col) > 3]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# This identifies all the rows with any column containing absolute values > 3\n", "data[(np.abs(data) > 3).any(1)]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Now we can cap the values at -3 to 3 using this:\n", "data[np.abs(data) > 3] = np.sign(data) * 3\n", "data.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Computing Dummy Variables" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],'data1': range(6)})\n", "df" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# This generates dummy variables for each value of key\n", "# Dummy variables are useful in statistical modeling, to have 0/1 indicator\n", "# variables for the presence of some condition\n", "pd.get_dummies(df['key'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# This generates dummy variables for each value of key and appends these to the dataframe\n", "dummies = pd.get_dummies(df['key'], prefix='key')\n", "df_with_dummy = df[['data1']].join(dummies)\n", "df_with_dummy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that we used join instead of merge. The join method is very similar to merge, but uses indexes to merge, by default. From the documentation:\n", "\n", "http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging\n", "merge is a function in the pandas namespace, and it is also available as a DataFrame instance method, with the calling DataFrame being implicitly considered the left object in the join.\n", "\n", "The related DataFrame.join method, uses merge internally for the index-on-index and index-on-column(s) joins, but joins on indexes by default rather than trying to join on common columns (the default behavior for merge). If you are joining on index, you may wish to use DataFrame.join to save yourself some typing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Reviewing our earlier application of Data Wrangling to Craigslist Data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# import libraries and read in the csv file\n", "import re as re, pandas as pd, numpy as np, requests, json\n", "df = pd.read_csv('bay.csv')\n", "print(df[:5])\n", "\n", "# clean price and neighborhood\n", "df.price = df.price.str.strip('$').astype('float64')\n", "df.neighborhood = df.neighborhood.str.strip().str.strip('(').str.strip(')')\n", "\n", "# break out the date into month day year columns\n", "df['month'] = df['date'].str.split().str[0]\n", "df['day'] = df['date'].str.split().str[1].astype('int32')\n", "df['year'] = df['date'].str.split().str[2].astype('int32')\n", "del df['date']\n", "\n", "def clean_br(value):\n", " if isinstance(value, str):\n", " end = value.find('br')\n", " if end == -1:\n", " return None\n", " else:\n", " start = value.find('/') + 2\n", " return int(value[start:end])\n", "df['bedrooms'] = df['bedrooms'].map(clean_br)\n", "\n", "def clean_sqft(value):\n", " if isinstance(value, str):\n", " end = value.find('ft')\n", " if end == -1:\n", " return None\n", " else:\n", " if value.find('br') == -1:\n", " start = value.find('/') + 2\n", " else:\n", " start = value.find('-') + 2\n", " return int(value[start:end])\n", "df['sqft'] = df['sqft'].map(clean_sqft)\n", "\n", "\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Let's do some wrangling on this dataset:\n", "1. Find outliers in rent, say below 200 or above 10,000\n", "1. Analyze the data without missing data\n", "1. Create a dataset that removes the outliers" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df['price'].dropna().describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df['price'][(df['price'] < 200)].dropna().describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df['price'][(df['price'] > 10000)].dropna().describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Let's get a quantile value at the 99 percentile to see the value that the top one percent of our records exceed\n", "df['price'].dropna().quantile(.99)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "filtered = df[(df['price'] < 10000) & (df['price'] > 200)]\n", "filtered.dropna().describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## OK, now on your own: \n", "1. Filter out records with more than 4 bedrooms\n", "2. Create dummy variables for each bedroom count (e.g. bed_1 would have 1 for rows with 1 bedroom, 0 for others), and merge them with the dataframe\n", "3. Filter sqft < 500 and > 3000\n", "4. Create a set of 5 bins for price and do counts of how many records are in each category" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 0 }