{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Joining (Merging) DataFrames"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using the [MovieLens 100k data](http://grouplens.org/datasets/movielens/), let's create two DataFrames:\n",
"\n",
"- **movies**: shows information about movies, namely a unique **movie_id** and its **title**\n",
"- **ratings**: shows the **rating** that a particular **user_id** gave to a particular **movie_id** at a particular **timestamp**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Movies"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"
\n",
" \n",
" \n",
" | \n",
" movie_id | \n",
" title | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" Toy Story (1995) | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" GoldenEye (1995) | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" Four Rooms (1995) | \n",
"
\n",
" \n",
" 3 | \n",
" 4 | \n",
" Get Shorty (1995) | \n",
"
\n",
" \n",
" 4 | \n",
" 5 | \n",
" Copycat (1995) | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" movie_id title\n",
"0 1 Toy Story (1995)\n",
"1 2 GoldenEye (1995)\n",
"2 3 Four Rooms (1995)\n",
"3 4 Get Shorty (1995)\n",
"4 5 Copycat (1995)"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"movie_url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.item'\n",
"movie_cols = ['movie_id', 'title']\n",
"movies = pd.read_table(movie_url, sep='|', header=None, names=movie_cols, usecols=[0, 1])\n",
"movies.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Ratings"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" user_id | \n",
" movie_id | \n",
" rating | \n",
" timestamp | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 196 | \n",
" 242 | \n",
" 3 | \n",
" 881250949 | \n",
"
\n",
" \n",
" 1 | \n",
" 186 | \n",
" 302 | \n",
" 3 | \n",
" 891717742 | \n",
"
\n",
" \n",
" 2 | \n",
" 22 | \n",
" 377 | \n",
" 1 | \n",
" 878887116 | \n",
"
\n",
" \n",
" 3 | \n",
" 244 | \n",
" 51 | \n",
" 2 | \n",
" 880606923 | \n",
"
\n",
" \n",
" 4 | \n",
" 166 | \n",
" 346 | \n",
" 1 | \n",
" 886397596 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" user_id movie_id rating timestamp\n",
"0 196 242 3 881250949\n",
"1 186 302 3 891717742\n",
"2 22 377 1 878887116\n",
"3 244 51 2 880606923\n",
"4 166 346 1 886397596"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rating_url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.data'\n",
"rating_cols = ['user_id', 'movie_id', 'rating', 'timestamp']\n",
"ratings = pd.read_table(rating_url, sep='\\t', header=None, names=rating_cols)\n",
"ratings.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's pretend that you want to examine the ratings DataFrame, but you want to know the **title** of each movie rather than its **movie_id**. The best way to accomplish this objective is by \"joining\" (or \"merging\") the DataFrames using the Pandas `merge` function:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" movie_id | \n",
" title | \n",
" user_id | \n",
" rating | \n",
" timestamp | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" Toy Story (1995) | \n",
" 308 | \n",
" 4 | \n",
" 887736532 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" Toy Story (1995) | \n",
" 287 | \n",
" 5 | \n",
" 875334088 | \n",
"
\n",
" \n",
" 2 | \n",
" 1 | \n",
" Toy Story (1995) | \n",
" 148 | \n",
" 4 | \n",
" 877019411 | \n",
"
\n",
" \n",
" 3 | \n",
" 1 | \n",
" Toy Story (1995) | \n",
" 280 | \n",
" 4 | \n",
" 891700426 | \n",
"
\n",
" \n",
" 4 | \n",
" 1 | \n",
" Toy Story (1995) | \n",
" 66 | \n",
" 3 | \n",
" 883601324 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" movie_id title user_id rating timestamp\n",
"0 1 Toy Story (1995) 308 4 887736532\n",
"1 1 Toy Story (1995) 287 5 875334088\n",
"2 1 Toy Story (1995) 148 4 877019411\n",
"3 1 Toy Story (1995) 280 4 891700426\n",
"4 1 Toy Story (1995) 66 3 883601324"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"movie_ratings = pd.merge(movies, ratings)\n",
"movie_ratings.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here's what just happened:\n",
"\n",
"- Pandas noticed that movies and ratings had one column in common, namely **movie_id**. This is the \"key\" on which the DataFrames will be joined.\n",
"- The first **movie_id** in movies is 1. Thus, Pandas looked through every row in the ratings DataFrame, searching for a movie_id of 1. Every time it found such a row, it recorded the **user_id**, **rating**, and **timestamp** listed in that row. In this case, it found 452 matching rows.\n",
"- The second **movie_id** in movies is 2. Again, Pandas did a search of ratings and found 131 matching rows.\n",
"- This process was repeated for all of the remaining rows in movies.\n",
"\n",
"At the end of the process, the movie_ratings DataFrame is created, which contains the two columns from movies (**movie_id** and **title**) and the three other colums from ratings (**user_id**, **rating**, and **timestamp**).\n",
"\n",
"- **movie_id** 1 and its **title** are listed 452 times, next to the **user_id**, **rating**, and **timestamp** for each of the 452 matching ratings.\n",
"- **movie_id** 2 and its **title** are listed 131 times, next to the **user_id**, **rating**, and **timestamp** for each of the 131 matching ratings.\n",
"- And so on, for every movie in the dataset."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(1682, 2)\n",
"(100000, 4)\n",
"(100000, 5)\n"
]
}
],
"source": [
"print movies.shape\n",
"print ratings.shape\n",
"print movie_ratings.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice the shapes of the three DataFrames:\n",
"\n",
"- There are 1682 rows in the movies DataFrame.\n",
"- There are 100000 rows in the ratings DataFrame.\n",
"- The `merge` function resulted in a movie_ratings DataFrame with 100000 rows, because every row from ratings matched a row from movies.\n",
"- The movie_ratings DataFrame has 5 columns, namely the 2 columns from movies, plus the 4 columns from ratings, minus the 1 column in common.\n",
"\n",
"By default, the `merge` function joins the DataFrames using all column names that are in common (**movie_id**, in this case). The [documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) explains how you can override this behavior."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Four Types of Joins"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are actually four types of joins supported by the Pandas `merge` function. Here's how they are described by the documentation:\n",
"\n",
"- **inner:** use intersection of keys from both frames (SQL: inner join)\n",
"- **outer:** use union of keys from both frames (SQL: full outer join)\n",
"- **left:** use only keys from left frame (SQL: left outer join)\n",
"- **right:** use only keys from right frame (SQL: right outer join)\n",
"\n",
"The default is the \"inner join\", which was used when creating the movie_ratings DataFrame.\n",
"\n",
"It's easiest to understand the different types by looking at some simple examples:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Example DataFrames A and B"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" color | \n",
" num | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" green | \n",
" 1 | \n",
"
\n",
" \n",
" 1 | \n",
" yellow | \n",
" 2 | \n",
"
\n",
" \n",
" 2 | \n",
" red | \n",
" 3 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" color num\n",
"0 green 1\n",
"1 yellow 2\n",
"2 red 3"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"A = pd.DataFrame({'color': ['green', 'yellow', 'red'], 'num':[1, 2, 3]})\n",
"A"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" color | \n",
" size | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" green | \n",
" S | \n",
"
\n",
" \n",
" 1 | \n",
" yellow | \n",
" M | \n",
"
\n",
" \n",
" 2 | \n",
" pink | \n",
" L | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" color size\n",
"0 green S\n",
"1 yellow M\n",
"2 pink L"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"B = pd.DataFrame({'color': ['green', 'yellow', 'pink'], 'size':['S', 'M', 'L']})\n",
"B"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Inner join\n",
"\n",
"Only include observations found in both A and B:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" color | \n",
" num | \n",
" size | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" green | \n",
" 1 | \n",
" S | \n",
"
\n",
" \n",
" 1 | \n",
" yellow | \n",
" 2 | \n",
" M | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" color num size\n",
"0 green 1 S\n",
"1 yellow 2 M"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.merge(A, B, how='inner')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Outer join\n",
"\n",
"Include observations found in either A or B:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" color | \n",
" num | \n",
" size | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" green | \n",
" 1 | \n",
" S | \n",
"
\n",
" \n",
" 1 | \n",
" yellow | \n",
" 2 | \n",
" M | \n",
"
\n",
" \n",
" 2 | \n",
" red | \n",
" 3 | \n",
" NaN | \n",
"
\n",
" \n",
" 3 | \n",
" pink | \n",
" NaN | \n",
" L | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" color num size\n",
"0 green 1 S\n",
"1 yellow 2 M\n",
"2 red 3 NaN\n",
"3 pink NaN L"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.merge(A, B, how='outer')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Left join\n",
"\n",
"Include all observations found in A:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" color | \n",
" num | \n",
" size | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" green | \n",
" 1 | \n",
" S | \n",
"
\n",
" \n",
" 1 | \n",
" yellow | \n",
" 2 | \n",
" M | \n",
"
\n",
" \n",
" 2 | \n",
" red | \n",
" 3 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" color num size\n",
"0 green 1 S\n",
"1 yellow 2 M\n",
"2 red 3 NaN"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.merge(A, B, how='left')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Right join\n",
"\n",
"Include all observations found in B:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" color | \n",
" num | \n",
" size | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" green | \n",
" 1 | \n",
" S | \n",
"
\n",
" \n",
" 1 | \n",
" yellow | \n",
" 2 | \n",
" M | \n",
"
\n",
" \n",
" 2 | \n",
" pink | \n",
" NaN | \n",
" L | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" color num size\n",
"0 green 1 S\n",
"1 yellow 2 M\n",
"2 pink NaN L"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.merge(A, B, how='right')"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 0
}