{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Joining (Merging) DataFrames" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using the [MovieLens 100k data](http://grouplens.org/datasets/movielens/), let's create two DataFrames:\n", "\n", "- **movies**: shows information about movies, namely a unique **movie_id** and its **title**\n", "- **ratings**: shows the **rating** that a particular **user_id** gave to a particular **movie_id** at a particular **timestamp**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Movies" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
movie_idtitle
01Toy Story (1995)
12GoldenEye (1995)
23Four Rooms (1995)
34Get Shorty (1995)
45Copycat (1995)
\n", "
" ], "text/plain": [ " movie_id title\n", "0 1 Toy Story (1995)\n", "1 2 GoldenEye (1995)\n", "2 3 Four Rooms (1995)\n", "3 4 Get Shorty (1995)\n", "4 5 Copycat (1995)" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "movie_url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.item'\n", "movie_cols = ['movie_id', 'title']\n", "movies = pd.read_table(movie_url, sep='|', header=None, names=movie_cols, usecols=[0, 1])\n", "movies.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Ratings" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
user_idmovie_idratingtimestamp
01962423881250949
11863023891717742
2223771878887116
3244512880606923
41663461886397596
\n", "
" ], "text/plain": [ " user_id movie_id rating timestamp\n", "0 196 242 3 881250949\n", "1 186 302 3 891717742\n", "2 22 377 1 878887116\n", "3 244 51 2 880606923\n", "4 166 346 1 886397596" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rating_url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.data'\n", "rating_cols = ['user_id', 'movie_id', 'rating', 'timestamp']\n", "ratings = pd.read_table(rating_url, sep='\\t', header=None, names=rating_cols)\n", "ratings.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's pretend that you want to examine the ratings DataFrame, but you want to know the **title** of each movie rather than its **movie_id**. The best way to accomplish this objective is by \"joining\" (or \"merging\") the DataFrames using the Pandas `merge` function:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
movie_idtitleuser_idratingtimestamp
01Toy Story (1995)3084887736532
11Toy Story (1995)2875875334088
21Toy Story (1995)1484877019411
31Toy Story (1995)2804891700426
41Toy Story (1995)663883601324
\n", "
" ], "text/plain": [ " movie_id title user_id rating timestamp\n", "0 1 Toy Story (1995) 308 4 887736532\n", "1 1 Toy Story (1995) 287 5 875334088\n", "2 1 Toy Story (1995) 148 4 877019411\n", "3 1 Toy Story (1995) 280 4 891700426\n", "4 1 Toy Story (1995) 66 3 883601324" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "movie_ratings = pd.merge(movies, ratings)\n", "movie_ratings.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's what just happened:\n", "\n", "- Pandas noticed that movies and ratings had one column in common, namely **movie_id**. This is the \"key\" on which the DataFrames will be joined.\n", "- The first **movie_id** in movies is 1. Thus, Pandas looked through every row in the ratings DataFrame, searching for a movie_id of 1. Every time it found such a row, it recorded the **user_id**, **rating**, and **timestamp** listed in that row. In this case, it found 452 matching rows.\n", "- The second **movie_id** in movies is 2. Again, Pandas did a search of ratings and found 131 matching rows.\n", "- This process was repeated for all of the remaining rows in movies.\n", "\n", "At the end of the process, the movie_ratings DataFrame is created, which contains the two columns from movies (**movie_id** and **title**) and the three other colums from ratings (**user_id**, **rating**, and **timestamp**).\n", "\n", "- **movie_id** 1 and its **title** are listed 452 times, next to the **user_id**, **rating**, and **timestamp** for each of the 452 matching ratings.\n", "- **movie_id** 2 and its **title** are listed 131 times, next to the **user_id**, **rating**, and **timestamp** for each of the 131 matching ratings.\n", "- And so on, for every movie in the dataset." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(1682, 2)\n", "(100000, 4)\n", "(100000, 5)\n" ] } ], "source": [ "print movies.shape\n", "print ratings.shape\n", "print movie_ratings.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice the shapes of the three DataFrames:\n", "\n", "- There are 1682 rows in the movies DataFrame.\n", "- There are 100000 rows in the ratings DataFrame.\n", "- The `merge` function resulted in a movie_ratings DataFrame with 100000 rows, because every row from ratings matched a row from movies.\n", "- The movie_ratings DataFrame has 5 columns, namely the 2 columns from movies, plus the 4 columns from ratings, minus the 1 column in common.\n", "\n", "By default, the `merge` function joins the DataFrames using all column names that are in common (**movie_id**, in this case). The [documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) explains how you can override this behavior." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Four Types of Joins" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are actually four types of joins supported by the Pandas `merge` function. Here's how they are described by the documentation:\n", "\n", "- **inner:** use intersection of keys from both frames (SQL: inner join)\n", "- **outer:** use union of keys from both frames (SQL: full outer join)\n", "- **left:** use only keys from left frame (SQL: left outer join)\n", "- **right:** use only keys from right frame (SQL: right outer join)\n", "\n", "The default is the \"inner join\", which was used when creating the movie_ratings DataFrame.\n", "\n", "It's easiest to understand the different types by looking at some simple examples:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example DataFrames A and B" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
colornum
0green1
1yellow2
2red3
\n", "
" ], "text/plain": [ " color num\n", "0 green 1\n", "1 yellow 2\n", "2 red 3" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "A = pd.DataFrame({'color': ['green', 'yellow', 'red'], 'num':[1, 2, 3]})\n", "A" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
colorsize
0greenS
1yellowM
2pinkL
\n", "
" ], "text/plain": [ " color size\n", "0 green S\n", "1 yellow M\n", "2 pink L" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "B = pd.DataFrame({'color': ['green', 'yellow', 'pink'], 'size':['S', 'M', 'L']})\n", "B" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Inner join\n", "\n", "Only include observations found in both A and B:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
colornumsize
0green1S
1yellow2M
\n", "
" ], "text/plain": [ " color num size\n", "0 green 1 S\n", "1 yellow 2 M" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.merge(A, B, how='inner')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Outer join\n", "\n", "Include observations found in either A or B:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
colornumsize
0green1S
1yellow2M
2red3NaN
3pinkNaNL
\n", "
" ], "text/plain": [ " color num size\n", "0 green 1 S\n", "1 yellow 2 M\n", "2 red 3 NaN\n", "3 pink NaN L" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.merge(A, B, how='outer')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Left join\n", "\n", "Include all observations found in A:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
colornumsize
0green1S
1yellow2M
2red3NaN
\n", "
" ], "text/plain": [ " color num size\n", "0 green 1 S\n", "1 yellow 2 M\n", "2 red 3 NaN" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.merge(A, B, how='left')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Right join\n", "\n", "Include all observations found in B:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
colornumsize
0green1S
1yellow2M
2pinkNaNL
\n", "
" ], "text/plain": [ " color num size\n", "0 green 1 S\n", "1 yellow 2 M\n", "2 pink NaN L" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.merge(A, B, how='right')" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 0 }