{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# How do I merge DataFrames in pandas? ([video](https://www.youtube.com/watch?v=iYWKfUOtGaw&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=32))\n",
"\n",
"- [My pandas video series](https://www.dataschool.io/easier-data-analysis-with-pandas/)\n",
"- [GitHub repository](https://github.com/justmarkham/pandas-videos)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Table of contents\n",
"\n",
"1. Selecting a Function\n",
"2. Joining (Merging) DataFrames\n",
"3. What if...?\n",
"4. Four Types of Joins"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Part 1: Selecting a Function"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Taken from [Merging DataFrames with pandas](https://www.datacamp.com/courses/merging-dataframes-with-pandas?tap_a=5644-dce66f&tap_s=280411-a25fc8) (DataCamp course):\n",
"\n",
"- `df1.append(df2)`: stacking vertically\n",
"- `pd.concat([df1, df2])`:\n",
" - stacking many horizontally or vertically\n",
" - simple inner/outer joins on Indexes\n",
"- `df1.join(df2)`: inner/outer/left/right joins on Indexes\n",
"- `pd.merge(df1, df2)`: many joins on multiple columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Part 2: Joining (Merging) DataFrames"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using the [MovieLens 100k data](http://grouplens.org/datasets/movielens/), let's create two DataFrames:\n",
"\n",
"- **movies**: shows information about movies, namely a unique **movie_id** and its **title**\n",
"- **ratings**: shows the **rating** that a particular **user_id** gave to a particular **movie_id** at a particular **timestamp**"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Movies"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" movie_id | \n",
" title | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" Toy Story (1995) | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" GoldenEye (1995) | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" Four Rooms (1995) | \n",
"
\n",
" \n",
" 3 | \n",
" 4 | \n",
" Get Shorty (1995) | \n",
"
\n",
" \n",
" 4 | \n",
" 5 | \n",
" Copycat (1995) | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" movie_id title\n",
"0 1 Toy Story (1995)\n",
"1 2 GoldenEye (1995)\n",
"2 3 Four Rooms (1995)\n",
"3 4 Get Shorty (1995)\n",
"4 5 Copycat (1995)"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"movie_cols = ['movie_id', 'title']\n",
"movies = pd.read_table('data/u.item', sep='|', header=None, names=movie_cols, usecols=[0, 1])\n",
"movies.head()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(1682, 2)"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"movies.shape"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1682"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"movies.movie_id.nunique()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Ratings"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" user_id | \n",
" movie_id | \n",
" rating | \n",
" timestamp | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 196 | \n",
" 242 | \n",
" 3 | \n",
" 881250949 | \n",
"
\n",
" \n",
" 1 | \n",
" 186 | \n",
" 302 | \n",
" 3 | \n",
" 891717742 | \n",
"
\n",
" \n",
" 2 | \n",
" 22 | \n",
" 377 | \n",
" 1 | \n",
" 878887116 | \n",
"
\n",
" \n",
" 3 | \n",
" 244 | \n",
" 51 | \n",
" 2 | \n",
" 880606923 | \n",
"
\n",
" \n",
" 4 | \n",
" 166 | \n",
" 346 | \n",
" 1 | \n",
" 886397596 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" user_id movie_id rating timestamp\n",
"0 196 242 3 881250949\n",
"1 186 302 3 891717742\n",
"2 22 377 1 878887116\n",
"3 244 51 2 880606923\n",
"4 166 346 1 886397596"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rating_cols = ['user_id', 'movie_id', 'rating', 'timestamp']\n",
"ratings = pd.read_table('data/u.data', sep='\\t', header=None, names=rating_cols)\n",
"ratings.head()"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(100000, 4)"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ratings.shape"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1682"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ratings.movie_id.nunique()"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" user_id | \n",
" movie_id | \n",
" rating | \n",
" timestamp | \n",
"
\n",
" \n",
" \n",
" \n",
" 24 | \n",
" 308 | \n",
" 1 | \n",
" 4 | \n",
" 887736532 | \n",
"
\n",
" \n",
" 454 | \n",
" 287 | \n",
" 1 | \n",
" 5 | \n",
" 875334088 | \n",
"
\n",
" \n",
" 957 | \n",
" 148 | \n",
" 1 | \n",
" 4 | \n",
" 877019411 | \n",
"
\n",
" \n",
" 971 | \n",
" 280 | \n",
" 1 | \n",
" 4 | \n",
" 891700426 | \n",
"
\n",
" \n",
" 1324 | \n",
" 66 | \n",
" 1 | \n",
" 3 | \n",
" 883601324 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" user_id movie_id rating timestamp\n",
"24 308 1 4 887736532\n",
"454 287 1 5 875334088\n",
"957 148 1 4 877019411\n",
"971 280 1 4 891700426\n",
"1324 66 1 3 883601324"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ratings.loc[ratings.movie_id == 1, :].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Merging Movies and Ratings"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's pretend that you want to examine the ratings DataFrame, but you want to know the **title** of each movie rather than its **movie_id**. The best way to accomplish this objective is by \"joining\" (or \"merging\") the DataFrames using the Pandas `merge` function:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['movie_id', 'title'], dtype='object')"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"movies.columns"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['user_id', 'movie_id', 'rating', 'timestamp'], dtype='object')"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ratings.columns"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['movie_id', 'title', 'user_id', 'rating', 'timestamp'], dtype='object')"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"movie_ratings = pd.merge(movies, ratings)\n",
"movie_ratings.columns"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" movie_id | \n",
" title | \n",
" user_id | \n",
" rating | \n",
" timestamp | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" Toy Story (1995) | \n",
" 308 | \n",
" 4 | \n",
" 887736532 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" Toy Story (1995) | \n",
" 287 | \n",
" 5 | \n",
" 875334088 | \n",
"
\n",
" \n",
" 2 | \n",
" 1 | \n",
" Toy Story (1995) | \n",
" 148 | \n",
" 4 | \n",
" 877019411 | \n",
"
\n",
" \n",
" 3 | \n",
" 1 | \n",
" Toy Story (1995) | \n",
" 280 | \n",
" 4 | \n",
" 891700426 | \n",
"
\n",
" \n",
" 4 | \n",
" 1 | \n",
" Toy Story (1995) | \n",
" 66 | \n",
" 3 | \n",
" 883601324 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" movie_id title user_id rating timestamp\n",
"0 1 Toy Story (1995) 308 4 887736532\n",
"1 1 Toy Story (1995) 287 5 875334088\n",
"2 1 Toy Story (1995) 148 4 877019411\n",
"3 1 Toy Story (1995) 280 4 891700426\n",
"4 1 Toy Story (1995) 66 3 883601324"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"movie_ratings.head()"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(100000, 5)"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"movie_ratings.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here's what just happened:\n",
"\n",
"- Pandas noticed that movies and ratings had one column in common, namely **movie_id**. This is the \"key\" on which the DataFrames will be joined.\n",
"- The first **movie_id** in movies is 1. Thus, Pandas looked through every row in the ratings DataFrame, searching for a movie_id of 1. Every time it found such a row, it recorded the **user_id**, **rating**, and **timestamp** listed in that row. In this case, it found 452 matching rows.\n",
"- The second **movie_id** in movies is 2. Again, Pandas did a search of ratings and found 131 matching rows.\n",
"- This process was repeated for all of the remaining rows in movies.\n",
"\n",
"At the end of the process, the movie_ratings DataFrame is created, which contains the two columns from movies (**movie_id** and **title**) and the three other colums from ratings (**user_id**, **rating**, and **timestamp**).\n",
"\n",
"- **movie_id** 1 and its **title** are listed 452 times, next to the **user_id**, **rating**, and **timestamp** for each of the 452 matching ratings.\n",
"- **movie_id** 2 and its **title** are listed 131 times, next to the **user_id**, **rating**, and **timestamp** for each of the 131 matching ratings.\n",
"- And so on, for every movie in the dataset."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(1682, 2)\n",
"(100000, 4)\n",
"(100000, 5)\n"
]
}
],
"source": [
"print(movies.shape)\n",
"print(ratings.shape)\n",
"print(movie_ratings.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice the shapes of the three DataFrames:\n",
"\n",
"- There are 1682 rows in the movies DataFrame.\n",
"- There are 100000 rows in the ratings DataFrame.\n",
"- The `merge` function resulted in a movie_ratings DataFrame with 100000 rows, because every row from ratings matched a row from movies.\n",
"- The movie_ratings DataFrame has 5 columns, namely the 2 columns from movies, plus the 4 columns from ratings, minus the 1 column in common.\n",
"\n",
"By default, the `merge` function joins the DataFrames using all column names that are in common (**movie_id**, in this case). The [documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.merge.html) explains how you can override this behavior."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Part 3: What if...?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## What if the columns you want to join on don't have the same name?"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['m_id', 'title'], dtype='object')"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"movies.columns = ['m_id', 'title']\n",
"movies.columns"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['user_id', 'movie_id', 'rating', 'timestamp'], dtype='object')"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ratings.columns"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" m_id | \n",
" title | \n",
" user_id | \n",
" movie_id | \n",
" rating | \n",
" timestamp | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" Toy Story (1995) | \n",
" 308 | \n",
" 1 | \n",
" 4 | \n",
" 887736532 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" Toy Story (1995) | \n",
" 287 | \n",
" 1 | \n",
" 5 | \n",
" 875334088 | \n",
"
\n",
" \n",
" 2 | \n",
" 1 | \n",
" Toy Story (1995) | \n",
" 148 | \n",
" 1 | \n",
" 4 | \n",
" 877019411 | \n",
"
\n",
" \n",
" 3 | \n",
" 1 | \n",
" Toy Story (1995) | \n",
" 280 | \n",
" 1 | \n",
" 4 | \n",
" 891700426 | \n",
"
\n",
" \n",
" 4 | \n",
" 1 | \n",
" Toy Story (1995) | \n",
" 66 | \n",
" 1 | \n",
" 3 | \n",
" 883601324 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" m_id title user_id movie_id rating timestamp\n",
"0 1 Toy Story (1995) 308 1 4 887736532\n",
"1 1 Toy Story (1995) 287 1 5 875334088\n",
"2 1 Toy Story (1995) 148 1 4 877019411\n",
"3 1 Toy Story (1995) 280 1 4 891700426\n",
"4 1 Toy Story (1995) 66 1 3 883601324"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.merge(movies, ratings, left_on='m_id', right_on='movie_id').head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## What if you want to join on one index?"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" title | \n",
"
\n",
" \n",
" m_id | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 1 | \n",
" Toy Story (1995) | \n",
"
\n",
" \n",
" 2 | \n",
" GoldenEye (1995) | \n",
"
\n",
" \n",
" 3 | \n",
" Four Rooms (1995) | \n",
"
\n",
" \n",
" 4 | \n",
" Get Shorty (1995) | \n",
"
\n",
" \n",
" 5 | \n",
" Copycat (1995) | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" title\n",
"m_id \n",
"1 Toy Story (1995)\n",
"2 GoldenEye (1995)\n",
"3 Four Rooms (1995)\n",
"4 Get Shorty (1995)\n",
"5 Copycat (1995)"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"movies = movies.set_index('m_id')\n",
"movies.head()"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" title | \n",
" user_id | \n",
" movie_id | \n",
" rating | \n",
" timestamp | \n",
"
\n",
" \n",
" \n",
" \n",
" 24 | \n",
" Toy Story (1995) | \n",
" 308 | \n",
" 1 | \n",
" 4 | \n",
" 887736532 | \n",
"
\n",
" \n",
" 454 | \n",
" Toy Story (1995) | \n",
" 287 | \n",
" 1 | \n",
" 5 | \n",
" 875334088 | \n",
"
\n",
" \n",
" 957 | \n",
" Toy Story (1995) | \n",
" 148 | \n",
" 1 | \n",
" 4 | \n",
" 877019411 | \n",
"
\n",
" \n",
" 971 | \n",
" Toy Story (1995) | \n",
" 280 | \n",
" 1 | \n",
" 4 | \n",
" 891700426 | \n",
"
\n",
" \n",
" 1324 | \n",
" Toy Story (1995) | \n",
" 66 | \n",
" 1 | \n",
" 3 | \n",
" 883601324 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" title user_id movie_id rating timestamp\n",
"24 Toy Story (1995) 308 1 4 887736532\n",
"454 Toy Story (1995) 287 1 5 875334088\n",
"957 Toy Story (1995) 148 1 4 877019411\n",
"971 Toy Story (1995) 280 1 4 891700426\n",
"1324 Toy Story (1995) 66 1 3 883601324"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.merge(movies, ratings, left_index=True, right_on='movie_id').head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## What if you want to join on two indexes?"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" user_id | \n",
" rating | \n",
" timestamp | \n",
"
\n",
" \n",
" movie_id | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 242 | \n",
" 196 | \n",
" 3 | \n",
" 881250949 | \n",
"
\n",
" \n",
" 302 | \n",
" 186 | \n",
" 3 | \n",
" 891717742 | \n",
"
\n",
" \n",
" 377 | \n",
" 22 | \n",
" 1 | \n",
" 878887116 | \n",
"
\n",
" \n",
" 51 | \n",
" 244 | \n",
" 2 | \n",
" 880606923 | \n",
"
\n",
" \n",
" 346 | \n",
" 166 | \n",
" 1 | \n",
" 886397596 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" user_id rating timestamp\n",
"movie_id \n",
"242 196 3 881250949\n",
"302 186 3 891717742\n",
"377 22 1 878887116\n",
"51 244 2 880606923\n",
"346 166 1 886397596"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ratings = ratings.set_index('movie_id')\n",
"ratings.head()"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" title | \n",
" user_id | \n",
" rating | \n",
" timestamp | \n",
"
\n",
" \n",
" \n",
" \n",
" 1 | \n",
" Toy Story (1995) | \n",
" 308 | \n",
" 4 | \n",
" 887736532 | \n",
"
\n",
" \n",
" 1 | \n",
" Toy Story (1995) | \n",
" 287 | \n",
" 5 | \n",
" 875334088 | \n",
"
\n",
" \n",
" 1 | \n",
" Toy Story (1995) | \n",
" 148 | \n",
" 4 | \n",
" 877019411 | \n",
"
\n",
" \n",
" 1 | \n",
" Toy Story (1995) | \n",
" 280 | \n",
" 4 | \n",
" 891700426 | \n",
"
\n",
" \n",
" 1 | \n",
" Toy Story (1995) | \n",
" 66 | \n",
" 3 | \n",
" 883601324 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" title user_id rating timestamp\n",
"1 Toy Story (1995) 308 4 887736532\n",
"1 Toy Story (1995) 287 5 875334088\n",
"1 Toy Story (1995) 148 4 877019411\n",
"1 Toy Story (1995) 280 4 891700426\n",
"1 Toy Story (1995) 66 3 883601324"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.merge(movies, ratings, left_index=True, right_index=True).head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Part 4: Four Types of Joins"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are actually four types of joins supported by the Pandas `merge` function. Here's how they are described by the documentation:\n",
"\n",
"- **inner:** use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys\n",
"- **outer:** use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically\n",
"- **left:** use only keys from left frame, similar to a SQL left outer join; preserve key order\n",
"- **right:** use only keys from right frame, similar to a SQL right outer join; preserve key order\n",
"\n",
"The default is the \"inner join\", which was used when creating the movie_ratings DataFrame.\n",
"\n",
"It's easiest to understand the different types by looking at some simple examples:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Example DataFrames A and B"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" color | \n",
" num | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" green | \n",
" 1 | \n",
"
\n",
" \n",
" 1 | \n",
" yellow | \n",
" 2 | \n",
"
\n",
" \n",
" 2 | \n",
" red | \n",
" 3 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" color num\n",
"0 green 1\n",
"1 yellow 2\n",
"2 red 3"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"A = pd.DataFrame({'color': ['green', 'yellow', 'red'], 'num':[1, 2, 3]})\n",
"A"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" color | \n",
" size | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" green | \n",
" S | \n",
"
\n",
" \n",
" 1 | \n",
" yellow | \n",
" M | \n",
"
\n",
" \n",
" 2 | \n",
" pink | \n",
" L | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" color size\n",
"0 green S\n",
"1 yellow M\n",
"2 pink L"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"B = pd.DataFrame({'color': ['green', 'yellow', 'pink'], 'size':['S', 'M', 'L']})\n",
"B"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Inner join\n",
"\n",
"Only include observations found in both A and B:"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" color | \n",
" num | \n",
" size | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" green | \n",
" 1 | \n",
" S | \n",
"
\n",
" \n",
" 1 | \n",
" yellow | \n",
" 2 | \n",
" M | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" color num size\n",
"0 green 1 S\n",
"1 yellow 2 M"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.merge(A, B, how='inner')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Outer join\n",
"\n",
"Include observations found in either A or B:"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" color | \n",
" num | \n",
" size | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" green | \n",
" 1.0 | \n",
" S | \n",
"
\n",
" \n",
" 1 | \n",
" yellow | \n",
" 2.0 | \n",
" M | \n",
"
\n",
" \n",
" 2 | \n",
" red | \n",
" 3.0 | \n",
" NaN | \n",
"
\n",
" \n",
" 3 | \n",
" pink | \n",
" NaN | \n",
" L | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" color num size\n",
"0 green 1.0 S\n",
"1 yellow 2.0 M\n",
"2 red 3.0 NaN\n",
"3 pink NaN L"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.merge(A, B, how='outer')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Left join\n",
"\n",
"Include all observations found in A:"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" color | \n",
" num | \n",
" size | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" green | \n",
" 1 | \n",
" S | \n",
"
\n",
" \n",
" 1 | \n",
" yellow | \n",
" 2 | \n",
" M | \n",
"
\n",
" \n",
" 2 | \n",
" red | \n",
" 3 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" color num size\n",
"0 green 1 S\n",
"1 yellow 2 M\n",
"2 red 3 NaN"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.merge(A, B, how='left')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Right join\n",
"\n",
"Include all observations found in B:"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" color | \n",
" num | \n",
" size | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" green | \n",
" 1.0 | \n",
" S | \n",
"
\n",
" \n",
" 1 | \n",
" yellow | \n",
" 2.0 | \n",
" M | \n",
"
\n",
" \n",
" 2 | \n",
" pink | \n",
" NaN | \n",
" L | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" color num size\n",
"0 green 1.0 S\n",
"1 yellow 2.0 M\n",
"2 pink NaN L"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.merge(A, B, how='right')"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.5"
}
},
"nbformat": 4,
"nbformat_minor": 1
}