{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Joining Data\n", "\n", "> It’s rare that a data analysis involves only a single table of data. Typically you have many tables of data, and you must combine them to answer the questions that you’re interested in.\n", ">\n", "> \\- Garrett Grolemund, Master Instructor, RStudio" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Applied Review" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Functions vs. Methods" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "* There are two types of operations in Python: **functions** and **methods**" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* **Functions** are standalone operations from a module -- `print()` is a function:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Hello\n" ] } ], "source": [ "print(\"Hello\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "* **Methods** are operations that are encapsulated within Python objects -- `DataFrame.head()` is a method:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearmonthdaydep_timesched_dep_timedep_delayarr_timesched_arr_timearr_delaycarrierflighttailnumorigindestair_timedistancehourminutetime_hour
0201311517.05152.0830.081911.0UA1545N14228EWRIAH227.014005152013-01-01 05:00:00
1201311533.05294.0850.083020.0UA1714N24211LGAIAH227.014165292013-01-01 05:00:00
2201311542.05402.0923.085033.0AA1141N619AAJFKMIA160.010895402013-01-01 05:00:00
3201311544.0545-1.01004.01022-18.0B6725N804JBJFKBQN183.015765452013-01-01 05:00:00
4201311554.0600-6.0812.0837-25.0DL461N668DNLGAATL116.0762602013-01-01 06:00:00
\n", "
" ], "text/plain": [ " year month day dep_time sched_dep_time dep_delay arr_time \\\n", "0 2013 1 1 517.0 515 2.0 830.0 \n", "1 2013 1 1 533.0 529 4.0 850.0 \n", "2 2013 1 1 542.0 540 2.0 923.0 \n", "3 2013 1 1 544.0 545 -1.0 1004.0 \n", "4 2013 1 1 554.0 600 -6.0 812.0 \n", "\n", " sched_arr_time arr_delay carrier flight tailnum origin dest air_time \\\n", "0 819 11.0 UA 1545 N14228 EWR IAH 227.0 \n", "1 830 20.0 UA 1714 N24211 LGA IAH 227.0 \n", "2 850 33.0 AA 1141 N619AA JFK MIA 160.0 \n", "3 1022 -18.0 B6 725 N804JB JFK BQN 183.0 \n", "4 837 -25.0 DL 461 N668DN LGA ATL 116.0 \n", "\n", " distance hour minute time_hour \n", "0 1400 5 15 2013-01-01 05:00:00 \n", "1 1416 5 29 2013-01-01 05:00:00 \n", "2 1089 5 40 2013-01-01 05:00:00 \n", "3 1576 5 45 2013-01-01 05:00:00 \n", "4 762 6 0 2013-01-01 06:00:00 " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "flights_df = pd.read_csv('../data/flights.csv')\n", "flights_df.head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### DataFrame Structure" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "* Each DataFrame variable is a **Series** and can be accessed with bracket subsetting notation: `DataFrame['SeriesName']`" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "* The DataFrame has an **Index** that is visible on the far left side" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## General Model" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Combining Data" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "* We frequently want to use more than one table at once, so we need to combine them in some way" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "* Because tables are two-dimensional, we can combine them **vertically** and **horizontally**" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* Combining data **vertically** is known as **appending**/**unioning**/**concatenating**" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* Combiding data **horizontally** is known as **joining**/**merging**" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Appending Data Vertically" ] }, { "cell_type": "markdown", "metadata": { "cell_style": "center", "tags": [] }, "source": [ "* When we combine data **vertically**, we are stacking tables on top of one another:\n", "\n", "
\n", "\"combine-vertically.png\"\n", "
" ] }, { "cell_type": "markdown", "metadata": { "cell_style": "center", "slideshow": { "slide_type": "fragment" } }, "source": [ "
\n", "

Note

\n", "

This is particularly useful when all columns are the same between the two tables.

\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Joining Data Horizontally" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "* When we combine data **horizontally**, we are attaching the tables at their sides:" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "![combine-horizontally.png](images/combine-horizontally.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "* Note that the rows do not need to be in the same order to join/merge two tables:" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "![combine-horizontally-unordered.png](images/combine-horizontally-unordered.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "* The joining occurs by matching on a **key column**" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "![combine-horizontally-key.png](images/combine-horizontally-key.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Combining DataFrames" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "### Appending DataFrames" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "* When we combine DataFrames vertically, we want to stack two DataFrames on top of one another" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* Let's start by creating two DataFrames with the same variables:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "cell_style": "split", "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
xy
01a
12b
\n", "
" ], "text/plain": [ " x y\n", "0 1 a\n", "1 2 b" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_1 = pd.DataFrame({'x': [1, 2], 'y': ['a', 'b']})\n", "df_1" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "cell_style": "split", "scrolled": true, "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
xy
03c
14d
\n", "
" ], "text/plain": [ " x y\n", "0 3 c\n", "1 4 d" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_2 = pd.DataFrame({'x': [3, 4], 'y': ['c', 'd']})\n", "df_2" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "We can stack `df_1` and `df_2` on top of one another using the `concat()` function from `pandas` with a list:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
xy
01a
12b
03c
14d
\n", "
" ], "text/plain": [ " x y\n", "0 1 a\n", "1 2 b\n", "0 3 c\n", "1 4 d" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([df_1, df_2])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "
\n", "

Question

\n", "

Does anything about this result seem weird?

\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The **Index** is repeating..." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "We can add the `ignore_index = True` to make the Index reset:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "scrolled": true, "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
xy
01a
12b
23c
34d
\n", "
" ], "text/plain": [ " x y\n", "0 1 a\n", "1 2 b\n", "2 3 c\n", "3 4 d" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([df_1, df_2], ignore_index = True)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We can also use the `DataFrame.reset_index()` method:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
xy
01a
12b
23c
34d
\n", "
" ], "text/plain": [ " x y\n", "0 1 a\n", "1 2 b\n", "2 3 c\n", "3 4 d" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([df_1, df_2]).reset_index(drop = True)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "
\n", "

Caution!

\n", "

Using pd.concat() to vertically combine dataframes should only be used when we know that the DataFrames' schemas are consistent.

\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Joining DataFrames" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* Joining DataFrames may be one of the most important skills to learn in Python" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "* As a reminder, joining DataFrames is the horizontal combining of two DataFrames on some **key column**:\n", "\n", "![combine-horizontally-key.png](images/combine-horizontally-key.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "* We have `flights_df`, but we need another DataFrame to join to `flights_df` that has a common **key column**" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* As an example, assume we want to know which airline carried each flight in `flights_df`:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "airlines_df = pd.read_csv('../data/airlines.csv')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "* Now let's examine the columns/variables in our two DataFrames using the `DataFrame.columns` attribute:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "Index(['year', 'month', 'day', 'dep_time', 'sched_dep_time', 'dep_delay',\n", " 'arr_time', 'sched_arr_time', 'arr_delay', 'carrier', 'flight',\n", " 'tailnum', 'origin', 'dest', 'air_time', 'distance', 'hour', 'minute',\n", " 'time_hour'],\n", " dtype='object')" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "flights_df.columns" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "Index(['carrier', 'name'], dtype='object')" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "airlines_df.columns" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "
\n", "

Question

\n", "

Which column should be our key column?

\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "The `carrier` column is our key because it's in both DataFrames.\n", "\n", "
\n", "

Tip!

\n", "

There is an intersection() method that makes it easy to find common columns between two DataFrames.

\n", "
" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['carrier'], dtype='object')" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "flights_df.columns.intersection(airlines_df.columns)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "We can join/merge the DataFrames together using the `merge()` function:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearmonthdaydep_timesched_dep_timedep_delayarr_timesched_arr_timearr_delaycarrierflighttailnumorigindestair_timedistancehourminutetime_hourname
0201311517.05152.0830.081911.0UA1545N14228EWRIAH227.014005152013-01-01 05:00:00United Air Lines Inc.
1201311533.05294.0850.083020.0UA1714N24211LGAIAH227.014165292013-01-01 05:00:00United Air Lines Inc.
\n", "
" ], "text/plain": [ " year month day dep_time sched_dep_time dep_delay arr_time \\\n", "0 2013 1 1 517.0 515 2.0 830.0 \n", "1 2013 1 1 533.0 529 4.0 850.0 \n", "\n", " sched_arr_time arr_delay carrier flight tailnum origin dest air_time \\\n", "0 819 11.0 UA 1545 N14228 EWR IAH 227.0 \n", "1 830 20.0 UA 1714 N24211 LGA IAH 227.0 \n", "\n", " distance hour minute time_hour name \n", "0 1400 5 15 2013-01-01 05:00:00 United Air Lines Inc. \n", "1 1416 5 29 2013-01-01 05:00:00 United Air Lines Inc. " ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.merge(flights_df, airlines_df, on = 'carrier').head(2)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* This joined `flights_df` and `airlines_df` together to attach the `name` from `airlines_df` to each flight" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Your Turn\n", "\n", "1\\. Import `planes.csv` into a DataFrame named `planes_df`\n", "\n", "2\\. Identify the common column names between the two DataFrames.\n", "\n", "3\\. Fill in the blanks below to join the `flights_df` to `planes_df`:\n", "\n", "```python\n", "pd._____(flights_df, planes_df, on = '_____')\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Join Types" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Inner Joins" ] }, { "cell_type": "markdown", "metadata": { "cell_style": "center", "slideshow": { "slide_type": "fragment" } }, "source": [ "All of our joins have been **inner joins**:\n", "\n", "
\n", "\"inner-join.png\"\n", "
" ] }, { "cell_type": "markdown", "metadata": { "cell_style": "center", "slideshow": { "slide_type": "fragment" } }, "source": [ "
\n", "

Note

\n", "

Inner joins only keep rows where the key is in both tables.

\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Left Joins" ] }, { "cell_type": "markdown", "metadata": { "cell_style": "center", "slideshow": { "slide_type": "fragment" } }, "source": [ "Sometimes we only want to include data that is **in the left table** regardless of whether it's in the right table:\n", "\n", "
\n", "\"left-outer-join.png\"\n", "
" ] }, { "cell_type": "markdown", "metadata": { "cell_style": "center", "slideshow": { "slide_type": "fragment" } }, "source": [ "
\n", "

Note

\n", "

Left outer joins, or simply left joins, keep rows where the key is in the left table.

\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Right Joins" ] }, { "cell_type": "markdown", "metadata": { "cell_style": "center", "slideshow": { "slide_type": "fragment" } }, "source": [ "Sometimes we only want to include data that is **in the right table** regardless of whether it's in the left table:\n", "\n", "
\n", "\"right-outer-join.png\"\n", "
" ] }, { "cell_type": "markdown", "metadata": { "cell_style": "center", "slideshow": { "slide_type": "fragment" } }, "source": [ "
\n", "

Note

\n", "

Right outer joins, or simply right joins, keep rows where the key is in the right table.

\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Outer Joins" ] }, { "cell_type": "markdown", "metadata": { "cell_style": "center", "slideshow": { "slide_type": "fragment" } }, "source": [ "Sometimes we want to include **all rows** in either the left table or the right table:\n", "\n", "
\n", "\"full-outer-join.png\"\n", "
" ] }, { "cell_type": "markdown", "metadata": { "cell_style": "center", "slideshow": { "slide_type": "fragment" } }, "source": [ "
\n", "

Note

\n", "

Full outer joins keep all rows in both tables.

\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Applying Different Join Types" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We can apply these different join types using the `how` parameter of the `merge()` function:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearmonthdaydep_timesched_dep_timedep_delayarr_timesched_arr_timearr_delaycarrierflighttailnumorigindestair_timedistancehourminutetime_hourname
0201311517.05152.0830.081911.0UA1545N14228EWRIAH227.014005152013-01-01 05:00:00United Air Lines Inc.
1201311533.05294.0850.083020.0UA1714N24211LGAIAH227.014165292013-01-01 05:00:00United Air Lines Inc.
2201311554.0558-4.0740.072812.0UA1696N39463EWRORD150.07195582013-01-01 05:00:00United Air Lines Inc.
\n", "
" ], "text/plain": [ " year month day dep_time sched_dep_time dep_delay arr_time \\\n", "0 2013 1 1 517.0 515 2.0 830.0 \n", "1 2013 1 1 533.0 529 4.0 850.0 \n", "2 2013 1 1 554.0 558 -4.0 740.0 \n", "\n", " sched_arr_time arr_delay carrier flight tailnum origin dest air_time \\\n", "0 819 11.0 UA 1545 N14228 EWR IAH 227.0 \n", "1 830 20.0 UA 1714 N24211 LGA IAH 227.0 \n", "2 728 12.0 UA 1696 N39463 EWR ORD 150.0 \n", "\n", " distance hour minute time_hour name \n", "0 1400 5 15 2013-01-01 05:00:00 United Air Lines Inc. \n", "1 1416 5 29 2013-01-01 05:00:00 United Air Lines Inc. \n", "2 719 5 58 2013-01-01 05:00:00 United Air Lines Inc. " ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.merge(flights_df, airlines_df, on = 'carrier', how = 'inner').head(3)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "While `how = 'inner'` is the default, we can also use `'left'`, `'right'`, and `'outer'`:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearmonthdaydep_timesched_dep_timedep_delayarr_timesched_arr_timearr_delaycarrierflighttailnumorigindestair_timedistancehourminutetime_hourname
0201311517.05152.0830.081911.0UA1545N14228EWRIAH227.014005152013-01-01 05:00:00United Air Lines Inc.
1201311533.05294.0850.083020.0UA1714N24211LGAIAH227.014165292013-01-01 05:00:00United Air Lines Inc.
2201311554.0558-4.0740.072812.0UA1696N39463EWRORD150.07195582013-01-01 05:00:00United Air Lines Inc.
\n", "
" ], "text/plain": [ " year month day dep_time sched_dep_time dep_delay arr_time \\\n", "0 2013 1 1 517.0 515 2.0 830.0 \n", "1 2013 1 1 533.0 529 4.0 850.0 \n", "2 2013 1 1 554.0 558 -4.0 740.0 \n", "\n", " sched_arr_time arr_delay carrier flight tailnum origin dest air_time \\\n", "0 819 11.0 UA 1545 N14228 EWR IAH 227.0 \n", "1 830 20.0 UA 1714 N24211 LGA IAH 227.0 \n", "2 728 12.0 UA 1696 N39463 EWR ORD 150.0 \n", "\n", " distance hour minute time_hour name \n", "0 1400 5 15 2013-01-01 05:00:00 United Air Lines Inc. \n", "1 1416 5 29 2013-01-01 05:00:00 United Air Lines Inc. \n", "2 719 5 58 2013-01-01 05:00:00 United Air Lines Inc. " ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.merge(flights_df, airlines_df, on = 'carrier', how = 'outer').head(3)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Your Turn\n", "\n", "1. What type of join includes the rows where the key is in the left table regardless of whether the key is in the right table?\n", "\n", "2. Join the `flights_df` to `planes_df` and keep all rows from both tables." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Questions\n", "\n", "Are there questions before we move on?" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" }, "rise": { "autolaunch": true, "transition": "none" } }, "nbformat": 4, "nbformat_minor": 4 }