{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Joining Data\n", "\n", "> It’s rare that a data analysis involves only a single table of data. Typically you have many tables of data, and you must combine them to answer the questions that you’re interested in.\n", ">\n", "> \\- Garrett Grolemund, Master Instructor, RStudio" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Applied Review" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Functions vs. Methods" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "* There are two types of operations in Python: **functions** and **methods**" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* **Functions** are standalone operations from a module -- `print()` is a function:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Hello\n" ] } ], "source": [ "print(\"Hello\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "* **Methods** are operations that are encapsulated within Python objects -- `DataFrame.head()` is a method:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", " | year | \n", "month | \n", "day | \n", "dep_time | \n", "sched_dep_time | \n", "dep_delay | \n", "arr_time | \n", "sched_arr_time | \n", "arr_delay | \n", "carrier | \n", "flight | \n", "tailnum | \n", "origin | \n", "dest | \n", "air_time | \n", "distance | \n", "hour | \n", "minute | \n", "time_hour | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "2013 | \n", "1 | \n", "1 | \n", "517.0 | \n", "515 | \n", "2.0 | \n", "830.0 | \n", "819 | \n", "11.0 | \n", "UA | \n", "1545 | \n", "N14228 | \n", "EWR | \n", "IAH | \n", "227.0 | \n", "1400 | \n", "5 | \n", "15 | \n", "2013-01-01 05:00:00 | \n", "
1 | \n", "2013 | \n", "1 | \n", "1 | \n", "533.0 | \n", "529 | \n", "4.0 | \n", "850.0 | \n", "830 | \n", "20.0 | \n", "UA | \n", "1714 | \n", "N24211 | \n", "LGA | \n", "IAH | \n", "227.0 | \n", "1416 | \n", "5 | \n", "29 | \n", "2013-01-01 05:00:00 | \n", "
2 | \n", "2013 | \n", "1 | \n", "1 | \n", "542.0 | \n", "540 | \n", "2.0 | \n", "923.0 | \n", "850 | \n", "33.0 | \n", "AA | \n", "1141 | \n", "N619AA | \n", "JFK | \n", "MIA | \n", "160.0 | \n", "1089 | \n", "5 | \n", "40 | \n", "2013-01-01 05:00:00 | \n", "
3 | \n", "2013 | \n", "1 | \n", "1 | \n", "544.0 | \n", "545 | \n", "-1.0 | \n", "1004.0 | \n", "1022 | \n", "-18.0 | \n", "B6 | \n", "725 | \n", "N804JB | \n", "JFK | \n", "BQN | \n", "183.0 | \n", "1576 | \n", "5 | \n", "45 | \n", "2013-01-01 05:00:00 | \n", "
4 | \n", "2013 | \n", "1 | \n", "1 | \n", "554.0 | \n", "600 | \n", "-6.0 | \n", "812.0 | \n", "837 | \n", "-25.0 | \n", "DL | \n", "461 | \n", "N668DN | \n", "LGA | \n", "ATL | \n", "116.0 | \n", "762 | \n", "6 | \n", "0 | \n", "2013-01-01 06:00:00 | \n", "
Note
\n", "This is particularly useful when all columns are the same between the two tables.
\n", "\n", " | x | \n", "y | \n", "
---|---|---|
0 | \n", "1 | \n", "a | \n", "
1 | \n", "2 | \n", "b | \n", "
\n", " | x | \n", "y | \n", "
---|---|---|
0 | \n", "3 | \n", "c | \n", "
1 | \n", "4 | \n", "d | \n", "
\n", " | x | \n", "y | \n", "
---|---|---|
0 | \n", "1 | \n", "a | \n", "
1 | \n", "2 | \n", "b | \n", "
0 | \n", "3 | \n", "c | \n", "
1 | \n", "4 | \n", "d | \n", "
Question
\n", "Does anything about this result seem weird?
\n", "\n", " | x | \n", "y | \n", "
---|---|---|
0 | \n", "1 | \n", "a | \n", "
1 | \n", "2 | \n", "b | \n", "
2 | \n", "3 | \n", "c | \n", "
3 | \n", "4 | \n", "d | \n", "
\n", " | x | \n", "y | \n", "
---|---|---|
0 | \n", "1 | \n", "a | \n", "
1 | \n", "2 | \n", "b | \n", "
2 | \n", "3 | \n", "c | \n", "
3 | \n", "4 | \n", "d | \n", "
Caution!
\n", "Using pd.concat() to vertically combine dataframes should only be used when we know that the DataFrames' schemas are consistent.
\n", "Question
\n", "Which column should be our key column?
\n", "Tip!
\n", "There is an intersection() method that makes it easy to find common columns between two DataFrames.
\n", "\n", " | year | \n", "month | \n", "day | \n", "dep_time | \n", "sched_dep_time | \n", "dep_delay | \n", "arr_time | \n", "sched_arr_time | \n", "arr_delay | \n", "carrier | \n", "flight | \n", "tailnum | \n", "origin | \n", "dest | \n", "air_time | \n", "distance | \n", "hour | \n", "minute | \n", "time_hour | \n", "name | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "2013 | \n", "1 | \n", "1 | \n", "517.0 | \n", "515 | \n", "2.0 | \n", "830.0 | \n", "819 | \n", "11.0 | \n", "UA | \n", "1545 | \n", "N14228 | \n", "EWR | \n", "IAH | \n", "227.0 | \n", "1400 | \n", "5 | \n", "15 | \n", "2013-01-01 05:00:00 | \n", "United Air Lines Inc. | \n", "
1 | \n", "2013 | \n", "1 | \n", "1 | \n", "533.0 | \n", "529 | \n", "4.0 | \n", "850.0 | \n", "830 | \n", "20.0 | \n", "UA | \n", "1714 | \n", "N24211 | \n", "LGA | \n", "IAH | \n", "227.0 | \n", "1416 | \n", "5 | \n", "29 | \n", "2013-01-01 05:00:00 | \n", "United Air Lines Inc. | \n", "
Note
\n", "Inner joins only keep rows where the key is in both tables.
\n", "Note
\n", "Left outer joins, or simply left joins, keep rows where the key is in the left table.
\n", "Note
\n", "Right outer joins, or simply right joins, keep rows where the key is in the right table.
\n", "Note
\n", "Full outer joins keep all rows in both tables.
\n", "\n", " | year | \n", "month | \n", "day | \n", "dep_time | \n", "sched_dep_time | \n", "dep_delay | \n", "arr_time | \n", "sched_arr_time | \n", "arr_delay | \n", "carrier | \n", "flight | \n", "tailnum | \n", "origin | \n", "dest | \n", "air_time | \n", "distance | \n", "hour | \n", "minute | \n", "time_hour | \n", "name | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "2013 | \n", "1 | \n", "1 | \n", "517.0 | \n", "515 | \n", "2.0 | \n", "830.0 | \n", "819 | \n", "11.0 | \n", "UA | \n", "1545 | \n", "N14228 | \n", "EWR | \n", "IAH | \n", "227.0 | \n", "1400 | \n", "5 | \n", "15 | \n", "2013-01-01 05:00:00 | \n", "United Air Lines Inc. | \n", "
1 | \n", "2013 | \n", "1 | \n", "1 | \n", "533.0 | \n", "529 | \n", "4.0 | \n", "850.0 | \n", "830 | \n", "20.0 | \n", "UA | \n", "1714 | \n", "N24211 | \n", "LGA | \n", "IAH | \n", "227.0 | \n", "1416 | \n", "5 | \n", "29 | \n", "2013-01-01 05:00:00 | \n", "United Air Lines Inc. | \n", "
2 | \n", "2013 | \n", "1 | \n", "1 | \n", "554.0 | \n", "558 | \n", "-4.0 | \n", "740.0 | \n", "728 | \n", "12.0 | \n", "UA | \n", "1696 | \n", "N39463 | \n", "EWR | \n", "ORD | \n", "150.0 | \n", "719 | \n", "5 | \n", "58 | \n", "2013-01-01 05:00:00 | \n", "United Air Lines Inc. | \n", "
\n", " | year | \n", "month | \n", "day | \n", "dep_time | \n", "sched_dep_time | \n", "dep_delay | \n", "arr_time | \n", "sched_arr_time | \n", "arr_delay | \n", "carrier | \n", "flight | \n", "tailnum | \n", "origin | \n", "dest | \n", "air_time | \n", "distance | \n", "hour | \n", "minute | \n", "time_hour | \n", "name | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "2013 | \n", "1 | \n", "1 | \n", "517.0 | \n", "515 | \n", "2.0 | \n", "830.0 | \n", "819 | \n", "11.0 | \n", "UA | \n", "1545 | \n", "N14228 | \n", "EWR | \n", "IAH | \n", "227.0 | \n", "1400 | \n", "5 | \n", "15 | \n", "2013-01-01 05:00:00 | \n", "United Air Lines Inc. | \n", "
1 | \n", "2013 | \n", "1 | \n", "1 | \n", "533.0 | \n", "529 | \n", "4.0 | \n", "850.0 | \n", "830 | \n", "20.0 | \n", "UA | \n", "1714 | \n", "N24211 | \n", "LGA | \n", "IAH | \n", "227.0 | \n", "1416 | \n", "5 | \n", "29 | \n", "2013-01-01 05:00:00 | \n", "United Air Lines Inc. | \n", "
2 | \n", "2013 | \n", "1 | \n", "1 | \n", "554.0 | \n", "558 | \n", "-4.0 | \n", "740.0 | \n", "728 | \n", "12.0 | \n", "UA | \n", "1696 | \n", "N39463 | \n", "EWR | \n", "ORD | \n", "150.0 | \n", "719 | \n", "5 | \n", "58 | \n", "2013-01-01 05:00:00 | \n", "United Air Lines Inc. | \n", "