{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Selecting and Filtering" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "> “Every second of every day, our senses bring in way too much data than we can possibly process in our brains.”\n", ">\n", "> \\- Peter Diamandis, Founder of the X-Prize for human-AI collaboration" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Applied Review" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Data Structures and the DataFrame Class" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "* Data is frequently represented inside a **DataFrame** - a class from the `pandas` library that is similar to a *table* or *dataset*" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "* Each DataFrame object has rows and columns" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "* The DataFrame class has methods (built-in operations) for common tasks and attributes (stored data) of common information" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "* Other structures exist, too - lists, dicts, tuples, etc." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Importing Data" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "* Tabular data can be imported into DataFrames using the `pd.read_csv()` function - there are parameters for different options" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [] }, "outputs": [], "source": [ "import pandas as pd\n", "planes_df = pd.read_csv('../data/planes.csv')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* Other data formats like JSON (key-value pairs) and Pickle (native Python) can be imported using the `with` statement and respective functions:\n", " * JSON files use the `load()` function from the `json` library\n", " * Pickle files use the `load()` function from the `pickle` library" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## General Model" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Subsetting Dimensions" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "* We don't always want all of the data in a DataFrame, so we need to take subsets of the DataFrame." ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "* In general, **subsetting** is extracting a small portion of a DataFrame -- making the DataFrame smaller." ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "* Since the DataFrame is two-dimensional, there are two dimensions on which to subset." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "**Dimension 1:** We may only want to consider certain *variables*." ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "For example, we may only care about the `year` and `engines` variables:\n", "\n", "![](./images/selecting_columns.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" }, "tags": [] }, "source": [ "We call this **selecting** columns/variables -- this is similar to SQL's `SELECT` or R's dplyr package's `select()`." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "**Dimension 2:** We may only want to consider certain *cases*." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "For example, we may only care about the cases where the manufacturer is Embraer.\n", "\n", "![](./images/selecting_rows.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We call this **filtering** or **slicing** -- this is similar to SQL's `WHERE` or R's dplyr package's `filter()` or `slice()`." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "And we can combine these two options to subset in both dimensions -- the `year` and `engines` variables where the manufacturer is Embraer:\n", "\n", "![](./images/selecting_rows_columns.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Subsetting into a New DataFrame" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "In the previous example, we want to do two things using `planes_df`:\n", "\n", " 1. **select** the `year` and `engines` variables\n", " 2. **filter** to cases where the manufacturer is Embraer" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "But we also want to return a new DataFrame -- not just highlight certain cells." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "In other words, we want to turn this:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "slideshow": { "slide_type": "-" }, "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tailnumyeartypemanufacturermodelenginesseatsspeedengine
0N101562004.0Fixed wing multi engineEMBRAEREMB-145XR255NaNTurbo-fan
1N102UW1998.0Fixed wing multi engineAIRBUS INDUSTRIEA320-2142182NaNTurbo-fan
2N103US1999.0Fixed wing multi engineAIRBUS INDUSTRIEA320-2142182NaNTurbo-fan
3N104UW1999.0Fixed wing multi engineAIRBUS INDUSTRIEA320-2142182NaNTurbo-fan
4N105752002.0Fixed wing multi engineEMBRAEREMB-145LR255NaNTurbo-fan
\n", "
" ], "text/plain": [ " tailnum year type manufacturer model \\\n", "0 N10156 2004.0 Fixed wing multi engine EMBRAER EMB-145XR \n", "1 N102UW 1998.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n", "2 N103US 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n", "3 N104UW 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n", "4 N10575 2002.0 Fixed wing multi engine EMBRAER EMB-145LR \n", "\n", " engines seats speed engine \n", "0 2 55 NaN Turbo-fan \n", "1 2 182 NaN Turbo-fan \n", "2 2 182 NaN Turbo-fan \n", "3 2 182 NaN Turbo-fan \n", "4 2 55 NaN Turbo-fan " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "planes_df.head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Into this:\n", "\n", "![](./images/subsetting_result.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "So we really have a third need: return the resulting DataFrame so we can continue our analysis:\n", "\n", " 1. **select** the `year` and `engines` variables\n", " 2. **filter** to cases where the manufacturer is Embraer\n", " 3. Return a DataFrame to continue the analysis" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Subsetting Variables" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Recall that the subsetting of variables/columns is called **selecting** variables/columns." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "In a simple example, we can select a single variable using bracket subsetting notation:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "0 2004.0\n", "1 1998.0\n", "2 1999.0\n", "3 1999.0\n", "4 2002.0\n", "Name: year, dtype: float64" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "planes_df['year'].head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Notice the `head()` method also works on `planes_df['year']` to return the first five elements." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "
\n", "

Question

\n", "

What is the data type of planes_df['year']?

\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "This returns `pandas.core.series.Series`, referred to simply as a \"Series\", rather than a DataFrame." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "pandas.core.series.Series" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(planes_df['year'])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "This is okay -- the Series is a popular data structure in Python. Recall:" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* A Series is a one-dimensional data structure -- this is similar to a Python `list`" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "* Note that all objects in a Series must be of the same type" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "* Each DataFrame can be thought of as a list of equal-length Series (plus an Index)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "This visual of a Series and DataFrame may be helpful:\n", "\n", "
\n", "\"dataframe-series.png\"\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Series can be useful for other things, and we will cover that later -- but for now, we are interested in *returning a DataFrame* rather than a series." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "We can select a single variable and return a DataFrame by still using bracket subsetting notation, but this time we will pass a `list` of variables names:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
year
02004.0
11998.0
21999.0
31999.0
42002.0
\n", "
" ], "text/plain": [ " year\n", "0 2004.0\n", "1 1998.0\n", "2 1999.0\n", "3 1999.0\n", "4 2002.0" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "planes_df[['year']].head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "And we can see that we've returned a DataFrame:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "pandas.core.frame.DataFrame" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(planes_df[['year']].head())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "
\n", "

Question

\n", "

What's another advantage of this passing a list?

\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Passing a list into the bracket subsetting notation allows us to select multiple variables at once:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearengines
02004.02
11998.02
21999.02
31999.02
42002.02
\n", "
" ], "text/plain": [ " year engines\n", "0 2004.0 2\n", "1 1998.0 2\n", "2 1999.0 2\n", "3 1999.0 2\n", "4 2002.0 2" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "planes_df[['year', 'engines']].head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "In another example, assume we are interested in the `model` of plane, number of `seats` and `engine` type:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "scrolled": true, "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
modelseatsengine
0EMB-145XR55Turbo-fan
1A320-214182Turbo-fan
2A320-214182Turbo-fan
3A320-214182Turbo-fan
4EMB-145LR55Turbo-fan
\n", "
" ], "text/plain": [ " model seats engine\n", "0 EMB-145XR 55 Turbo-fan\n", "1 A320-214 182 Turbo-fan\n", "2 A320-214 182 Turbo-fan\n", "3 A320-214 182 Turbo-fan\n", "4 EMB-145LR 55 Turbo-fan" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "planes_df[['model', 'seats', 'engine']].head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Your Turn\n", "\n", "1\\. ______ is a common term for subsetting DataFrame variables.\n", "\n", "2\\. What type of object is a DataFrame column?\n", "\n", "3\\. What will be returned by the following code?\n", "\n", " ```python\n", " planes_df['type', 'model']\n", " ```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Subsetting Cases" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "When we subset cases/records/rows we primarily use two names: **slicing** and **filtering**, but *these are not the same*:\n", "\n", " * **slicing**, similar to row **indexing**, subsets cases by the value of the Index\n", " * **filtering** subsets cases using a conditional test" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Slicing Cases" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Remember that all DataFrames have an Index:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tailnumyeartypemanufacturermodelenginesseatsspeedengine
0N101562004.0Fixed wing multi engineEMBRAEREMB-145XR255NaNTurbo-fan
1N102UW1998.0Fixed wing multi engineAIRBUS INDUSTRIEA320-2142182NaNTurbo-fan
2N103US1999.0Fixed wing multi engineAIRBUS INDUSTRIEA320-2142182NaNTurbo-fan
3N104UW1999.0Fixed wing multi engineAIRBUS INDUSTRIEA320-2142182NaNTurbo-fan
4N105752002.0Fixed wing multi engineEMBRAEREMB-145LR255NaNTurbo-fan
\n", "
" ], "text/plain": [ " tailnum year type manufacturer model \\\n", "0 N10156 2004.0 Fixed wing multi engine EMBRAER EMB-145XR \n", "1 N102UW 1998.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n", "2 N103US 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n", "3 N104UW 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n", "4 N10575 2002.0 Fixed wing multi engine EMBRAER EMB-145LR \n", "\n", " engines seats speed engine \n", "0 2 55 NaN Turbo-fan \n", "1 2 182 NaN Turbo-fan \n", "2 2 182 NaN Turbo-fan \n", "3 2 182 NaN Turbo-fan \n", "4 2 55 NaN Turbo-fan " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "planes_df.head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "We can **slice** cases/rows using the values in the Index and bracket subsetting notation. It's common practice to use `.loc` to slice cases/rows:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tailnumyeartypemanufacturermodelenginesseatsspeedengine
0N101562004.0Fixed wing multi engineEMBRAEREMB-145XR255NaNTurbo-fan
1N102UW1998.0Fixed wing multi engineAIRBUS INDUSTRIEA320-2142182NaNTurbo-fan
2N103US1999.0Fixed wing multi engineAIRBUS INDUSTRIEA320-2142182NaNTurbo-fan
3N104UW1999.0Fixed wing multi engineAIRBUS INDUSTRIEA320-2142182NaNTurbo-fan
4N105752002.0Fixed wing multi engineEMBRAEREMB-145LR255NaNTurbo-fan
5N105UW1999.0Fixed wing multi engineAIRBUS INDUSTRIEA320-2142182NaNTurbo-fan
\n", "
" ], "text/plain": [ " tailnum year type manufacturer model \\\n", "0 N10156 2004.0 Fixed wing multi engine EMBRAER EMB-145XR \n", "1 N102UW 1998.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n", "2 N103US 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n", "3 N104UW 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n", "4 N10575 2002.0 Fixed wing multi engine EMBRAER EMB-145LR \n", "5 N105UW 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n", "\n", " engines seats speed engine \n", "0 2 55 NaN Turbo-fan \n", "1 2 182 NaN Turbo-fan \n", "2 2 182 NaN Turbo-fan \n", "3 2 182 NaN Turbo-fan \n", "4 2 55 NaN Turbo-fan \n", "5 2 182 NaN Turbo-fan " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "planes_df.loc[0:5]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "
\n", "

Note

\n", "

Note that since this is not \"indexing\", the last element is inclusive.

\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "We can also pass a `list` of Index values:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tailnumyeartypemanufacturermodelenginesseatsspeedengine
0N101562004.0Fixed wing multi engineEMBRAEREMB-145XR255NaNTurbo-fan
2N103US1999.0Fixed wing multi engineAIRBUS INDUSTRIEA320-2142182NaNTurbo-fan
4N105752002.0Fixed wing multi engineEMBRAEREMB-145LR255NaNTurbo-fan
6N107US1999.0Fixed wing multi engineAIRBUS INDUSTRIEA320-2142182NaNTurbo-fan
8N109UW1999.0Fixed wing multi engineAIRBUS INDUSTRIEA320-2142182NaNTurbo-fan
\n", "
" ], "text/plain": [ " tailnum year type manufacturer model \\\n", "0 N10156 2004.0 Fixed wing multi engine EMBRAER EMB-145XR \n", "2 N103US 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n", "4 N10575 2002.0 Fixed wing multi engine EMBRAER EMB-145LR \n", "6 N107US 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n", "8 N109UW 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n", "\n", " engines seats speed engine \n", "0 2 55 NaN Turbo-fan \n", "2 2 182 NaN Turbo-fan \n", "4 2 55 NaN Turbo-fan \n", "6 2 182 NaN Turbo-fan \n", "8 2 182 NaN Turbo-fan " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "planes_df.loc[[0, 2, 4, 6, 8]]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Filtering Cases" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We can **filter** cases/rows using a logical sequence equal in length to the number of rows in the DataFrame." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Continuing our example, assume we want to determine whether each case's `manufacturer` is Embraer. We can use the `manufacturer` Series and a logical equivalency test to find the result for each row:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "0 True\n", "1 False\n", "2 False\n", "3 False\n", "4 True\n", " ... \n", "3317 False\n", "3318 False\n", "3319 False\n", "3320 False\n", "3321 False\n", "Name: manufacturer, Length: 3322, dtype: bool" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "planes_df['manufacturer'] == 'EMBRAER'" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "We can use this resulting logical sequence to test **filter** cases -- rows that are `True` will be returned while those that are `False` will be removed:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tailnumyeartypemanufacturermodelenginesseatsspeedengine
0N101562004.0Fixed wing multi engineEMBRAEREMB-145XR255NaNTurbo-fan
4N105752002.0Fixed wing multi engineEMBRAEREMB-145LR255NaNTurbo-fan
10N111062002.0Fixed wing multi engineEMBRAEREMB-145XR255NaNTurbo-fan
11N111072002.0Fixed wing multi engineEMBRAEREMB-145XR255NaNTurbo-fan
12N111092002.0Fixed wing multi engineEMBRAEREMB-145XR255NaNTurbo-fan
\n", "
" ], "text/plain": [ " tailnum year type manufacturer model engines \\\n", "0 N10156 2004.0 Fixed wing multi engine EMBRAER EMB-145XR 2 \n", "4 N10575 2002.0 Fixed wing multi engine EMBRAER EMB-145LR 2 \n", "10 N11106 2002.0 Fixed wing multi engine EMBRAER EMB-145XR 2 \n", "11 N11107 2002.0 Fixed wing multi engine EMBRAER EMB-145XR 2 \n", "12 N11109 2002.0 Fixed wing multi engine EMBRAER EMB-145XR 2 \n", "\n", " seats speed engine \n", "0 55 NaN Turbo-fan \n", "4 55 NaN Turbo-fan \n", "10 55 NaN Turbo-fan \n", "11 55 NaN Turbo-fan \n", "12 55 NaN Turbo-fan " ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "planes_df[planes_df['manufacturer'] == 'EMBRAER'].head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "This also works with `.loc`:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tailnumyeartypemanufacturermodelenginesseatsspeedengine
0N101562004.0Fixed wing multi engineEMBRAEREMB-145XR255NaNTurbo-fan
4N105752002.0Fixed wing multi engineEMBRAEREMB-145LR255NaNTurbo-fan
10N111062002.0Fixed wing multi engineEMBRAEREMB-145XR255NaNTurbo-fan
11N111072002.0Fixed wing multi engineEMBRAEREMB-145XR255NaNTurbo-fan
12N111092002.0Fixed wing multi engineEMBRAEREMB-145XR255NaNTurbo-fan
\n", "
" ], "text/plain": [ " tailnum year type manufacturer model engines \\\n", "0 N10156 2004.0 Fixed wing multi engine EMBRAER EMB-145XR 2 \n", "4 N10575 2002.0 Fixed wing multi engine EMBRAER EMB-145LR 2 \n", "10 N11106 2002.0 Fixed wing multi engine EMBRAER EMB-145XR 2 \n", "11 N11107 2002.0 Fixed wing multi engine EMBRAER EMB-145XR 2 \n", "12 N11109 2002.0 Fixed wing multi engine EMBRAER EMB-145XR 2 \n", "\n", " seats speed engine \n", "0 55 NaN Turbo-fan \n", "4 55 NaN Turbo-fan \n", "10 55 NaN Turbo-fan \n", "11 55 NaN Turbo-fan \n", "12 55 NaN Turbo-fan " ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "planes_df.loc[planes_df['manufacturer'] == 'EMBRAER'].head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Any conditional test can be used to **filter** DataFrame rows:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tailnumyeartypemanufacturermodelenginesseatsspeedengine
0N101562004.0Fixed wing multi engineEMBRAEREMB-145XR255NaNTurbo-fan
15N111212003.0Fixed wing multi engineEMBRAEREMB-145XR255NaNTurbo-fan
16N111272003.0Fixed wing multi engineEMBRAEREMB-145XR255NaNTurbo-fan
17N111372003.0Fixed wing multi engineEMBRAEREMB-145XR255NaNTurbo-fan
18N111402003.0Fixed wing multi engineEMBRAEREMB-145XR255NaNTurbo-fan
\n", "
" ], "text/plain": [ " tailnum year type manufacturer model engines \\\n", "0 N10156 2004.0 Fixed wing multi engine EMBRAER EMB-145XR 2 \n", "15 N11121 2003.0 Fixed wing multi engine EMBRAER EMB-145XR 2 \n", "16 N11127 2003.0 Fixed wing multi engine EMBRAER EMB-145XR 2 \n", "17 N11137 2003.0 Fixed wing multi engine EMBRAER EMB-145XR 2 \n", "18 N11140 2003.0 Fixed wing multi engine EMBRAER EMB-145XR 2 \n", "\n", " seats speed engine \n", "0 55 NaN Turbo-fan \n", "15 55 NaN Turbo-fan \n", "16 55 NaN Turbo-fan \n", "17 55 NaN Turbo-fan \n", "18 55 NaN Turbo-fan " ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "planes_df.loc[planes_df['year'] > 2002].head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "And multiple conditional tests can be combined using logical operators:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tailnumyeartypemanufacturermodelenginesseatsspeedengine
15N111212003.0Fixed wing multi engineEMBRAEREMB-145XR255NaNTurbo-fan
16N111272003.0Fixed wing multi engineEMBRAEREMB-145XR255NaNTurbo-fan
17N111372003.0Fixed wing multi engineEMBRAEREMB-145XR255NaNTurbo-fan
18N111402003.0Fixed wing multi engineEMBRAEREMB-145XR255NaNTurbo-fan
19N111502003.0Fixed wing multi engineEMBRAEREMB-145XR255NaNTurbo-fan
\n", "
" ], "text/plain": [ " tailnum year type manufacturer model engines \\\n", "15 N11121 2003.0 Fixed wing multi engine EMBRAER EMB-145XR 2 \n", "16 N11127 2003.0 Fixed wing multi engine EMBRAER EMB-145XR 2 \n", "17 N11137 2003.0 Fixed wing multi engine EMBRAER EMB-145XR 2 \n", "18 N11140 2003.0 Fixed wing multi engine EMBRAER EMB-145XR 2 \n", "19 N11150 2003.0 Fixed wing multi engine EMBRAER EMB-145XR 2 \n", "\n", " seats speed engine \n", "15 55 NaN Turbo-fan \n", "16 55 NaN Turbo-fan \n", "17 55 NaN Turbo-fan \n", "18 55 NaN Turbo-fan \n", "19 55 NaN Turbo-fan " ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "planes_df.loc[(planes_df['year'] > 2002) & (planes_df['year'] < 2004)].head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "
\n", "

Note

\n", "

Note that each condition is wrapped in parentheses -- this is required.

\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Your Turn\n", "\n", "1. What's the difference between **slicing** cases and **filtering** cases?\n", "2. Fill in the blanks to fix the following code to find planes that have more than three engines:\n", "\n", " ```python\n", " planes_df.loc[______['______'] > 3]\n", " ```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Selecting Variables and Filtering Cases" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "If we want to select variables and filter cases at the same time, we have a few options:\n", "\n", "1. Sequential operations\n", "2. Simultaneous operations" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Sequential Operations\n", "\n", "We can use what we've previously learned to select variables and filter cases in multiple steps:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearengines
02004.02
42002.02
102002.02
112002.02
122002.02
\n", "
" ], "text/plain": [ " year engines\n", "0 2004.0 2\n", "4 2002.0 2\n", "10 2002.0 2\n", "11 2002.0 2\n", "12 2002.0 2" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "planes_df_filtered = planes_df.loc[planes_df['manufacturer'] == 'EMBRAER']\n", "planes_df_filtered_and_selected = planes_df_filtered[['year', 'engines']]\n", "planes_df_filtered_and_selected.head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "This is a good way to learn how to select and filter independently, and it also reads very clearly." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Simultaneous Operations\n", "\n", "However, we can also do both selecting and filtering in a single step with `.loc`:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearengines
02004.02
42002.02
102002.02
112002.02
122002.02
\n", "
" ], "text/plain": [ " year engines\n", "0 2004.0 2\n", "4 2002.0 2\n", "10 2002.0 2\n", "11 2002.0 2\n", "12 2002.0 2" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "planes_df.loc[planes_df['manufacturer'] == 'EMBRAER', ['year', 'engines']].head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "This option is more succinct and also reduces programming time." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Your Turn\n", "\n", "Subset `planes_df` to only include planes made by Boeing and the `seats` and `model` variables." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Questions\n", "\n", "Are there any questions before we move on?" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" }, "rise": { "autolaunch": true, "transition": "none" } }, "nbformat": 4, "nbformat_minor": 4 }