{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Selecting and Filtering"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"source": [
"> “Every second of every day, our senses bring in way too much data than we can possibly process in our brains.”\n",
">\n",
"> \\- Peter Diamandis, Founder of the X-Prize for human-AI collaboration"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Applied Review"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Data Structures and the DataFrame Class"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"* Data is frequently represented inside a **DataFrame** - a class from the `pandas` library that is similar to a *table* or *dataset*"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"* Each DataFrame object has rows and columns"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"* The DataFrame class has methods (built-in operations) for common tasks and attributes (stored data) of common information"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"* Other structures exist, too - lists, dicts, tuples, etc."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Importing Data"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"* Tabular data can be imported into DataFrames using the `pd.read_csv()` function - there are parameters for different options"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"import pandas as pd\n",
"planes_df = pd.read_csv('../data/planes.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* Other data formats like JSON (key-value pairs) and Pickle (native Python) can be imported using the `with` statement and respective functions:\n",
" * JSON files use the `load()` function from the `json` library\n",
" * Pickle files use the `load()` function from the `pickle` library"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## General Model"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Subsetting Dimensions"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"* We don't always want all of the data in a DataFrame, so we need to take subsets of the DataFrame."
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"* In general, **subsetting** is extracting a small portion of a DataFrame -- making the DataFrame smaller."
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"* Since the DataFrame is two-dimensional, there are two dimensions on which to subset."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"**Dimension 1:** We may only want to consider certain *variables*."
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"For example, we may only care about the `year` and `engines` variables:\n",
"\n",
"![](./images/selecting_columns.png)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
},
"tags": []
},
"source": [
"We call this **selecting** columns/variables -- this is similar to SQL's `SELECT` or R's dplyr package's `select()`."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"**Dimension 2:** We may only want to consider certain *cases*."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"For example, we may only care about the cases where the manufacturer is Embraer.\n",
"\n",
"![](./images/selecting_rows.png)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"We call this **filtering** or **slicing** -- this is similar to SQL's `WHERE` or R's dplyr package's `filter()` or `slice()`."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"And we can combine these two options to subset in both dimensions -- the `year` and `engines` variables where the manufacturer is Embraer:\n",
"\n",
"![](./images/selecting_rows_columns.png)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Subsetting into a New DataFrame"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"In the previous example, we want to do two things using `planes_df`:\n",
"\n",
" 1. **select** the `year` and `engines` variables\n",
" 2. **filter** to cases where the manufacturer is Embraer"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"But we also want to return a new DataFrame -- not just highlight certain cells."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"In other words, we want to turn this:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"slideshow": {
"slide_type": "-"
},
"tags": [
"hide-input"
]
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" tailnum | \n",
" year | \n",
" type | \n",
" manufacturer | \n",
" model | \n",
" engines | \n",
" seats | \n",
" speed | \n",
" engine | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" N10156 | \n",
" 2004.0 | \n",
" Fixed wing multi engine | \n",
" EMBRAER | \n",
" EMB-145XR | \n",
" 2 | \n",
" 55 | \n",
" NaN | \n",
" Turbo-fan | \n",
"
\n",
" \n",
" 1 | \n",
" N102UW | \n",
" 1998.0 | \n",
" Fixed wing multi engine | \n",
" AIRBUS INDUSTRIE | \n",
" A320-214 | \n",
" 2 | \n",
" 182 | \n",
" NaN | \n",
" Turbo-fan | \n",
"
\n",
" \n",
" 2 | \n",
" N103US | \n",
" 1999.0 | \n",
" Fixed wing multi engine | \n",
" AIRBUS INDUSTRIE | \n",
" A320-214 | \n",
" 2 | \n",
" 182 | \n",
" NaN | \n",
" Turbo-fan | \n",
"
\n",
" \n",
" 3 | \n",
" N104UW | \n",
" 1999.0 | \n",
" Fixed wing multi engine | \n",
" AIRBUS INDUSTRIE | \n",
" A320-214 | \n",
" 2 | \n",
" 182 | \n",
" NaN | \n",
" Turbo-fan | \n",
"
\n",
" \n",
" 4 | \n",
" N10575 | \n",
" 2002.0 | \n",
" Fixed wing multi engine | \n",
" EMBRAER | \n",
" EMB-145LR | \n",
" 2 | \n",
" 55 | \n",
" NaN | \n",
" Turbo-fan | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" tailnum year type manufacturer model \\\n",
"0 N10156 2004.0 Fixed wing multi engine EMBRAER EMB-145XR \n",
"1 N102UW 1998.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n",
"2 N103US 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n",
"3 N104UW 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n",
"4 N10575 2002.0 Fixed wing multi engine EMBRAER EMB-145LR \n",
"\n",
" engines seats speed engine \n",
"0 2 55 NaN Turbo-fan \n",
"1 2 182 NaN Turbo-fan \n",
"2 2 182 NaN Turbo-fan \n",
"3 2 182 NaN Turbo-fan \n",
"4 2 55 NaN Turbo-fan "
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"planes_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Into this:\n",
"\n",
"![](./images/subsetting_result.png)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"So we really have a third need: return the resulting DataFrame so we can continue our analysis:\n",
"\n",
" 1. **select** the `year` and `engines` variables\n",
" 2. **filter** to cases where the manufacturer is Embraer\n",
" 3. Return a DataFrame to continue the analysis"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Subsetting Variables"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Recall that the subsetting of variables/columns is called **selecting** variables/columns."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"In a simple example, we can select a single variable using bracket subsetting notation:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0 2004.0\n",
"1 1998.0\n",
"2 1999.0\n",
"3 1999.0\n",
"4 2002.0\n",
"Name: year, dtype: float64"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"planes_df['year'].head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"Notice the `head()` method also works on `planes_df['year']` to return the first five elements."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"\n",
"
Question
\n",
"
What is the data type of planes_df['year']?
\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"This returns `pandas.core.series.Series`, referred to simply as a \"Series\", rather than a DataFrame."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"text/plain": [
"pandas.core.series.Series"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(planes_df['year'])"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"This is okay -- the Series is a popular data structure in Python. Recall:"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* A Series is a one-dimensional data structure -- this is similar to a Python `list`"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"* Note that all objects in a Series must be of the same type"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"* Each DataFrame can be thought of as a list of equal-length Series (plus an Index)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"This visual of a Series and DataFrame may be helpful:\n",
"\n",
"\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"Series can be useful for other things, and we will cover that later -- but for now, we are interested in *returning a DataFrame* rather than a series."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"We can select a single variable and return a DataFrame by still using bracket subsetting notation, but this time we will pass a `list` of variables names:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" year | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2004.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 1998.0 | \n",
"
\n",
" \n",
" 2 | \n",
" 1999.0 | \n",
"
\n",
" \n",
" 3 | \n",
" 1999.0 | \n",
"
\n",
" \n",
" 4 | \n",
" 2002.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" year\n",
"0 2004.0\n",
"1 1998.0\n",
"2 1999.0\n",
"3 1999.0\n",
"4 2002.0"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"planes_df[['year']].head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"And we can see that we've returned a DataFrame:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"text/plain": [
"pandas.core.frame.DataFrame"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(planes_df[['year']].head())"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"\n",
"
Question
\n",
"
What's another advantage of this passing a list?
\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"Passing a list into the bracket subsetting notation allows us to select multiple variables at once:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" year | \n",
" engines | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2004.0 | \n",
" 2 | \n",
"
\n",
" \n",
" 1 | \n",
" 1998.0 | \n",
" 2 | \n",
"
\n",
" \n",
" 2 | \n",
" 1999.0 | \n",
" 2 | \n",
"
\n",
" \n",
" 3 | \n",
" 1999.0 | \n",
" 2 | \n",
"
\n",
" \n",
" 4 | \n",
" 2002.0 | \n",
" 2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" year engines\n",
"0 2004.0 2\n",
"1 1998.0 2\n",
"2 1999.0 2\n",
"3 1999.0 2\n",
"4 2002.0 2"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"planes_df[['year', 'engines']].head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"In another example, assume we are interested in the `model` of plane, number of `seats` and `engine` type:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" model | \n",
" seats | \n",
" engine | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" EMB-145XR | \n",
" 55 | \n",
" Turbo-fan | \n",
"
\n",
" \n",
" 1 | \n",
" A320-214 | \n",
" 182 | \n",
" Turbo-fan | \n",
"
\n",
" \n",
" 2 | \n",
" A320-214 | \n",
" 182 | \n",
" Turbo-fan | \n",
"
\n",
" \n",
" 3 | \n",
" A320-214 | \n",
" 182 | \n",
" Turbo-fan | \n",
"
\n",
" \n",
" 4 | \n",
" EMB-145LR | \n",
" 55 | \n",
" Turbo-fan | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" model seats engine\n",
"0 EMB-145XR 55 Turbo-fan\n",
"1 A320-214 182 Turbo-fan\n",
"2 A320-214 182 Turbo-fan\n",
"3 A320-214 182 Turbo-fan\n",
"4 EMB-145LR 55 Turbo-fan"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"planes_df[['model', 'seats', 'engine']].head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Your Turn\n",
"\n",
"1\\. ______ is a common term for subsetting DataFrame variables.\n",
"\n",
"2\\. What type of object is a DataFrame column?\n",
"\n",
"3\\. What will be returned by the following code?\n",
"\n",
" ```python\n",
" planes_df['type', 'model']\n",
" ```"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Subsetting Cases"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"When we subset cases/records/rows we primarily use two names: **slicing** and **filtering**, but *these are not the same*:\n",
"\n",
" * **slicing**, similar to row **indexing**, subsets cases by the value of the Index\n",
" * **filtering** subsets cases using a conditional test"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Slicing Cases"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"Remember that all DataFrames have an Index:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" tailnum | \n",
" year | \n",
" type | \n",
" manufacturer | \n",
" model | \n",
" engines | \n",
" seats | \n",
" speed | \n",
" engine | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" N10156 | \n",
" 2004.0 | \n",
" Fixed wing multi engine | \n",
" EMBRAER | \n",
" EMB-145XR | \n",
" 2 | \n",
" 55 | \n",
" NaN | \n",
" Turbo-fan | \n",
"
\n",
" \n",
" 1 | \n",
" N102UW | \n",
" 1998.0 | \n",
" Fixed wing multi engine | \n",
" AIRBUS INDUSTRIE | \n",
" A320-214 | \n",
" 2 | \n",
" 182 | \n",
" NaN | \n",
" Turbo-fan | \n",
"
\n",
" \n",
" 2 | \n",
" N103US | \n",
" 1999.0 | \n",
" Fixed wing multi engine | \n",
" AIRBUS INDUSTRIE | \n",
" A320-214 | \n",
" 2 | \n",
" 182 | \n",
" NaN | \n",
" Turbo-fan | \n",
"
\n",
" \n",
" 3 | \n",
" N104UW | \n",
" 1999.0 | \n",
" Fixed wing multi engine | \n",
" AIRBUS INDUSTRIE | \n",
" A320-214 | \n",
" 2 | \n",
" 182 | \n",
" NaN | \n",
" Turbo-fan | \n",
"
\n",
" \n",
" 4 | \n",
" N10575 | \n",
" 2002.0 | \n",
" Fixed wing multi engine | \n",
" EMBRAER | \n",
" EMB-145LR | \n",
" 2 | \n",
" 55 | \n",
" NaN | \n",
" Turbo-fan | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" tailnum year type manufacturer model \\\n",
"0 N10156 2004.0 Fixed wing multi engine EMBRAER EMB-145XR \n",
"1 N102UW 1998.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n",
"2 N103US 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n",
"3 N104UW 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n",
"4 N10575 2002.0 Fixed wing multi engine EMBRAER EMB-145LR \n",
"\n",
" engines seats speed engine \n",
"0 2 55 NaN Turbo-fan \n",
"1 2 182 NaN Turbo-fan \n",
"2 2 182 NaN Turbo-fan \n",
"3 2 182 NaN Turbo-fan \n",
"4 2 55 NaN Turbo-fan "
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"planes_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"We can **slice** cases/rows using the values in the Index and bracket subsetting notation. It's common practice to use `.loc` to slice cases/rows:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" tailnum | \n",
" year | \n",
" type | \n",
" manufacturer | \n",
" model | \n",
" engines | \n",
" seats | \n",
" speed | \n",
" engine | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" N10156 | \n",
" 2004.0 | \n",
" Fixed wing multi engine | \n",
" EMBRAER | \n",
" EMB-145XR | \n",
" 2 | \n",
" 55 | \n",
" NaN | \n",
" Turbo-fan | \n",
"
\n",
" \n",
" 1 | \n",
" N102UW | \n",
" 1998.0 | \n",
" Fixed wing multi engine | \n",
" AIRBUS INDUSTRIE | \n",
" A320-214 | \n",
" 2 | \n",
" 182 | \n",
" NaN | \n",
" Turbo-fan | \n",
"
\n",
" \n",
" 2 | \n",
" N103US | \n",
" 1999.0 | \n",
" Fixed wing multi engine | \n",
" AIRBUS INDUSTRIE | \n",
" A320-214 | \n",
" 2 | \n",
" 182 | \n",
" NaN | \n",
" Turbo-fan | \n",
"
\n",
" \n",
" 3 | \n",
" N104UW | \n",
" 1999.0 | \n",
" Fixed wing multi engine | \n",
" AIRBUS INDUSTRIE | \n",
" A320-214 | \n",
" 2 | \n",
" 182 | \n",
" NaN | \n",
" Turbo-fan | \n",
"
\n",
" \n",
" 4 | \n",
" N10575 | \n",
" 2002.0 | \n",
" Fixed wing multi engine | \n",
" EMBRAER | \n",
" EMB-145LR | \n",
" 2 | \n",
" 55 | \n",
" NaN | \n",
" Turbo-fan | \n",
"
\n",
" \n",
" 5 | \n",
" N105UW | \n",
" 1999.0 | \n",
" Fixed wing multi engine | \n",
" AIRBUS INDUSTRIE | \n",
" A320-214 | \n",
" 2 | \n",
" 182 | \n",
" NaN | \n",
" Turbo-fan | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" tailnum year type manufacturer model \\\n",
"0 N10156 2004.0 Fixed wing multi engine EMBRAER EMB-145XR \n",
"1 N102UW 1998.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n",
"2 N103US 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n",
"3 N104UW 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n",
"4 N10575 2002.0 Fixed wing multi engine EMBRAER EMB-145LR \n",
"5 N105UW 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n",
"\n",
" engines seats speed engine \n",
"0 2 55 NaN Turbo-fan \n",
"1 2 182 NaN Turbo-fan \n",
"2 2 182 NaN Turbo-fan \n",
"3 2 182 NaN Turbo-fan \n",
"4 2 55 NaN Turbo-fan \n",
"5 2 182 NaN Turbo-fan "
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"planes_df.loc[0:5]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"\n",
"
Note
\n",
"
Note that since this is not \"indexing\", the last element is inclusive.
\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"We can also pass a `list` of Index values:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" tailnum | \n",
" year | \n",
" type | \n",
" manufacturer | \n",
" model | \n",
" engines | \n",
" seats | \n",
" speed | \n",
" engine | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" N10156 | \n",
" 2004.0 | \n",
" Fixed wing multi engine | \n",
" EMBRAER | \n",
" EMB-145XR | \n",
" 2 | \n",
" 55 | \n",
" NaN | \n",
" Turbo-fan | \n",
"
\n",
" \n",
" 2 | \n",
" N103US | \n",
" 1999.0 | \n",
" Fixed wing multi engine | \n",
" AIRBUS INDUSTRIE | \n",
" A320-214 | \n",
" 2 | \n",
" 182 | \n",
" NaN | \n",
" Turbo-fan | \n",
"
\n",
" \n",
" 4 | \n",
" N10575 | \n",
" 2002.0 | \n",
" Fixed wing multi engine | \n",
" EMBRAER | \n",
" EMB-145LR | \n",
" 2 | \n",
" 55 | \n",
" NaN | \n",
" Turbo-fan | \n",
"
\n",
" \n",
" 6 | \n",
" N107US | \n",
" 1999.0 | \n",
" Fixed wing multi engine | \n",
" AIRBUS INDUSTRIE | \n",
" A320-214 | \n",
" 2 | \n",
" 182 | \n",
" NaN | \n",
" Turbo-fan | \n",
"
\n",
" \n",
" 8 | \n",
" N109UW | \n",
" 1999.0 | \n",
" Fixed wing multi engine | \n",
" AIRBUS INDUSTRIE | \n",
" A320-214 | \n",
" 2 | \n",
" 182 | \n",
" NaN | \n",
" Turbo-fan | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" tailnum year type manufacturer model \\\n",
"0 N10156 2004.0 Fixed wing multi engine EMBRAER EMB-145XR \n",
"2 N103US 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n",
"4 N10575 2002.0 Fixed wing multi engine EMBRAER EMB-145LR \n",
"6 N107US 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n",
"8 N109UW 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n",
"\n",
" engines seats speed engine \n",
"0 2 55 NaN Turbo-fan \n",
"2 2 182 NaN Turbo-fan \n",
"4 2 55 NaN Turbo-fan \n",
"6 2 182 NaN Turbo-fan \n",
"8 2 182 NaN Turbo-fan "
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"planes_df.loc[[0, 2, 4, 6, 8]]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Filtering Cases"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"We can **filter** cases/rows using a logical sequence equal in length to the number of rows in the DataFrame."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"Continuing our example, assume we want to determine whether each case's `manufacturer` is Embraer. We can use the `manufacturer` Series and a logical equivalency test to find the result for each row:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0 True\n",
"1 False\n",
"2 False\n",
"3 False\n",
"4 True\n",
" ... \n",
"3317 False\n",
"3318 False\n",
"3319 False\n",
"3320 False\n",
"3321 False\n",
"Name: manufacturer, Length: 3322, dtype: bool"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"planes_df['manufacturer'] == 'EMBRAER'"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"We can use this resulting logical sequence to test **filter** cases -- rows that are `True` will be returned while those that are `False` will be removed:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" tailnum | \n",
" year | \n",
" type | \n",
" manufacturer | \n",
" model | \n",
" engines | \n",
" seats | \n",
" speed | \n",
" engine | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" N10156 | \n",
" 2004.0 | \n",
" Fixed wing multi engine | \n",
" EMBRAER | \n",
" EMB-145XR | \n",
" 2 | \n",
" 55 | \n",
" NaN | \n",
" Turbo-fan | \n",
"
\n",
" \n",
" 4 | \n",
" N10575 | \n",
" 2002.0 | \n",
" Fixed wing multi engine | \n",
" EMBRAER | \n",
" EMB-145LR | \n",
" 2 | \n",
" 55 | \n",
" NaN | \n",
" Turbo-fan | \n",
"
\n",
" \n",
" 10 | \n",
" N11106 | \n",
" 2002.0 | \n",
" Fixed wing multi engine | \n",
" EMBRAER | \n",
" EMB-145XR | \n",
" 2 | \n",
" 55 | \n",
" NaN | \n",
" Turbo-fan | \n",
"
\n",
" \n",
" 11 | \n",
" N11107 | \n",
" 2002.0 | \n",
" Fixed wing multi engine | \n",
" EMBRAER | \n",
" EMB-145XR | \n",
" 2 | \n",
" 55 | \n",
" NaN | \n",
" Turbo-fan | \n",
"
\n",
" \n",
" 12 | \n",
" N11109 | \n",
" 2002.0 | \n",
" Fixed wing multi engine | \n",
" EMBRAER | \n",
" EMB-145XR | \n",
" 2 | \n",
" 55 | \n",
" NaN | \n",
" Turbo-fan | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" tailnum year type manufacturer model engines \\\n",
"0 N10156 2004.0 Fixed wing multi engine EMBRAER EMB-145XR 2 \n",
"4 N10575 2002.0 Fixed wing multi engine EMBRAER EMB-145LR 2 \n",
"10 N11106 2002.0 Fixed wing multi engine EMBRAER EMB-145XR 2 \n",
"11 N11107 2002.0 Fixed wing multi engine EMBRAER EMB-145XR 2 \n",
"12 N11109 2002.0 Fixed wing multi engine EMBRAER EMB-145XR 2 \n",
"\n",
" seats speed engine \n",
"0 55 NaN Turbo-fan \n",
"4 55 NaN Turbo-fan \n",
"10 55 NaN Turbo-fan \n",
"11 55 NaN Turbo-fan \n",
"12 55 NaN Turbo-fan "
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"planes_df[planes_df['manufacturer'] == 'EMBRAER'].head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"This also works with `.loc`:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" tailnum | \n",
" year | \n",
" type | \n",
" manufacturer | \n",
" model | \n",
" engines | \n",
" seats | \n",
" speed | \n",
" engine | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" N10156 | \n",
" 2004.0 | \n",
" Fixed wing multi engine | \n",
" EMBRAER | \n",
" EMB-145XR | \n",
" 2 | \n",
" 55 | \n",
" NaN | \n",
" Turbo-fan | \n",
"
\n",
" \n",
" 4 | \n",
" N10575 | \n",
" 2002.0 | \n",
" Fixed wing multi engine | \n",
" EMBRAER | \n",
" EMB-145LR | \n",
" 2 | \n",
" 55 | \n",
" NaN | \n",
" Turbo-fan | \n",
"
\n",
" \n",
" 10 | \n",
" N11106 | \n",
" 2002.0 | \n",
" Fixed wing multi engine | \n",
" EMBRAER | \n",
" EMB-145XR | \n",
" 2 | \n",
" 55 | \n",
" NaN | \n",
" Turbo-fan | \n",
"
\n",
" \n",
" 11 | \n",
" N11107 | \n",
" 2002.0 | \n",
" Fixed wing multi engine | \n",
" EMBRAER | \n",
" EMB-145XR | \n",
" 2 | \n",
" 55 | \n",
" NaN | \n",
" Turbo-fan | \n",
"
\n",
" \n",
" 12 | \n",
" N11109 | \n",
" 2002.0 | \n",
" Fixed wing multi engine | \n",
" EMBRAER | \n",
" EMB-145XR | \n",
" 2 | \n",
" 55 | \n",
" NaN | \n",
" Turbo-fan | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" tailnum year type manufacturer model engines \\\n",
"0 N10156 2004.0 Fixed wing multi engine EMBRAER EMB-145XR 2 \n",
"4 N10575 2002.0 Fixed wing multi engine EMBRAER EMB-145LR 2 \n",
"10 N11106 2002.0 Fixed wing multi engine EMBRAER EMB-145XR 2 \n",
"11 N11107 2002.0 Fixed wing multi engine EMBRAER EMB-145XR 2 \n",
"12 N11109 2002.0 Fixed wing multi engine EMBRAER EMB-145XR 2 \n",
"\n",
" seats speed engine \n",
"0 55 NaN Turbo-fan \n",
"4 55 NaN Turbo-fan \n",
"10 55 NaN Turbo-fan \n",
"11 55 NaN Turbo-fan \n",
"12 55 NaN Turbo-fan "
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"planes_df.loc[planes_df['manufacturer'] == 'EMBRAER'].head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Any conditional test can be used to **filter** DataFrame rows:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" tailnum | \n",
" year | \n",
" type | \n",
" manufacturer | \n",
" model | \n",
" engines | \n",
" seats | \n",
" speed | \n",
" engine | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" N10156 | \n",
" 2004.0 | \n",
" Fixed wing multi engine | \n",
" EMBRAER | \n",
" EMB-145XR | \n",
" 2 | \n",
" 55 | \n",
" NaN | \n",
" Turbo-fan | \n",
"
\n",
" \n",
" 15 | \n",
" N11121 | \n",
" 2003.0 | \n",
" Fixed wing multi engine | \n",
" EMBRAER | \n",
" EMB-145XR | \n",
" 2 | \n",
" 55 | \n",
" NaN | \n",
" Turbo-fan | \n",
"
\n",
" \n",
" 16 | \n",
" N11127 | \n",
" 2003.0 | \n",
" Fixed wing multi engine | \n",
" EMBRAER | \n",
" EMB-145XR | \n",
" 2 | \n",
" 55 | \n",
" NaN | \n",
" Turbo-fan | \n",
"
\n",
" \n",
" 17 | \n",
" N11137 | \n",
" 2003.0 | \n",
" Fixed wing multi engine | \n",
" EMBRAER | \n",
" EMB-145XR | \n",
" 2 | \n",
" 55 | \n",
" NaN | \n",
" Turbo-fan | \n",
"
\n",
" \n",
" 18 | \n",
" N11140 | \n",
" 2003.0 | \n",
" Fixed wing multi engine | \n",
" EMBRAER | \n",
" EMB-145XR | \n",
" 2 | \n",
" 55 | \n",
" NaN | \n",
" Turbo-fan | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" tailnum year type manufacturer model engines \\\n",
"0 N10156 2004.0 Fixed wing multi engine EMBRAER EMB-145XR 2 \n",
"15 N11121 2003.0 Fixed wing multi engine EMBRAER EMB-145XR 2 \n",
"16 N11127 2003.0 Fixed wing multi engine EMBRAER EMB-145XR 2 \n",
"17 N11137 2003.0 Fixed wing multi engine EMBRAER EMB-145XR 2 \n",
"18 N11140 2003.0 Fixed wing multi engine EMBRAER EMB-145XR 2 \n",
"\n",
" seats speed engine \n",
"0 55 NaN Turbo-fan \n",
"15 55 NaN Turbo-fan \n",
"16 55 NaN Turbo-fan \n",
"17 55 NaN Turbo-fan \n",
"18 55 NaN Turbo-fan "
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"planes_df.loc[planes_df['year'] > 2002].head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"And multiple conditional tests can be combined using logical operators:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" tailnum | \n",
" year | \n",
" type | \n",
" manufacturer | \n",
" model | \n",
" engines | \n",
" seats | \n",
" speed | \n",
" engine | \n",
"
\n",
" \n",
" \n",
" \n",
" 15 | \n",
" N11121 | \n",
" 2003.0 | \n",
" Fixed wing multi engine | \n",
" EMBRAER | \n",
" EMB-145XR | \n",
" 2 | \n",
" 55 | \n",
" NaN | \n",
" Turbo-fan | \n",
"
\n",
" \n",
" 16 | \n",
" N11127 | \n",
" 2003.0 | \n",
" Fixed wing multi engine | \n",
" EMBRAER | \n",
" EMB-145XR | \n",
" 2 | \n",
" 55 | \n",
" NaN | \n",
" Turbo-fan | \n",
"
\n",
" \n",
" 17 | \n",
" N11137 | \n",
" 2003.0 | \n",
" Fixed wing multi engine | \n",
" EMBRAER | \n",
" EMB-145XR | \n",
" 2 | \n",
" 55 | \n",
" NaN | \n",
" Turbo-fan | \n",
"
\n",
" \n",
" 18 | \n",
" N11140 | \n",
" 2003.0 | \n",
" Fixed wing multi engine | \n",
" EMBRAER | \n",
" EMB-145XR | \n",
" 2 | \n",
" 55 | \n",
" NaN | \n",
" Turbo-fan | \n",
"
\n",
" \n",
" 19 | \n",
" N11150 | \n",
" 2003.0 | \n",
" Fixed wing multi engine | \n",
" EMBRAER | \n",
" EMB-145XR | \n",
" 2 | \n",
" 55 | \n",
" NaN | \n",
" Turbo-fan | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" tailnum year type manufacturer model engines \\\n",
"15 N11121 2003.0 Fixed wing multi engine EMBRAER EMB-145XR 2 \n",
"16 N11127 2003.0 Fixed wing multi engine EMBRAER EMB-145XR 2 \n",
"17 N11137 2003.0 Fixed wing multi engine EMBRAER EMB-145XR 2 \n",
"18 N11140 2003.0 Fixed wing multi engine EMBRAER EMB-145XR 2 \n",
"19 N11150 2003.0 Fixed wing multi engine EMBRAER EMB-145XR 2 \n",
"\n",
" seats speed engine \n",
"15 55 NaN Turbo-fan \n",
"16 55 NaN Turbo-fan \n",
"17 55 NaN Turbo-fan \n",
"18 55 NaN Turbo-fan \n",
"19 55 NaN Turbo-fan "
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"planes_df.loc[(planes_df['year'] > 2002) & (planes_df['year'] < 2004)].head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"\n",
"
Note
\n",
"
Note that each condition is wrapped in parentheses -- this is required.
\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Your Turn\n",
"\n",
"1. What's the difference between **slicing** cases and **filtering** cases?\n",
"2. Fill in the blanks to fix the following code to find planes that have more than three engines:\n",
"\n",
" ```python\n",
" planes_df.loc[______['______'] > 3]\n",
" ```"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Selecting Variables and Filtering Cases"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"If we want to select variables and filter cases at the same time, we have a few options:\n",
"\n",
"1. Sequential operations\n",
"2. Simultaneous operations"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Sequential Operations\n",
"\n",
"We can use what we've previously learned to select variables and filter cases in multiple steps:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" year | \n",
" engines | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2004.0 | \n",
" 2 | \n",
"
\n",
" \n",
" 4 | \n",
" 2002.0 | \n",
" 2 | \n",
"
\n",
" \n",
" 10 | \n",
" 2002.0 | \n",
" 2 | \n",
"
\n",
" \n",
" 11 | \n",
" 2002.0 | \n",
" 2 | \n",
"
\n",
" \n",
" 12 | \n",
" 2002.0 | \n",
" 2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" year engines\n",
"0 2004.0 2\n",
"4 2002.0 2\n",
"10 2002.0 2\n",
"11 2002.0 2\n",
"12 2002.0 2"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"planes_df_filtered = planes_df.loc[planes_df['manufacturer'] == 'EMBRAER']\n",
"planes_df_filtered_and_selected = planes_df_filtered[['year', 'engines']]\n",
"planes_df_filtered_and_selected.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"This is a good way to learn how to select and filter independently, and it also reads very clearly."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Simultaneous Operations\n",
"\n",
"However, we can also do both selecting and filtering in a single step with `.loc`:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" year | \n",
" engines | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2004.0 | \n",
" 2 | \n",
"
\n",
" \n",
" 4 | \n",
" 2002.0 | \n",
" 2 | \n",
"
\n",
" \n",
" 10 | \n",
" 2002.0 | \n",
" 2 | \n",
"
\n",
" \n",
" 11 | \n",
" 2002.0 | \n",
" 2 | \n",
"
\n",
" \n",
" 12 | \n",
" 2002.0 | \n",
" 2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" year engines\n",
"0 2004.0 2\n",
"4 2002.0 2\n",
"10 2002.0 2\n",
"11 2002.0 2\n",
"12 2002.0 2"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"planes_df.loc[planes_df['manufacturer'] == 'EMBRAER', ['year', 'engines']].head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"This option is more succinct and also reduces programming time."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Your Turn\n",
"\n",
"Subset `planes_df` to only include planes made by Boeing and the `seats` and `model` variables."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Questions\n",
"\n",
"Are there any questions before we move on?"
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.4"
},
"rise": {
"autolaunch": true,
"transition": "none"
}
},
"nbformat": 4,
"nbformat_minor": 4
}