{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Manipulating and Creating Columns"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"> During the course of doing data analysis and modeling, a significant amount of time is spent on data preparation: loading, cleaning, transforming, and rearranging. Such tasks are often reported to take up 80% or more of an analyst's time.\n",
">\n",
"> \\- Wes McKinney, the creator of Pandas, in his book *Python for Data Analysis*"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Applied Review"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Data Structures and DataFrames\n",
"- We use **DataFrames** to represent tables in Python.\n",
"- Python also supports other data structures for storing information that isn't tabular. Examples include lists and dictionaries.\n",
"- DataFrames have many **methods**, or functions that access their internal data. Some examples we saw were `describe()` and `set_index()`.\n",
"- DataFrames are composed of **Series**, 1-D data structures (like a vector). DataFrame columns can be thought of as Series."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Importing Data\n",
"- Python can read in data from CSVs, JSON files, and pickle files with just a few lines of code."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Selecting and Filtering Data\n",
"- Python's pandas library supports limiting rows (via *filtering* and *slicing*), as well as *selecting* columns.\n",
"- All of these operations use the bracket operators, but row syntax includes the `.loc` *accessor*."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Calculations Using Columns"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"It's common to want to modify a column of a DataFrame, or sometimes even to create a new column.\n",
"Let's take a look at our planes data again."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" tailnum | \n",
" year | \n",
" type | \n",
" manufacturer | \n",
" model | \n",
" engines | \n",
" seats | \n",
" speed | \n",
" engine | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" N10156 | \n",
" 2004.0 | \n",
" Fixed wing multi engine | \n",
" EMBRAER | \n",
" EMB-145XR | \n",
" 2 | \n",
" 55 | \n",
" NaN | \n",
" Turbo-fan | \n",
"
\n",
" \n",
" 1 | \n",
" N102UW | \n",
" 1998.0 | \n",
" Fixed wing multi engine | \n",
" AIRBUS INDUSTRIE | \n",
" A320-214 | \n",
" 2 | \n",
" 182 | \n",
" NaN | \n",
" Turbo-fan | \n",
"
\n",
" \n",
" 2 | \n",
" N103US | \n",
" 1999.0 | \n",
" Fixed wing multi engine | \n",
" AIRBUS INDUSTRIE | \n",
" A320-214 | \n",
" 2 | \n",
" 182 | \n",
" NaN | \n",
" Turbo-fan | \n",
"
\n",
" \n",
" 3 | \n",
" N104UW | \n",
" 1999.0 | \n",
" Fixed wing multi engine | \n",
" AIRBUS INDUSTRIE | \n",
" A320-214 | \n",
" 2 | \n",
" 182 | \n",
" NaN | \n",
" Turbo-fan | \n",
"
\n",
" \n",
" 4 | \n",
" N10575 | \n",
" 2002.0 | \n",
" Fixed wing multi engine | \n",
" EMBRAER | \n",
" EMB-145LR | \n",
" 2 | \n",
" 55 | \n",
" NaN | \n",
" Turbo-fan | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" tailnum year type manufacturer model \\\n",
"0 N10156 2004.0 Fixed wing multi engine EMBRAER EMB-145XR \n",
"1 N102UW 1998.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n",
"2 N103US 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n",
"3 N104UW 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n",
"4 N10575 2002.0 Fixed wing multi engine EMBRAER EMB-145LR \n",
"\n",
" engines seats speed engine \n",
"0 2 55 NaN Turbo-fan \n",
"1 2 182 NaN Turbo-fan \n",
"2 2 182 NaN Turbo-fan \n",
"3 2 182 NaN Turbo-fan \n",
"4 2 55 NaN Turbo-fan "
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"planes = pd.read_csv('../data/planes.csv')\n",
"planes.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Suppose we wanted to know the total capacity of each plane, including the crew.\n",
"We have data on how many seats each plane has (in the `seats` column), but that only includes paying passengers.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0 55\n",
"1 182\n",
"2 182\n",
"3 182\n",
"4 55\n",
"Name: seats, dtype: int64"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"seats = planes['seats']\n",
"seats.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"For simplicity, let's say a full flight crew is always 5 people.\n",
"Series objects allow us to perform addition with the regular `+` syntax –- in this case, `seats + 5`."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0 60\n",
"1 187\n",
"2 187\n",
"3 187\n",
"4 60\n",
"Name: seats, dtype: int64"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"capacity = seats + 5\n",
"capacity.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"\n",
"
Question
\n",
"
What happens if you switch the order? (i.e. 5 + seats)? Does this make sense?
\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"So we've create a new series, `capacity`, with the total capacity of the plane.\n",
"Right now it's totally separate from our original `planes` DataFrame, but we can make it a column of `planes` using the assignment syntax with the column reference syntax.\n",
"```python\n",
"df['new_column_name'] = new_column_series\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" tailnum | \n",
" year | \n",
" type | \n",
" manufacturer | \n",
" model | \n",
" engines | \n",
" seats | \n",
" speed | \n",
" engine | \n",
" capacity | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" N10156 | \n",
" 2004.0 | \n",
" Fixed wing multi engine | \n",
" EMBRAER | \n",
" EMB-145XR | \n",
" 2 | \n",
" 55 | \n",
" NaN | \n",
" Turbo-fan | \n",
" 60 | \n",
"
\n",
" \n",
" 1 | \n",
" N102UW | \n",
" 1998.0 | \n",
" Fixed wing multi engine | \n",
" AIRBUS INDUSTRIE | \n",
" A320-214 | \n",
" 2 | \n",
" 182 | \n",
" NaN | \n",
" Turbo-fan | \n",
" 187 | \n",
"
\n",
" \n",
" 2 | \n",
" N103US | \n",
" 1999.0 | \n",
" Fixed wing multi engine | \n",
" AIRBUS INDUSTRIE | \n",
" A320-214 | \n",
" 2 | \n",
" 182 | \n",
" NaN | \n",
" Turbo-fan | \n",
" 187 | \n",
"
\n",
" \n",
" 3 | \n",
" N104UW | \n",
" 1999.0 | \n",
" Fixed wing multi engine | \n",
" AIRBUS INDUSTRIE | \n",
" A320-214 | \n",
" 2 | \n",
" 182 | \n",
" NaN | \n",
" Turbo-fan | \n",
" 187 | \n",
"
\n",
" \n",
" 4 | \n",
" N10575 | \n",
" 2002.0 | \n",
" Fixed wing multi engine | \n",
" EMBRAER | \n",
" EMB-145LR | \n",
" 2 | \n",
" 55 | \n",
" NaN | \n",
" Turbo-fan | \n",
" 60 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" tailnum year type manufacturer model \\\n",
"0 N10156 2004.0 Fixed wing multi engine EMBRAER EMB-145XR \n",
"1 N102UW 1998.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n",
"2 N103US 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n",
"3 N104UW 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n",
"4 N10575 2002.0 Fixed wing multi engine EMBRAER EMB-145LR \n",
"\n",
" engines seats speed engine capacity \n",
"0 2 55 NaN Turbo-fan 60 \n",
"1 2 182 NaN Turbo-fan 187 \n",
"2 2 182 NaN Turbo-fan 187 \n",
"3 2 182 NaN Turbo-fan 187 \n",
"4 2 55 NaN Turbo-fan 60 "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"planes['capacity'] = capacity\n",
"planes.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Note that `planes` now has a \"capacity\" column at the end."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"\n",
"
Note
\n",
"
Also note that in the code above, the column name goes in quotes within the bracket syntax, while the values that will become the column -- the Series we're using -- are on the right side of the statement, without any brackets or quotes.
\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"This sequence of operations can be expressed as a single line:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"# Create a capacity column filled with the values in the seats column added with 5.\n",
"planes['capacity'] = planes['seats'] + 5"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"From a mathematical perspective, what we're doing here is adding a *scalar* -- a single value -- to a *vector* -- a series of values (aka a `Series`).\n",
"Other vector-scalar math is supported as well."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0 43\n",
"1 170\n",
"2 170\n",
"3 170\n",
"4 43\n",
"Name: seats, dtype: int64"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Subtraction\n",
"(planes['seats'] - 12).head()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0 550\n",
"1 1820\n",
"2 1820\n",
"3 1820\n",
"4 550\n",
"Name: seats, dtype: int64"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Multiplication\n",
"(planes['seats'] * 10).head()"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0 3025\n",
"1 33124\n",
"2 33124\n",
"3 33124\n",
"4 3025\n",
"Name: seats, dtype: int64"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Exponentiation\n",
"(planes['seats'] ** 2).head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Your Turn\n",
"\n",
"Create a new column named \"first_class_seats\" that is 1/5 of the total seats on the plane. You will have some results with fractional seats; don't worry about this."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Overwriting Columns\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"What if we discovered a systematic error in our data?\n",
"Perhaps we find out that the \"engines\" column is only the number of engines *per wing* -- so the total number of engines is actually double the value in that column."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"We could create a new column, \"real_engine_count\" or \"total_engines\".\n",
"But we're not going to need the original \"engines\" column, and leaving it could cause confusion for others looking at our data."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"A better solution would be to replace the original column with the new, recalculated, values.\n",
"We can do so using the same syntax as for creating a new column."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" tailnum | \n",
" year | \n",
" type | \n",
" manufacturer | \n",
" model | \n",
" engines | \n",
" seats | \n",
" speed | \n",
" engine | \n",
" capacity | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" N10156 | \n",
" 2004.0 | \n",
" Fixed wing multi engine | \n",
" EMBRAER | \n",
" EMB-145XR | \n",
" 2 | \n",
" 55 | \n",
" NaN | \n",
" Turbo-fan | \n",
" 60 | \n",
"
\n",
" \n",
" 1 | \n",
" N102UW | \n",
" 1998.0 | \n",
" Fixed wing multi engine | \n",
" AIRBUS INDUSTRIE | \n",
" A320-214 | \n",
" 2 | \n",
" 182 | \n",
" NaN | \n",
" Turbo-fan | \n",
" 187 | \n",
"
\n",
" \n",
" 2 | \n",
" N103US | \n",
" 1999.0 | \n",
" Fixed wing multi engine | \n",
" AIRBUS INDUSTRIE | \n",
" A320-214 | \n",
" 2 | \n",
" 182 | \n",
" NaN | \n",
" Turbo-fan | \n",
" 187 | \n",
"
\n",
" \n",
" 3 | \n",
" N104UW | \n",
" 1999.0 | \n",
" Fixed wing multi engine | \n",
" AIRBUS INDUSTRIE | \n",
" A320-214 | \n",
" 2 | \n",
" 182 | \n",
" NaN | \n",
" Turbo-fan | \n",
" 187 | \n",
"
\n",
" \n",
" 4 | \n",
" N10575 | \n",
" 2002.0 | \n",
" Fixed wing multi engine | \n",
" EMBRAER | \n",
" EMB-145LR | \n",
" 2 | \n",
" 55 | \n",
" NaN | \n",
" Turbo-fan | \n",
" 60 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" tailnum year type manufacturer model \\\n",
"0 N10156 2004.0 Fixed wing multi engine EMBRAER EMB-145XR \n",
"1 N102UW 1998.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n",
"2 N103US 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n",
"3 N104UW 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n",
"4 N10575 2002.0 Fixed wing multi engine EMBRAER EMB-145LR \n",
"\n",
" engines seats speed engine capacity \n",
"0 2 55 NaN Turbo-fan 60 \n",
"1 2 182 NaN Turbo-fan 187 \n",
"2 2 182 NaN Turbo-fan 187 \n",
"3 2 182 NaN Turbo-fan 187 \n",
"4 2 55 NaN Turbo-fan 60 "
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"planes.head()"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"# Multiply the engines column by 2, and then overwrite the original data.\n",
"planes['engines'] = planes['engines'] * 2"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" tailnum | \n",
" year | \n",
" type | \n",
" manufacturer | \n",
" model | \n",
" engines | \n",
" seats | \n",
" speed | \n",
" engine | \n",
" capacity | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" N10156 | \n",
" 2004.0 | \n",
" Fixed wing multi engine | \n",
" EMBRAER | \n",
" EMB-145XR | \n",
" 4 | \n",
" 55 | \n",
" NaN | \n",
" Turbo-fan | \n",
" 60 | \n",
"
\n",
" \n",
" 1 | \n",
" N102UW | \n",
" 1998.0 | \n",
" Fixed wing multi engine | \n",
" AIRBUS INDUSTRIE | \n",
" A320-214 | \n",
" 4 | \n",
" 182 | \n",
" NaN | \n",
" Turbo-fan | \n",
" 187 | \n",
"
\n",
" \n",
" 2 | \n",
" N103US | \n",
" 1999.0 | \n",
" Fixed wing multi engine | \n",
" AIRBUS INDUSTRIE | \n",
" A320-214 | \n",
" 4 | \n",
" 182 | \n",
" NaN | \n",
" Turbo-fan | \n",
" 187 | \n",
"
\n",
" \n",
" 3 | \n",
" N104UW | \n",
" 1999.0 | \n",
" Fixed wing multi engine | \n",
" AIRBUS INDUSTRIE | \n",
" A320-214 | \n",
" 4 | \n",
" 182 | \n",
" NaN | \n",
" Turbo-fan | \n",
" 187 | \n",
"
\n",
" \n",
" 4 | \n",
" N10575 | \n",
" 2002.0 | \n",
" Fixed wing multi engine | \n",
" EMBRAER | \n",
" EMB-145LR | \n",
" 4 | \n",
" 55 | \n",
" NaN | \n",
" Turbo-fan | \n",
" 60 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" tailnum year type manufacturer model \\\n",
"0 N10156 2004.0 Fixed wing multi engine EMBRAER EMB-145XR \n",
"1 N102UW 1998.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n",
"2 N103US 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n",
"3 N104UW 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n",
"4 N10575 2002.0 Fixed wing multi engine EMBRAER EMB-145LR \n",
"\n",
" engines seats speed engine capacity \n",
"0 4 55 NaN Turbo-fan 60 \n",
"1 4 182 NaN Turbo-fan 187 \n",
"2 4 182 NaN Turbo-fan 187 \n",
"3 4 182 NaN Turbo-fan 187 \n",
"4 4 55 NaN Turbo-fan 60 "
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"planes.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Calculating Values Based on Multiple Columns"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"So far we've only seen vector-scalar math.\n",
"But vector-vector math is supported as well."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"Let's look at a toy example of creating a column that contains the number of seats per engine."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0 13.75\n",
"1 45.50\n",
"2 45.50\n",
"3 45.50\n",
"4 13.75\n",
"dtype: float64"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"seats_per_engine = planes['seats'] / planes['engines']\n",
"seats_per_engine.head()"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" tailnum | \n",
" year | \n",
" type | \n",
" manufacturer | \n",
" model | \n",
" engines | \n",
" seats | \n",
" speed | \n",
" engine | \n",
" capacity | \n",
" seats_per_engine | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" N10156 | \n",
" 2004.0 | \n",
" Fixed wing multi engine | \n",
" EMBRAER | \n",
" EMB-145XR | \n",
" 4 | \n",
" 55 | \n",
" NaN | \n",
" Turbo-fan | \n",
" 60 | \n",
" 13.75 | \n",
"
\n",
" \n",
" 1 | \n",
" N102UW | \n",
" 1998.0 | \n",
" Fixed wing multi engine | \n",
" AIRBUS INDUSTRIE | \n",
" A320-214 | \n",
" 4 | \n",
" 182 | \n",
" NaN | \n",
" Turbo-fan | \n",
" 187 | \n",
" 45.50 | \n",
"
\n",
" \n",
" 2 | \n",
" N103US | \n",
" 1999.0 | \n",
" Fixed wing multi engine | \n",
" AIRBUS INDUSTRIE | \n",
" A320-214 | \n",
" 4 | \n",
" 182 | \n",
" NaN | \n",
" Turbo-fan | \n",
" 187 | \n",
" 45.50 | \n",
"
\n",
" \n",
" 3 | \n",
" N104UW | \n",
" 1999.0 | \n",
" Fixed wing multi engine | \n",
" AIRBUS INDUSTRIE | \n",
" A320-214 | \n",
" 4 | \n",
" 182 | \n",
" NaN | \n",
" Turbo-fan | \n",
" 187 | \n",
" 45.50 | \n",
"
\n",
" \n",
" 4 | \n",
" N10575 | \n",
" 2002.0 | \n",
" Fixed wing multi engine | \n",
" EMBRAER | \n",
" EMB-145LR | \n",
" 4 | \n",
" 55 | \n",
" NaN | \n",
" Turbo-fan | \n",
" 60 | \n",
" 13.75 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" tailnum year type manufacturer model \\\n",
"0 N10156 2004.0 Fixed wing multi engine EMBRAER EMB-145XR \n",
"1 N102UW 1998.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n",
"2 N103US 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n",
"3 N104UW 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n",
"4 N10575 2002.0 Fixed wing multi engine EMBRAER EMB-145LR \n",
"\n",
" engines seats speed engine capacity seats_per_engine \n",
"0 4 55 NaN Turbo-fan 60 13.75 \n",
"1 4 182 NaN Turbo-fan 187 45.50 \n",
"2 4 182 NaN Turbo-fan 187 45.50 \n",
"3 4 182 NaN Turbo-fan 187 45.50 \n",
"4 4 55 NaN Turbo-fan 60 13.75 "
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"planes['seats_per_engine'] = seats_per_engine\n",
"planes.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"You can combine vector-vector and vector-scalar calculations in arbitrarily complex ways."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" tailnum | \n",
" year | \n",
" type | \n",
" manufacturer | \n",
" model | \n",
" engines | \n",
" seats | \n",
" speed | \n",
" engine | \n",
" capacity | \n",
" seats_per_engine | \n",
" nonsense | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" N10156 | \n",
" 2004.0 | \n",
" Fixed wing multi engine | \n",
" EMBRAER | \n",
" EMB-145XR | \n",
" 4 | \n",
" 55 | \n",
" NaN | \n",
" Turbo-fan | \n",
" 60 | \n",
" 13.75 | \n",
" 8110.0 | \n",
"
\n",
" \n",
" 1 | \n",
" N102UW | \n",
" 1998.0 | \n",
" Fixed wing multi engine | \n",
" AIRBUS INDUSTRIE | \n",
" A320-214 | \n",
" 4 | \n",
" 182 | \n",
" NaN | \n",
" Turbo-fan | \n",
" 187 | \n",
" 45.50 | \n",
" 8213.0 | \n",
"
\n",
" \n",
" 2 | \n",
" N103US | \n",
" 1999.0 | \n",
" Fixed wing multi engine | \n",
" AIRBUS INDUSTRIE | \n",
" A320-214 | \n",
" 4 | \n",
" 182 | \n",
" NaN | \n",
" Turbo-fan | \n",
" 187 | \n",
" 45.50 | \n",
" 8217.0 | \n",
"
\n",
" \n",
" 3 | \n",
" N104UW | \n",
" 1999.0 | \n",
" Fixed wing multi engine | \n",
" AIRBUS INDUSTRIE | \n",
" A320-214 | \n",
" 4 | \n",
" 182 | \n",
" NaN | \n",
" Turbo-fan | \n",
" 187 | \n",
" 45.50 | \n",
" 8217.0 | \n",
"
\n",
" \n",
" 4 | \n",
" N10575 | \n",
" 2002.0 | \n",
" Fixed wing multi engine | \n",
" EMBRAER | \n",
" EMB-145LR | \n",
" 4 | \n",
" 55 | \n",
" NaN | \n",
" Turbo-fan | \n",
" 60 | \n",
" 13.75 | \n",
" 8102.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" tailnum year type manufacturer model \\\n",
"0 N10156 2004.0 Fixed wing multi engine EMBRAER EMB-145XR \n",
"1 N102UW 1998.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n",
"2 N103US 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n",
"3 N104UW 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n",
"4 N10575 2002.0 Fixed wing multi engine EMBRAER EMB-145LR \n",
"\n",
" engines seats speed engine capacity seats_per_engine nonsense \n",
"0 4 55 NaN Turbo-fan 60 13.75 8110.0 \n",
"1 4 182 NaN Turbo-fan 187 45.50 8213.0 \n",
"2 4 182 NaN Turbo-fan 187 45.50 8217.0 \n",
"3 4 182 NaN Turbo-fan 187 45.50 8217.0 \n",
"4 4 55 NaN Turbo-fan 60 13.75 8102.0 "
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"planes['nonsense'] = (planes['year'] + 12) * planes['engines'] + planes['seats'] - 9\n",
"planes.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Your Turn\n",
"\n",
"Create a new column in the planes DataFrame, \"technology_index\", that is calculated with the formula:\n",
"\n",
"`technology_index = (year-1900) / 4 + engines * 2`\n",
"\n",
"\n",
"
Note
\n",
"
Remember order of operations!
\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Non-numeric Column Operations"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"For simplicity, we started with mathematical operations.\n",
"However, pandas supports string operations as well."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"We can use `+` to concatenate strings, with both vectors and scalars."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0 Tailnum is N10156 and Model is EMB-145XR\n",
"1 Tailnum is N102UW and Model is A320-214\n",
"2 Tailnum is N103US and Model is A320-214\n",
"3 Tailnum is N104UW and Model is A320-214\n",
"4 Tailnum is N10575 and Model is EMB-145LR\n",
"dtype: object"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"summary = 'Tailnum is ' + planes['tailnum'] + ' and Model is ' + planes['model']\n",
"summary.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"More complex string operations are possible using methods available through the `.str` *accessor*.\n",
"We won't cover them in detail, so refer to the [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html) if you're interested."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0 embraer\n",
"1 airbus industrie\n",
"2 airbus industrie\n",
"3 airbus industrie\n",
"4 embraer\n",
"Name: manufacturer, dtype: object"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Make the manufacturer field lowercase.\n",
"lowercase_mfctr = planes['manufacturer'].str.lower()\n",
"lowercase_mfctr.head()"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0 6\n",
"1 6\n",
"2 6\n",
"3 6\n",
"4 6\n",
"Name: tailnum, dtype: int64"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Get the length of the tail number.\n",
"tailnum_len = planes['tailnum'].str.len()\n",
"tailnum_len.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## More Complex Column Manipulation"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Replacing Values"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"One fairly common situation in data wrangling is needing to convert one set of values to another, where there is a one-to-one correspondence between the values currently in the column and the new values that should replace them.\n",
"This operation can be described as \"mapping one set of values to another\"."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"Let's look at an example of this."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" carrier | \n",
" name | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 9E | \n",
" Endeavor Air Inc. | \n",
"
\n",
" \n",
" 1 | \n",
" AA | \n",
" American Airlines Inc. | \n",
"
\n",
" \n",
" 2 | \n",
" AS | \n",
" Alaska Airlines Inc. | \n",
"
\n",
" \n",
" 3 | \n",
" B6 | \n",
" JetBlue Airways | \n",
"
\n",
" \n",
" 4 | \n",
" DL | \n",
" Delta Air Lines Inc. | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" carrier name\n",
"0 9E Endeavor Air Inc.\n",
"1 AA American Airlines Inc.\n",
"2 AS Alaska Airlines Inc.\n",
"3 B6 JetBlue Airways\n",
"4 DL Delta Air Lines Inc."
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"airlines = pd.read_csv('../data/airlines.csv')\n",
"# Keep just the first 5 rows for this example.\n",
"airlines = airlines.iloc[:5]\n",
"airlines"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Suppose we learn that there is a mistake in the carrier codes and they should be updated.\n",
"- 9E should be PE\n",
"- B6 should be BB\n",
"- The other codes should stay as they are."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"We can express this *mapping* of old values to new values using a Python dictionary."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"# Only specify the values we want to replace; don't include the ones that should stay the same.\n",
"value_mapping = {'9E': 'PE',\n",
" 'B6': 'BB'}"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Pandas provides a handy method on Series, `.replace`, that accepts this value mapping and updates the Series accordingly.\n",
"We can use it to create a new column, \"updated_carrier\", with the proper carrier code values."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" carrier | \n",
" name | \n",
" updated_carrier | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 9E | \n",
" Endeavor Air Inc. | \n",
" PE | \n",
"
\n",
" \n",
" 1 | \n",
" AA | \n",
" American Airlines Inc. | \n",
" AA | \n",
"
\n",
" \n",
" 2 | \n",
" AS | \n",
" Alaska Airlines Inc. | \n",
" AS | \n",
"
\n",
" \n",
" 3 | \n",
" B6 | \n",
" JetBlue Airways | \n",
" BB | \n",
"
\n",
" \n",
" 4 | \n",
" DL | \n",
" Delta Air Lines Inc. | \n",
" DL | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" carrier name updated_carrier\n",
"0 9E Endeavor Air Inc. PE\n",
"1 AA American Airlines Inc. AA\n",
"2 AS Alaska Airlines Inc. AS\n",
"3 B6 JetBlue Airways BB\n",
"4 DL Delta Air Lines Inc. DL"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"airlines['updated_carrier'] = airlines['carrier'].replace(value_mapping)\n",
"airlines"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"If you are a SQL user, this workflow may look familiar to you;\n",
"it's quite similar to a `CASE WHEN` statement in SQL."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Your Turn\n",
"\n",
"1. Open the weather CSV (path: `../data/weather.csv`) and store it in a variable called `weather`.\n",
"2. Take a look at the \"month\" column. Observe that its values are numeric, not strings. How do you think these values relate to months of the year?\n",
"3. Create a mapping from each number to the corresponding month name, as a dictionary. For example, one of the keys would be `5` and its value would be `May`. Store it in a variable called `month_mapping`.\n",
"4. Use the `.replace` method to overwrite the current month column with the month names as strings, using your newly created mapping."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### The `apply` Method and Beyond"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"If you can think of a way to express a new column as a combination of other columns and constants, it can be created using Python and Pandas.\n",
"However, column calculations beyond the above are outside the scope of this training."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"If you wish to learn more, take a look at the [`DataFrame.apply` method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html)."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Questions\n",
"\n",
"Are there any questions before we move on?"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.4"
},
"rise": {
"autolaunch": true,
"transition": "none"
}
},
"nbformat": 4,
"nbformat_minor": 4
}