{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Manipulating and Creating Columns" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "> During the course of doing data analysis and modeling, a significant amount of time is spent on data preparation: loading, cleaning, transforming, and rearranging. Such tasks are often reported to take up 80% or more of an analyst's time.\n", ">\n", "> \\- Wes McKinney, the creator of Pandas, in his book *Python for Data Analysis*" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Applied Review" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Data Structures and DataFrames\n", "- We use **DataFrames** to represent tables in Python.\n", "- Python also supports other data structures for storing information that isn't tabular. Examples include lists and dictionaries.\n", "- DataFrames have many **methods**, or functions that access their internal data. Some examples we saw were `describe()` and `set_index()`.\n", "- DataFrames are composed of **Series**, 1-D data structures (like a vector). DataFrame columns can be thought of as Series." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Importing Data\n", "- Python can read in data from CSVs, JSON files, and pickle files with just a few lines of code." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Selecting and Filtering Data\n", "- Python's pandas library supports limiting rows (via *filtering* and *slicing*), as well as *selecting* columns.\n", "- All of these operations use the bracket operators, but row syntax includes the `.loc` *accessor*." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Calculations Using Columns" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "It's common to want to modify a column of a DataFrame, or sometimes even to create a new column.\n", "Let's take a look at our planes data again." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tailnumyeartypemanufacturermodelenginesseatsspeedengine
0N101562004.0Fixed wing multi engineEMBRAEREMB-145XR255NaNTurbo-fan
1N102UW1998.0Fixed wing multi engineAIRBUS INDUSTRIEA320-2142182NaNTurbo-fan
2N103US1999.0Fixed wing multi engineAIRBUS INDUSTRIEA320-2142182NaNTurbo-fan
3N104UW1999.0Fixed wing multi engineAIRBUS INDUSTRIEA320-2142182NaNTurbo-fan
4N105752002.0Fixed wing multi engineEMBRAEREMB-145LR255NaNTurbo-fan
\n", "
" ], "text/plain": [ " tailnum year type manufacturer model \\\n", "0 N10156 2004.0 Fixed wing multi engine EMBRAER EMB-145XR \n", "1 N102UW 1998.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n", "2 N103US 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n", "3 N104UW 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n", "4 N10575 2002.0 Fixed wing multi engine EMBRAER EMB-145LR \n", "\n", " engines seats speed engine \n", "0 2 55 NaN Turbo-fan \n", "1 2 182 NaN Turbo-fan \n", "2 2 182 NaN Turbo-fan \n", "3 2 182 NaN Turbo-fan \n", "4 2 55 NaN Turbo-fan " ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "planes = pd.read_csv('../data/planes.csv')\n", "planes.head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Suppose we wanted to know the total capacity of each plane, including the crew.\n", "We have data on how many seats each plane has (in the `seats` column), but that only includes paying passengers.\n", "\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "0 55\n", "1 182\n", "2 182\n", "3 182\n", "4 55\n", "Name: seats, dtype: int64" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "seats = planes['seats']\n", "seats.head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "For simplicity, let's say a full flight crew is always 5 people.\n", "Series objects allow us to perform addition with the regular `+` syntax –- in this case, `seats + 5`." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "0 60\n", "1 187\n", "2 187\n", "3 187\n", "4 60\n", "Name: seats, dtype: int64" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "capacity = seats + 5\n", "capacity.head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "
\n", "

Question

\n", "

What happens if you switch the order? (i.e. 5 + seats)? Does this make sense?

\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "So we've create a new series, `capacity`, with the total capacity of the plane.\n", "Right now it's totally separate from our original `planes` DataFrame, but we can make it a column of `planes` using the assignment syntax with the column reference syntax.\n", "```python\n", "df['new_column_name'] = new_column_series\n", "```" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tailnumyeartypemanufacturermodelenginesseatsspeedenginecapacity
0N101562004.0Fixed wing multi engineEMBRAEREMB-145XR255NaNTurbo-fan60
1N102UW1998.0Fixed wing multi engineAIRBUS INDUSTRIEA320-2142182NaNTurbo-fan187
2N103US1999.0Fixed wing multi engineAIRBUS INDUSTRIEA320-2142182NaNTurbo-fan187
3N104UW1999.0Fixed wing multi engineAIRBUS INDUSTRIEA320-2142182NaNTurbo-fan187
4N105752002.0Fixed wing multi engineEMBRAEREMB-145LR255NaNTurbo-fan60
\n", "
" ], "text/plain": [ " tailnum year type manufacturer model \\\n", "0 N10156 2004.0 Fixed wing multi engine EMBRAER EMB-145XR \n", "1 N102UW 1998.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n", "2 N103US 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n", "3 N104UW 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n", "4 N10575 2002.0 Fixed wing multi engine EMBRAER EMB-145LR \n", "\n", " engines seats speed engine capacity \n", "0 2 55 NaN Turbo-fan 60 \n", "1 2 182 NaN Turbo-fan 187 \n", "2 2 182 NaN Turbo-fan 187 \n", "3 2 182 NaN Turbo-fan 187 \n", "4 2 55 NaN Turbo-fan 60 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "planes['capacity'] = capacity\n", "planes.head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Note that `planes` now has a \"capacity\" column at the end." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "
\n", "

Note

\n", "

Also note that in the code above, the column name goes in quotes within the bracket syntax, while the values that will become the column -- the Series we're using -- are on the right side of the statement, without any brackets or quotes.

\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "This sequence of operations can be expressed as a single line:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "# Create a capacity column filled with the values in the seats column added with 5.\n", "planes['capacity'] = planes['seats'] + 5" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "From a mathematical perspective, what we're doing here is adding a *scalar* -- a single value -- to a *vector* -- a series of values (aka a `Series`).\n", "Other vector-scalar math is supported as well." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "0 43\n", "1 170\n", "2 170\n", "3 170\n", "4 43\n", "Name: seats, dtype: int64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Subtraction\n", "(planes['seats'] - 12).head()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "0 550\n", "1 1820\n", "2 1820\n", "3 1820\n", "4 550\n", "Name: seats, dtype: int64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Multiplication\n", "(planes['seats'] * 10).head()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "0 3025\n", "1 33124\n", "2 33124\n", "3 33124\n", "4 3025\n", "Name: seats, dtype: int64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Exponentiation\n", "(planes['seats'] ** 2).head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Your Turn\n", "\n", "Create a new column named \"first_class_seats\" that is 1/5 of the total seats on the plane. You will have some results with fractional seats; don't worry about this." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Overwriting Columns\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "What if we discovered a systematic error in our data?\n", "Perhaps we find out that the \"engines\" column is only the number of engines *per wing* -- so the total number of engines is actually double the value in that column." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We could create a new column, \"real_engine_count\" or \"total_engines\".\n", "But we're not going to need the original \"engines\" column, and leaving it could cause confusion for others looking at our data." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "A better solution would be to replace the original column with the new, recalculated, values.\n", "We can do so using the same syntax as for creating a new column." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tailnumyeartypemanufacturermodelenginesseatsspeedenginecapacity
0N101562004.0Fixed wing multi engineEMBRAEREMB-145XR255NaNTurbo-fan60
1N102UW1998.0Fixed wing multi engineAIRBUS INDUSTRIEA320-2142182NaNTurbo-fan187
2N103US1999.0Fixed wing multi engineAIRBUS INDUSTRIEA320-2142182NaNTurbo-fan187
3N104UW1999.0Fixed wing multi engineAIRBUS INDUSTRIEA320-2142182NaNTurbo-fan187
4N105752002.0Fixed wing multi engineEMBRAEREMB-145LR255NaNTurbo-fan60
\n", "
" ], "text/plain": [ " tailnum year type manufacturer model \\\n", "0 N10156 2004.0 Fixed wing multi engine EMBRAER EMB-145XR \n", "1 N102UW 1998.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n", "2 N103US 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n", "3 N104UW 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n", "4 N10575 2002.0 Fixed wing multi engine EMBRAER EMB-145LR \n", "\n", " engines seats speed engine capacity \n", "0 2 55 NaN Turbo-fan 60 \n", "1 2 182 NaN Turbo-fan 187 \n", "2 2 182 NaN Turbo-fan 187 \n", "3 2 182 NaN Turbo-fan 187 \n", "4 2 55 NaN Turbo-fan 60 " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "planes.head()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "# Multiply the engines column by 2, and then overwrite the original data.\n", "planes['engines'] = planes['engines'] * 2" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tailnumyeartypemanufacturermodelenginesseatsspeedenginecapacity
0N101562004.0Fixed wing multi engineEMBRAEREMB-145XR455NaNTurbo-fan60
1N102UW1998.0Fixed wing multi engineAIRBUS INDUSTRIEA320-2144182NaNTurbo-fan187
2N103US1999.0Fixed wing multi engineAIRBUS INDUSTRIEA320-2144182NaNTurbo-fan187
3N104UW1999.0Fixed wing multi engineAIRBUS INDUSTRIEA320-2144182NaNTurbo-fan187
4N105752002.0Fixed wing multi engineEMBRAEREMB-145LR455NaNTurbo-fan60
\n", "
" ], "text/plain": [ " tailnum year type manufacturer model \\\n", "0 N10156 2004.0 Fixed wing multi engine EMBRAER EMB-145XR \n", "1 N102UW 1998.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n", "2 N103US 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n", "3 N104UW 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n", "4 N10575 2002.0 Fixed wing multi engine EMBRAER EMB-145LR \n", "\n", " engines seats speed engine capacity \n", "0 4 55 NaN Turbo-fan 60 \n", "1 4 182 NaN Turbo-fan 187 \n", "2 4 182 NaN Turbo-fan 187 \n", "3 4 182 NaN Turbo-fan 187 \n", "4 4 55 NaN Turbo-fan 60 " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "planes.head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Calculating Values Based on Multiple Columns" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "So far we've only seen vector-scalar math.\n", "But vector-vector math is supported as well." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Let's look at a toy example of creating a column that contains the number of seats per engine." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "0 13.75\n", "1 45.50\n", "2 45.50\n", "3 45.50\n", "4 13.75\n", "dtype: float64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "seats_per_engine = planes['seats'] / planes['engines']\n", "seats_per_engine.head()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tailnumyeartypemanufacturermodelenginesseatsspeedenginecapacityseats_per_engine
0N101562004.0Fixed wing multi engineEMBRAEREMB-145XR455NaNTurbo-fan6013.75
1N102UW1998.0Fixed wing multi engineAIRBUS INDUSTRIEA320-2144182NaNTurbo-fan18745.50
2N103US1999.0Fixed wing multi engineAIRBUS INDUSTRIEA320-2144182NaNTurbo-fan18745.50
3N104UW1999.0Fixed wing multi engineAIRBUS INDUSTRIEA320-2144182NaNTurbo-fan18745.50
4N105752002.0Fixed wing multi engineEMBRAEREMB-145LR455NaNTurbo-fan6013.75
\n", "
" ], "text/plain": [ " tailnum year type manufacturer model \\\n", "0 N10156 2004.0 Fixed wing multi engine EMBRAER EMB-145XR \n", "1 N102UW 1998.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n", "2 N103US 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n", "3 N104UW 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n", "4 N10575 2002.0 Fixed wing multi engine EMBRAER EMB-145LR \n", "\n", " engines seats speed engine capacity seats_per_engine \n", "0 4 55 NaN Turbo-fan 60 13.75 \n", "1 4 182 NaN Turbo-fan 187 45.50 \n", "2 4 182 NaN Turbo-fan 187 45.50 \n", "3 4 182 NaN Turbo-fan 187 45.50 \n", "4 4 55 NaN Turbo-fan 60 13.75 " ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "planes['seats_per_engine'] = seats_per_engine\n", "planes.head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "You can combine vector-vector and vector-scalar calculations in arbitrarily complex ways." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tailnumyeartypemanufacturermodelenginesseatsspeedenginecapacityseats_per_enginenonsense
0N101562004.0Fixed wing multi engineEMBRAEREMB-145XR455NaNTurbo-fan6013.758110.0
1N102UW1998.0Fixed wing multi engineAIRBUS INDUSTRIEA320-2144182NaNTurbo-fan18745.508213.0
2N103US1999.0Fixed wing multi engineAIRBUS INDUSTRIEA320-2144182NaNTurbo-fan18745.508217.0
3N104UW1999.0Fixed wing multi engineAIRBUS INDUSTRIEA320-2144182NaNTurbo-fan18745.508217.0
4N105752002.0Fixed wing multi engineEMBRAEREMB-145LR455NaNTurbo-fan6013.758102.0
\n", "
" ], "text/plain": [ " tailnum year type manufacturer model \\\n", "0 N10156 2004.0 Fixed wing multi engine EMBRAER EMB-145XR \n", "1 N102UW 1998.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n", "2 N103US 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n", "3 N104UW 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n", "4 N10575 2002.0 Fixed wing multi engine EMBRAER EMB-145LR \n", "\n", " engines seats speed engine capacity seats_per_engine nonsense \n", "0 4 55 NaN Turbo-fan 60 13.75 8110.0 \n", "1 4 182 NaN Turbo-fan 187 45.50 8213.0 \n", "2 4 182 NaN Turbo-fan 187 45.50 8217.0 \n", "3 4 182 NaN Turbo-fan 187 45.50 8217.0 \n", "4 4 55 NaN Turbo-fan 60 13.75 8102.0 " ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "planes['nonsense'] = (planes['year'] + 12) * planes['engines'] + planes['seats'] - 9\n", "planes.head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Your Turn\n", "\n", "Create a new column in the planes DataFrame, \"technology_index\", that is calculated with the formula:\n", "\n", "`technology_index = (year-1900) / 4 + engines * 2`\n", "\n", "
\n", "

Note

\n", "

Remember order of operations!

\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Non-numeric Column Operations" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "For simplicity, we started with mathematical operations.\n", "However, pandas supports string operations as well." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "We can use `+` to concatenate strings, with both vectors and scalars." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "0 Tailnum is N10156 and Model is EMB-145XR\n", "1 Tailnum is N102UW and Model is A320-214\n", "2 Tailnum is N103US and Model is A320-214\n", "3 Tailnum is N104UW and Model is A320-214\n", "4 Tailnum is N10575 and Model is EMB-145LR\n", "dtype: object" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "summary = 'Tailnum is ' + planes['tailnum'] + ' and Model is ' + planes['model']\n", "summary.head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "More complex string operations are possible using methods available through the `.str` *accessor*.\n", "We won't cover them in detail, so refer to the [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html) if you're interested." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "0 embraer\n", "1 airbus industrie\n", "2 airbus industrie\n", "3 airbus industrie\n", "4 embraer\n", "Name: manufacturer, dtype: object" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Make the manufacturer field lowercase.\n", "lowercase_mfctr = planes['manufacturer'].str.lower()\n", "lowercase_mfctr.head()" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "0 6\n", "1 6\n", "2 6\n", "3 6\n", "4 6\n", "Name: tailnum, dtype: int64" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Get the length of the tail number.\n", "tailnum_len = planes['tailnum'].str.len()\n", "tailnum_len.head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## More Complex Column Manipulation" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Replacing Values" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "One fairly common situation in data wrangling is needing to convert one set of values to another, where there is a one-to-one correspondence between the values currently in the column and the new values that should replace them.\n", "This operation can be described as \"mapping one set of values to another\"." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Let's look at an example of this." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
carriername
09EEndeavor Air Inc.
1AAAmerican Airlines Inc.
2ASAlaska Airlines Inc.
3B6JetBlue Airways
4DLDelta Air Lines Inc.
\n", "
" ], "text/plain": [ " carrier name\n", "0 9E Endeavor Air Inc.\n", "1 AA American Airlines Inc.\n", "2 AS Alaska Airlines Inc.\n", "3 B6 JetBlue Airways\n", "4 DL Delta Air Lines Inc." ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "airlines = pd.read_csv('../data/airlines.csv')\n", "# Keep just the first 5 rows for this example.\n", "airlines = airlines.iloc[:5]\n", "airlines" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Suppose we learn that there is a mistake in the carrier codes and they should be updated.\n", "- 9E should be PE\n", "- B6 should be BB\n", "- The other codes should stay as they are." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We can express this *mapping* of old values to new values using a Python dictionary." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "# Only specify the values we want to replace; don't include the ones that should stay the same.\n", "value_mapping = {'9E': 'PE',\n", " 'B6': 'BB'}" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Pandas provides a handy method on Series, `.replace`, that accepts this value mapping and updates the Series accordingly.\n", "We can use it to create a new column, \"updated_carrier\", with the proper carrier code values." ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
carriernameupdated_carrier
09EEndeavor Air Inc.PE
1AAAmerican Airlines Inc.AA
2ASAlaska Airlines Inc.AS
3B6JetBlue AirwaysBB
4DLDelta Air Lines Inc.DL
\n", "
" ], "text/plain": [ " carrier name updated_carrier\n", "0 9E Endeavor Air Inc. PE\n", "1 AA American Airlines Inc. AA\n", "2 AS Alaska Airlines Inc. AS\n", "3 B6 JetBlue Airways BB\n", "4 DL Delta Air Lines Inc. DL" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "airlines['updated_carrier'] = airlines['carrier'].replace(value_mapping)\n", "airlines" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "If you are a SQL user, this workflow may look familiar to you;\n", "it's quite similar to a `CASE WHEN` statement in SQL." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Your Turn\n", "\n", "1. Open the weather CSV (path: `../data/weather.csv`) and store it in a variable called `weather`.\n", "2. Take a look at the \"month\" column. Observe that its values are numeric, not strings. How do you think these values relate to months of the year?\n", "3. Create a mapping from each number to the corresponding month name, as a dictionary. For example, one of the keys would be `5` and its value would be `May`. Store it in a variable called `month_mapping`.\n", "4. Use the `.replace` method to overwrite the current month column with the month names as strings, using your newly created mapping." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### The `apply` Method and Beyond" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "If you can think of a way to express a new column as a combination of other columns and constants, it can be created using Python and Pandas.\n", "However, column calculations beyond the above are outside the scope of this training." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "If you wish to learn more, take a look at the [`DataFrame.apply` method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Questions\n", "\n", "Are there any questions before we move on?" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" }, "rise": { "autolaunch": true, "transition": "none" } }, "nbformat": 4, "nbformat_minor": 4 }