{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Packages, Modules, Methods, and Functions" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "> The Python source distribution has long maintained the philosophy of \"batteries included\" -- having a rich and versatile standard library which is immediately available, without making the user download separate packages. This gives the Python language a head start in many projects.\n", ">\n", "> \\- PEP 206" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Applied Review" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Python and Jupyter Overview" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- We're working with Python through Jupyter, the most common IDE for data science." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Fundamentals" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Python's common *atomic*, or basic, data types are:\n", " - Integers\n", " - Floats (decimals)\n", " - Strings\n", " - Booleans" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- These simple types can be combined to form more complex types, including:\n", " - Lists: Ordered collections\n", " - Dictionaries: Key-value pairs\n", " - DataFrames: Tabular datasets" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Packages (aka *Modules*)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "So far we've seen several data types that Python offers out-of-the-box.\n", "However, to keep things organized, some Python functionality is stored in standalone *packages*, or libraries of code.\n", "The word \"module\" is generally synonymous with package; you will hear both in discussions of Python." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "For example, functionality related to the operating system -- such as creating files and folders -- is stored in a package called `os`.\n", "To use the tools in `os`, we *import* the package." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "import os" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Once we import it, we gain access to everything inside.\n", "With Jupyter's autocomplete, we can view what's available." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" }, "tags": [ "ci-skip" ] }, "outputs": [], "source": [ "# Move your cursor the end of the below line and press tab.\n", "os." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Some packages, like `os`, are bundled with every Python install; downloading Python guarantees you'll have these packages.\n", "Collectively, this group of packages is known as the *standard library*." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Other packages must be downloaded separately, either because\n", "- they aren't sufficiently popular to merit inclusion in the standard library\n", "- *or* they change too quickly for the maintainers of Python to keep up" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The DataFrame type that we saw earlier is part of one such package called `pandas` (short for *Panel Data*).\n", "Since pandas is specific to data science and is still rapidly evolving, it is not part of the standard library." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We can download packages like pandas from the internet using a website called PyPI, the *Python Package Index*.\n", "Fortunately, since we are using Binder today, that has been handled for us and pandas is already installed." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "It's possible to import packages under an *alias*, or a nickname.\n", "The community has adopted certain conventions for aliases for common packages;\n", "while following them isn't mandatory, it's highly recommended, as it makes your code easier for others to understand." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "pandas is conventionally imported under the alias `pd`." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "pandas.core.frame.DataFrame" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Importing pandas has given us access to the DataFrame, accessible as pd.DataFrame\n", "pd.DataFrame" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "
\n", "

Question

\n", "

What is the type of pd? Guess before you run the code below.

\n", "
" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "module" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(pd)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Third-party packages unlock a huge range of functionality that isn't available in native Python; much of Python's data science capabilities come from a handful of packages outside the standard library:" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- pandas\n", "- numpy (numerical computing)\n", "- scikit-learn (modeling)\n", "- scipy (scientific computing)\n", "- matplotlib (graphing)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We won't have time to touch on most of these in this training, but if you're interested in one, google it!" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Your Turn\n", "\n", "1. Import the `numpy` library, listed above. Give it the alias \"np\".\n", "2. Using autocomplete, determine what variable or function inside the numpy library starts with \"asco\". *Hint: remember you'll need to preface the variable name with the package alias, e.g. `np.asco`*" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Dot Notation with Packages" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We've seen it a few times already, but now it's time to discuss it explicitly:\n", "things inside packages can be accessed with *dot-notation*." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Dot notation looks like this:\n", "```python\n", "pd.Series\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "or\n", "```python\n", "import numpy as np\n", "np.array\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "You can read this is \"the `array` variable, within the Numpy library\"." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Packages can contain pretty much anything that's legal in Python;\n", "if it's code, it can be in a package.\n", "\n", "This flexibility is part of the reason that Python's package ecosystem is so expansive and powerful." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Functions" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "As you may have noticed already, occasionally we run code using parentheses.\n", "The feature that permits this in Python is **functions** -- code snippets wrapped up into a single name." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "For example, take the `type` function we saw above.\n", "```python\n", "type(x)\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "`type` does some complex things under the hood -- it looks at the variable inside the parentheses, determines what type of thing it is, and then returns that type to the user." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "int" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x = 7\n", "type(x)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "But the beauty of `type`, and of all functions, is that you (as the user) don't need to know all the complex code that's necessary to figure out that x is an `int` -- you just need to remember that there's a `type` function to do that for you." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Functions make you much more powerful, as they unlock lots of functionality within a simple interface." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "```python\n", "# Get the first few rows of the planes data.\n", "planes.head()\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "```python\n", "# Read in the planes.csv file.\n", "pd.read_csv('../data/planes.csv')\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "The variables within the parens are called function arguments, or simply **arguments**.\n", "\n", "Above, the string `'../data/planes.csv'` is the argument to the `pd.read_csv` function." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Functions are integral to using Python, because it's much more efficient to use pre-written code than to always write your own." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "If you ever do want to write your own function -- perhaps to share with others, or to make it easier to reuse your work -- it's fairly simple to do so, but beyond the scope of this training." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Objects and Dot Notation" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Dot-notation, which we discussed in relation to packages, has another use -- accessing things inside of *objects*.\n", "\n", "What's an object? Basically, a variable that contains other data or functionality inside of it that is exposed to users." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "For example, DataFrames are objects." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "df = pd.DataFrame({'first_name': ['Ethan', 'Brad'], 'last_name': ['Swan', 'Boehmke']})" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
first_namelast_name
0EthanSwan
1BradBoehmke
\n", "
" ], "text/plain": [ " first_name last_name\n", "0 Ethan Swan\n", "1 Brad Boehmke" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "(2, 2)" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.shape" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
first_namelast_name
count22
unique22
topEthanSwan
freq11
\n", "
" ], "text/plain": [ " first_name last_name\n", "count 2 2\n", "unique 2 2\n", "top Ethan Swan\n", "freq 1 1" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.describe()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "You can see that DataFrames have a `shape` variable and a `describe` function inside of them, both accessible through dot notation.\n", "\n", "
\n", "

Note

\n", "

Variables inside an object are often called attributes and functions inside objects are called methods.

\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### On Consistency and Language Design" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "One of the great things about Python is that its creators really cared about internal consistency." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "What that means to us, as users, is that syntax is consistent and predictable -- even across different uses that would appear to be different at first." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Dot notation reveals something kind of cool about Python: packages are just like other objects, and the variables inside them are just attributes and methods!\n", "\n", "This standardization across packages and objects helps us remember a single, intuitive syntax that works for many different things." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Functions, Objects, and Methods in the Context of DataFrames" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "As we saw above, DataFrames are a type of Python object, so let's use them to explore the new Python features we've learned." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Using the `read_csv` function from the Pandas package to read in a DataFrame" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "df = pd.read_csv('../data/airlines.csv')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Using the `type` function to determine the type of `df`" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "pandas.core.frame.DataFrame" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(df)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Using the `head` method of the DataFrame to view some of its rows" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
carriername
09EEndeavor Air Inc.
1AAAmerican Airlines Inc.
2ASAlaska Airlines Inc.
3B6JetBlue Airways
4DLDelta Air Lines Inc.
\n", "
" ], "text/plain": [ " carrier name\n", "0 9E Endeavor Air Inc.\n", "1 AA American Airlines Inc.\n", "2 AS Alaska Airlines Inc.\n", "3 B6 JetBlue Airways\n", "4 DL Delta Air Lines Inc." ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Examining the `columns` attribute of the DataFrame to see the names of its columns." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "Index(['carrier', 'name'], dtype='object')" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.columns" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Inspecting the `shape` attribute to find the *dimensions* (rows and columns) of the DataFrame." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "(16, 2)" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.shape" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Calling the `describe` method to get a summary of the data in the DataFrame." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
carriername
count1616
unique1616
top9EEndeavor Air Inc.
freq11
\n", "
" ], "text/plain": [ " carrier name\n", "count 16 16\n", "unique 16 16\n", "top 9E Endeavor Air Inc.\n", "freq 1 1" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.describe()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Now let's combine them: using the `type` function to determine what `df.describe` holds." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "method" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(df.describe)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "
\n", "

Question

\n", "

Does this result make sense? What would happen if you added parens? i.e. type(df.describe())

\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Your Turn\n", "\n", "Spend some time using autocomplete to explore the methods and attributes of the `df` object we used above.\n", "Remember from the Jupyter lesson that you can use a question mark to see the documentation for a function or method (e.g. `df.describe?`)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Deeper Dive on DataFrames" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Now that we understand objects and functions better, let's look more at DataFrames." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## What Are DataFrames Made of?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Accessing an individual column of a DataFrame can be done by passing the column name as a string, in brackets." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "0 9E\n", "1 AA\n", "2 AS\n", "3 B6\n", "4 DL\n", "5 EV\n", "6 F9\n", "7 FL\n", "8 HA\n", "9 MQ\n", "10 OO\n", "11 UA\n", "12 US\n", "13 VX\n", "14 WN\n", "15 YV\n", "Name: carrier, dtype: object" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "carrier_column = df['carrier']\n", "carrier_column" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Individual columns are pandas `Series` objects." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "pandas.core.series.Series" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(carrier_column)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "How are Series different from DataFrames?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- They're always 1-dimensional" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- They have different attributes than DataFrames\n", " - For example, Series have a `to_list` method -- which doesn't make sense to have on DataFrames" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- They don't print in the pretty format of DataFrames, but in plain text (see above)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "(16,)" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "carrier_column.shape" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "(16, 2)" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.shape" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "['9E',\n", " 'AA',\n", " 'AS',\n", " 'B6',\n", " 'DL',\n", " 'EV',\n", " 'F9',\n", " 'FL',\n", " 'HA',\n", " 'MQ',\n", " 'OO',\n", " 'UA',\n", " 'US',\n", " 'VX',\n", " 'WN',\n", " 'YV']" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "carrier_column.to_list()" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "slideshow": { "slide_type": "slide" }, "tags": [ "ci-skip" ] }, "outputs": [ { "ename": "AttributeError", "evalue": "'DataFrame' object has no attribute 'to_list'", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mAttributeError\u001b[0m Traceback (most recent call last)", "Cell \u001b[0;32mIn[23], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m df\u001b[39m.\u001b[39mto_list()\n", "File \u001b[0;32m/usr/local/anaconda3/envs/uc-python/lib/python3.11/site-packages/pandas/core/generic.py:5989\u001b[0m, in \u001b[0;36mNDFrame.__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m 5982\u001b[0m \u001b[39mif\u001b[39;00m (\n\u001b[1;32m 5983\u001b[0m name \u001b[39mnot\u001b[39;00m \u001b[39min\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_internal_names_set\n\u001b[1;32m 5984\u001b[0m \u001b[39mand\u001b[39;00m name \u001b[39mnot\u001b[39;00m \u001b[39min\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_metadata\n\u001b[1;32m 5985\u001b[0m \u001b[39mand\u001b[39;00m name \u001b[39mnot\u001b[39;00m \u001b[39min\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_accessors\n\u001b[1;32m 5986\u001b[0m \u001b[39mand\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_info_axis\u001b[39m.\u001b[39m_can_hold_identifiers_and_holds_name(name)\n\u001b[1;32m 5987\u001b[0m ):\n\u001b[1;32m 5988\u001b[0m \u001b[39mreturn\u001b[39;00m \u001b[39mself\u001b[39m[name]\n\u001b[0;32m-> 5989\u001b[0m \u001b[39mreturn\u001b[39;00m \u001b[39mobject\u001b[39m\u001b[39m.\u001b[39m\u001b[39m__getattribute__\u001b[39m(\u001b[39mself\u001b[39m, name)\n", "\u001b[0;31mAttributeError\u001b[0m: 'DataFrame' object has no attribute 'to_list'" ] } ], "source": [ "df.to_list()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "It's important to be familiar with Series because they are fundamentally the core of DataFrames.\n", "Not only are columns represented as Series, but so are rows!" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "carrier 9E\n", "name Endeavor Air Inc.\n", "Name: 0, dtype: object" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Fetch the first row of the DataFrame\n", "first_row = df.loc[0]\n", "first_row" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "pandas.core.series.Series" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(first_row)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Whenever you select individual columns or rows, you'll get Series objects." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## What Can You Do with a Series?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "First, let's create our own Series object from scratch -- they don't need to come from a DataFrame." ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "0 10\n", "1 20\n", "2 30\n", "3 40\n", "4 50\n", "dtype: int64" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Pass a list in as an argument and it will be converted to a Series.\n", "s = pd.Series([10, 20, 30, 40, 50])\n", "s" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "0 10\n", "1 20\n", "2 30\n", "3 40\n", "4 50\n", "dtype: int64" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Pass a list in as an argument and it will be converted to a Series.\n", "s = pd.Series([10, 20, 30, 40, 50])\n", "s" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "There are 3 things to notice about this Series:" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- The values (10, 20, 30...)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- The *dtype*, short for data type." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- The *index* (0, 1, 2...)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Values\n", "Values are fairly self-explanatory; we chose them in our input list." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### dtype\n", "Data types are also straightforward." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Series are always homogeneous, holding only integers, floats, or generic Python objects (called just `object`)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Because a Python object is general enough to contain any other type, any Series holding strings or other non-numeric data will typically default to be of type `object`." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "For example, going back to our carriers DataFrame, note that the carrier column is of type `object`." ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "0 9E\n", "1 AA\n", "2 AS\n", "3 B6\n", "4 DL\n", "5 EV\n", "6 F9\n", "7 FL\n", "8 HA\n", "9 MQ\n", "10 OO\n", "11 UA\n", "12 US\n", "13 VX\n", "14 WN\n", "15 YV\n", "Name: carrier, dtype: object" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['carrier']" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Index\n", "Indexes are more interesting.\n", "Every Series has an index, a way to reference each element.\n", "The index of a Series is a lot like the keys of a dictionary: each index element corresponds to a value in the Series, and can be used to look up that element." ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "RangeIndex(start=0, stop=5, step=1)" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Our index is a range from 0 (inclusive) to 5 (exclusive).\n", "s.index" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "0 10\n", "1 20\n", "2 30\n", "3 40\n", "4 50\n", "dtype: int64" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "40" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s[3]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "In our example, the index is just the integers 0-4, so right now it looks no different that referencing elements of a regular Python list.\n", "*But* indexes can be changed to something different -- like the letters a-e, for example." ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "a 10\n", "b 20\n", "c 30\n", "d 40\n", "e 50\n", "dtype: int64" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.index = ['a', 'b', 'c', 'd', 'e']\n", "s" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Now to look up the value 40, we reference `'d'`." ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "40" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s['d']" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "We saw earlier that rows of a DataFrame are Series.\n", "In such cases, the flexibility of Series indexes comes in handy;\n", "the index is set to the DataFrame column names." ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
carriername
09EEndeavor Air Inc.
1AAAmerican Airlines Inc.
2ASAlaska Airlines Inc.
3B6JetBlue Airways
4DLDelta Air Lines Inc.
\n", "
" ], "text/plain": [ " carrier name\n", "0 9E Endeavor Air Inc.\n", "1 AA American Airlines Inc.\n", "2 AS Alaska Airlines Inc.\n", "3 B6 JetBlue Airways\n", "4 DL Delta Air Lines Inc." ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "carrier 9E\n", "name Endeavor Air Inc.\n", "Name: 0, dtype: object" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Note that the index is ['carrier', 'name']\n", "first_row = df.loc[0]\n", "first_row" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "This is particularly handy because it means you can extract individual elements based on a column name." ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "'9E'" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "first_row['carrier']" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## DataFrame Indexes" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "It's not just Series that have indexes!\n", "DataFrames have them too.\n", "Take a look at the carrier DataFrame again and note the bold numbers on the left." ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
carriername
09EEndeavor Air Inc.
1AAAmerican Airlines Inc.
2ASAlaska Airlines Inc.
3B6JetBlue Airways
4DLDelta Air Lines Inc.
\n", "
" ], "text/plain": [ " carrier name\n", "0 9E Endeavor Air Inc.\n", "1 AA American Airlines Inc.\n", "2 AS Alaska Airlines Inc.\n", "3 B6 JetBlue Airways\n", "4 DL Delta Air Lines Inc." ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "These numbers are an index, just like the one we saw on our example Series.\n", "And DataFrame indexes support similar functionality." ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "RangeIndex(start=0, stop=16, step=1)" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Our index is a range from 0 (inclusive) to 16 (exclusive).\n", "df.index" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "When loading in a DataFrame, the default index will always be 0 to N-1, where N is the number of rows in your DataFrame.\n", "This is called a `RangeIndex`." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Selecting individual rows by their index is done with the `.loc` accessor.\n", "An *accessor* is an attribute designed specifically to help users reference something else (like rows within a DataFrame)." ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "carrier DL\n", "name Delta Air Lines Inc.\n", "Name: 4, dtype: object" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Get the row at index 4 (the fifth row).\n", "df.loc[4]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "As with Series, DataFrames support reassigning their index." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "However, with DataFrames it often makes sense to change one of your columns into the index." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "This is analogous to a primary key in relational databases: a way to rapidly look up rows within a table." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "In our case, maybe we will often use the carrier code (`carrier`) to look up the full name of the airline.\n", "In that case, it would make sense set the carrier column as our index." ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
name
carrier
9EEndeavor Air Inc.
AAAmerican Airlines Inc.
ASAlaska Airlines Inc.
B6JetBlue Airways
DLDelta Air Lines Inc.
\n", "
" ], "text/plain": [ " name\n", "carrier \n", "9E Endeavor Air Inc.\n", "AA American Airlines Inc.\n", "AS Alaska Airlines Inc.\n", "B6 JetBlue Airways\n", "DL Delta Air Lines Inc." ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = df.set_index('carrier')\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Now the RangeIndex has been replaced with a more meaningful index, and it's possible to look up rows of the table by passing carrier code to the `.loc` accessor." ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "name United Air Lines Inc.\n", "Name: UA, dtype: object" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc['UA']" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "
\n", "

Caution!

\n", "

Pandas does not require that indexes have unique values (that is, no duplicates) although many relational databases do have that requirement of a primary key. This means that it is *possible* to create a non-unique index, but highly inadvisable. Having duplicate values in your index can cause unexpected results when you refer to rows by index -- but multiple rows have that index. Don't do it if you can help it!

\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "When starting to work with a DataFrame, it's often a good idea to determine what column makes sense as your index and to set it immediately." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "This will make your code nicer -- by letting you directly look up values with the index -- and also make your selections and filters faster, because Pandas is optimized for operations by index." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "If you want to change the index of your DataFrame later, you can always `reset_index` (and then assign a new one)." ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
name
carrier
9EEndeavor Air Inc.
AAAmerican Airlines Inc.
ASAlaska Airlines Inc.
B6JetBlue Airways
DLDelta Air Lines Inc.
\n", "
" ], "text/plain": [ " name\n", "carrier \n", "9E Endeavor Air Inc.\n", "AA American Airlines Inc.\n", "AS Alaska Airlines Inc.\n", "B6 JetBlue Airways\n", "DL Delta Air Lines Inc." ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
carriername
09EEndeavor Air Inc.
1AAAmerican Airlines Inc.
2ASAlaska Airlines Inc.
3B6JetBlue Airways
4DLDelta Air Lines Inc.
\n", "
" ], "text/plain": [ " carrier name\n", "0 9E Endeavor Air Inc.\n", "1 AA American Airlines Inc.\n", "2 AS Alaska Airlines Inc.\n", "3 B6 JetBlue Airways\n", "4 DL Delta Air Lines Inc." ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = df.reset_index()\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Your Turn\n", "\n", "The below cell has code to load in the first 100 rows of the airports data as `airports`.\n", "The data contains the airport code, airport name, and some basic facts about the airport location." ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
faanamelatlonalttzdsttzone
004GLansdowne Airport41.130472-80.6195831044-5AAmerica/New_York
106AMoton Field Municipal Airport32.460572-85.680028264-6AAmerica/Chicago
206CSchaumburg Regional41.989341-88.101243801-6AAmerica/Chicago
306NRandall Airport41.431912-74.391561523-5AAmerica/New_York
409JJekyll Island Airport31.074472-81.42777811-5AAmerica/New_York
\n", "
" ], "text/plain": [ " faa name lat lon alt tz dst \\\n", "0 04G Lansdowne Airport 41.130472 -80.619583 1044 -5 A \n", "1 06A Moton Field Municipal Airport 32.460572 -85.680028 264 -6 A \n", "2 06C Schaumburg Regional 41.989341 -88.101243 801 -6 A \n", "3 06N Randall Airport 41.431912 -74.391561 523 -5 A \n", "4 09J Jekyll Island Airport 31.074472 -81.427778 11 -5 A \n", "\n", " tzone \n", "0 America/New_York \n", "1 America/Chicago \n", "2 America/Chicago \n", "3 America/New_York \n", "4 America/New_York " ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "airports = pd.read_csv('../data/airports.csv')\n", "airports = airports.loc[0:100]\n", "airports.head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "1. What kind of index is the current index of `airports`? \n", "2. Is this a good choice for the DataFrame's index? If not, what column or columns would be a better candidate?\n", "3. If you chose a different column to be the index, make it your index using `airports.set_index()`.\n", "4. Using your new index, look up \"Pittsburgh-Monroeville Airport\", code 4G0. What is its altitude?\n", "5. Reset your index in case you want to make a different column your index in the future." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Questions\n", "\n", "Are there any questions before we move on?" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" }, "rise": { "autolaunch": true, "transition": "none" } }, "nbformat": 4, "nbformat_minor": 4 }