{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lab Lesson\n", "\n", "## Processing Data with Python, Part 1\n", "\n", "### Topics\n", "\n", "* introduction to/motivation of pandas\n", "* analyzing real-life data\n", "* Series\n", "* DataFrames\n", "* manipulating DataFrames with Index\n", "\n", "### Resources\n", "\n", "For more information, see following books on Python and data analysis:\n", "\n", "* [The Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/), by Jake VanderPlas. This link is to the entire book in Jupyter notebook form.\n", "* [Python for Data Analysis](https://proquest.safaribooksonline.com/9781491957653), by Wes McKinney. This book is available for free through the Pitt library. \n", "\n", "### Exercises\n", "\n", "Back to the normal notebook-of-exercises format for the next few weeks! The exercises for this week build on what we talked about today, and they primarily focus on a set of data about individual Chipotle orders. \n", "\n", "**TURNING IT IN:** Submit your completed Jupyter notebook on Canvas by the deadline. Don't submit any other files." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# So, what is pandas anyway?\n", "\n", "`pandas` is a third-party Python library for doing data analysis. It's developed and maintained by [Wes McKinney](http://wesmckinney.com/pages/about.html), who at the time of initial development was working in finance. Suffice to say, **pandas is very much beta software**, like JupyterLab. There are going to be some bugs, design flaws, and other issues that you're going to run into; I'll address them as they come up in class, and if you run into them on your own, we can work through them together in office hours.\n", "\n", "pandas is a foundational part of using Python for data science. Most if not all things that pandas does can be done with plain-jane Python, but, most of the time, pandas does them *faster* and *easier*. It's built on top of another extremely powerful third-party Python library called `numpy`; if you are doing data analysis on a huge scale, consider `numpy`.\n", "\n", "## Why are *we* using pandas?\n", "\n", "We're using pandas because it has a powerful set of *structures* and *functions* that make working with large datasets simple. Once you learn these structures and functions, it becomes extremely easy to answer any question you want to ask with a given data set. This is my definition of **data analysis**: answering questions asked of data.\n", "\n", "pandas also interacts nicely with a bunch of other Python libraries and programs, which make doing data science easier.\n", "\n", "* Jupyter notebooks, of course, allow you to construct computational narratives with code, data, and text. Displaying dataframes (one of pandas' data structures) as an inline HTML table is one of the major interactions between Jupyter and pandas.\n", "* [Matplotlib](https://matplotlib.org/) is an incredibly powerful graphing library for Python. Generating plots from dataframes is simple with matplotlib and pandas.\n", "* pandas also integrates with scientific computing/machine learning Python libraries, like [SciKit](http://scikit-learn.org/stable/) and [SciPy](https://www.scipy.org/)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# pandas basics\n", "\n", "Importing pandas is simple." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can use the `as` keyword in Python to specify a different name for a library you're importing. Here, I'm telling Python that I'm going to use the term `pd` whenever I want to use pandas functions. It's basically like giving a Python library a nickname.\n", "\n", "That second line, `%matplotlib inline`, is a specifier within Jupyter for the `matplotlib` library, which, as we already discussed, is for making 2D graphs and charts from data. This specifies that any plots we generate are displayed below the code block, like Jupyter normally does. We'll see this later.\n", "\n", "## Diving into data\n", "\n", "We're going to do some stuff with a dataset that I provided for y'all here. It contains information about attendance at various community centers in Pittsburgh. \n", "\n", "I said at the beginning of this lecture that pandas allows us to ask questions of data and get answers from it. In our analysis, we want to answer the following question of this dataset:\n", "\n", "     **Which community center has had the most attendance over time?**\n", "\n", "**NOTE**: It's important to look at the **comments** in the code we're running through today. I'll be describing what's going on, line by line, by commenting my code. Confused as to what something does? Look at the comments first." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# load in the community center data file\n", "data = pd.read_csv(\"community-center-attendance-2019.csv\", index_col=\"date\", parse_dates=True)\n", "\n", "# look at the first ten rows of the data\n", "data.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, this has the name of the community center and the attendance, all organized in chronological order. Let's see what the attendance values look like over time." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# no prizes for guessing what this does\n", "data.plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can **pivot** the data so the center names are columns and each row is the number of people attending that community center per day. This is basically rotating the data; it's a common data operation. You'll often hear accountant-types talking about \"pivot tables\" in Excel. This is like that, but way better." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# use the pivot function to make row values into data columns\n", "data.pivot(columns=\"center_name\", values=\"attendance_count\").head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's a little ugly, because whenever there's no attendance data for a center on a given day, the default is `NaN`, or \"not a number\". Thanks, Python. That's really useful.\n", "\n", "Maybe we should consider separating the attendance values for each center. Let's first get a list of how much attendance data we have for each center." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# count the number of rows per center\n", "data.groupby(\"center_name\").count()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hm. We don't have a ton of data for quite a few centers, either because they don't report attendance all that often, or they just aren't open all that often. \n", "\n", "So, we're going to write a function that'll filter out any community center that doesn't have a lot of attendance data. We'll apply that filter to ever row in the dataframe using the [groupby filter function](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.DataFrameGroupBy.filter.html). \n", "\n", "This is... a little esoteric. What we're doing is using a special type of filter function called a *lambda* that, instead of doing something itself, will apply a function *we write* to each row. **This sounds more complicated than it is.** All we're doing is testing to see if each community center has more than 1000 attendance values; that's our standard." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# create a function we will use to perform a filtering operation on the data\n", "# filter out centers that have less then 1000 total entries\n", "\n", "# FUNCTION HERE\n", "\n", "# use the custom function to filter out rows\n", "popular_centers = data.groupby(\"center_name\").filter(, \n", " threshold=1000)\n", "# look at what centers are in the data now\n", "popular_centers.groupby(\"center_name\").count()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That looks better. Now we've got some good data to work with! Let's look at the data again, now that we've filtered it. \n", "\n", "**NOTE**: Part of that above code renames our data. Instead of calling it `data`, like at the beginning, we named our filtered dataset to `popular_centers`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# get the first 5 rows\n", "popular_centers.head(5)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# plot it\n", "popular_centers.plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This plot... isn't great. It just doesn't do a great job at displaying anything useful. Let's try pivoting again, now that we've eliminated some of the sparser columns. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# use the pivot function to make rows into columns with only the popular community centers\n", "pivoted_data = popular_centers.pivot(columns=\"center_name\", values=\"attendance_count\")\n", "pivoted_data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Not nearly as bad as before! Now, we can plot the attendance over time of individual community centers." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# plot the data! \n", "pivoted_data.plot(figsize=(10,10))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pretty messy, but definitely better. Now, let's calculate the **cumulative sum**, a measure that will add up attendance over time. This can give us both attendance values *and* a general sense of how they're changing over time." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# compute the cumulative sum for every column and make a chart\n", "pivoted_data.cumsum().plot(figsize=(10,10))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Brookline seems to be the winner, but attendance isn't growing as it has been in the past. Let's look at month-by-month attendance." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# resample and compute the monthly totals for the popular community centers\n", "pivoted_data.resample(\"M\").sum().plot(figsize=(10,10))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hmm, that's way too much noise. Let's plot it by year, instead. That's as simple as changing the `M` to a `Y`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# resample and compute the yearly totals for the popular community centers\n", "pivoted_data.resample(\"Y\").sum().plot(figsize=(10,10))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And now, hey presto, we've turned this unwieldy, huge dataset into something that we can use to answer questions. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Let's talk data structures\n", "\n", "The major thing that pandas brings to Python is three new data structures: the Series, the Dataframe, and the Index. \n", "\n", "## Series: a list, but a dictionary\n", "\n", "A series is a *one-dimensional array* of *indexed* data. Let's break that down.\n", "\n", "* **one-dimensional**: only one column of data\n", "* **array**: a contiguous representation of the data (i.e. data[2] comes after data[1])\n", "* **indexed**: the data has indices that you use to access the data. These can be numbers or other, more complicated keys.\n", "\n", "Here's a default, empty Series. What do you get when you run that? What about if you take the `type()` of it?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.Series()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can create a Series from a single Python list. When you do that, the indices are just numbers, starting from 0." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# create a regular Python list\n", "my_list = [0.25, 0.5, 0.75, 1.0]\n", "\n", "# transform that list into a Series\n", "data = pd.Series(my_list)\n", "\n", "# display the data in the series\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A Series is kind of like a list, in that *order matters*. The first element will always be before the second element, and so on. You can use indexing to grab an item in a series, like so:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# get the first item in the series\n", "data[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "pandas makes this way more powerful and a little harder with the `.iloc` function. You can use `.iloc` to index a Series. This is the idiomatic \"pandas-y\" way to do it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# get the first item, but pandas-y, with iloc\n", "data.iloc[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# go crazy, grab the fourth element\n", "data.iloc[3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All of the normal Python indexing styles work with `.iloc`. You can use *slicing* to grab certain sub-lists within a Series." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# get the first two elements\n", "data.iloc[0:2]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# get the last two elements\n", "data.iloc[2:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Alternate indices\n", "\n", "You can also think of a Series as a Python dictionary. The main difference is that **order matters in a Series**. You can grab things by their name, in addition to their numerical index, and you can generate a Series from a dictionary." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# create a regular Python Dictionary\n", "population_dict = {'California': 38332521,\n", " 'Texas': 26448193,\n", " 'New York': 19651127,\n", " 'Florida': 19552860,\n", " 'Illinois': 12882135}\n", "\n", "# make that dictionary into a Series \n", "population = pd.Series(population_dict)\n", "\n", "# display the data\n", "population" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also make a named Series from two independent Python lists." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# create two ordered lists\n", "population_list = [38332521, 26448193, 19651127, 19552860, 12882135]\n", "states = ['California', 'Texas', 'New York', 'Florida', 'Illinois']\n", "\n", "# Create a Series from those two lists\n", "population = pd.Series(population_list, index=states)\n", "\n", "# display the data\n", "population" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can index and slice, just like we did before, but you can use the keys instead of the numbers! You use the `.loc` method, which just stands for \"location\", to access items from their named keys." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# select the data value with the name \"California\"\n", "population.loc['California']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can do slicing with these keys, too!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "population.loc['Texas':'Florida']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **Group Assignment**: some indexing questions\n", "\n", "Answer the following questions concerning `.iloc`, `.loc` and Series indexing with pandas.\n", "\n", "1. Make a pandas Series from a Python list that contains the following numbers: `[0.1, 0.3, 0.75, 1.2, 1.6]`\n", "2. Use `iloc` and slices to grab the second and third elements of your Series.\n", "3. Grab the last element of your Series.\n", "4. **As a challenge**, grab the last element of your Series without using the length of the list. You can't use slicing or just say `iloc[4]`. There's a way to do it, search around. (Hint: think small.)\n", "5. Create another Series from the following dictionary:\n", " ```\n", " {'Aaron': 65,\n", " 'Lauren': 24,\n", " 'Joseph': 49,\n", " 'Mallory': 32,\n", " 'Eric': 19,\n", " 'Jeff': 84}\n", " ```\n", "6. Print out Lauren's age, using the `.loc` method.\n", "7. Print out the elements between Joseph and Jeff using `.loc`.\n", "8. Do the same thing, but with numerical slicing using `.iloc`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Dataframe: adding a dimension\n", "\n", "A pandas DataFrame is the main way to display and manipulate data with pandas. You're going to be suuuuuper familiar with DataFrames by the time we're done here. \n", "\n", "A DataFrame is just a two-dimensional Series. It's like a table, or an Excel spreadsheet. Functionally, it's a Series-of-Serieses, a bunch of Series lined up together." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# remember our population Series?\n", "population" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# let's make one for the areas, too!\n", "area_dict = {'Illinois': 149995, 'California': 423967, \n", " 'Texas': 695662, 'Florida': 170312, \n", " 'New York': 141297}\n", "area = pd.Series(area_dict)\n", "area" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# now, we create a dictionary containing both of our dictionaries\n", "# meta, huh?\n", "state_info_dictionary = {'population': population,\n", " 'area': area}\n", "\n", "# then we mash them together into a DataFrame\n", "states = pd.DataFrame(state_info_dictionary)\n", "# let's check our work!\n", "states" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "pandas just takes care of lining everything up, because our indices (state names) are the same across our two dictionaries!\n", "\n", "You can also generate a DataFrame from a list of dictionaries, and from a list of lists." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# create a list of dictionaries that contain our data.\n", "# one dictionary per observation/row\n", "dead_people = [\n", " {\"ssn\":1, \"first_name\": \"Bob\", \"last_name\": \"Jones\", \"age\": 200},\n", " {\"ssn\":2, \"first_name\": \"Jane\", \"last_name\": \"Jones\", \"age\": 199},\n", " {\"ssn\":3, \"first_name\": \"Ethel\", \"last_name\": \"Jones\", \"age\": 180},\n", " {\"ssn\":4, \"first_name\": \"Hortense\", \"last_name\": \"Jones\", \"age\": 178},\n", " {\"ssn\":5, \"first_name\": \"Vern\", \"last_name\": \"Jones\", \"age\": 178}\n", "]\n", "\n", "# create a DataFrame from a list of dictionaries\n", "pd.DataFrame(dead_people)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you create a DataFrame from a list of lists, you can either specify the row indices, or it'll automatically number them, starting at 0." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# auto-number the rows\n", "\n", "# create a list of lists, each sub-list is an observation/row\n", "dead_people = [\n", " [1,\"Bob\",\"Jones\",200],\n", " [2,\"Jane\",\"Jones\",199],\n", " [3,\"Ethel\",\"Jones\",180],\n", " [4,\"Hortense\",\"Jones\",178],\n", " [5,\"Vern\",\"Jones\",178]\n", "]\n", "\n", "# specify the column names separately\n", "column_names = [\"ssn\",\"first_name\", \"last_name\", \"age\"]\n", "\n", "# make a DataFrame with column names specified separately\n", "pd.DataFrame(dead_people, columns=column_names)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# specify the row indices\n", "\n", "# create a list of lists, each sub-list is an observation/row\n", "dead_people = [\n", " [1,\"Bob\",\"Jones\",200],\n", " [2,\"Jane\",\"Jones\",199],\n", " [3,\"Ethel\",\"Jones\",180],\n", " [4,\"Hortense\",\"Jones\",178],\n", " [5,\"Vern\",\"Jones\",178]\n", "]\n", "\n", "# specify the column names separately\n", "column_names = [\"ssn\",\"first_name\", \"last_name\", \"age\"]\n", "\n", "row_ids = [123,3452,3235,4345,563463]\n", "\n", "# make a DataFrame with column names specified separately\n", "dead_dataframe = pd.DataFrame(dead_people, columns=column_names, index=row_ids)\n", "dead_dataframe" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Index and DataFrame slicing\n", "\n", "In pandas, the Series and the DataFrame are both containers for data; they store information. The Index is what makes that data useful.\n", "\n", "* In a Series, the Index is the key to each value in the list.\n", "* In a DataFrame, the index is the column name. There is also an index for each row.\n", "\n", "You can use indexing to merge otherwise disparate datasets. Remember our list of states?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "states" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also access the column and row labels, programmatically, using the `.columns` and `.index` functions. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# get the column labels as a list-like data structure\n", "states.columns" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# get the row labels as a list-like data structure\n", "states.index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use `.loc` and `.iloc`, and indexing, on DataFrames too! `.loc` lets us select specific rows and columns by their name. The syntax for indexing a DataFrame is `[ROW, COLUMN]`, where `ROW` and `COLUMN` are the index values." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# let's get Illinois' population\n", "states.loc[\"Illinois\", \"population\"]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This can get really powerful, really quickly. \n", "\n", "You can use slicing within a row or column when indexing a DataFrame; this includes the `:` operator, which selects all indices. You can also use a list in place of a row or column to select all items on that list from the DataFrame." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# two-dimensional slicing\n", "\n", "# get the area for states from Florida to Texas\n", "states.loc[\"Florida\":\"Texas\", \"area\"]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# using a list to select values\n", "\n", "# get the area for Florida and Texas\n", "states.loc[[\"Florida\", \"Texas\"], \"area\"]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# use a \":\" to specify \"all columns\"\n", "\n", "# get area and population for Florida and Texas\n", "states.loc[[\"Florida\", \"Texas\"], :]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# select all the rows and columns\n", "states.loc[:,:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And this all works for `.iloc`, too." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# two-dimensional slicing\n", "\n", "# get the area for states from Florida to Texas\n", "states.iloc[1:, 1]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# using a list to select values\n", "\n", "# get the area for Florida and Texas\n", "states.iloc[[1, 4], 1]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# using a list with -1 index to select values to the end\n", "\n", "# get the area for Florida and Texas\n", "states.iloc[[1, -1], 1]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# using a \":\" to specify \"all columns\"\n", "\n", "# get area and population for Florida and Texas\n", "states.iloc[[1, -1], :]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Iterating through a DataFrame\n", "\n", "Sometimes you might want to go through each row in a DataFrame one at a time. Luckily, pandas has a function called `iterrows` that lets you accomplish this. The basic blueprint looks like this:\n", "\n", "```python\n", "# df is the name a of a dataframe that we have already created\n", "for index, row in df.iterrows():\n", " # do stuff with each row\n", "```\n", "\n", "Note that when you loop through the data frame, it *unpacks* the row into two separate parts: the row index, and the actual contents of the row.\n", "\n", "The loop variable `row` is a dictionary. Each column name in the data frame is a key in the dictionary; the value associated with the key is the value of that column in that particular row.\n", "\n", "Now let's see some examples of how we can use `iterrows` to loop through a dataframe. We will use our `states` dataframe in all of these examples." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Print out every index one by one" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# use states.iterrows() because we are iterating through the states dataframe\n", "for index, row in states.iterrows():\n", " print(index)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Print out every row one by one" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# use states.iterrows() because we are iterating through the states dataframe\n", "for index, row in states.iterrows():\n", " print(row)\n", " print() # add a blank space between each row" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each row is a dictionary with two keys: `population` and `area`. The values associated with those keys are the population and area of that row." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Printing out the population for each row" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# use states.iterrows() because we are iterating through the states dataframe\n", "for index, row in states.iterrows():\n", " population = row['population']\n", " print(population)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Printing out the area for each row" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# use states.iterrows() because we are iterating through the states dataframe\n", "for index, row in states.iterrows():\n", " area = row['area']\n", " print(area)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Adding up the total population in all the rows" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# create a variable to keep track of the total population\n", "total = 0\n", "\n", "# loop through each row one at a time\n", "for index, row in states.iterrows():\n", " # get the population of the current row\n", " population = row['population']\n", " \n", " # update the total to include the current population\n", " total = total + population\n", "\n", "print(total)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## One last note\n", "As you can imagine, being able to loop through the rows opens up a lot of possibilities for performing data analysis. However, writing a for loop to go through each row manually is actually quite inefficient. Luckily, pandas has many built in functions that allow you to accomplish the tasks that you would want to accomplish with looping. These *functions* take advantage of the *structure* of the dataframe! You'll learn about all of this next week." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Structure and Function\n", "\n", "Take a look at the `community-center-attendance-2019.csv` file one more time. How is the data structured in the CSV? How does the way the data is stored in the CSV affect how it is read in Python? How could I restructure this data to make certain data analysis tasks faster?\n", "\n", "What about in Pandas? How do the data structures in Pandas versus vanilla Python inform our use of the library?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# The end!\n", "\n", "That's it for today! There's a bit more slicing in your lab exercises for the week, and some more serious data analysis as well. Any questions before we begin?" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" }, "nteract": { "version": "0.12.3" } }, "nbformat": 4, "nbformat_minor": 4 }