{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data analysis with pandas\n", "\n", "> Objectives:\n", "> * Be able to load and save data using pandas\n", "> * Be able to access columns, rows, and elements in DataFrames and Series objects\n", "> * Be able to perform aggregate computations across different variables\n", "> * Be able to filter and subset data\n", "\n", "[pandas](http://pandas.pydata.org/) is a Python library that provides tools for processing and manipulating data.\n", "\n", "Typically, you will see pandas imported as \"`pd`\", which is shorter and therefore easier to type than the full name `pandas`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading data\n", "\n", "First and foremost, pandas gives us a really convenient way to read in data in CSV (\"comma separated value\") format. In this lesson, we have two CSV files containing information about precipitation in California. The first file we'll take a look at is `precip_monthly.csv`, which contains monthly aggregate data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "!head precip_monthly.csv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To load it, we call `read_csv` function, and pandas automatically figures out how to read the file for us:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "monthly = pd.read_csv(\"precip_monthly.csv\")\n", "monthly" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The type of object that is returned is called a \"data frame\", and is one of two fundamental data types that pandas uses:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "type(monthly)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A data frame is essentially a table which has labeled rows and columns. By default when reading in a csv file, pandas will create a numerical index for the rows, and will use the first row of the CSV as the column names. However, having a numerical index isn't necessarily what we want. In this case, it might be more useful for us to have the rows correspond to different stations and the dates the data was recorded. To do this, we can use the `set_index` method of the data frame object:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "monthly = monthly.set_index('station')\n", "monthly" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Accessing rows, columns, and elements" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To access a column in a DataFrame, we index into the DataFrame as if it were a dictionary. For example, to get just the precipitation for each station and date:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "monthly['precip']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The type of object that is returned is a `Series` object, which is the 1D equivalent of a DataFrame. We can further index into this Series object, for example, to get the precipitation for one particular station:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "monthly['precip']['San Jose']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we want to access the data the other way around -- i.e., access the row(s) first, and then the `'precip'` column -- we need to index slightly differently, using the `.loc` attribute:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [], "source": [ "monthly.loc['San Jose']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This returns another DataFrame, which we can then index as we saw earlier:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "monthly.loc['San Jose']['precip']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To summarize:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# column indexing --> Series\n", "monthly['precip']\n", "\n", "# column, then row indexing --> Series or element\n", "monthly['precip']['San Jose']\n", "\n", "# row indexing --> DataFrame or Series\n", "monthly.loc['San Jose']\n", "\n", "# row, then column indexing --> Series or element\n", "monthly.loc['San Jose']['precip']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 1\n", "\n", "Load the monthly data again from the CSV file and extract only the data from the month of July.\n", "\n", "**Hint:** Take a look at `monthly.head()` and `monthly.columns` to see what data is available to you. \n", "**Hint:** Use `.set_index()` to set a certain column as the DataFrame index." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Saving data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's say we want to save out just the San Jose data. We can do this using the `.to_csv()` method of the DataFrame:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "monthly.loc['San Jose'].to_csv('san_jose.csv')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [], "source": [ "!head san_jose.csv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Aggregate computations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One of the really powerful operations that pandas can do involves splitting the data up into particular groups, performing some operation on each group, and then recombining the results.\n", "\n", "For example, how would we compute the total precipitation per year for each station? To do this, we want to:\n", "\n", "* split the data into groups, where there each group corresponds to one station and year\n", "* sum across the precipitation values for each group\n", "* recombine the resulting sums\n", "\n", "In pandas, this is really easy! First, we'll want to \"reset\" the index of our DataFrame so that everything is in columns:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "monthly = monthly.reset_index()\n", "monthly" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we use the `groupby` command to specify which columns should be used to form the groups:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "gb = monthly.groupby(['station', 'year'])\n", "gb" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can index into this \"groupby\" object just like a DataFrame, and select only the precipitation data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [], "source": [ "gb['precip']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we can calculate summary statistics on these groups. For example, a sum of the precipitation each year for each station:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "gb['precip'].sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result is a Series object that has as its index labels for the stations and years. The values of the Series objects are the total precipitation for the corresponding station and year.\n", "\n", "Similar computations follow the same basic recipe. For example, to compute the *average* precipitation per *month*:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "gb['precip'].mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `.apply()` function of the groupby object is incredibly powerful, and allows us to perform even more complex computations. We can use the `.apply()` function to perform *any* computation we can write a function for! For example, if we wanted to compute the mean and standard deviation of the precipitation in one go:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def stats(data):\n", " return pd.Series(\n", " [data.mean(), data.std()], # compute the mean and standard deviation of one particular group\n", " index=['mean', 'stddev'], # label the computed statistics\n", " name=data.name # give a name to the result, so pandas knows how to put everything\n", " # back together\n", " )\n", "\n", "gb['precip'].apply(stats)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we want to make these statistics (`mean` and `stddev`) correspond to columns, rather than an additional level in the index, we can use the `.unstack()` method:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "gb['precip'].apply(stats).unstack()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 2\n", "\n", "Use the `.groupby()` method to compute the average yearly precipitation for each region, and modify the resulting DataFrame so that the rows correspond to regions, and the columns correspond to years. Store the result in a variable called `region_yearly_precip`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## Filtering Data\n", "\n", "We've looked at how to group our data into subsets and how to use get subsets using indexing, but there are also other ways to create subsets of data.\n", "Above we used `.set_index` to move the `'station'` column into the index then grab the data for the `'San Jose'` station. Here's another way to do that using \"boolean indexing\":" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "monthly.loc[monthly.station == 'San Jose']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another way to filter is with the `.isin` method. This tests whether values are in another set of values.\n", "Say we want all the data for the winter months Dec, Jan, Feb:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "monthly.loc[monthly.month.isin(['Dec', 'Jan', 'Feb'])]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And the `.query` method is useful for more complex queries:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "monthly.query('region == \"SAN FRANCISCO BAY\" and month in [\"Dec\", \"Jan\", \"Feb\"]')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 3\n", "\n", "Use any of the above filtering methods to filter the `monthly` data to only those rows from the `'COLORADO RIVER'` subregion.\n", "\n", "(**Bonus:** Can you further filter the data so it's only for the years 1999 - 2003?)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.4.3" } }, "nbformat": 4, "nbformat_minor": 0 }