{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's start with a hypothetical problem we want to solve. We are interested in understanding the relationship between the weather and the number of mosquitos occuring in a particular year so that we can plan mosquito control measures accordingly. Since we want to apply these mosquito control measures at a number of different sites we need to understand both the relationship at a particular site and whether or not it is consistent across sites. The data we have to address this problem comes from the local government and are stored in tables in comma-separated values (CSV) files. Each file holds the data for a single location, each row holds the information for a single year at that location, and the columns hold the data on both mosquito numbers and the average temperature and rainfall from the beginning of mosquito breeding season. The first few rows of our first file look like:\n", "\n", "~~~\n", "year,temperature,rainfall,mosquitos\n", "2001,87,222,198\n", "2002,72,103,105\n", "2003,77,176,166\n", "~~~" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Dowload archive from http://bit.ly/mosquitodata" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Objectives" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Conduct variable assignment, looping, and conditionals in Python\n", "* Use an external Python library\n", "* Read tabular data from a file\n", "* Subset and perform analysis on data\n", "* Display simple graphs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to load the data, we need to import a library called Pandas that knows\n", "how to operate on tables of data." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import pandas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now use Pandas to read our data file." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "pandas.read_csv('A1_mosquito_data.csv', sep=',')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `read_csv()` function belongs to the `pandas` library. In order to run it we need to tell Python that it is part of `pandas` and we do this using the dot notation, which is used everywhere in Python to refer to parts of larger things.\n", "\n", "When we are finished typing and press Shift+Enter, the notebook runs our command and shows us its output. In this case, the output is the data we just loaded." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first column on the left is the `index` column, a way for `pandas` to label each row, by default it just uses integers.\n", "However in this case we have a better way of indexing the data, we can use the `year` column. We can use the `index_col` keyword argument to `read_csv()` to specify the index column:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "pandas.read_csv('A1_mosquito_data.csv', sep=',', index_col='year')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our call to `pandas.read_csv()` read data into memory, but didn't save it anywhere. To do that, we need to assign the array to a variable." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data = pandas.read_csv('A1_mosquito_data.csv', sep=',', index_col='year')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This statement doesn't produce any output because assignment doesn't display anything. If we want to check that our data has been loaded, we can print the variable's value:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`print data` tells Python to display the text. Alternatively we could just include `data` as the last value in a code cell:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This tells the IPython Notebook to display the `data` object, which is why we see a pretty formated table." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Manipulating data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once we have imported the data we can start doing things with it. First, let's ask what type of thing `data` refers to:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(type(data))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data is stored in a data structure called a DataFrame. There are other kinds of data structures that are also commonly used in scientific computing including Numpy arrays, and Numpy matrices, which can be used for doing linear algebra." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can select an individual column of data using its name:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(data['temperature'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or we can select several columns of data at once:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(data[['rainfall', 'temperature']])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data.index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also select subsets of rows using slicing. Say we just want the first two rows of data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(data[:2])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are a couple of important things to note here. First, Python indexing starts at zero. In contrast, programming languages like R and MATLAB start counting at 1, because that's what human beings have done for thousands of years. Languages in the C family (including C++, Java, Perl, and Python) count from 0 because that's simpler for computers to do. This means that if we have 5 things in Python they are numbered 0, 1, 2, 3, 4, and the first row in a data frame is always row 0.\n", "\n", "The other thing to note is that the subset of rows starts at the first value and goes up to, but does not include, the second value. Again, the up-to-but-not-including takes a bit of getting used to, but the rule is that the difference between the upper and lower bounds is the number of values in the slice." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Python uses intervals open on the right: $ \\left[1, 4\\right[ $" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(data[1:4])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However most of the time we want to index using the `index`!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data.loc[2004:2008]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data.loc[2004:2008][\"temperature\"]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data.loc[2004]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also subset the data based on the value of other rows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data['temperature'][data['mosquitos'] > 200]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Challenge" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Print the mosquitos number when temperature is more than 75 degrees between 2005 and 2008" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data['mosquitos'][data['temperature'] > 75].loc[2005:2008]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data['mosquitos'].loc[2005:2008][data['temperature'] > 75]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Data frames also know how to perform common mathematical operations on their values. If we want to find the average value for each variable, we can just ask the data frame for its mean values" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(data.mean())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Data frames have lots of useful methods:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(data.max())" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(data['temperature'].min())" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(data['mosquitos'][1:3].std())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Challenge" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Import the data from `A2_mosquito_data.csv`, create a new variable that holds a data frame with only the weather data, and print the means and standard deviations for the weather variables." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary on how to use parentheses" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "abs(-1) # call function\n", "data.max() # call method on an object\n", "\n", "data[:3] # slicing by row\n", "data[\"temperature\"] # get a column of a data frame \"slice a column\"\n", "dataslice = data.loc[2004:2008] # this is a special case, we are slicing ix" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Plotting" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The mathematician Richard Hamming once said, \"The purpose of computing is insight, not numbers,\" and the best way to develop insight is often to visualize data. The main plotting library in Python is `matplotlib`. To get started, let's tell the IPython Notebook that we want our plots displayed inline, rather than in a separate viewing window:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `%` at the start of the line signals that this is a command for the notebook, rather than a statement in Python. Next, we will import the `pyplot` module from `matplotlib`, but since `pyplot` is a fairly long name to type repeatedly let's give it an alias." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from matplotlib import pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This import statement shows two new things. First, we can import part of a library by using the `from library import submodule` syntax. Second, we can use a different name to refer to the imported library by using `as newname`.\n", "\n", "Now, let's make a simple plot showing how the number of mosquitos varies over time. We'll use the site you've been doing exercises with since it has a longer time-series." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data = pandas.read_csv('A2_mosquito_data.csv', index_col=\"year\")\n", "data['mosquitos'].plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "More complicated plots can be created by adding a little additional information. Let's say we want to look at how the different weather variables vary over time." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "plt.plot(data.index, data['temperature'], 'ro-')\n", "plt.xlabel('Year')\n", "plt.ylabel('Temperature')\n", "\n", "plt.figure()\n", "plt.plot(data.index, data['rainfall'], 'bs-')\n", "plt.xlabel('Year')\n", "plt.ylabel('Rain Fall')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Challenge" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using the data in `A2_mosquito_data.csv` plot the relationship between the number of mosquitos and temperature and the number of mosquitos and rainfall." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Key Points" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Import a library into a program using `import libraryname`.\n", "* Use the `pandas` library to work with data tables in Python.\n", "* Use `variable = value` to assign a value to a variable.\n", "* Use `print something` to display the value of `something`.\n", "* Use `dataframe['columnname']` to select a column of data.\n", "* Use `dataframe[start_row:stop_row]` to select rows from a data frame.\n", "* Indices start at 0, not 1.\n", "* Use `dataframe.mean()`, `dataframe.max()`, and `dataframe.min()` to calculate simple statistics.\n", "* Use `for x in list:` to loop over values\n", "* Use `if condition:` to make conditional decisions\n", "* Use the `pyplot` library from `matplotlib` for creating simple visualizations." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Next steps" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the requisite Python background out of the way, now we're ready to dig in to analyzing our data, and along the way learn how to write better code, more efficiently, that is more likely to be correct." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }