{ "metadata": { "name": "", "signature": "sha256:af2901089fe431c3c9794d6878e7eeff568c2bc204f2f019c92180eac7e09600" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Geospatial Data in Python: Database, Desktop, and the Web\n", "## Tutorial (Part 0b)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Working with Data in Python\n", "\n", "Data is central to GIS, and the tabular arrangement of data is very common. Accordingly, Python provides a large number of ways to read in tabular data. These vary depending on how the data are stored, where they are located, etc. To help keep things as simple as possible, the 'pandas' Python library provides an operator, `read_csv()` that allows you to access data \ufb01les in tabular format on your computer as well as data stored in online repositories, or one that a course instructor might set up for his or her students.\n", "\n", "If you successfully ran the install script, then you already have 'pandas' installed. Now, you simply need to `import pandas` in order to to use `read_csv()`, as well as a variety of other 'pandas' operators that you will encounter later (it is also usually a good idea to `import numpy as np` at the same time that we `import pandas as pd`)." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd\n", "import numpy as np" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You need do this only once in each session of Python, and on systems such as IPython, the library will sometimes be reloaded automatically (if you get an error message, it\u2019s likely that the 'pandas' library has not been installed on your system.)\n", "\n", "Reading in a data table with `read_csv()` is simply a matter of knowing the name (and location) of the data set. For instance, one data table I use in my statistics classes is `\"swim100m.csv\"`. To read in this data table and create an object in Python that contains the data, use a command like this:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "swim = pd.read_csv(\"http://www.mosaic-web.org/go/datasets/swim100m.csv\")" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The csv part of the name in `\"swim100m.csv\"` indicates that the \ufb01le has been stored in a particular data format, comma-separated values that is handled by spreadsheet software as well as many other kinds of software. The part of this command that requires creativity is choosing a name for the Python object that will hold the data. In the above command it is called `swim`, but you might prefer another name (e.g., `s` or `sdata` or even `ralph`). Of course, it's sensible to choose names that are short, easy to type and remember, and remind you what the contents of the object are about." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data Frames\n", "\n", "The type of Python object created by `read_csv()` is called a data frame and is essentially a tabular layout. To illustrate, here are the \ufb01rst several cases of the `swim` data frame created by the previous use of `read_csv()`:\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "swim.head()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the `head()` function, one of several functions built-into 'pandas' data frames, is a function of the Python object (data frame) itself; not from the main 'pandas' library.\n", "\n", "Data frames, like tabular data generally, involve variables and cases. In 'pandas' data frames, each of the variables is given a name. You can refer to the variable by name in a couple of di\ufb00erent ways. To see the variable names in a data frame, something you might want to do to remind yourself of how names a spelled and capitalized, use the `columns` attribute of the data frame object:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "swim.columns" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that we have **not** used brackets `()` in the above command. This is because `columns` is not a function; it is an *attribute* of the data frame. Attributes add 'extra' information (or metadata) to objects in the form of additional Python objects. In this case, the attributes describe the names (and data types) of the columns. Another way to get quick information about the variables in a data frame is with `describe()`:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "swim.describe()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This provides a numerical summary of each of the variables contained in the data frame. To keep things simple, the output from `describe()` is itself a data frame.\n", "\n", "There are lots of different functions and attributes available for data frames (and any other Python objects). For instance, to see how many cases and variables there are in a data frame, you can use the `shape` attribute:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "swim.shape" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Variables in Data Frames\n", "\n", "Perhaps the most common operation on a data frame is to refer to the values in a single variable. The two ways you will most commonly use involve referring to a variable by string-quoted name (`swim[\"year\"]`) and as an attribute of a data frame without quotes (`swim.year`)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Each column or variable in a 'pandas' data frame is called a 'series', and each series can contain one of many different data types. For more information on series', data frames, and other objects in 'pandas', [have a look here][intro].\n", "\n", "\n", "[intro]: http://pandas.pydata.org/pandas-docs/dev/dsintro.html" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Most of the statistical/mathematical functions you will encounter in this tutorial are designed to work with data frames and allow you to refer directly to variables within a data frame. For instance:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "swim.year.mean()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "swim[\"year\"].min()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is also possible to combine 'numpy' operators with 'pandas' variables:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "np.min(swim[\"year\"])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "np.min(swim.year)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `swim` portion of the above commands tells Python which data frame we want to operate on. Leaving o\ufb00 that argument leads to an error:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "year.min()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The advantage of referring to variables by name becomes evident when you construct statements that involve more than one variable within a data frame. For instance, here's a calculation of the mean year, separately for (grouping by) the different sexes:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "swim.groupby('sex')['year'].mean()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Both the `mean()` and `min()` functions have been arranged by the 'pandas' library to look in the data frame when interpreting variables, but not all Python functions are designed this way. For instance:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "swim.year.sqrt()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When you encounter a function that isn't supported by data frames, you can use 'numpy' functions and the special `apply` function built-into data frames (note that the `func` argument is optional):" ] }, { "cell_type": "code", "collapsed": false, "input": [ "swim.year.apply(func=np.sqrt).head() # There are 62 cases in total" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively, since columns are basically just arrays, we can use built-in numpy functions directly on the columns:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "np.sqrt(swim.year).head() # Again, there are 62 cases in total" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Adding a New Variable\n", "\n", "Sometimes you will compute a new quantity from the existing variables and want to treat this as a new variable. Adding a new variable to a data frame can be done similarly to *accessing* a variable. For instance, here is how to create a new variable in `swim` that holds the `time` converted from seconds to units of minutes:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "swim['minutes'] = swim.time/60. # or swim['time']/60." ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default, columns get inserted at the end. The `insert` function is available to insert at a particular location in the columns. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "swim.insert(1, 'mins', swim.time/60.)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You could also, if you want, rede\ufb01ne an existing variable, for instance:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "swim['time'] = swim.time/60." ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As always, we can take a quick look at the results of our operations by using the `head()` fuction of our data frame:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "swim.head()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Such assignment operations do not change the original file from which the data were read, only the data frame in the current session of Python. This is an advantage, since it means that your data in the data file stay in their original state and therefore won\u2019t be corrupted by operations made during analysis." ] } ], "metadata": {} } ] }