{ "metadata": { "name": "", "signature": "sha256:7418e89cae3044231b13f1dbeec637f5690e04d9e7076b1431efc244d171bed9" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "EOAS Python - Reading and Exploring CSV Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Learning Goals\n", "- Read CSV data from files into NumPy data structures,\n", "using [pandas](http://pandas.pydata.org/)\n", "- Use list boolean slices to select data\n", "- Use requests to get data from the web" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Getting the Data\n", "- Use your browser to go to [Environment Canada Climate Data](http://climate.weather.gc.ca/) and work your way through to the \"Hourly Data Report\" for yesterday at the *Vancouver Intl A* station.\n", "- Download the August 2013 hourly data as a CSV file\n", "- Use your shelf skills to confirm that:\n", " * You really got a CSV file\n", " * It's for the *Vancouver Intl A* station\n", " * It contains hourly data for the whole month of August 2013\n", "- Move or copy the CSV file into the *data-explore/* directory in your repo and commit it." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Reading CSV Data Files" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import numpy as np\n", "import pandas as pd" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Earlier we used `numpy.loadtxt()` to read our csv file.\n", "Now we will use the [pandas](http://pandas.pydata.org/) data analysis library,\n", "that handles data better;\n", "in particular, \n", "headers in our files.\n", "\n", "However even with panda if we try the simplest possible thing:\n", "```python\n", "data = pd.read_csv('eng-hourly-08012013-08312013.csv')\n", "```\n", "we get a dismaying number of errors.\n", "\n", "Let's take a look at the data.\n", "You can use shell commands within notebook cells by prefixing them with `!`." ] }, { "cell_type": "code", "collapsed": false, "input": [ "!head eng-hourly-08012013-08312013.csv" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\"Station Name\",\"VANCOUVER INTL A\"\r\n", "\"Province\",\"BRITISH COLUMBIA\"\r\n", "\"Latitude\",\"49.19\"\r\n", "\"Longitude\",\"-123.18\"\r\n", "\"Elevation\",\"4.30\"\r\n", "\"Climate Identifier\",\"1108395\"\r\n", "\"WMO Identifier\",\"71892\"\r\n", "\"TC Identifier\",\"YVR\"\r\n", "\"All times are specified in Local Standard Time (LST). Add 1 hour to adjust for Daylight Saving Time where and when it is observed.\"\r\n", "\r\n" ] } ], "prompt_number": 2 }, { "cell_type": "code", "collapsed": false, "input": [ "!head -20 eng-hourly-08012013-08312013.csv" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\"Station Name\",\"VANCOUVER INTL A\"\r\n", "\"Province\",\"BRITISH COLUMBIA\"\r\n", "\"Latitude\",\"49.19\"\r\n", "\"Longitude\",\"-123.18\"\r\n", "\"Elevation\",\"4.30\"\r\n", "\"Climate Identifier\",\"1108395\"\r\n", "\"WMO Identifier\",\"71892\"\r\n", "\"TC Identifier\",\"YVR\"\r\n", "\"All times are specified in Local Standard Time (LST). Add 1 hour to adjust for Daylight Saving Time where and when it is observed.\"\r\n", "\r\n", "\"Legend\"\r\n", "\"M\",\"Missing\"\r\n", "\"E\",\"Estimated\"\r\n", "\"NA\",\"Not Available\"\r\n", "\"**\",\"Partner data that is not subject to review by the National Climate Archives\"\r\n", "\r\n", "\"Date/Time\",\"Year\",\"Month\",\"Day\",\"Time\",\"Data Quality\",\"Temp (\ufffdC)\",\"Temp Flag\",\"Dew Point Temp (\ufffdC)\",\"Dew Point Temp Flag\",\"Rel Hum (%)\",\"Rel Hum Flag\",\"Wind Dir (10s deg)\",\"Wind Dir Flag\",\"Wind Spd (km/h)\",\"Wind Spd Flag\",\"Visibility (km)\",\"Visibility Flag\",\"Stn Press (kPa)\",\"Stn Press Flag\",\"Hmdx\",\"Hmdx Flag\",\"Wind Chill\",\"Wind Chill Flag\",\"Weather\"\r\n", "\"2013-08-01 00:00\",\"2013\",\"08\",\"01\",\"00:00\",\"**\",\"18.7\",\"\",\"14.6\",\"\",\"77\",\"\",\"13\",\"\",\"3\",\"\",\"32.2\",\"\",\"101.60\",\"\",\"\",\"\",\"\",\"\",\"Rain\"\r\n", "\"2013-08-01 01:00\",\"2013\",\"08\",\"01\",\"01:00\",\"**\",\"18.4\",\"\",\"15.1\",\"\",\"81\",\"\",\"1\",\"\",\"5\",\"\",\"32.2\",\"\",\"101.57\",\"\",\"\",\"\",\"\",\"\",\"Cloudy\"\r\n", "\"2013-08-01 02:00\",\"2013\",\"08\",\"01\",\"02:00\",\"**\",\"17.8\",\"\",\"16.0\",\"\",\"89\",\"\",\"18\",\"\",\"9\",\"\",\"24.1\",\"\",\"101.57\",\"\",\"\",\"\",\"\",\"\",\"Rain\"\r\n" ] } ], "prompt_number": 3 }, { "cell_type": "code", "collapsed": false, "input": [ "!tail eng-hourly-08012013-08312013.csv" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\"2013-08-31 14:00\",\"2013\",\"08\",\"31\",\"14:00\",\"**\",\"19.4\",\"\",\"16.3\",\"\",\"82\",\"\",\"30\",\"\",\"19\",\"\",\"48.3\",\"\",\"101.43\",\"\",\"\",\"\",\"\",\"\",\"NA\"\r\n", "\"2013-08-31 15:00\",\"2013\",\"08\",\"31\",\"15:00\",\"**\",\"19.3\",\"\",\"16.0\",\"\",\"81\",\"\",\"31\",\"\",\"22\",\"\",\"48.3\",\"\",\"101.32\",\"\",\"\",\"\",\"\",\"\",\"NA\"\r\n", "\"2013-08-31 16:00\",\"2013\",\"08\",\"31\",\"16:00\",\"**\",\"19.6\",\"\",\"16.1\",\"\",\"80\",\"\",\"30\",\"\",\"18\",\"\",\"48.3\",\"\",\"101.23\",\"\",\"\",\"\",\"\",\"\",\"Mainly Clear\"\r\n", "\"2013-08-31 17:00\",\"2013\",\"08\",\"31\",\"17:00\",\"**\",\"19.5\",\"\",\"16.0\",\"\",\"80\",\"\",\"30\",\"\",\"18\",\"\",\"48.3\",\"\",\"101.16\",\"\",\"\",\"\",\"\",\"\",\"NA\"\r\n", "\"2013-08-31 18:00\",\"2013\",\"08\",\"31\",\"18:00\",\"**\",\"18.7\",\"\",\"16.0\",\"\",\"84\",\"\",\"30\",\"\",\"16\",\"\",\"48.3\",\"\",\"101.08\",\"\",\"\",\"\",\"\",\"\",\"NA\"\r\n", "\"2013-08-31 19:00\",\"2013\",\"08\",\"31\",\"19:00\",\"**\",\"17.6\",\"\",\"16.1\",\"\",\"91\",\"\",\"30\",\"\",\"13\",\"\",\"48.3\",\"\",\"101.06\",\"\",\"\",\"\",\"\",\"\",\"Mainly Clear\"\r\n", "\"2013-08-31 20:00\",\"2013\",\"08\",\"31\",\"20:00\",\"**\",\"17.4\",\"\",\"15.8\",\"\",\"90\",\"\",\"29\",\"\",\"12\",\"\",\"32.2\",\"\",\"101.03\",\"\",\"\",\"\",\"\",\"\",\"NA\"\r\n", "\"2013-08-31 21:00\",\"2013\",\"08\",\"31\",\"21:00\",\"**\",\"16.7\",\"\",\"15.6\",\"\",\"93\",\"\",\"30\",\"\",\"9\",\"\",\"32.2\",\"\",\"101.00\",\"\",\"\",\"\",\"\",\"\",\"NA\"\r\n", "\"2013-08-31 22:00\",\"2013\",\"08\",\"31\",\"22:00\",\"**\",\"17.1\",\"\",\"16.1\",\"\",\"94\",\"\",\"32\",\"\",\"13\",\"\",\"32.2\",\"\",\"100.99\",\"\",\"\",\"\",\"\",\"\",\"Mainly Clear\"\r\n", "\"2013-08-31 23:00\",\"2013\",\"08\",\"31\",\"23:00\",\"**\",\"15.5\",\"\",\"14.6\",\"\",\"94\",\"\",\"34\",\"\",\"10\",\"\",\"32.2\",\"\",\"101.02\",\"\",\"\",\"\",\"\",\"\",\"NA\"\r\n" ] } ], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "So,\n", "we have several lines of header data,\n", "with a couple of empty lines thrown in,\n", "a line of column names,\n", "and then line after line of data.\n", "The data values appear to be all `\"\"` quoted strings delimited by commas. Comma is assumed delimiter by pandas.read_csv.\n", "\n", "We will just throw away all of the non-data lines.\n", "Counting them manually we find that there are 17 total, but we'll keep the column names." ] }, { "cell_type": "code", "collapsed": false, "input": [ "data = pd.read_csv('eng-hourly-08012013-08312013.csv', skiprows=16)\n", "print data[0:4]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ " Date/Time Year Month Day Time Data Quality Temp (\ufffdC) \\\n", "0 2013-08-01 00:00 2013 8 1 00:00 ** 18.7 \n", "1 2013-08-01 01:00 2013 8 1 01:00 ** 18.4 \n", "2 2013-08-01 02:00 2013 8 1 02:00 ** 17.8 \n", "3 2013-08-01 03:00 2013 8 1 03:00 ** 17.2 \n", "\n", " Temp Flag Dew Point Temp (\ufffdC) Dew Point Temp Flag ... \\\n", "0 NaN 14.6 NaN ... \n", "1 NaN 15.1 NaN ... \n", "2 NaN 16.0 NaN ... \n", "3 NaN 15.6 NaN ... \n", "\n", " Wind Spd Flag Visibility (km) Visibility Flag Stn Press (kPa) \\\n", "0 NaN 32.2 NaN 101.60 \n", "1 NaN 32.2 NaN 101.57 \n", "2 NaN 24.1 NaN 101.57 \n", "3 NaN 32.2 NaN 101.57 \n", "\n", " Stn Press Flag Hmdx Hmdx Flag Wind Chill Wind Chill Flag Weather \n", "0 NaN NaN NaN NaN NaN Rain \n", "1 NaN NaN NaN NaN NaN Cloudy \n", "2 NaN NaN NaN NaN NaN Rain \n", "3 NaN NaN NaN NaN NaN NaN \n", "\n", "[4 rows x 25 columns]\n" ] } ], "prompt_number": 5 }, { "cell_type": "code", "collapsed": false, "input": [ "print data.tail(1)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ " Date/Time Year Month Day Time Data Quality Temp (\ufffdC) \\\n", "743 2013-08-31 23:00 2013 8 31 23:00 ** 15.5 \n", "\n", " Temp Flag Dew Point Temp (\ufffdC) Dew Point Temp Flag ... \\\n", "743 NaN 14.6 NaN ... \n", "\n", " Wind Spd Flag Visibility (km) Visibility Flag Stn Press (kPa) \\\n", "743 NaN 32.2 NaN 101.02 \n", "\n", " Stn Press Flag Hmdx Hmdx Flag Wind Chill Wind Chill Flag Weather \n", "743 NaN NaN NaN NaN NaN NaN \n", "\n", "[1 rows x 25 columns]\n" ] } ], "prompt_number": 6 }, { "cell_type": "code", "collapsed": false, "input": [ "!tail -1 eng-hourly-08012013-08312013.csv" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\"2013-08-31 23:00\",\"2013\",\"08\",\"31\",\"23:00\",\"**\",\"15.5\",\"\",\"14.6\",\"\",\"94\",\"\",\"34\",\"\",\"10\",\"\",\"32.2\",\"\",\"101.02\",\"\",\"\",\"\",\"\",\"\",\"NA\"\r\n" ] } ], "prompt_number": 7 }, { "cell_type": "markdown", "metadata": {}, "source": [ "That looks like progress and looks like what we expect.\n", "\n", "One remaining issue\n", "\n", "The degree symbol in the data file has been mangled.\n", "This is because we have not told pandas what encoding to use for non-ASCII characters.\n", "From trial and error we find that EC is using Windows encoding : `ISO-8859-1`." ] }, { "cell_type": "code", "collapsed": false, "input": [ "data = pd.read_csv('eng-hourly-08012013-08312013.csv', skiprows=16, encoding=\"ISO-8859-1\")\n", "print data.columns" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Index([u'Date/Time', u'Year', u'Month', u'Day', u'Time', u'Data Quality', u'Temp (\u00b0C)', u'Temp Flag', u'Dew Point Temp (\u00b0C)', u'Dew Point Temp Flag', u'Rel Hum (%)', u'Rel Hum Flag', u'Wind Dir (10s deg)', u'Wind Dir Flag', u'Wind Spd (km/h)', u'Wind Spd Flag', u'Visibility (km)', u'Visibility Flag', u'Stn Press (kPa)', u'Stn Press Flag', u'Hmdx', u'Hmdx Flag', u'Wind Chill', u'Wind Chill Flag', u'Weather'], dtype='object')\n" ] } ], "prompt_number": 8 }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Descriptive Statistics\n", "\n", "Now we can use array methods to do some basic analysis of the August 2013 weather.\n", "Let's look at temperatures:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "temps = data[u'Temp (\u00b0C)']\n", "print 'max:', temps.max(), 'on', data['Date/Time'][temps.argmax()]\n", "print 'min:', temps.min(), 'on', data['Date/Time'][temps.argmin()]\n", "print 'mean:', temps.mean()\n", "print 'std dev:', temps.std()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "max: 24.6 on 2013-08-10 15:00\n", "min: 12.5 on 2013-08-31 06:00\n", "mean: 18.4383064516\n", "std dev: 2.67368872431\n" ] } ], "prompt_number": 9 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that we need to include the `u` infront of the `Temp (\u00b0C)` to warn Python that this string includes a Unicode character.\n", "I got the degree symbol by copying and pasting,\n", "but you could also spell the key as `u'Temp (\\u00b0C)'`\n", "(because `u'\\u00b0'` is the Python string representation of the\n", "[Unicode DEGREESIGN character](http://www.fileformat.info/info/unicode/char/b0/index.htm))\n", "\n", "Now let's use Boolean slicing to dig down to the day level in our data.\n", "Let's get the maximum temperature each day,\n", "and the time when it occurred." ] }, { "cell_type": "code", "collapsed": false, "input": [ "temps = data[u'Temp (\u00b0C)']\n", "for day in range(1, 32):\n", " mask = data['Day']==day\n", " max_temp = temps[mask].max()\n", " date = data[mask]['Date/Time'][temps[mask].argmax()][:11]\n", " hour = data[mask]['Time'][temps[mask].argmax()] \n", " print 'max temperature on',date, 'was', max_temp, 'at', hour" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "max temperature on 2013-08-01 was 21.2 at 14:00\n", "max temperature on 2013-08-02 was 17.6 at 16:00\n", "max temperature on 2013-08-03 was 20.2 at 17:00\n", "max temperature on 2013-08-04 was 22.3 at 15:00\n", "max temperature on 2013-08-05 was 22.7 at 14:00\n", "max temperature on 2013-08-06 was 23.6 at 17:00\n", "max temperature on 2013-08-07 was 24.5 at 17:00\n", "max temperature on 2013-08-08 was 23.7 at 17:00\n", "max temperature on 2013-08-09 was 23.9 at 14:00\n", "max temperature on 2013-08-10 was 24.6 at 15:00\n", "max temperature on 2013-08-11 was 21.1 at 14:00\n", "max temperature on 2013-08-12 was 23.0 at 13:00\n", "max temperature on" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 2013-08-13 was 22.6 at 18:00\n", "max temperature on 2013-08-14 was 23.4 at 11:00\n", "max temperature on 2013-08-15 was 21.8 at 10:00\n", "max temperature on 2013-08-16 was 23.5 at 13:00\n", "max temperature on 2013-08-17 was 23.8 at 16:00\n", "max temperature on 2013-08-18 was 22.0 at 17:00\n", "max temperature on 2013-08-19 was 22.4 at 16:00\n", "max temperature on 2013-08-20 was 21.4 at 16:00\n", "max temperature on 2013-08-21 was 22.1 at 14:00\n", "max temperature on 2013-08-22 was 24.2 at 15:00\n", "max temperature on 2013-08-23 was 22.1 at 14:00\n", "max temperature on 2013-08-24 was 20.9 at 13:00\n", "max temperature on 2013-08-25 was 22.0 at 16:00\n", "max temperature on 2013-08-26 was 22.3 at 14:00\n", "max temperature on 2013-08-27 was 22.0 at 14:00\n", "max temperature on" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 2013-08-28 was 21.8 at 16:00\n", "max temperature on 2013-08-29 was 20.5 at 15:00\n", "max temperature on 2013-08-30 was 21.6 at 16:00\n", "max temperature on 2013-08-31 was 19.6 at 16:00\n" ] } ], "prompt_number": 10 }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exercise:**\n", "Plot the daily maximum temperature." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But what if you want to look at lots of files. Can we automate downloading of the files? Yes! 1) You can use subprocess to go out to shell and run curl or wget \n", "2) You can use the requests library \n", "3) You can use the url library" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It takes some digging around, but it turns out that sending an HTTP GET request to:\n", "\n", "http://climate.weather.gc.ca/climateData/bulkdata_e.html?timeframe=1&stationID=51442&Year=2013&Month=8&Day=1&format=csv\n", "\n", "will return the hourly data CSV file for YVR that we all downloaded earlier.\n", "\n", "Note: The program that accepts that URL and processes it to return the data is very picky about capitalization. That's not good design, but it's what we have to live with.\n", "\n", "We can write that URL more readably in Python by separating it into a string for the URL of the page, and a dictionary containing the keys and values in the query part. Then we can use `requests.get()` function to get the content at that URL:\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import requests\n", "url = 'http://climate.weather.gc.ca/climateData/bulkdata_e.html'\n", "params = {\n", " 'timeframe': 1,\n", " 'stationID': 51442,\n", " 'Year': 2013,\n", " 'Month': 7,\n", " 'Day': 1,\n", " 'format': 'csv',\n", "}\n", "response = requests.get(url, params=params)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 11 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The response object that we get back has a variety of properties and methods. We can look at the response headers:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "response.headers" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 12, "text": [ "{'content-disposition': 'attachment; filename=\"eng-hourly-07012013-07312013.csv\"', 'content-transfer-encoding': 'binary', 'set-cookie': 'jsenabled=0; expires=Wed, 24-Sep-2014 19:42:43 GMT; path=/', 'accept-ranges': 'bytes', 'expires': 'Mon, 26 Jul 1997 05:00:00 GMT', 'keep-alive': 'timeout=3, max=100', 'server': 'Apache', 'transfer-encoding': 'chunked', 'connection': 'Keep-Alive', 'pragma': 'public', 'cache-control': 'private', 'date': 'Wed, 24 Sep 2014 18:42:43 GMT', 'content-type': 'application/force-download'}" ] } ], "prompt_number": 12 }, { "cell_type": "markdown", "metadata": {}, "source": [ "to see that:\n", "\n", "'content-disposition': 'attachment; filename=\"eng-hourly-08012013-08312013.csv\"'\n", "\n", "which is why our browsers download the file with the name that they do, or offer to open it for us in an appropriate application. We can also see that:\n", "\n", "'content-type': 'text/csv'\n", "\n", "confirms that the server is sending us CSV data.\n", "\n", "Now we have a bit of a library mis-match. Requests produces a `response` object that has a bunch of neat properties BUT pandas wants to read a file. So we use the python library `StringIO` to produce a file-like object from the `response` content.\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from StringIO import StringIO\n", "fakefile = StringIO(response.content)\n", "datajul = pd.read_csv(fakefile, skiprows=16, encoding=\"ISO-8859-1\")\n", "\n", "print datajul.head(2)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ " Date/Time Year Month Day Time Data Quality Temp (\u00b0C) \\\n", "0 2013-07-01 00:00 2013 7 1 00:00 ** 19.6 \n", "1 2013-07-01 01:00 2013 7 1 01:00 ** 18.7 \n", "\n", " Temp Flag Dew Point Temp (\u00b0C) Dew Point Temp Flag ... \\\n", "0 NaN 17.8 NaN ... \n", "1 NaN 16.7 NaN ... \n", "\n", " Wind Spd Flag Visibility (km) Visibility Flag Stn Press (kPa) \\\n", "0 NaN 32.2 NaN 101.22 \n", "1 NaN 32.2 NaN 101.26 \n", "\n", " Stn Press Flag Hmdx Hmdx Flag Wind Chill Wind Chill Flag Weather \n", "0 NaN NaN NaN NaN NaN NaN \n", "1 NaN NaN NaN NaN NaN Clear \n", "\n", "[2 rows x 25 columns]\n" ] } ], "prompt_number": 13 }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exercise:**\n", "Plot the daily maximum temperature for both August and July" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Key Points\n", "- Use the `pandas.read_csv()` to read data from text files into NumPy arrays\n", "- Use `!` to prefix shell commands to execute them from a notebook cell\n", "- Use Boolean slices to explore and analyze data\n", "- Use requests to download a file" ] } ], "metadata": {} } ] }