{ "metadata": { "name": "", "signature": "sha256:1dec014908b350d0c54505ef3415b91eabd06f7a0dacf1eba412ea9fed14afea" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Weather Changes: an Introduction to Data Analysis with Python" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is a basic introduction to real-world data analysis and visualization with Python. To see the finished product and a bit more analysis, read my [blog](http://corbt.com/posts/2014/05/10/climate-change-since-1950.html)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first step in data analysis is finding the right data. For this project we'll be using the Global Historical Climatology Network (GHCN) monthly data compiled by the US government. In a [previous post](http://corbt.com/posts/2014/05/01/world-temperatures.html) I explain where this data comes from.\n", "\n", "A professor of mine, and an experienced data scientist, once mentioned that \"80% of the effort in machine learning is getting the source data into a manageable format.\" The following section simply does exactly that. :) The source files are in a custom fixed-width-column format specified on the [NOAA website](ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/v3/README).\n", "\n", "The first block here parses three \\*.dat files, one each for minimum, maximum and average temperature measurements. Each of these files consists of one line per station per year. A line includes the station ID, year of observation, and measurements for all 12 months.\n", "\n", "These files are simply read in, split at the correct locations, and written out to a CSV file. This will allow us to import the data into pandas directly using pandas's `read_csv` functionality." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from __future__ import division\n", "from glob import glob\n", "import csv,re\n", "\n", "for name in glob('data/raw/*.dat'):\n", " label = name[15:19]\n", " print \"Parsing data:\",label\n", " with open(name) as in_file:\n", " with open(\"data/csv/\"+label+\".csv\",'w') as out_file:\n", " writer = csv.writer(out_file)\n", " header = [\"id\",\"year\"]\n", " header += [str(month+1)+\"_\"+label for month in xrange(12)]\n", " writer.writerow(header)\n", "\n", " for line in in_file:\n", " # if int(line[11:15]) < 2000:\n", " # continue\n", " components = [line[0:11],line[11:15]]\n", " components += [int(line[19+x*8:24+x*8])/100 for x in range(12)]\n", " writer.writerow(components)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Parsing data: tavg\n", "Parsing data:" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " tmax\n", "Parsing data:" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " tmin\n" ] } ], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "A `country-codes` file is provided, which correlates each station ID with a country. Because this file is small we can read it straight into a python dictionary of the form `{[country_code]: [country_name]}`, which we will use in the next step to associate countries with station IDs." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def title(name):\n", " \"\"\"Title case with correct handling of internal apostrophes\"\"\"\n", " return re.sub(\"([a-z])'([A-Z])\", lambda m: m.group(0).lower(), name.title())\n", "\n", "cc_file = open('data/raw/country-codes')\n", "ccodes = {l[:3]:title(l[4:].strip()) for l in cc_file}" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Index files, ending in \\*.inv, correlate the station IDs in the .dat files with station metadata, such as latitude, longitude and station name. The following lines open each .inv file and convert them to a CSV, as above. It also uses the ccodes index from above to look up the country name for each file." ] }, { "cell_type": "code", "collapsed": false, "input": [ "for name in glob('data/raw/*.inv'):\n", " label = name[15:19]\n", " print \"Parsing index:\",label\n", " with open(name) as in_file:\n", " with open(\"data/csv/\"+label+\"-inv.csv\",'w') as out_file:\n", " writer = csv.writer(out_file)\n", " writer.writerow([\"id\",\"lat\",\"long\",\"name\",\"country\"])\n", " for line in in_file:\n", " # if not line[74:79].strip().isdigit() or int(line[74:79]) < 100:\n", " # continue\n", " name = title(line[38:68])\n", " name = name[:name.find(\" \")].strip()\n", " components = [line[0:11],line[12:20],line[21:30],name,ccodes[line[:3]]]\n", " writer.writerow(components)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Parsing index: tavg\n", "Parsing index:" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " tmin\n", "Parsing index:" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " tmax\n" ] } ], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Aggregating the Data\n", "\n", "Now that all the relevant data is stored as easily-parsed CSV files, we can begin processing it!\n", "\n", "The main tool we'll be using for this is a pandas `DataFrame`. Think of a dataframe as much like a table in SQL or even a spreadsheet in Excel with lots of data indexing, retreival and aggregation tools built right in. For more information on DataFrames, see the pandas [introduction](http://pandas.pydata.org/pandas-docs/stable/dsintro.html). Pandas lets you build a dataframe from a dictionary, but in this case we'll be reading one right from a CSV file on disk. It looks something like this:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd\n", "\n", "tavg_dataframe = pd.read_csv(\"data/csv/tavg.csv\")\n", "tavg_dataframe.head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", " | id | \n", "year | \n", "1_tavg | \n", "2_tavg | \n", "3_tavg | \n", "4_tavg | \n", "5_tavg | \n", "6_tavg | \n", "7_tavg | \n", "8_tavg | \n", "9_tavg | \n", "10_tavg | \n", "11_tavg | \n", "12_tavg | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "10160355000 | \n", "1878 | \n", "9.60 | \n", "10.2 | \n", "11.80 | \n", "16.80 | \n", "20.5 | \n", "23.10 | \n", "25.60 | \n", "27.50 | \n", "23.9 | \n", "-99.99 | \n", "14.40 | \n", "12.20 | \n", "
1 | \n", "10160355000 | \n", "1879 | \n", "12.50 | \n", "12.4 | \n", "12.90 | \n", "16.20 | \n", "16.3 | \n", "23.40 | \n", "24.70 | \n", "25.80 | \n", "23.1 | \n", "18.20 | \n", "15.20 | \n", "9.70 | \n", "
2 | \n", "10160355000 | \n", "1880 | \n", "10.30 | \n", "11.8 | \n", "13.10 | \n", "15.90 | \n", "17.8 | \n", "21.40 | \n", "26.50 | \n", "26.40 | \n", "23.6 | \n", "21.30 | \n", "16.40 | \n", "13.60 | \n", "
3 | \n", "10160355000 | \n", "1931 | \n", "-99.99 | \n", "10.4 | \n", "-99.99 | \n", "-99.99 | \n", "19.2 | \n", "24.60 | \n", "-99.99 | \n", "26.70 | \n", "22.3 | \n", "20.00 | \n", "16.20 | \n", "11.30 | \n", "
4 | \n", "10160355000 | \n", "1932 | \n", "10.80 | \n", "10.5 | \n", "-99.99 | \n", "14.90 | \n", "19.1 | \n", "-99.99 | \n", "23.60 | \n", "-99.99 | \n", "25.1 | \n", "-99.99 | \n", "-99.99 | \n", "-99.99 | \n", "
5 rows \u00d7 14 columns
\n", "\n", " | id | \n", "year | \n", "tavg | \n", "tmin | \n", "tmax | \n", "lat | \n", "long | \n", "name | \n", "country | \n", "
---|---|---|---|---|---|---|---|---|---|
0 | \n", "10160355000 | \n", "1996 | \n", "19.370000 | \n", "15.860000 | \n", "24.200000 | \n", "36.93 | \n", "6.95 | \n", "Skikda | \n", "Algeria | \n", "
1 | \n", "10160355000 | \n", "1997 | \n", "18.790000 | \n", "14.942857 | \n", "21.828571 | \n", "36.93 | \n", "6.95 | \n", "Skikda | \n", "Algeria | \n", "
2 | \n", "10160355000 | \n", "1998 | \n", "18.550000 | \n", "14.730000 | \n", "22.460000 | \n", "36.93 | \n", "6.95 | \n", "Skikda | \n", "Algeria | \n", "
3 | \n", "10160355000 | \n", "1999 | \n", "19.536364 | \n", "12.914286 | \n", "20.600000 | \n", "36.93 | \n", "6.95 | \n", "Skikda | \n", "Algeria | \n", "
4 | \n", "10160355000 | \n", "2000 | \n", "19.070000 | \n", "16.111111 | \n", "23.933333 | \n", "36.93 | \n", "6.95 | \n", "Skikda | \n", "Algeria | \n", "
5 rows \u00d7 9 columns
\n", "\n", " | country | \n", "
---|---|
country | \n", "\n", " |
United States Of America | \n", "1222 | \n", "
Russian Federation (Asian Sector) | \n", "71 | \n", "
China | \n", "52 | \n", "
Australia | \n", "49 | \n", "
Japan | \n", "46 | \n", "
Canada | \n", "34 | \n", "
Russian Federation (European Sector) | \n", "19 | \n", "
Pakistan | \n", "17 | \n", "
Sudan | \n", "15 | \n", "
Italy | \n", "15 | \n", "
Philippines | \n", "14 | \n", "
Thailand | \n", "13 | \n", "
Malaysia | \n", "11 | \n", "
Ukraine | \n", "10 | \n", "
Poland | \n", "10 | \n", "
South Africa | \n", "9 | \n", "
Republic Of Korea | \n", "9 | \n", "
Algeria | \n", "6 | \n", "
Turkmenistan | \n", "6 | \n", "
Uzbekistan | \n", "6 | \n", "
Kazakhstan | \n", "5 | \n", "
Zambia | \n", "3 | \n", "
Belarus | \n", "3 | \n", "
Egypt | \n", "3 | \n", "
Federated States Of Micronesia | \n", "3 | \n", "
Bangladesh | \n", "3 | \n", "
Zimbabwe | \n", "2 | \n", "
Libya | \n", "2 | \n", "
Lithuania | \n", "2 | \n", "
Tanzania | \n", "2 | \n", "
Nigeria | \n", "2 | \n", "
Cocos Islands (Australia) | \n", "1 | \n", "
Antarctica | \n", "1 | \n", "
Armenia | \n", "1 | \n", "
Wake Island (U.S.A.) | \n", "1 | \n", "
Austria | \n", "1 | \n", "
Saudi Arabia | \n", "1 | \n", "
Belau | \n", "1 | \n", "
Bermuda (U.K.) | \n", "1 | \n", "
Portugal | \n", "1 | \n", "
Northern Mariana Islands (U.S.A.) | \n", "1 | \n", "
Latvia | \n", "1 | \n", "
Kyrgyzstan | \n", "1 | \n", "
Cuba | \n", "1 | \n", "
Czech Republic | \n", "1 | \n", "
Norfolk Island (Australia) | \n", "1 | \n", "
Estonia | \n", "1 | \n", "
Ghana | \n", "1 | \n", "
Iceland | \n", "1 | \n", "
Indonesia | \n", "1 | \n", "
Moldova | \n", "1 | \n", "
Marshall Islands | \n", "1 | \n", "
Coral Sea Islands (Australia) | \n", "1 | \n", "
53 rows \u00d7 1 columns
\n", "