{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Working with time series data" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Some imports:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "%matplotlib inline\n", "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "try:\n", " import seaborn\n", "except:\n", " pass\n", "\n", "pd.options.display.max_rows = 8" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Case study: air quality data of European monitoring stations (AirBase)\n", "\n", "AirBase (The European Air quality dataBase): hourly measurements of all air quality monitoring stations from Europe. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from IPython.display import HTML\n", "HTML('')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I downloaded and preprocessed some of the data ([python-airbase](https://github.com/jorisvandenbossche/python-airbase)): `data/airbase_data.csv`. This file includes the hourly concentrations of NO2 for 4 different measurement stations:\n", "\n", "- FR04037 (PARIS 13eme): urban background site at Square de Choisy\n", "- FR04012 (Paris, Place Victor Basch): urban traffic site at Rue d'Alesia\n", "- BETR802: urban traffic site in Antwerp, Belgium\n", "- BETN029: rural background site in Houtem, Belgium\n", "\n", "See http://www.eea.europa.eu/themes/air/interactive/no2" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Importing the data" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Import the csv file:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "!head -5 data/airbase_data.csv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, the missing values are indicated by `-9999`. This can be recognized by `read_csv` by passing the `na_values` keyword:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data = pd.read_csv('data/airbase_data.csv', index_col=0, parse_dates=True, na_values=[-9999])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Exploring the data - recap of some useful methods" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Some useful methods:\n", "\n", "`head` and `tail`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "data.head(3)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data.tail()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "`info()`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data.info()" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "slideshow": { "slide_type": "subslide" } }, "source": [ "Getting some basic summary statistics about the data with `describe`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data.describe()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Quickly visualizing the data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "data.plot(kind='box', ylim=[0,250])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "data['BETR801'].plot(kind='hist', bins=50)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "data.plot(figsize=(12,6))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This does not say too much .." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We can select part of the data (eg the latest 500 data points):" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data[-500:].plot(figsize=(12,6))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or we can use some more advanced time series features -> next section!" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Working with time series data" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "When we ensure the DataFrame has a `DatetimeIndex`, time-series related functionality becomes available:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data.index" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Indexing a time series works with strings:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data[\"2010-01-01 09:00\":\"2010-01-01 12:00\"]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "A nice feature is **\"partial string\" indexing**, where we can do implicit slicing by providing a partial datetime string.\n", "\n", "E.g. all data of 2012:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [], "source": [ "data['2012']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Normally you would expect this to access a column named '2012', but as for a DatetimeIndex, pandas also tries to interprete it as a datetime slice." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "Or all data of January up to March 2012:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data['2012-01':'2012-03']" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Time and date components can be accessed from the index:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data.index.hour" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data.index.year" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " EXERCISE: select all data starting from 1999\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "data = data['1999':]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " EXERCISE: select all data in January for all different years\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data[data.index.month == 1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " EXERCISE: select all data in January, February and March for all different years\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data['months'] = data.index.month\n", "data[data['months'].isin([1, 2, 3])]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " EXERCISE: select all 'daytime' data (between 8h and 20h) for all days\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data[(data.index.hour >= 8) & (data.index.hour < 20)]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data.between_time('08:00', '20:00')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "data = pd.read_csv('data/airbase_data.csv', index_col=0, parse_dates=True, na_values=[-9999])\n", "data = data['1999':]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## The power of pandas: `resample`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A very powerfull method is **`resample`: converting the frequency of the time series** (e.g. from hourly to daily data).\n", "\n", "The time series has a frequency of 1 hour. I want to change this to daily:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data.resample('D').mean().head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " NOTE: with older versions of pandas, data.resample('D').mean() was expressed as data.resample('D', how='mean').\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Similar to `groupby`, other methods can also be specified:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data.resample('D').max().head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "The string to specify the new time frequency: http://pandas.pydata.org/pandas-docs/dev/timeseries.html#offset-aliases \n", "These strings can also be combined with numbers, eg `'10D'`." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Further exploring the data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data.resample('M').mean().plot() # 'A'" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# data['2012'].resample('D').mean().plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " QUESTION: plot the monthly mean and median concentration of the 'FR04037' station for the years 2009-2012\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false }, "outputs": [], "source": [ "# %load snippets/05 - Time series data29.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# %load snippets/05 - Time series data30.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " QUESTION: plot the monthly mininum and maximum daily concentration of the 'BETR801' station\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": true }, "outputs": [], "source": [ "# %load snippets/05 - Time series data31.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false }, "outputs": [], "source": [ "# %load snippets/05 - Time series data32.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " QUESTION: make a bar plot of the mean of the stations in year of 2012\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false }, "outputs": [], "source": [ "# %load snippets/05 - Time series data33.py" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "
\n", " QUESTION: The evolution of the yearly averages with, and the overall mean of all stations (indicate the overall mean with a thicker black line)?\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false }, "outputs": [], "source": [ "# %load snippets/05 - Time series data34.py" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Combination with groupby" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`resample` can actually be seen as a specific kind of `groupby`. E.g. taking annual means with `data.resample('A', 'mean')` is equivalent to `data.groupby(data.index.year).mean()` (only the result of `resample` still has a DatetimeIndex).\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data.groupby(data.index.year).mean().plot()" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "But, `groupby` is more flexible and can also do resamples that do not result in a new continuous time series, e.g. by grouping by the hour of the day to get the diurnal cycle." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " QUESTION: how does the *typical monthly profile* look like for the different stations?\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1\\. add a column to the dataframe that indicates the month (integer value of 1 to 12):" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false }, "outputs": [], "source": [ "# %load snippets/05 - Time series data36.py" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "2\\. Now, we can calculate the mean of each month over the different years:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false }, "outputs": [], "source": [ "# %load snippets/05 - Time series data37.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3\\. plot the typical monthly profile of the different stations:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# %load snippets/05 - Time series data38.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " QUESTION: plot the weekly 95% percentiles of the concentration in 'BETR801' and 'BETN029' for 2011\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df2011 = data['2011'].dropna()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false }, "outputs": [], "source": [ "# %load snippets/05 - Time series data40.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false }, "outputs": [], "source": [ "# %load snippets/05 - Time series data41.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": false, "collapsed": true }, "outputs": [], "source": [ "data = data.drop('month', axis=1)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "
\n", " QUESTION: The typical diurnal profile for the different stations?\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "# %load snippets/05 - Time series data43.py" ] }, { "cell_type": "markdown", "metadata": { "clear_cell": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "
\n", " QUESTION: What are the number of exceedances of hourly values above the European limit 200 µg/m3 for each year/station?\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": true }, "outputs": [], "source": [ "# %load snippets/05 - Time series data44.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false }, "outputs": [], "source": [ "# %load snippets/05 - Time series data45.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false }, "outputs": [], "source": [ "# %load snippets/05 - Time series data46.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false }, "outputs": [], "source": [ "# %load snippets/05 - Time series data47.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " QUESTION: And are there exceedances of the yearly limit value of 40 µg/m3 since 200 ?\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": true }, "outputs": [], "source": [ "# %load snippets/05 - Time series data48.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false }, "outputs": [], "source": [ "# %load snippets/05 - Time series data49.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false }, "outputs": [], "source": [ "# %load snippets/05 - Time series data50.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " QUESTION: The maximum daily 8 hour mean should be below 100 µg/m³. What are the number of exceedances of this limit for each year/station?\n", "
\n", "\n", "Tip: have a look at the `rolling` method to perform moving window operations.\n", "\n", "Note: this is not an actual limit for NO2, but a nice exercise to introduce the `rolling` method. Other pollutans, such as 03 have actually such kind of limit values." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false }, "outputs": [], "source": [ "# %load snippets/05 - Time series data52.py" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "
\n", " QUESTION: Calculate the correlation between the different stations\n", "
\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false }, "outputs": [], "source": [ "# %load snippets/05 - Time series data53.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false, "scrolled": true }, "outputs": [], "source": [ "# %load snippets/05 - Time series data54.py" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 0 }