{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "

Pandas: Working with time series data

\n", "\n", "> *© 2016, Joris Van den Bossche and Stijn Van Hoey (, ). Licensed under [CC BY 4.0 Creative Commons](http://creativecommons.org/licenses/by/4.0/)*\n", "\n", "---" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "%matplotlib notebook\n", "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "pd.options.display.max_rows = 8" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction: `datetime` module" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Standard Python contains the `datetime` module to handle with date and time data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "import datetime" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "dt = datetime.datetime(year=2016, month=12, day=19, hour=13, minute=30)\n", "dt" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "print(dt) # .day,..." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "print(dt.strftime(\"%d %B %Y\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Dates and times in pandas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The ``Timestamp`` object" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pandas has its own date and time objects, which are compatible with the standard `datetime` objects, but provide some more functionality to work with. \n", "\n", "The `Timestamp` object can also be constructed from a string:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "ts = pd.Timestamp('2016-12-19')\n", "ts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Like with `datetime.datetime` objects, there are several useful attributes available on the `Timestamp`. For example, we can get the month:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "ts.month" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "ts + pd.Timedelta('5 days')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Parsing datetime strings " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](http://imgs.xkcd.com/comics/iso_8601.png)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Unfortunately, when working with real world data, you encounter many different `datetime` formats. Most of the time when you have to deal with them, they come in text format, e.g. from a `CSV` file. To work with those data in Pandas, we first have to *parse* the strings to actual `Timestamp` objects." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "REMEMBER:

\n", "\n", "from string formatted dates to Timestamp objects: `to_datetime` function\n", "
\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "pd.to_datetime(\"2016-12-09\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "pd.to_datetime(\"09/12/2016\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "pd.to_datetime(\"09/12/2016\", dayfirst=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "pd.to_datetime(\"09/12/2016\", format=\"%d/%m/%Y\")" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "A detailed overview of how to specify the `format` string, see the table in the python documentation: https://docs.python.org/3.5/library/datetime.html#strftime-and-strptime-behavior" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## `Timestamp` data in a Series or DataFrame column" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "s = pd.Series(['2016-12-09 10:00:00', '2016-12-09, 11:00:00', '2016-12-09 12:00:00'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `to_datetime` function can also be used to convert a full series of strings:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "ts = pd.to_datetime(s)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "ts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice the data type of this series: the `datetime64[ns]` dtype. This indicates that we have a series of actual datetime values." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The same attributes as on single `Timestamp`s are also available on a Series with datetime data, using the **`.dt`** accessor:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "run_control": { "frozen": false, "read_only": false }, "scrolled": true }, "outputs": [], "source": [ "ts.dt.hour" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "ts.dt.weekday" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To quickly construct some regular time series data, the [``pd.date_range``](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.date_range.html) function comes in handy:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "pd.Series(pd.date_range(start=\"2016-01-01\", periods=10, freq='3H'))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Time series data: `Timestamp` in the index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## River discharge example data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the following demonstration of the time series functionality, we use a sample of discharge data of the Maarkebeek (Flanders) with 3 hour averaged values, derived from the [Waterinfo website](https://www.waterinfo.be/)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "data = pd.read_csv(\"data/flowdata.csv\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We already know how to parse a date column with Pandas:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "data['Time'] = pd.to_datetime(data['Time'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With `set_index('datetime')`, we set the column with datetime values as the index, which can be done by both `Series` and `DataFrame`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "data = data.set_index(\"Time\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The steps above are provided as built-in functionality of `read_csv`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "data = pd.read_csv(\"data/flowdata.csv\", index_col=0, parse_dates=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "REMEMBER:

\n", "\n", "`pd.read_csv` provides a lot of built-in functionality to support this kind of transactions when reading in a file! Check the help of the read_csv function...\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The DatetimeIndex" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "When we ensure the DataFrame has a `DatetimeIndex`, time-series related functionality becomes available:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "run_control": { "frozen": false, "read_only": false }, "scrolled": true }, "outputs": [], "source": [ "data.index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similar to a Series with datetime data, there are some attributes of the timestamp values available:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "data.index.day" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "data.index.dayofyear" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "data.index.year" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `plot` method will also adapt it's labels (when you zoom in, you can see the different levels of detail of the datetime labels):\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "data.plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have to much data to sensibly plot on one figure. Let's see how we can easily select part of the data or aggregate the data to other time resolutions in the next sections." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Selecting data from a time series" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use label based indexing on a timeseries as expected:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "data[pd.Timestamp(\"2012-01-01 09:00\"):pd.Timestamp(\"2012-01-01 19:00\")]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "But, for convenience, indexing a time series also works with strings:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "data[\"2012-01-01 09:00\":\"2012-01-01 19:00\"]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "A nice feature is **\"partial string\" indexing**, where we can do implicit slicing by providing a partial datetime string.\n", "\n", "E.g. all data of 2013:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "run_control": { "frozen": false, "read_only": false }, "scrolled": true }, "outputs": [], "source": [ "data['2013']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Normally you would expect this to access a column named '2013', but as for a DatetimeIndex, pandas also tries to interprete it as a datetime slice." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "Or all data of January up to March 2012:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "data['2012-01':'2012-03']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "EXERCISE:\n", "\n", "
    \n", "
  • select all data starting from 2012
  • \n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "# %load snippets/05 - Time series data36.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "EXERCISE:\n", "\n", "
    \n", "
  • select all data in January for all different years
  • \n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "# %load snippets/05 - Time series data37.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "EXERCISE:\n", "\n", "
    \n", "
  • select all data in January, February and March for all different years
  • \n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "# %load snippets/05 - Time series data38.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "data = data.drop(\"months\", axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "EXERCISE:\n", "\n", "
    \n", "
  • select all 'daytime' data (between 8h and 20h) for all days
  • \n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "# %load snippets/05 - Time series data40.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "# %load snippets/05 - Time series data41.py" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## The power of pandas: `resample`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A very powerfull method is **`resample`: converting the frequency of the time series** (e.g. from hourly to daily data).\n", "\n", "The time series has a frequency of 1 hour. I want to change this to daily:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "data.resample('D').mean().head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "NOTE:\n", "\n", "
    \n", "
  • with older versions of pandas, data.resample('D').mean() was expressed as data.resample('D', how='mean').
  • \n", "
\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Other mathematical methods can also be specified:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "data.resample('D').max().head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "REMEMBER:

\n", "\n", " The string to specify the new time frequency: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases

\n", " \n", " These strings can also be combined with numbers, eg `'10D'`...\n", "\n", "
\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "data.resample('A').mean().plot() # 10D" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "EXERCISE:\n", "\n", "
    \n", "
  • plot the monthly standard deviation of the columns
  • \n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "# %load snippets/05 - Time series data45.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "EXERCISE:\n", "\n", "
    \n", "
  • plot the monthly mean and median values for the years 2011-2012 for 'L06_347'

  • \n", "
\n", " \n", " **Note**
You can create a new figure with `fig, ax = plt.subplots()` and add each of the plots to the created `ax` object (see documentation of pandas plot function)\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "# %load snippets/05 - Time series data46.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": true, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# %load snippets/05 - Time series data47.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "EXERCISE:\n", "\n", "
    \n", "
  • plot the monthly mininum and maximum daily average value of the 'LS06_348' column
  • \n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "# %load snippets/05 - Time series data48.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "# %load snippets/05 - Time series data49.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "EXERCISE:\n", "\n", "
    \n", "
  • make a bar plot of the mean of the stations in year of 2013 (Remark: create a `fig, ax = plt.subplots()` object and add the plot to the created ax
  • \n", "
\n", "\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "# %load snippets/05 - Time series data50.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "EXERCISE:\n", "\n", "
    \n", "
  • Calculate the typical yearly pattern with monthly resolution (plot of the typical monthly average over the years)
  • \n", "
\n", "\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" }, "nav_menu": {}, "toc": { "navigate_menu": true, "number_sections": true, "sideBar": true, "threshold": 6, "toc_cell": false, "toc_section_display": "block", "toc_window_display": true }, "toc_position": { "height": "500px", "left": "0px", "right": "986px", "top": "106px", "width": "214px" } }, "nbformat": 4, "nbformat_minor": 1 }