{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Reshaping data with `stack` and `unstack`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "%matplotlib inline\n", "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "try:\n", " import seaborn\n", "except ImportError:\n", " pass\n", "\n", "pd.options.display.max_rows = 8" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Case study: air quality data of European monitoring stations (AirBase)\n", "\n", "Going further with the time series case study [test](05 - Time series data.ipynb) on the AirBase (The European Air quality dataBase) data: the actual data downloaded from the Airbase website did not look like a nice csv file (`data/airbase_data.csv`)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "One of the actual downloaded raw data files of AirBase is included in the repo:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "!head -1 ./data/BETR8010000800100hour.1-1-1990.31-12-2012" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Just reading the tab-delimited data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data = pd.read_csv(\"data/BETR8010000800100hour.1-1-1990.31-12-2012\", sep='\\t')#, header=None)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above data is clearly not ready to be used! Each row contains the 24 measurements for each hour of the day, and also contains a flag (0/1) indicating the quality of the data." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "
\n", " EXERCISE: Clean up this dataframe using more options of `read_csv`\n", "
\n", "\n", "- specify that the values of -999 and -9999 should be regarded as NaN\n", "- specify are own column names (http://stackoverflow.com/questions/6356041/python-intertwining-two-lists)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "hours = [\"{:02d}\".format(i) for i in range(24)]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": true }, "outputs": [], "source": [ "# %load snippets/06 - Reshaping data6.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false, "scrolled": true }, "outputs": [], "source": [ "# %load snippets/06 - Reshaping data7.py" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "For now, we disregard the 'flag' columns" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false }, "outputs": [], "source": [ "# %load snippets/06 - Reshaping data8.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we want to reshape it: our goal is to have the different hours as row indices, merged with the date into a datetime-index." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Overview: reshaping your data with `stack`, `unstack` and `pivot`" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The docs say:\n", "\n", "> Pivot a level of the (possibly hierarchical) column labels, returning a\n", "DataFrame (or Series in the case of an object with a single level of\n", "column labels) having a hierarchical index with a new inner-most level\n", "of row labels.\n", "\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "df = pd.DataFrame({'A':['one', 'one', 'two', 'two'], 'B':['a', 'b', 'a', 'b'], 'C':range(4)})\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To use `stack`/`unstack`, we need the values we want to shift from rows to columns or the other way around as the index:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df = df.set_index(['A', 'B'])\n", "df" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "result = df['C'].unstack()\n", "result" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df = result.stack().reset_index(name='C')\n", "df" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "`pivot` is similar to `unstack`, but let you specify column names:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "df.pivot(index='A', columns='B', values='C')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "`pivot_table` is similar as `pivot`, but can work with duplicate indices and let you specify an aggregation function:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df = pd.DataFrame({'A':['one', 'one', 'two', 'two', 'one', 'two'], 'B':['a', 'b', 'a', 'b', 'a', 'b'], 'C':range(6)})\n", "df" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df.pivot_table(index='A', columns='B', values='C', aggfunc='count') #'mean'" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Back to our case study" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now use `stack` and some other functions to create a timeseries from the original dataframe:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "colnames = ['date'] + [item for pair in zip([\"{:02d}\".format(i) for i in range(24)], ['flag']*24) for item in pair]\n", "\n", "data = pd.read_csv(\"data/BETR8010000800100hour.1-1-1990.31-12-2012\",\n", " sep='\\t', header=None, na_values=[-999, -9999], names=colnames)\n", "data = data.drop('flag', axis=1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " EXERCISE: Reshape the dataframe to a timeseries\n", "
\n", "\n", "The end result should look like:\n", "\n", "\n", "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
BETR801
1990-01-02 09:00:0048.0
1990-01-02 12:00:0048.0
1990-01-02 13:00:0050.0
1990-01-02 14:00:0055.0
......
2012-12-31 20:00:0016.5
2012-12-31 21:00:0014.5
2012-12-31 22:00:0016.5
2012-12-31 23:00:0015.0
\n", "

170794 rows × 1 columns

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, reshape the dataframe so that each row consists of one observation for one date + hour combination:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false }, "outputs": [], "source": [ "# %load snippets/06 - Reshaping data18.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": true }, "outputs": [], "source": [ "# %load snippets/06 - Reshaping data19.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false }, "outputs": [], "source": [ "# %load snippets/06 - Reshaping data20.py" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Now, combine the date and hour colums into a datetime (tip: string columns can be summed to concatenate the strings):" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false }, "outputs": [], "source": [ "# %load snippets/06 - Reshaping data21.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false }, "outputs": [], "source": [ "# %load snippets/06 - Reshaping data22.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false }, "outputs": [], "source": [ "# %load snippets/06 - Reshaping data23.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false }, "outputs": [], "source": [ "# %load snippets/06 - Reshaping data24.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false }, "outputs": [], "source": [ "# %load snippets/06 - Reshaping data25.py" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "
\n", " QUESTION: What is the difference in the typical diurnal profile between week and weekend days? (and visualise it)\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# use the data of above\n", "data = data_stacked\n", "# or read from the processed csv file\n", "#data = pd.read_csv('data/airbase_data.csv', index_col=0, parse_dates=True, na_values=[-9999])[['BETR801']]\n", "#data = data['1999':]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": false, "collapsed": true }, "outputs": [], "source": [ "data.index.weekday?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": true }, "outputs": [], "source": [ "# %load snippets/06 - Reshaping data28.py" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Add a column indicating week/weekend" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false }, "outputs": [], "source": [ "# %load snippets/06 - Reshaping data29.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false }, "outputs": [], "source": [ "# %load snippets/06 - Reshaping data30.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# %load snippets/06 - Reshaping data31.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# %load snippets/06 - Reshaping data32.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": true }, "outputs": [], "source": [ "# %load snippets/06 - Reshaping data33.py" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "
\n", " QUESTION: Visualize the typical week profile for the different stations as boxplots (where the values in one boxplot are the daily means for the different weeks for a certain weekday).\n", "
\n", "\n", "Tip: the boxplot method of a DataFrame expects the data for the different boxes in different columns). For this, you can either use `pivot_table` as a combination of `groupby` and `unstack`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "# %load snippets/06 - Reshaping data34.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# %load snippets/06 - Reshaping data35.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false }, "outputs": [], "source": [ "# %load snippets/06 - Reshaping data36.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An alternative method using `groupby` and `unstack`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false }, "outputs": [], "source": [ "# %load snippets/06 - Reshaping data37.py" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## We can also use the reshaping methods with the movie data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "cast = pd.read_csv('data/cast.csv')\n", "cast.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "titles = pd.read_csv('data/titles.csv')\n", "titles.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " EXERCISE: Define a year as a \"Superman year\" whose films feature more Superman characters than Batman. How many years in film history have been Superman years?\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false }, "outputs": [], "source": [ "# %load snippets/06 - Reshaping data40.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false }, "outputs": [], "source": [ "# %load snippets/06 - Reshaping data41.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " EXERCISE: Plot the number of actor roles each year and the number of actress roles each year over the history of film.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false }, "outputs": [], "source": [ "# %load snippets/06 - Reshaping data42.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " EXERCISE: Plot the number of actor roles each year and the number of actress roles each year, but this time as a kind='area' plot.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false }, "outputs": [], "source": [ "# %load snippets/06 - Reshaping data43.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "celltoolbar": "Nbtutor - export exercises", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 0 }