{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Reshaping data with `stack` and `unstack`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "try:\n",
    "    import seaborn\n",
    "except ImportError:\n",
    "    pass\n",
    "\n",
    "pd.options.display.max_rows = 8"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Case study: air quality data of European monitoring stations (AirBase)\n",
    "\n",
    "Going further with the time series case study [test](05 - Time series data.ipynb) on the AirBase (The European Air quality dataBase) data: the actual data downloaded from the Airbase website did not look like a nice csv file (`data/airbase_data.csv`)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "One of the actual downloaded raw data files of AirBase is included in the repo:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "!head -1 ./data/BETR8010000800100hour.1-1-1990.31-12-2012"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Just reading the tab-delimited data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "data = pd.read_csv(\"data/BETR8010000800100hour.1-1-1990.31-12-2012\", sep='\\t')#, header=None)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The above data is clearly not ready to be used! Each row contains the 24 measurements for each hour of the day, and also contains a flag (0/1) indicating the quality of the data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "<div class=\"alert alert-success\">\n",
    "    <b>EXERCISE</b>: Clean up this dataframe using more options of `read_csv`\n",
    "</div>\n",
    "\n",
    "- specify that the values of -999 and -9999 should be regarded as NaN\n",
    "- specify are own column names (http://stackoverflow.com/questions/6356041/python-intertwining-two-lists)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "hours = [\"{:02d}\".format(i) for i in range(24)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# %load snippets/06 - Reshaping data6.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# %load snippets/06 - Reshaping data7.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "For now, we disregard the 'flag' columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# %load snippets/06 - Reshaping data8.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we want to reshape it: our goal is to have the different hours as row indices, merged with the date into a datetime-index."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Overview: reshaping your data with `stack`, `unstack` and `pivot`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "The docs say:\n",
    "\n",
    "> Pivot a level of the (possibly hierarchical) column labels, returning a\n",
    "DataFrame (or Series in the case of an object with a single level of\n",
    "column labels) having a hierarchical index with a new inner-most level\n",
    "of row labels.\n",
    "\n",
    "<img src=\"img/stack.png\" width=70%>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "df = pd.DataFrame({'A':['one', 'one', 'two', 'two'], 'B':['a', 'b', 'a', 'b'], 'C':range(4)})\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To use `stack`/`unstack`, we need the values we want to shift from rows to columns or the other way around as the index:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "df = df.set_index(['A', 'B'])\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "result = df['C'].unstack()\n",
    "result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "df = result.stack().reset_index(name='C')\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "`pivot` is similar to `unstack`, but let you specify column names:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "df.pivot(index='A', columns='B', values='C')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "`pivot_table` is similar as `pivot`, but can work with duplicate indices and let you specify an aggregation function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "df = pd.DataFrame({'A':['one', 'one', 'two', 'two', 'one', 'two'], 'B':['a', 'b', 'a', 'b', 'a', 'b'], 'C':range(6)})\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "df.pivot_table(index='A', columns='B', values='C', aggfunc='count') #'mean'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Back to our case study"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now use `stack` and some other functions to create a timeseries from the original dataframe:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "colnames = ['date'] + [item for pair in zip([\"{:02d}\".format(i) for i in range(24)], ['flag']*24) for item in pair]\n",
    "\n",
    "data = pd.read_csv(\"data/BETR8010000800100hour.1-1-1990.31-12-2012\",\n",
    "                   sep='\\t', header=None, na_values=[-999, -9999], names=colnames)\n",
    "data = data.drop('flag', axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "    <b>EXERCISE</b>: Reshape the dataframe to a timeseries\n",
    "</div>\n",
    "\n",
    "The end result should look like:\n",
    "\n",
    "\n",
    "<div>\n",
    "<table border=\"1\" class=\"dataframe\">\n",
    "  <thead>\n",
    "    <tr style=\"text-align: right;\">\n",
    "      <th></th>\n",
    "      <th>BETR801</th>\n",
    "    </tr>\n",
    "  </thead>\n",
    "  <tbody>\n",
    "    <tr>\n",
    "      <th>1990-01-02 09:00:00</th>\n",
    "      <td>48.0</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th>1990-01-02 12:00:00</th>\n",
    "      <td>48.0</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th>1990-01-02 13:00:00</th>\n",
    "      <td>50.0</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th>1990-01-02 14:00:00</th>\n",
    "      <td>55.0</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th>...</th>\n",
    "      <td>...</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th>2012-12-31 20:00:00</th>\n",
    "      <td>16.5</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th>2012-12-31 21:00:00</th>\n",
    "      <td>14.5</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th>2012-12-31 22:00:00</th>\n",
    "      <td>16.5</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th>2012-12-31 23:00:00</th>\n",
    "      <td>15.0</td>\n",
    "    </tr>\n",
    "  </tbody>\n",
    "</table>\n",
    "<p>170794 rows × 1 columns</p>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, reshape the dataframe so that each row consists of one observation for one date + hour combination:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# %load snippets/06 - Reshaping data18.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# %load snippets/06 - Reshaping data19.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# %load snippets/06 - Reshaping data20.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Now, combine the date and hour colums into a datetime (tip: string columns can be summed to concatenate the strings):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# %load snippets/06 - Reshaping data21.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# %load snippets/06 - Reshaping data22.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# %load snippets/06 - Reshaping data23.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# %load snippets/06 - Reshaping data24.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# %load snippets/06 - Reshaping data25.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "<div class=\"alert alert-success\">\n",
    "    <b>QUESTION</b>: What is the difference in the typical diurnal profile between week and weekend days? (and visualise it)\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# use the data of above\n",
    "data = data_stacked\n",
    "# or read from the processed csv file\n",
    "#data = pd.read_csv('data/airbase_data.csv', index_col=0, parse_dates=True, na_values=[-9999])[['BETR801']]\n",
    "#data = data['1999':]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": false,
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "data.index.weekday?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# %load snippets/06 - Reshaping data28.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Add a column indicating week/weekend"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# %load snippets/06 - Reshaping data29.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# %load snippets/06 - Reshaping data30.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "# %load snippets/06 - Reshaping data31.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "# %load snippets/06 - Reshaping data32.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# %load snippets/06 - Reshaping data33.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "<div class=\"alert alert-success\">\n",
    "    <b>QUESTION</b>: Visualize the typical week profile for the different stations as boxplots (where the values in one boxplot are the daily means for the different weeks for a certain weekday).\n",
    "</div>\n",
    "\n",
    "Tip: the boxplot method of a DataFrame expects the data for the different boxes in different columns). For this, you can either use `pivot_table` as a combination of `groupby` and `unstack`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "# %load snippets/06 - Reshaping data34.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "# %load snippets/06 - Reshaping data35.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# %load snippets/06 - Reshaping data36.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "An alternative method using `groupby` and `unstack`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# %load snippets/06 - Reshaping data37.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## We can also use the reshaping methods with the movie data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "cast = pd.read_csv('data/cast.csv')\n",
    "cast.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "titles = pd.read_csv('data/titles.csv')\n",
    "titles.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "    <b>EXERCISE</b>: Define a year as a \"Superman year\" whose films feature more Superman characters than Batman. How many years in film history have been Superman years?\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# %load snippets/06 - Reshaping data40.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# %load snippets/06 - Reshaping data41.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "    <b>EXERCISE</b>: Plot the number of actor roles each year and the number of actress roles each year over the history of film.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# %load snippets/06 - Reshaping data42.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "    <b>EXERCISE</b>: Plot the number of actor roles each year and the number of actress roles each year, but this time as a kind='area' plot.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# %load snippets/06 - Reshaping data43.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "celltoolbar": "Nbtutor - export exercises",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}