{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "

Reshaping data

\n", "\n", "> *© 2016, Joris Van den Bossche and Stijn Van Hoey (, ). Licensed under [CC BY 4.0 Creative Commons](http://creativecommons.org/licenses/by/4.0/)*\n", "\n", "---" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "%matplotlib inline\n", "\n", "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Pivoting data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cfr. excel" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "People who know Excel, probably know the **Pivot** functionality:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](img/pivot_excel.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data of the table:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "excelample = pd.DataFrame({'Month': [\"January\", \"January\", \"January\", \"January\", \n", " \"February\", \"February\", \"February\", \"February\", \n", " \"March\", \"March\", \"March\", \"March\"],\n", " 'Category': [\"Transportation\", \"Grocery\", \"Household\", \"Entertainment\",\n", " \"Transportation\", \"Grocery\", \"Household\", \"Entertainment\",\n", " \"Transportation\", \"Grocery\", \"Household\", \"Entertainment\"],\n", " 'Amount': [74., 235., 175., 100., 115., 240., 225., 125., 90., 260., 200., 120.]})" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "excelample" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "excelample_pivot = excelample.pivot(index=\"Category\", columns=\"Month\", values=\"Amount\")\n", "excelample_pivot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Interested in *Grand totals*?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# sum columns\n", "excelample_pivot.sum(axis=1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# sum rows\n", "excelample_pivot.sum(axis=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pivot is just reordering your data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Small subsample of the titanic dataset:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "df = pd.DataFrame({'Fare': [7.25, 71.2833, 51.8625, 30.0708, 7.8542, 13.0],\n", " 'Pclass': [3, 1, 1, 2, 3, 2],\n", " 'Sex': ['male', 'female', 'male', 'female', 'female', 'male'],\n", " 'Survived': [0, 1, 0, 1, 0, 1]})" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "df" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "df.pivot(index='Pclass', columns='Sex', values='Fare')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "df.pivot(index='Pclass', columns='Sex', values='Survived')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So far, so good..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's now use the full titanic dataset:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "df = pd.read_csv(\"data/titanic.csv\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And try the same pivot (*no worries about the try-except, this is here just used to catch a loooong error*):" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "run_control": { "frozen": false, "read_only": false }, "scrolled": true }, "outputs": [], "source": [ "try:\n", " df.pivot(index='Sex', columns='Pclass', values='Fare')\n", "except Exception as e:\n", " print(\"Exception!\", e)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This does not work, because we would end up with multiple values for one cell of the resulting frame, as the error says: `duplicated` values for the columns in the selection. As an example, consider the following rows of our three columns of interest:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.loc[[1, 3], [\"Sex\", 'Pclass', 'Fare']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since `pivot` is just restructuring data, where would both values of `Fare` for the same combination of `Sex` and `Pclass` need to go?\n", "\n", "Well, they need to be combined, according to an `aggregation` functionality, which is supported by the function`pivot_table`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "NOTE:\n", "\n", "
    \n", "
  • **Pivot** is purely restructuring: a single value for each index/column combination is required.
  • \n", "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Pivot tables - aggregating while pivoting" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "df = pd.read_csv(\"data/titanic.csv\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "run_control": { "frozen": false, "read_only": false }, "scrolled": true }, "outputs": [], "source": [ "df.pivot_table(index='Sex', columns='Pclass', values='Fare')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "REMEMBER:\n", "\n", "
    \n", "
  • By default, `pivot_table` takes the **mean** of all values that would end up into one cell. However, you can also specify other aggregation functions using the `aggfunc` keyword.
  • \n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "df.pivot_table(index='Sex', columns='Pclass', \n", " values='Fare', aggfunc='max')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "df.pivot_table(index='Sex', columns='Pclass', \n", " values='Fare', aggfunc='count')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "REMEMBER:\n", "\n", "
    \n", "
  • There is a shortcut function for a `pivot_table` with a `aggfunc=count` as aggregation: `crosstab`
  • \n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.crosstab(index=df['Sex'], columns=df['Pclass'])" ] }, { "cell_type": "markdown", "metadata": { "clear_cell": false }, "source": [ "
\n", "\n", "EXERCISE:\n", "\n", "
    \n", "
  • Make a pivot table with the survival rates (= number of persons survived / total number of persons) for Pclass vs Sex.
  • \n", "
  • Plot the result as a bar plot.
  • \n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "# %load snippets/06 - Reshaping data20.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "# %load snippets/06 - Reshaping data21.py" ] }, { "cell_type": "markdown", "metadata": { "clear_cell": false }, "source": [ "
\n", "\n", "EXERCISE:\n", "\n", "
    \n", "
  • Make a table of the median Fare payed by aged/underaged vs Sex.
  • \n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": true }, "outputs": [], "source": [ "# %load snippets/06 - Reshaping data22.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "# %load snippets/06 - Reshaping data23.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Melt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `melt` function performs the inverse operation of a `pivot`. This can be used to make your frame longer, i.e. to make a *tidy* version of your data." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "pivoted = df.pivot_table(index='Sex', columns='Pclass', values='Fare').reset_index()\n", "pivoted.columns.name = None" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pivoted" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Assume we have a DataFrame like the above. The observations (the average Fare people payed) are spread over different columns. In a tidy dataset, each observation is stored in one row. To obtain this, we can use the `melt` function:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.melt(pivoted)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see above, the `melt` function puts all column labels in one column, and all values in a second column.\n", "\n", "In this case, this is not fully what we want. We would like to keep the 'Sex' column separately:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.melt(pivoted, id_vars=['Sex']) #, var_name='Pclass', value_name='Fare')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Reshaping with `stack` and `unstack`" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The docs say:\n", "\n", "> Pivot a level of the (possibly hierarchical) column labels, returning a\n", "DataFrame (or Series in the case of an object with a single level of\n", "column labels) having a hierarchical index with a new inner-most level\n", "of row labels.\n", "\n", "Indeed... \n", "\n", "\n", "Before we speak about `hierarchical index`, first check it in practice on the following dummy example:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "df = pd.DataFrame({'A':['one', 'one', 'two', 'two'], \n", " 'B':['a', 'b', 'a', 'b'], \n", " 'C':range(4)})\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To use `stack`/`unstack`, we need the values we want to shift from rows to columns or the other way around as the index:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "df = df.set_index(['A', 'B']) # Indeed, you can combine two indices\n", "df" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "result = df['C'].unstack()\n", "result" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "df = result.stack().reset_index(name='C')\n", "df" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "run_control": { "frozen": false, "read_only": false } }, "source": [ "
\n", "\n", "REMEMBER:\n", "\n", "
    \n", "
  • **stack**: make your data *longer* and *smaller*
  • \n", "
  • **unstack**: make your data *shorter* and *wider*
  • \n", "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Mimick pivot table " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To better understand and reason about pivot tables, we can express this method as a combination of more basic steps. In short, the pivot is a convenient way of expressing the combination of a `groupby` and `stack/unstack`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "df = pd.read_csv(\"data/titanic.csv\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "run_control": { "frozen": false, "read_only": false }, "scrolled": true }, "outputs": [], "source": [ "df.pivot_table(index='Pclass', columns='Sex', \n", " values='Survived', aggfunc='mean')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "EXERCISE:\n", "\n", "
    \n", "
  • Get the same result as above based on a combination of `groupby` and `unstack`
  • \n", "
  • First use `groupby` to calculate the survival ratio for all groups
  • \n", "
  • Then, use `unstack` to reshape the output of the groupby operation
  • \n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "# %load snippets/06 - Reshaping data37.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Mimick melt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Like the pivot table above, we can now also obtain the result of `melt` with stack/unstack.\n", "\n", "Let's use the same `pivoted` frame as above, and look at the final melt result:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pivoted = df.pivot_table(index='Sex', columns='Pclass', values='Fare').reset_index()\n", "pivoted.columns.name = None\n", "pivoted" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.melt(pivoted, id_vars=['Sex'], var_name='Pclass', value_name='Fare')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "EXERCISE:\n", "\n", "
    \n", "
  • Get the same result as above using `stack`/`unstack` (combined with `set_index` / `reset_index`)
  • \n", "
  • Tip: set those columns as the index that you do not want to stack
  • \n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true }, "outputs": [], "source": [ "# %load snippets/06 - Reshaping data40.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true }, "outputs": [], "source": [ "# %load snippets/06 - Reshaping data41.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true }, "outputs": [], "source": [ "# %load snippets/06 - Reshaping data42.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true }, "outputs": [], "source": [ "# %load snippets/06 - Reshaping data43.py" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "# Exercises: use the reshaping methods with the movie data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These exercises are based on the [PyCon tutorial of Brandon Rhodes](https://github.com/brandon-rhodes/pycon-pandas-tutorial/) (so credit to him!) and the datasets he prepared for that. You can download these data from here: [`titles.csv`](https://drive.google.com/open?id=0B3G70MlBnCgKajNMa1pfSzN6Q3M) and [`cast.csv`](https://drive.google.com/open?id=0B3G70MlBnCgKal9UYTJSR2ZhSW8) and put them in the `/data` folder." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "cast = pd.read_csv('data/cast.csv')\n", "cast.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "titles = pd.read_csv('data/titles.csv')\n", "titles.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "EXERCISE:\n", "\n", "
    \n", "
  • Plot the number of actor roles each year and the number of actress roles each year over the whole period of available movie data.
  • \n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "# %load snippets/06 - Reshaping data46.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true }, "outputs": [], "source": [ "# %load snippets/06 - Reshaping data47.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true }, "outputs": [], "source": [ "# %load snippets/06 - Reshaping data48.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "EXERCISE:\n", "\n", "
    \n", "
  • Plot the number of actor roles each year and the number of actress roles each year. Use kind='area' as plot type
  • \n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "# %load snippets/06 - Reshaping data49.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "EXERCISE:\n", "\n", "
    \n", "
  • Plot the fraction of roles that have been 'actor' roles each year over the whole period of available movie data.
  • \n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "# %load snippets/06 - Reshaping data50.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "EXERCISE:\n", "\n", "
    \n", "
  • Define a year as a \"Superman year\" when films of that year feature more Superman characters than Batman characters. How many years in film history have been Superman years?
  • \n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "# %load snippets/06 - Reshaping data51.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "# %load snippets/06 - Reshaping data52.py" ] } ], "metadata": { "celltoolbar": "Nbtutor - export exercises", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" }, "nav_menu": {}, "toc": { "navigate_menu": true, "number_sections": true, "sideBar": true, "threshold": 6, "toc_cell": false, "toc_section_display": "block", "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 1 }