{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Groupby operations" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Some imports:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "%matplotlib inline\n", "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "try:\n", " import seaborn\n", "except ImportError:\n", " pass\n", "\n", "pd.options.display.max_rows = 10" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Some 'theory': the groupby operation (split-apply-combine)\n", "\n", "The \"group by\" concept: we want to **apply the same function on subsets of your dataframe, based on some key to split the dataframe in subsets**\n", "\n", "This operation is also referred to as the \"split-apply-combine\" operation, involving the following steps:\n", "\n", "* **Splitting** the data into groups based on some criteria\n", "* **Applying** a function to each group independently\n", "* **Combining** the results into a data structure\n", "\n", "\n", "\n", "Similar to SQL `GROUP BY`" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The example of the image in pandas syntax:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df = pd.DataFrame({'key':['A','B','C','A','B','C','A','B','C'],\n", " 'data': [0, 5, 10, 5, 10, 15, 10, 15, 20]})\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using the filtering and reductions operations we have seen in the previous notebooks, we could do something like:\n", "\n", "\n", " df[df['key'] == \"A\"].sum()\n", " df[df['key'] == \"B\"].sum()\n", " ...\n", "\n", "But pandas provides the `groupby` method to do this:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "df.groupby('key').aggregate(np.sum) # 'sum'" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df.groupby('key').sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And many more methods are available. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## And now applying this on some real data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We go back to the titanic survival data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df = pd.read_csv(\"data/titanic.csv\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " EXERCISE: Using groupby(), calculate the average age for each sex.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " EXERCISE: Calculate the average survival ratio for all passengers.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " EXERCISE: Calculate this survival ratio for all passengers younger that 25 (remember: filtering/boolean indexing).\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " EXERCISE: Is there a difference in this survival ratio between the sexes? (tip: write the above calculation of the survival ratio as a function)\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " EXERCISE: Make a bar plot of the survival ratio for the different classes ('Pclass' column).\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you are ready, more groupby exercises can be found in the \"Advanded groupby operations\" notebook." ] } ], "metadata": { "celltoolbar": "Nbtutor - export exercises", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 0 }