{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Groupby operations" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Some imports:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "%matplotlib inline\n", "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "try:\n", " import seaborn\n", "except ImportError:\n", " pass\n", "\n", "pd.options.display.max_rows = 10" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Some 'theory': the groupby operation (split-apply-combine)\n", "\n", "The \"group by\" concept: we want to **apply the same function on subsets of your dataframe, based on some key to split the dataframe in subsets**\n", "\n", "This operation is also referred to as the \"split-apply-combine\" operation, involving the following steps:\n", "\n", "* **Splitting** the data into groups based on some criteria\n", "* **Applying** a function to each group independently\n", "* **Combining** the results into a data structure\n", "\n", "\n", "\n", "Similar to SQL `GROUP BY`" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The example of the image in pandas syntax:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df = pd.DataFrame({'key':['A','B','C','A','B','C','A','B','C'],\n", " 'data': [0, 5, 10, 5, 10, 15, 10, 15, 20]})\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using the filtering and reductions operations we have seen in the previous notebooks, we could do something like:\n", "\n", "\n", " df[df['key'] == \"A\"].sum()\n", " df[df['key'] == \"B\"].sum()\n", " ...\n", "\n", "But pandas provides the `groupby` method to do this:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "df.groupby('key').aggregate(np.sum) # 'sum'" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df.groupby('key').sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And many more methods are available. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## And now applying this on some real data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We go back to the titanic survival data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df = pd.read_csv(\"data/titanic.csv\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "