{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Groupby operations"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Some imports:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"slideshow": {
"slide_type": "-"
}
},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"try:\n",
" import seaborn\n",
"except ImportError:\n",
" pass\n",
"\n",
"pd.options.display.max_rows = 10"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Some 'theory': the groupby operation (split-apply-combine)\n",
"\n",
"The \"group by\" concept: we want to **apply the same function on subsets of your dataframe, based on some key to split the dataframe in subsets**\n",
"\n",
"This operation is also referred to as the \"split-apply-combine\" operation, involving the following steps:\n",
"\n",
"* **Splitting** the data into groups based on some criteria\n",
"* **Applying** a function to each group independently\n",
"* **Combining** the results into a data structure\n",
"\n",
"
\n",
"\n",
"Similar to SQL `GROUP BY`"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"The example of the image in pandas syntax:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df = pd.DataFrame({'key':['A','B','C','A','B','C','A','B','C'],\n",
" 'data': [0, 5, 10, 5, 10, 15, 10, 15, 20]})\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using the filtering and reductions operations we have seen in the previous notebooks, we could do something like:\n",
"\n",
"\n",
" df[df['key'] == \"A\"].sum()\n",
" df[df['key'] == \"B\"].sum()\n",
" ...\n",
"\n",
"But pandas provides the `groupby` method to do this:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"df.groupby('key').aggregate(np.sum) # 'sum'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df.groupby('key').sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And many more methods are available. "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## And now applying this on some real data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We go back to the titanic survival data:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df = pd.read_csv(\"data/titanic.csv\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"