{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Split apply and combine" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import addutils.toc ; addutils.toc.js(ipy_notebook=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this tutorial we are going to see advanced data management with `pandas` data structures." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "import pandas as pd\n", "from IPython.display import (display, HTML)\n", "from addutils import side_by_side2\n", "from addutils import css_notebook\n", "css_notebook()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Categorizing a data set and applying a function to each group, is often a critical component of a data analysis workflow. After\n", "loading, merging, and preparing a data set, a familiar task is to compute group statistics or possibly pivot tables for reporting or visualization purposes. *pandas* provides a flexible and high-performance groupby facility.\n", "\n", "By *'group by'* we refer to a process involving one or more of the following steps:\n", "\n", "* **Splitting** the data into groups based on some criteria\n", "* **Applying** a function to each group independently\n", "* **Combining** the results into a data structure\n", "\n", "Suppose we are managing a website and we have a log-file with number of *wiews* and *likes* coming from different cities:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
StateCityViewsLikes
0NEPage104
1KYStone93
2CORye30
3CORye72
4KYDema41
5KYKeavy21
6CORye10
7NECairo83
8CODumont127
\n", "
" ], "text/plain": [ " State City Views Likes\n", "0 NE Page 10 4\n", "1 KY Stone 9 3\n", "2 CO Rye 3 0\n", "3 CO Rye 7 2\n", "4 KY Dema 4 1\n", "5 KY Keavy 2 1\n", "6 CO Rye 1 0\n", "7 NE Cairo 8 3\n", "8 CO Dumont 12 7" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "d1 = pd.read_csv('temp/p07_d1.txt', index_col=0)\n", "d1 = d1.reindex(columns=['State','City','Views','Likes'])\n", "display(d1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1 Groupby" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`groupby` groups DataFrame or Series by a parameter on a given axis:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'NE': [0, 7], 'CO': [2, 3, 6, 8], 'KY': [1, 4, 5]}\n" ] } ], "source": [ "g1 = d1.groupby('State')\n", "print (g1.groups)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The variable `groups` of a `GroupBy` object is a dictionary containing indexes of each group member." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CO\n", " State City Views Likes\n", "2 CO Rye 3 0\n", "3 CO Rye 7 2\n", "6 CO Rye 1 0\n", "8 CO Dumont 12 7\n", "Total Views: 23 - Total Likes: 9\n", "\n", "\n", "KY\n", " State City Views Likes\n", "1 KY Stone 9 3\n", "4 KY Dema 4 1\n", "5 KY Keavy 2 1\n", "Total Views: 15 - Total Likes: 5\n", "\n", "\n", "NE\n", " State City Views Likes\n", "0 NE Page 10 4\n", "7 NE Cairo 8 3\n", "Total Views: 18 - Total Likes: 7\n", "\n", "\n" ] } ], "source": [ "for name,group in g1:\n", " print (name)\n", " print (group)\n", " print ('Total Views: %d - Total Likes: %d\\n\\n' %(group['Views'].sum(),\n", " group['Likes'].sum()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is also possibile to apply a `groupby` over a hierarchical index `DataFrame`" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ViewsLikes
StateCity
NEPage104
KYStone93
CORye30
Rye72
KYDema41
Keavy21
CORye10
NECairo83
CODumont127
\n", "
" ], "text/plain": [ " Views Likes\n", "State City \n", "NE Page 10 4\n", "KY Stone 9 3\n", "CO Rye 3 0\n", " Rye 7 2\n", "KY Dema 4 1\n", " Keavy 2 1\n", "CO Rye 1 0\n", "NE Cairo 8 3\n", "CO Dumont 12 7" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "d2 = d1.set_index(['State','City'])\n", "display(d2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2 Aggregate" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once the GroupBy object has been created, several methods are available to perform a computation on the grouped data. Here we use `aggregate`. The result of the aggregation will have the group names as the new index along the grouped axis. In the case of multiple keys, the result is a MultiIndex by default, though this can be changed by using the `as_index option`:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'NE': [('NE', 'Page'), ('NE', 'Cairo')], 'CO': [('CO', 'Rye'), ('CO', 'Rye'), ('CO', 'Rye'), ('CO', 'Dumont')], 'KY': [('KY', 'Stone'), ('KY', 'Dema'), ('KY', 'Keavy')]}\n" ] }, { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ViewsLikes
State
CO239
KY155
NE187
\n", "
" ], "text/plain": [ " Views Likes\n", "State \n", "CO 23 9\n", "KY 15 5\n", "NE 18 7" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "g2 = d2.groupby(level=[0])\n", "print (g2.groups)\n", "g2.aggregate(np.sum)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ViewsLikes
StateCity
CODumont127
Rye112
KYDema41
Keavy21
Stone93
NECairo83
Page104
\n", "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ViewsLikes
0127
1112
241
321
493
583
6104
\n", "
" ], "text/plain": [ "" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "g3 = d2.groupby(level=[0,1])\n", "g4 = d2.groupby(level=[0,1], as_index=False)\n", "HTML(side_by_side2(g3.aggregate(np.sum), g4.aggregate(np.sum)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`aggregate` allows to pass any function that returns a scalar value from a vector and can handle list of functions:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Views
summeanstd
State
CO235.754.856267
KY155.003.605551
NE189.001.414214
\n", "
" ], "text/plain": [ " Views \n", " sum mean std\n", "State \n", "CO 23 5.75 4.856267\n", "KY 15 5.00 3.605551\n", "NE 18 9.00 1.414214" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "d1[['State', 'Views']].groupby('State').aggregate([np.sum, np.mean, np.std])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3 Apply" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`apply` will extend the previous concepts to any Python function:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
StateCityViewsLikes
0NEPage104
1KYStone93
2CORye30
3CORye72
4KYDema41
5KYKeavy21
6CORye10
7NECairo83
8CODumont127
\n", "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
StateCityViewsLikesTot.ViewsLikes[%]
0NEPage1041857.1
1KYStone931560.0
2CORye30230.0
3CORye722322.2
4KYDema411520.0
5KYKeavy211520.0
6CORye10230.0
7NECairo831842.9
8CODumont1272377.8
\n", "
" ], "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.set_option('display.float_format', lambda x: '{:.1f}'.format(x))\n", "\n", "def add_field(group):\n", " group['Tot.Views'] = group['Views'].sum()\n", " group['Likes[%]'] = 100.0*group['Likes']/group['Likes'].sum()\n", " return group\n", "\n", "HTML(side_by_side2(d1, d1.groupby('State').apply(add_field)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4 A pratical example: Normalize by year" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [], "source": [ "idx = pd.date_range('1999/5/28', periods=1500, freq='1B')\n", "s1 = pd.Series(np.random.normal(5.5, 2, 1500), idx)\n", "s1 = pd.rolling_mean(s1, 10, 10).dropna()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we define a grouping key for months and one for years:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def my_groupby_key_year(timestamp):\n", " return timestamp.year\n", "\n", "def my_groupby_key_month(timestamp):\n", " return timestamp.month\n", "\n", "def my_normalization(group):\n", " return (group-group.mean())/group.std()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we normalize the data on a monthly base and check mean and std on an yearly base:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0
1999-06-105.3
1999-06-115.3
1999-06-145.6
1999-06-156.0
1999-06-166.0
1999-06-175.8
1999-06-186.0
1999-06-215.7
\n", "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0
1999-06-10-0.8
1999-06-11-0.9
1999-06-14-0.3
1999-06-150.5
1999-06-160.6
1999-06-170.2
1999-06-180.7
1999-06-21-0.0
\n", "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
meanstd
19990.10.9
2000-0.20.9
2001-0.31.0
20020.11.0
20030.21.1
20040.10.9
20050.40.8
\n", "
" ], "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "t1 = s1.groupby(my_groupby_key_month).apply(my_normalization)\n", "\n", "HTML(side_by_side2(s1.head(8),\n", " t1.head(8),\n", " t1.groupby(my_groupby_key_year).aggregate([np.mean, np.std])))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5 A practical example: Group and standardize by dimension" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ValueexpensesemployeesDimension
Company
Dapibus Company96 008712478Big
Pede Blandit Congue Company61 562845460Big
Pede Suspendisse Associates54 72866414Small
Dictum Associates16 802649889Big
Dui Cras Pellentesque Ltd93 954504097Big
\n", "
" ], "text/plain": [ " Value expenses employees Dimension\n", "Company \n", "Dapibus Company 96 008 7124 78 Big\n", "Pede Blandit Congue Company 61 562 8454 60 Big\n", "Pede Suspendisse Associates 54 728 6641 4 Small\n", "Dictum Associates 16 802 6498 89 Big\n", "Dui Cras Pellentesque Ltd 93 954 5040 97 Big" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "d3 = pd.read_csv('example_data/company.csv', index_col=0)\n", "display(d3.head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since the column \"Value\" is made by strings with a space separator we need a simpel intermediate step to convert values from string to floats:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ValueexpensesemployeesDimension
Company
Dapibus Company96008.0712478Big
Pede Blandit Congue Company61562.0845460Big
Pede Suspendisse Associates54728.066414Small
Dictum Associates16802.0649889Big
Dui Cras Pellentesque Ltd93954.0504097Big
\n", "
" ], "text/plain": [ " Value expenses employees Dimension\n", "Company \n", "Dapibus Company 96008.0 7124 78 Big\n", "Pede Blandit Congue Company 61562.0 8454 60 Big\n", "Pede Suspendisse Associates 54728.0 6641 4 Small\n", "Dictum Associates 16802.0 6498 89 Big\n", "Dui Cras Pellentesque Ltd 93954.0 5040 97 Big" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "d3['Value'] = d3['Value'].apply(lambda x: float(x.replace(' ', '')))\n", "d3.head()" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Valueexpensesemployees
Dimension
Big49445.45474.961.1
Small60947.84521.431.8
\n", "
" ], "text/plain": [ " Value expenses employees\n", "Dimension \n", "Big 49445.4 5474.9 61.1\n", "Small 60947.8 4521.4 31.8" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "d3.groupby('Dimension').mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "Visit [www.add-for.com]() for more tutorials and updates.\n", "\n", "This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.4.3" } }, "nbformat": 4, "nbformat_minor": 0 }