{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# groupby\n", "\n", "**With `groupby`, you can group data in a [DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) and apply calculations on those groups in various ways.**\n", "\n", "\n", "\n", "This Cheatbook (Cheatsheet + Notebook) introduces you to the core functionality of pandas' [`groupby`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) function. [Here can find the executable Jupyter Notebook version to directly play around with it](https://mybinder.org/v2/gh/feststelltaste/software-analytics/master?filepath=cheatbooks%2Fgroupby.ipynb)!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## References\n", "Here you can find out more about this function.\n", "\n", "* [API Reference](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html)\n", "* [Pandas Grouper and Agg Functions Explained](https://pbpython.com/pandas-grouper-agg.html)\n", "* [Understanding the Transform Function in Pandas](https://pbpython.com/pandas_transform.html)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example Scenario\n", "This is an excerpt of a file list from a directory with the following information as separate columns / [Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html):\n", "\n", "* `file`: The name of the file\n", "* `dir`: The name of the directory where the file lives in\n", "* `bytes`: The size of the file in bytes\n", "\n", "This data is stored into a pandas' DataFrame named `df`." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
filedirbytes
0hello.javasrc54
1tutorial.mddocs124
2controller.javasrc36
3build.shsrc78
4deploy.shsrc62
\n", "
" ], "text/plain": [ " file dir bytes\n", "0 hello.java src 54\n", "1 tutorial.md docs 124\n", "2 controller.java src 36\n", "3 build.sh src 78\n", "4 deploy.sh src 62" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "df = pd.DataFrame({\n", " \"file\" : ['hello.java', 'tutorial.md', 'controller.java', \"build.sh\", \"deploy.sh\"],\n", " \"dir\" : [\"src\", \"docs\", \"src\", \"src\", \"src\"],\n", " \"bytes\" : [54, 124, 36, 78, 62]\n", " })\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## When to use it\n", "`groupby` is a great way to summarize data in a specific way to build a more higher-level view on your data (e.g., to go from code level to module level).\n", "E.g., in our scenario, we could count the number of files per directory.\n", "Let's take a look at this use case step by step." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Basic Principles" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can use the `groupby` function on our DataFrame `df`.\n", "As parameter, you can put in the name (or a list of names) of the Series you want to group.\n", "In our case, we want to group the directories / the Series `dir`." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby('dir')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This gives you a [GroupBy](https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html) object. We can take a look at the built groups by inspecting the `groups` object of the GroupBy object." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'docs': Int64Index([1], dtype='int64'),\n", " 'src': Int64Index([0, 2, 3, 4], dtype='int64')}" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby('dir').groups" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `groups` object shows you the groups and their members, using their indexes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Aggregating Values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we have built some groups, but now what? The next step is to decide what we want to do with the values that belong to a group. This means we need to tell the GroupBy object how we want to group the values. We can apply a [multitude of aggregating functions](https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html#computations-descriptive-stats) here, e.g." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `count`: count the number of entries of each group" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
filebytes
dir
docs11
src44
\n", "
" ], "text/plain": [ " file bytes\n", "dir \n", "docs 1 1\n", "src 4 4" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby('dir').count()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `first`: take the first entry of each group" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
filebytes
dir
docstutorial.md124
srchello.java54
\n", "
" ], "text/plain": [ " file bytes\n", "dir \n", "docs tutorial.md 124\n", "src hello.java 54" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby('dir').first()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `max`: take the entry with the highest value" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
filebytes
dir
docstutorial.md124
srchello.java78
\n", "
" ], "text/plain": [ " file bytes\n", "dir \n", "docs tutorial.md 124\n", "src hello.java 78" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby('dir').max()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `sum`: sum up all values within one group" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bytes
dir
docs124
src230
\n", "
" ], "text/plain": [ " bytes\n", "dir \n", "docs 124\n", "src 230" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby('dir').sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This gives us the number of bytes of all files that reside in a directory. Note that there is no more `file` Series because it doesn't contain any values we could sum up. So this Series was thrown away." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also apply [dedicated functions](https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html#function-application) on each group using e.g.,\n", "\n", "* `agg`: apply a variety of aggregating functions on the groups (e.g., building the sum as well as counting the values at once)\n", "* `apply`: apply a custom function on each group to execute calculations as you like\n", "* `transform`: calculate summarizing values for each group (e.g., the sum of all entries for each group)\n", "\n", "We'll see these operations later on!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## More Advanced Use Cases\n", "Let's dig deeper into our example scenario.\n", "We want to find out which kind of files occupy what space in which directory.\n", "For this, we extract the files' extensions from the `file` series.\n", "We use the string `split` function to split by the `.` sign and keep just the last piece of the split file name (which is the file's extension)." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
filedirbytesext
0hello.javasrc54java
1tutorial.mddocs124md
2controller.javasrc36java
3build.shsrc78sh
4deploy.shsrc62sh
\n", "
" ], "text/plain": [ " file dir bytes ext\n", "0 hello.java src 54 java\n", "1 tutorial.md docs 124 md\n", "2 controller.java src 36 java\n", "3 build.sh src 78 sh\n", "4 deploy.sh src 62 sh" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['ext'] = df[\"file\"].str.split(\".\").str[-1]\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can then group this data in a more sophisticated way by using two Series for our groups.\n", "We sum up the numeric values (= the `bytes`) for each file for each group." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bytes
dirext
docsmd124
srcjava90
sh140
\n", "
" ], "text/plain": [ " bytes\n", "dir ext \n", "docs md 124\n", "src java 90\n", " sh 140" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dir_ext_bytes = df.groupby(['dir', 'ext']).sum()\n", "dir_ext_bytes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Last, we want to calculate the ratio of the files' bytes for each extension.\n", "We first calculate the overall size for each extension in each directory by using `transform`.\n", "The `transform` function doesn't compute results for each value of a group.\n", "Instead, it provides results for all values of a group." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bytes
dirext
docsmd124
srcjava230
sh230
\n", "
" ], "text/plain": [ " bytes\n", "dir ext \n", "docs md 124\n", "src java 230\n", " sh 230" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bytes_per_dir = dir_ext_bytes.groupby('dir').transform('sum')\n", "bytes_per_dir" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In our case, we summed up all the files' bytes of the file extensions per directory.\n", "We can add this new information to our existing DataFrame." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bytesall
dirext
docsmd124124
srcjava90230
sh140230
\n", "
" ], "text/plain": [ " bytes all\n", "dir ext \n", "docs md 124 124\n", "src java 90 230\n", " sh 140 230" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dir_ext_bytes['all'] = bytes_per_dir\n", "dir_ext_bytes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we are able to calculate the ratio." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bytesallratio
dirext
docsmd1241241.000000
srcjava902300.391304
sh1402300.608696
\n", "
" ], "text/plain": [ " bytes all ratio\n", "dir ext \n", "docs md 124 124 1.000000\n", "src java 90 230 0.391304\n", " sh 140 230 0.608696" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dir_ext_bytes['ratio'] = dir_ext_bytes['bytes'] / dir_ext_bytes['all']\n", "dir_ext_bytes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result gives us the ratios of file sizes per file type for each directory." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary\n", "The `groupby` function is an excellent way to summarize data.\n", "It will create a higher-level view of your fine-grained raw data." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 2 }