{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# groupby\n",
    "\n",
    "**With `groupby`, you can group data in a [DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) and apply calculations on those groups in various ways.**\n",
    "\n",
    "\n",
    "\n",
    "This Cheatbook (Cheatsheet + Notebook) introduces you to the core functionality of pandas' [`groupby`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) function. [Here can find the executable Jupyter Notebook version to directly play around with it](https://mybinder.org/v2/gh/feststelltaste/software-analytics/master?filepath=cheatbooks%2Fgroupby.ipynb)!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## References\n",
    "Here you can find out more about this function.\n",
    "\n",
    "* [API Reference](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html)\n",
    "* [Pandas Grouper and Agg Functions Explained](https://pbpython.com/pandas-grouper-agg.html)\n",
    "* [Understanding the Transform Function in Pandas](https://pbpython.com/pandas_transform.html)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example Scenario\n",
    "This is an excerpt of a file list from a directory with the following information as separate columns /  [Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html):\n",
    "\n",
    "* `file`: The name of the file\n",
    "* `dir`: The name of the directory where the file lives in\n",
    "* `bytes`: The size of the file in bytes\n",
    "\n",
    "This data is stored into a pandas' DataFrame named `df`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>file</th>\n",
       "      <th>dir</th>\n",
       "      <th>bytes</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>hello.java</td>\n",
       "      <td>src</td>\n",
       "      <td>54</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>tutorial.md</td>\n",
       "      <td>docs</td>\n",
       "      <td>124</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>controller.java</td>\n",
       "      <td>src</td>\n",
       "      <td>36</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>build.sh</td>\n",
       "      <td>src</td>\n",
       "      <td>78</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>deploy.sh</td>\n",
       "      <td>src</td>\n",
       "      <td>62</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              file   dir  bytes\n",
       "0       hello.java   src     54\n",
       "1      tutorial.md  docs    124\n",
       "2  controller.java   src     36\n",
       "3         build.sh   src     78\n",
       "4        deploy.sh   src     62"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "df = pd.DataFrame({\n",
    "    \"file\" : ['hello.java', 'tutorial.md', 'controller.java', \"build.sh\", \"deploy.sh\"],\n",
    "    \"dir\" : [\"src\", \"docs\", \"src\", \"src\", \"src\"],\n",
    "    \"bytes\" : [54, 124, 36, 78, 62]\n",
    "    })\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## When to use it\n",
    "`groupby` is a great way to summarize data in a specific way to build a more higher-level view on your data (e.g., to go from code level to module level).\n",
    "E.g., in our scenario, we could count the number of files per directory.\n",
    "Let's take a look at this use case step by step."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Basic Principles"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can use the `groupby` function on our DataFrame `df`.\n",
    "As parameter, you can put in the name (or a list of names) of the Series you want to group.\n",
    "In our case, we want to group the directories / the Series `dir`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fb30ce49520>"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.groupby('dir')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This gives you a [GroupBy](https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html) object. We can take a look at the built groups by inspecting the `groups` object of the GroupBy object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'docs': Int64Index([1], dtype='int64'),\n",
       " 'src': Int64Index([0, 2, 3, 4], dtype='int64')}"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.groupby('dir').groups"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `groups` object shows you the groups and their members, using their indexes."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Aggregating Values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we have built some groups, but now what? The next step is to decide what we want to do with the values that belong to a group. This means we need to tell the GroupBy object how we want to group the values. We can apply a [multitude of aggregating functions](https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html#computations-descriptive-stats) here, e.g."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* `count`: count the number of entries of each group"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>file</th>\n",
       "      <th>bytes</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>dir</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>docs</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>src</th>\n",
       "      <td>4</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      file  bytes\n",
       "dir              \n",
       "docs     1      1\n",
       "src      4      4"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.groupby('dir').count()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* `first`: take the first entry of each group"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>file</th>\n",
       "      <th>bytes</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>dir</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>docs</th>\n",
       "      <td>tutorial.md</td>\n",
       "      <td>124</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>src</th>\n",
       "      <td>hello.java</td>\n",
       "      <td>54</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             file  bytes\n",
       "dir                     \n",
       "docs  tutorial.md    124\n",
       "src    hello.java     54"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.groupby('dir').first()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* `max`: take the entry with the highest value"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>file</th>\n",
       "      <th>bytes</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>dir</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>docs</th>\n",
       "      <td>tutorial.md</td>\n",
       "      <td>124</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>src</th>\n",
       "      <td>hello.java</td>\n",
       "      <td>78</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             file  bytes\n",
       "dir                     \n",
       "docs  tutorial.md    124\n",
       "src    hello.java     78"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.groupby('dir').max()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* `sum`: sum up all values within one group"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>bytes</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>dir</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>docs</th>\n",
       "      <td>124</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>src</th>\n",
       "      <td>230</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      bytes\n",
       "dir        \n",
       "docs    124\n",
       "src     230"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.groupby('dir').sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This gives us the number of bytes of all files that reside in a directory. Note that there is no more `file` Series because it doesn't contain any values we could sum up. So this Series was thrown away."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also apply [dedicated functions](https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html#function-application) on each group using e.g.,\n",
    "\n",
    "* `agg`: apply a variety of aggregating functions on the groups (e.g., building the sum as well as counting the values at once)\n",
    "* `apply`: apply a custom function on each group to execute calculations as you like\n",
    "* `transform`: calculate summarizing values for each group (e.g., the sum of all entries for each group)\n",
    "\n",
    "We'll see these operations later on!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## More Advanced Use Cases\n",
    "Let's dig deeper into our example scenario.\n",
    "We want to find out which kind of files occupy what space in which directory.\n",
    "For this, we extract the files' extensions from the `file` series.\n",
    "We use the string `split` function to split by the `.` sign and keep just the last piece of the split file name (which is the file's extension)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>file</th>\n",
       "      <th>dir</th>\n",
       "      <th>bytes</th>\n",
       "      <th>ext</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>hello.java</td>\n",
       "      <td>src</td>\n",
       "      <td>54</td>\n",
       "      <td>java</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>tutorial.md</td>\n",
       "      <td>docs</td>\n",
       "      <td>124</td>\n",
       "      <td>md</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>controller.java</td>\n",
       "      <td>src</td>\n",
       "      <td>36</td>\n",
       "      <td>java</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>build.sh</td>\n",
       "      <td>src</td>\n",
       "      <td>78</td>\n",
       "      <td>sh</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>deploy.sh</td>\n",
       "      <td>src</td>\n",
       "      <td>62</td>\n",
       "      <td>sh</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              file   dir  bytes   ext\n",
       "0       hello.java   src     54  java\n",
       "1      tutorial.md  docs    124    md\n",
       "2  controller.java   src     36  java\n",
       "3         build.sh   src     78    sh\n",
       "4        deploy.sh   src     62    sh"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['ext'] = df[\"file\"].str.split(\".\").str[-1]\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can then group this data in a more sophisticated way by using two Series for our groups.\n",
    "We sum up the numeric values (= the `bytes`) for each file for each group."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>bytes</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>dir</th>\n",
       "      <th>ext</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>docs</th>\n",
       "      <th>md</th>\n",
       "      <td>124</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">src</th>\n",
       "      <th>java</th>\n",
       "      <td>90</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>sh</th>\n",
       "      <td>140</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           bytes\n",
       "dir  ext        \n",
       "docs md      124\n",
       "src  java     90\n",
       "     sh      140"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dir_ext_bytes = df.groupby(['dir', 'ext']).sum()\n",
    "dir_ext_bytes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Last, we want to calculate the ratio of the files' bytes for each extension.\n",
    "We first calculate the overall size for each extension in each directory by using `transform`.\n",
    "The `transform` function doesn't compute results for each value of a group.\n",
    "Instead, it provides results for all values of a group."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>bytes</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>dir</th>\n",
       "      <th>ext</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>docs</th>\n",
       "      <th>md</th>\n",
       "      <td>124</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">src</th>\n",
       "      <th>java</th>\n",
       "      <td>230</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>sh</th>\n",
       "      <td>230</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           bytes\n",
       "dir  ext        \n",
       "docs md      124\n",
       "src  java    230\n",
       "     sh      230"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bytes_per_dir = dir_ext_bytes.groupby('dir').transform('sum')\n",
    "bytes_per_dir"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In our case, we summed up all the files' bytes of the file extensions per directory.\n",
    "We can add this new information to our existing DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>bytes</th>\n",
       "      <th>all</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>dir</th>\n",
       "      <th>ext</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>docs</th>\n",
       "      <th>md</th>\n",
       "      <td>124</td>\n",
       "      <td>124</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">src</th>\n",
       "      <th>java</th>\n",
       "      <td>90</td>\n",
       "      <td>230</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>sh</th>\n",
       "      <td>140</td>\n",
       "      <td>230</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           bytes  all\n",
       "dir  ext             \n",
       "docs md      124  124\n",
       "src  java     90  230\n",
       "     sh      140  230"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dir_ext_bytes['all'] = bytes_per_dir\n",
    "dir_ext_bytes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we are able to calculate the ratio."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>bytes</th>\n",
       "      <th>all</th>\n",
       "      <th>ratio</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>dir</th>\n",
       "      <th>ext</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>docs</th>\n",
       "      <th>md</th>\n",
       "      <td>124</td>\n",
       "      <td>124</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">src</th>\n",
       "      <th>java</th>\n",
       "      <td>90</td>\n",
       "      <td>230</td>\n",
       "      <td>0.391304</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>sh</th>\n",
       "      <td>140</td>\n",
       "      <td>230</td>\n",
       "      <td>0.608696</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           bytes  all     ratio\n",
       "dir  ext                       \n",
       "docs md      124  124  1.000000\n",
       "src  java     90  230  0.391304\n",
       "     sh      140  230  0.608696"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dir_ext_bytes['ratio'] = dir_ext_bytes['bytes'] / dir_ext_bytes['all']\n",
    "dir_ext_bytes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The result gives us the ratios of file sizes per file type for each directory."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary\n",
    "The `groupby` function is an excellent way to summarize data.\n",
    "It will create a higher-level view of your fine-grained raw data."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}