{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "TLDR; I show how you can visualize the knowledge distribution of your source code by mining version control systems.\n",
    "\n",
    "# Introduction\n",
    "In software development, it's all about knowledge &ndash; both technical and the business domain. But we software developers transfer only a small part of this knowledge into code. But code alone isn't enough to get a glimpse of the greater picture and the interrelations of all the different concepts. There will be always developers that know more about some concept as laid down in source code. It's important to make sure that this knowledge is distributed over more than one head. More developers mean more different perspectives on the problem domain leading to a more robust and understandable code bases.\n",
    "\n",
    "How can we get insights about knowledge in code?\n",
    "\n",
    "It's possible to estimate the knowledge distribution by analyzing the version control system. We can use active changes in the code as proxy for \"someone knew what he did\" because otherwise, he wouldn't be able to contribute code at all. To find spots where the knowledge about the code could be improved, we can identify areas in the code that are possibly known by only one developer. This gives you a hint where you should start some pair programming or invest in redocumentation.\n",
    "\n",
    "In this blog post, we approximate the knowledge distribution by counting the number of additions per file that each developer contributed to a software system. I'll show you step by step how you can do this by using Python and [Pandas](http://pandas.pydata.org/).\n",
    "\n",
    "\n",
    "Attribution: The work is heavily inspired by [Adam Tornhill](http://adamtornhill.com/)'s book [\"Your Code as a Crime Scene\"](https://pragprog.com/book/atcrime/your-code-as-a-crime-scene), who did a similar analysis called \"knowledge map\". I use the similar visualization style of a \"bubble chart\" based on his work as well."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Import history\n",
    "For this analysis, you need a log from your Git repository. In this example, we analyze a [fork of the Spring PetClinic project](https://github.com/buschmais/spring-petclinic). \n",
    "\n",
    "To avoid some noise, we add the parameters <tt>--no-merges</tt> and <tt>--no-renames</tt>, too.\n",
    "\n",
    "```bash\n",
    "git log --no-merges --no-renames --numstat --pretty=format:\"%x09%x09%x09%aN\"\n",
    "```\n",
    "\n",
    "We read the log output into a Pandas' <tt>DataFrame</tt> by using the method described in this [blog post](https://www.feststelltaste.de/reading-a-git-repos-commit-history-with-pandas-efficiently/), but slightly modified (because we need less data):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>additions</th>\n",
       "      <th>deletions</th>\n",
       "      <th>path</th>\n",
       "      <th>author</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>docs/README.md</td>\n",
       "      <td>Markus Harrer</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>76</td>\n",
       "      <td>0</td>\n",
       "      <td>docs/README.md</td>\n",
       "      <td>Markus Harrer</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>290</td>\n",
       "      <td>0</td>\n",
       "      <td>docs/assets/css/style.scss</td>\n",
       "      <td>Markus Harrer</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>-</td>\n",
       "      <td>-</td>\n",
       "      <td>docs/documentation/images/class-diagram.png</td>\n",
       "      <td>Markus Harrer</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>1224</td>\n",
       "      <td>0</td>\n",
       "      <td>docs/documentation/index.html</td>\n",
       "      <td>Markus Harrer</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  additions deletions                                         path  \\\n",
       "1         1         1                               docs/README.md   \n",
       "3        76         0                               docs/README.md   \n",
       "4       290         0                   docs/assets/css/style.scss   \n",
       "5         -         -  docs/documentation/images/class-diagram.png   \n",
       "6      1224         0                docs/documentation/index.html   \n",
       "\n",
       "          author  \n",
       "1  Markus Harrer  \n",
       "3  Markus Harrer  \n",
       "4  Markus Harrer  \n",
       "5  Markus Harrer  \n",
       "6  Markus Harrer  "
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import git\n",
    "from io import StringIO\n",
    "import pandas as pd\n",
    "\n",
    "# connect to repo\n",
    "git_bin = git.Repo(\"../../buschmais-spring-petclinic/\").git\n",
    "\n",
    "# execute log command\n",
    "git_log = git_bin.execute('git log --no-merges --no-renames --numstat --pretty=format:\"%x09%x09%x09%aN\"')\n",
    "\n",
    "# read in the log\n",
    "git_log = pd.read_csv(StringIO(git_log), sep=\"\\x09\", header=None, names=['additions', 'deletions', 'path','author'])\n",
    "\n",
    "# convert to DataFrame\n",
    "commit_data = git_log[['additions', 'deletions', 'path']].join(git_log[['author']].fillna(method='ffill')).dropna()\n",
    "commit_data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Getting data that matters\n",
    "\n",
    "In this example, we are only interested in Java source code files that still exist in the software project.\n",
    "\n",
    "We can retrieve the existing Java source code files by using Git's <tt>ls-files</tt> combined with a filter for the Java source code file extension. The command will return a plain text string that we split by the line endings to get a list of files. Because we want to combine this information with the other above, we put it into a <tt>DataFrame</tt> with the column name <tt>path</tt>."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>path</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>src/main/java/org/springframework/samples/petc...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>src/main/java/org/springframework/samples/petc...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>src/main/java/org/springframework/samples/petc...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>src/main/java/org/springframework/samples/petc...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>src/main/java/org/springframework/samples/petc...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                path\n",
       "0  src/main/java/org/springframework/samples/petc...\n",
       "1  src/main/java/org/springframework/samples/petc...\n",
       "2  src/main/java/org/springframework/samples/petc...\n",
       "3  src/main/java/org/springframework/samples/petc...\n",
       "4  src/main/java/org/springframework/samples/petc..."
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "existing_files = pd.DataFrame(git_bin.execute('git ls-files -- *.java').split(\"\\n\"), columns=['path'])\n",
    "existing_files.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The next step is to combine the <tt>commit_data</tt> with the <tt>existing_files</tt> information by using Pandas' <tt>merge</tt> function. By default, <tt>merge</tt> will \n",
    "- combine the data by the columns with the same name in each <tt>DataFrame</tt> \n",
    "- only leave those entries that have the same value (using an \"inner join\"). \n",
    "\n",
    "In plain English, <tt>merge</tt> will only leave the still existing Java source code files in the <tt>DataFrame</tt>. This is exactly what we need."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>additions</th>\n",
       "      <th>deletions</th>\n",
       "      <th>path</th>\n",
       "      <th>author</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "      <td>src/test/java/org/springframework/samples/petc...</td>\n",
       "      <td>Antoine Rey</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>53</td>\n",
       "      <td>0</td>\n",
       "      <td>src/test/java/org/springframework/samples/petc...</td>\n",
       "      <td>Colin But</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>25</td>\n",
       "      <td>7</td>\n",
       "      <td>src/test/java/org/springframework/samples/petc...</td>\n",
       "      <td>Antoine Rey</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>167</td>\n",
       "      <td>0</td>\n",
       "      <td>src/test/java/org/springframework/samples/petc...</td>\n",
       "      <td>Colin But</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>21</td>\n",
       "      <td>9</td>\n",
       "      <td>src/test/java/org/springframework/samples/petc...</td>\n",
       "      <td>Antoine Rey</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  additions deletions                                               path  \\\n",
       "0         4         5  src/test/java/org/springframework/samples/petc...   \n",
       "1        53         0  src/test/java/org/springframework/samples/petc...   \n",
       "2        25         7  src/test/java/org/springframework/samples/petc...   \n",
       "3       167         0  src/test/java/org/springframework/samples/petc...   \n",
       "4        21         9  src/test/java/org/springframework/samples/petc...   \n",
       "\n",
       "        author  \n",
       "0  Antoine Rey  \n",
       "1    Colin But  \n",
       "2  Antoine Rey  \n",
       "3    Colin But  \n",
       "4  Antoine Rey  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "contributions = pd.merge(commit_data, existing_files)\n",
    "contributions.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now convert some columns to their correct data types. The columns <tt>additions</tt> and <tt>deletions</tt> columns are representing the added or deleted lines of code as numbers. We have to convert those accordingly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>additions</th>\n",
       "      <th>deletions</th>\n",
       "      <th>path</th>\n",
       "      <th>author</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "      <td>src/test/java/org/springframework/samples/petc...</td>\n",
       "      <td>Antoine Rey</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>53</td>\n",
       "      <td>0</td>\n",
       "      <td>src/test/java/org/springframework/samples/petc...</td>\n",
       "      <td>Colin But</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>25</td>\n",
       "      <td>7</td>\n",
       "      <td>src/test/java/org/springframework/samples/petc...</td>\n",
       "      <td>Antoine Rey</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>167</td>\n",
       "      <td>0</td>\n",
       "      <td>src/test/java/org/springframework/samples/petc...</td>\n",
       "      <td>Colin But</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>21</td>\n",
       "      <td>9</td>\n",
       "      <td>src/test/java/org/springframework/samples/petc...</td>\n",
       "      <td>Antoine Rey</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   additions  deletions                                               path  \\\n",
       "0          4          5  src/test/java/org/springframework/samples/petc...   \n",
       "1         53          0  src/test/java/org/springframework/samples/petc...   \n",
       "2         25          7  src/test/java/org/springframework/samples/petc...   \n",
       "3        167          0  src/test/java/org/springframework/samples/petc...   \n",
       "4         21          9  src/test/java/org/springframework/samples/petc...   \n",
       "\n",
       "        author  \n",
       "0  Antoine Rey  \n",
       "1    Colin But  \n",
       "2  Antoine Rey  \n",
       "3    Colin But  \n",
       "4  Antoine Rey  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "contributions['additions'] = pd.to_numeric(contributions['additions'])\n",
    "contributions['deletions'] = pd.to_numeric(contributions['deletions'])\n",
    "contributions.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Calculating the knowledge about code\n",
    "We want to estimate the knowledge about code as the proportion of additions to the whole source code file. This means we need to calculate the relative amount of added lines for each developer. To be able to do this, we have to know the sum of all additions for a file.\n",
    "\n",
    "Additionally, we calculate it for deletions as well to easily get the number of lines of code later on.\n",
    "\n",
    "We use an additional <tt>DataFrame</tt> to do these calculations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>path</th>\n",
       "      <th>additions</th>\n",
       "      <th>deletions</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>src/main/java/org/springframework/samples/petc...</td>\n",
       "      <td>111</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>src/main/java/org/springframework/samples/petc...</td>\n",
       "      <td>70</td>\n",
       "      <td>23</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>src/main/java/org/springframework/samples/petc...</td>\n",
       "      <td>67</td>\n",
       "      <td>19</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>src/main/java/org/springframework/samples/petc...</td>\n",
       "      <td>290</td>\n",
       "      <td>137</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>src/main/java/org/springframework/samples/petc...</td>\n",
       "      <td>79</td>\n",
       "      <td>23</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                path  additions  deletions\n",
       "0  src/main/java/org/springframework/samples/petc...        111          0\n",
       "1  src/main/java/org/springframework/samples/petc...         70         23\n",
       "2  src/main/java/org/springframework/samples/petc...         67         19\n",
       "3  src/main/java/org/springframework/samples/petc...        290        137\n",
       "4  src/main/java/org/springframework/samples/petc...         79         23"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "contributions_sum = contributions.groupby('path').sum()[['additions', 'deletions']].reset_index()\n",
    "contributions_sum.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We also want to have an indicator about the quantity of the knowledge. This can be achieved if we calculate the lines of code for each file, which is a simple subtraction of the deletions from the additions (be warned: this does only work for simple use cases where there are no heavy renames of files as in our case)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>path</th>\n",
       "      <th>additions</th>\n",
       "      <th>deletions</th>\n",
       "      <th>lines</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>src/main/java/org/springframework/samples/petc...</td>\n",
       "      <td>111</td>\n",
       "      <td>0</td>\n",
       "      <td>111</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>src/main/java/org/springframework/samples/petc...</td>\n",
       "      <td>70</td>\n",
       "      <td>23</td>\n",
       "      <td>47</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>src/main/java/org/springframework/samples/petc...</td>\n",
       "      <td>67</td>\n",
       "      <td>19</td>\n",
       "      <td>48</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>src/main/java/org/springframework/samples/petc...</td>\n",
       "      <td>290</td>\n",
       "      <td>137</td>\n",
       "      <td>153</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>src/main/java/org/springframework/samples/petc...</td>\n",
       "      <td>79</td>\n",
       "      <td>23</td>\n",
       "      <td>56</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                path  additions  deletions  \\\n",
       "0  src/main/java/org/springframework/samples/petc...        111          0   \n",
       "1  src/main/java/org/springframework/samples/petc...         70         23   \n",
       "2  src/main/java/org/springframework/samples/petc...         67         19   \n",
       "3  src/main/java/org/springframework/samples/petc...        290        137   \n",
       "4  src/main/java/org/springframework/samples/petc...         79         23   \n",
       "\n",
       "   lines  \n",
       "0    111  \n",
       "1     47  \n",
       "2     48  \n",
       "3    153  \n",
       "4     56  "
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "contributions_sum['lines'] = contributions_sum['additions'] - contributions_sum['deletions']\n",
    "contributions_sum.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We combine both <tt>DataFrame</tt>s with a <tt>merge</tt> analog as above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>additions</th>\n",
       "      <th>deletions</th>\n",
       "      <th>path</th>\n",
       "      <th>author</th>\n",
       "      <th>additions_sum</th>\n",
       "      <th>deletions_sum</th>\n",
       "      <th>lines</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "      <td>src/test/java/org/springframework/samples/petc...</td>\n",
       "      <td>Antoine Rey</td>\n",
       "      <td>57</td>\n",
       "      <td>5</td>\n",
       "      <td>52</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>53</td>\n",
       "      <td>0</td>\n",
       "      <td>src/test/java/org/springframework/samples/petc...</td>\n",
       "      <td>Colin But</td>\n",
       "      <td>57</td>\n",
       "      <td>5</td>\n",
       "      <td>52</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>25</td>\n",
       "      <td>7</td>\n",
       "      <td>src/test/java/org/springframework/samples/petc...</td>\n",
       "      <td>Antoine Rey</td>\n",
       "      <td>192</td>\n",
       "      <td>7</td>\n",
       "      <td>185</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>167</td>\n",
       "      <td>0</td>\n",
       "      <td>src/test/java/org/springframework/samples/petc...</td>\n",
       "      <td>Colin But</td>\n",
       "      <td>192</td>\n",
       "      <td>7</td>\n",
       "      <td>185</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>21</td>\n",
       "      <td>9</td>\n",
       "      <td>src/test/java/org/springframework/samples/petc...</td>\n",
       "      <td>Antoine Rey</td>\n",
       "      <td>134</td>\n",
       "      <td>9</td>\n",
       "      <td>125</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   additions  deletions                                               path  \\\n",
       "0          4          5  src/test/java/org/springframework/samples/petc...   \n",
       "1         53          0  src/test/java/org/springframework/samples/petc...   \n",
       "2         25          7  src/test/java/org/springframework/samples/petc...   \n",
       "3        167          0  src/test/java/org/springframework/samples/petc...   \n",
       "4         21          9  src/test/java/org/springframework/samples/petc...   \n",
       "\n",
       "        author  additions_sum  deletions_sum  lines  \n",
       "0  Antoine Rey             57              5     52  \n",
       "1    Colin But             57              5     52  \n",
       "2  Antoine Rey            192              7    185  \n",
       "3    Colin But            192              7    185  \n",
       "4  Antoine Rey            134              9    125  "
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "contributions_all = pd.merge(\n",
    "    contributions, \n",
    "    contributions_sum, \n",
    "    left_on='path', \n",
    "    right_on='path', \n",
    "    suffixes=['', '_sum'])\n",
    "contributions_all.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Identify knowledge hotspots"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "OK, here comes the key: We group all additions by the file paths and the authors. This gives us all the additions to a file per author. Additionally, we want to keep the sum of all additions as well as the information about the lines of code. Because those are contained in the <tt>DataFrame</tt> multiple times, we just get the first entry for each."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>additions</th>\n",
       "      <th>additions_sum</th>\n",
       "      <th>lines</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>path</th>\n",
       "      <th>author</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>src/main/java/org/springframework/samples/petclinic/PetclinicInitializer.java</th>\n",
       "      <th>Antoine Rey</th>\n",
       "      <td>111</td>\n",
       "      <td>111</td>\n",
       "      <td>111</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"5\" valign=\"top\">src/main/java/org/springframework/samples/petclinic/model/BaseEntity.java</th>\n",
       "      <th>Antoine Rey</th>\n",
       "      <td>3</td>\n",
       "      <td>70</td>\n",
       "      <td>47</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Faisal Hameed</th>\n",
       "      <td>1</td>\n",
       "      <td>70</td>\n",
       "      <td>47</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Gordon Dickens</th>\n",
       "      <td>14</td>\n",
       "      <td>70</td>\n",
       "      <td>47</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Michael Isvy</th>\n",
       "      <td>51</td>\n",
       "      <td>70</td>\n",
       "      <td>47</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>boly38</th>\n",
       "      <td>1</td>\n",
       "      <td>70</td>\n",
       "      <td>47</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"3\" valign=\"top\">src/main/java/org/springframework/samples/petclinic/model/NamedEntity.java</th>\n",
       "      <th>Antoine Rey</th>\n",
       "      <td>3</td>\n",
       "      <td>67</td>\n",
       "      <td>48</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Gordon Dickens</th>\n",
       "      <td>15</td>\n",
       "      <td>67</td>\n",
       "      <td>48</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Michael Isvy</th>\n",
       "      <td>49</td>\n",
       "      <td>67</td>\n",
       "      <td>48</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>src/main/java/org/springframework/samples/petclinic/model/Owner.java</th>\n",
       "      <th>Antoine Rey</th>\n",
       "      <td>14</td>\n",
       "      <td>290</td>\n",
       "      <td>153</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                                   additions  \\\n",
       "path                                               author                      \n",
       "src/main/java/org/springframework/samples/petcl... Antoine Rey           111   \n",
       "src/main/java/org/springframework/samples/petcl... Antoine Rey             3   \n",
       "                                                   Faisal Hameed           1   \n",
       "                                                   Gordon Dickens         14   \n",
       "                                                   Michael Isvy           51   \n",
       "                                                   boly38                  1   \n",
       "src/main/java/org/springframework/samples/petcl... Antoine Rey             3   \n",
       "                                                   Gordon Dickens         15   \n",
       "                                                   Michael Isvy           49   \n",
       "src/main/java/org/springframework/samples/petcl... Antoine Rey            14   \n",
       "\n",
       "                                                                   additions_sum  \\\n",
       "path                                               author                          \n",
       "src/main/java/org/springframework/samples/petcl... Antoine Rey               111   \n",
       "src/main/java/org/springframework/samples/petcl... Antoine Rey                70   \n",
       "                                                   Faisal Hameed              70   \n",
       "                                                   Gordon Dickens             70   \n",
       "                                                   Michael Isvy               70   \n",
       "                                                   boly38                     70   \n",
       "src/main/java/org/springframework/samples/petcl... Antoine Rey                67   \n",
       "                                                   Gordon Dickens             67   \n",
       "                                                   Michael Isvy               67   \n",
       "src/main/java/org/springframework/samples/petcl... Antoine Rey               290   \n",
       "\n",
       "                                                                   lines  \n",
       "path                                               author                 \n",
       "src/main/java/org/springframework/samples/petcl... Antoine Rey       111  \n",
       "src/main/java/org/springframework/samples/petcl... Antoine Rey        47  \n",
       "                                                   Faisal Hameed      47  \n",
       "                                                   Gordon Dickens     47  \n",
       "                                                   Michael Isvy       47  \n",
       "                                                   boly38             47  \n",
       "src/main/java/org/springframework/samples/petcl... Antoine Rey        48  \n",
       "                                                   Gordon Dickens     48  \n",
       "                                                   Michael Isvy       48  \n",
       "src/main/java/org/springframework/samples/petcl... Antoine Rey       153  "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grouped_contributions = contributions_all.groupby(\n",
    "    ['path', 'author']).agg(\n",
    "    {'additions' : 'sum',\n",
    "     'additions_sum' : 'first',\n",
    "     'lines' : 'first'})\n",
    "grouped_contributions.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we are ready to calculate the knowledge \"ownership\". The ownership is the relative amount of additions to all additions of one file per author."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>additions</th>\n",
       "      <th>additions_sum</th>\n",
       "      <th>lines</th>\n",
       "      <th>ownership</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>path</th>\n",
       "      <th>author</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>src/main/java/org/springframework/samples/petclinic/PetclinicInitializer.java</th>\n",
       "      <th>Antoine Rey</th>\n",
       "      <td>111</td>\n",
       "      <td>111</td>\n",
       "      <td>111</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"4\" valign=\"top\">src/main/java/org/springframework/samples/petclinic/model/BaseEntity.java</th>\n",
       "      <th>Antoine Rey</th>\n",
       "      <td>3</td>\n",
       "      <td>70</td>\n",
       "      <td>47</td>\n",
       "      <td>0.042857</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Faisal Hameed</th>\n",
       "      <td>1</td>\n",
       "      <td>70</td>\n",
       "      <td>47</td>\n",
       "      <td>0.014286</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Gordon Dickens</th>\n",
       "      <td>14</td>\n",
       "      <td>70</td>\n",
       "      <td>47</td>\n",
       "      <td>0.200000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Michael Isvy</th>\n",
       "      <td>51</td>\n",
       "      <td>70</td>\n",
       "      <td>47</td>\n",
       "      <td>0.728571</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                                   additions  \\\n",
       "path                                               author                      \n",
       "src/main/java/org/springframework/samples/petcl... Antoine Rey           111   \n",
       "src/main/java/org/springframework/samples/petcl... Antoine Rey             3   \n",
       "                                                   Faisal Hameed           1   \n",
       "                                                   Gordon Dickens         14   \n",
       "                                                   Michael Isvy           51   \n",
       "\n",
       "                                                                   additions_sum  \\\n",
       "path                                               author                          \n",
       "src/main/java/org/springframework/samples/petcl... Antoine Rey               111   \n",
       "src/main/java/org/springframework/samples/petcl... Antoine Rey                70   \n",
       "                                                   Faisal Hameed              70   \n",
       "                                                   Gordon Dickens             70   \n",
       "                                                   Michael Isvy               70   \n",
       "\n",
       "                                                                   lines  \\\n",
       "path                                               author                  \n",
       "src/main/java/org/springframework/samples/petcl... Antoine Rey       111   \n",
       "src/main/java/org/springframework/samples/petcl... Antoine Rey        47   \n",
       "                                                   Faisal Hameed      47   \n",
       "                                                   Gordon Dickens     47   \n",
       "                                                   Michael Isvy       47   \n",
       "\n",
       "                                                                   ownership  \n",
       "path                                               author                     \n",
       "src/main/java/org/springframework/samples/petcl... Antoine Rey      1.000000  \n",
       "src/main/java/org/springframework/samples/petcl... Antoine Rey      0.042857  \n",
       "                                                   Faisal Hameed    0.014286  \n",
       "                                                   Gordon Dickens   0.200000  \n",
       "                                                   Michael Isvy     0.728571  "
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grouped_contributions['ownership'] = grouped_contributions['additions'] / grouped_contributions['additions_sum']\n",
    "grouped_contributions.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Having this data, we can now extract the author with the highest ownership value for each file. This gives us a list with the knowledge \"holder\" for each file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>author</th>\n",
       "      <th>additions</th>\n",
       "      <th>additions_sum</th>\n",
       "      <th>lines</th>\n",
       "      <th>ownership</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>path</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>src/main/java/org/springframework/samples/petclinic/PetclinicInitializer.java</th>\n",
       "      <td>Antoine Rey</td>\n",
       "      <td>111</td>\n",
       "      <td>111</td>\n",
       "      <td>111</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>src/main/java/org/springframework/samples/petclinic/model/BaseEntity.java</th>\n",
       "      <td>boly38</td>\n",
       "      <td>51</td>\n",
       "      <td>70</td>\n",
       "      <td>47</td>\n",
       "      <td>0.728571</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>src/main/java/org/springframework/samples/petclinic/model/NamedEntity.java</th>\n",
       "      <td>Michael Isvy</td>\n",
       "      <td>49</td>\n",
       "      <td>67</td>\n",
       "      <td>48</td>\n",
       "      <td>0.731343</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>src/main/java/org/springframework/samples/petclinic/model/Owner.java</th>\n",
       "      <td>Michael Isvy</td>\n",
       "      <td>164</td>\n",
       "      <td>290</td>\n",
       "      <td>153</td>\n",
       "      <td>0.565517</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>src/main/java/org/springframework/samples/petclinic/model/Person.java</th>\n",
       "      <td>Michael Isvy</td>\n",
       "      <td>59</td>\n",
       "      <td>79</td>\n",
       "      <td>56</td>\n",
       "      <td>0.746835</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                          author  additions  \\\n",
       "path                                                                          \n",
       "src/main/java/org/springframework/samples/petcl...   Antoine Rey        111   \n",
       "src/main/java/org/springframework/samples/petcl...        boly38         51   \n",
       "src/main/java/org/springframework/samples/petcl...  Michael Isvy         49   \n",
       "src/main/java/org/springframework/samples/petcl...  Michael Isvy        164   \n",
       "src/main/java/org/springframework/samples/petcl...  Michael Isvy         59   \n",
       "\n",
       "                                                    additions_sum  lines  \\\n",
       "path                                                                       \n",
       "src/main/java/org/springframework/samples/petcl...            111    111   \n",
       "src/main/java/org/springframework/samples/petcl...             70     47   \n",
       "src/main/java/org/springframework/samples/petcl...             67     48   \n",
       "src/main/java/org/springframework/samples/petcl...            290    153   \n",
       "src/main/java/org/springframework/samples/petcl...             79     56   \n",
       "\n",
       "                                                    ownership  \n",
       "path                                                           \n",
       "src/main/java/org/springframework/samples/petcl...   1.000000  \n",
       "src/main/java/org/springframework/samples/petcl...   0.728571  \n",
       "src/main/java/org/springframework/samples/petcl...   0.731343  \n",
       "src/main/java/org/springframework/samples/petcl...   0.565517  \n",
       "src/main/java/org/springframework/samples/petcl...   0.746835  "
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ownerships = grouped_contributions.reset_index().groupby(['path']).max()\n",
    "ownerships.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Preparing the visualization\n",
    "Reading tables is not as much fun as a good visualization. I find Adam Tornhill's suggestion of an enclosure or bubble chart very good:\n",
    "\n",
    "<img src=\"https://pbs.twimg.com/media/C-fYgvCWsAAB1y8.jpg\" style=\"width: 500px;\"/>\n",
    "\n",
    "*Source: Thorsten Brunzendorf ([@thbrunzendorf](https://twitter.com/thbrunzendorf/status/857892329285451776))*   \n",
    "  \n",
    "\n",
    "The visualization is written in [D3](https://d3js.org/) and just need data in a specific format called \"[flare](http://matthewfieger.com/posts/me/2014/06/27/nesting_data_for_d3.html)\". So let's prepare some data for this!\n",
    "\n",
    "First, we calculate the <tt>responsible</tt> author. We say that an author that contributed more than 70% of the source code is the responsible person that we have to ask if we want to know something about the code. For all the other code parts, we assume that the knowledge is distributed among different heads."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>path</th>\n",
       "      <th>author</th>\n",
       "      <th>additions</th>\n",
       "      <th>additions_sum</th>\n",
       "      <th>lines</th>\n",
       "      <th>ownership</th>\n",
       "      <th>responsible</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>src/main/java/org/springframework/samples/petc...</td>\n",
       "      <td>Antoine Rey</td>\n",
       "      <td>111</td>\n",
       "      <td>111</td>\n",
       "      <td>111</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>Antoine Rey</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>src/main/java/org/springframework/samples/petc...</td>\n",
       "      <td>boly38</td>\n",
       "      <td>51</td>\n",
       "      <td>70</td>\n",
       "      <td>47</td>\n",
       "      <td>0.728571</td>\n",
       "      <td>boly38</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>src/main/java/org/springframework/samples/petc...</td>\n",
       "      <td>Michael Isvy</td>\n",
       "      <td>49</td>\n",
       "      <td>67</td>\n",
       "      <td>48</td>\n",
       "      <td>0.731343</td>\n",
       "      <td>Michael Isvy</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>src/main/java/org/springframework/samples/petc...</td>\n",
       "      <td>Michael Isvy</td>\n",
       "      <td>164</td>\n",
       "      <td>290</td>\n",
       "      <td>153</td>\n",
       "      <td>0.565517</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>src/main/java/org/springframework/samples/petc...</td>\n",
       "      <td>Michael Isvy</td>\n",
       "      <td>59</td>\n",
       "      <td>79</td>\n",
       "      <td>56</td>\n",
       "      <td>0.746835</td>\n",
       "      <td>Michael Isvy</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                path        author  additions  \\\n",
       "0  src/main/java/org/springframework/samples/petc...   Antoine Rey        111   \n",
       "1  src/main/java/org/springframework/samples/petc...        boly38         51   \n",
       "2  src/main/java/org/springframework/samples/petc...  Michael Isvy         49   \n",
       "3  src/main/java/org/springframework/samples/petc...  Michael Isvy        164   \n",
       "4  src/main/java/org/springframework/samples/petc...  Michael Isvy         59   \n",
       "\n",
       "   additions_sum  lines  ownership   responsible  \n",
       "0            111    111   1.000000   Antoine Rey  \n",
       "1             70     47   0.728571        boly38  \n",
       "2             67     48   0.731343  Michael Isvy  \n",
       "3            290    153   0.565517          None  \n",
       "4             79     56   0.746835  Michael Isvy  "
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "plot_data = ownerships.reset_index()\n",
    "plot_data['responsible']  = plot_data['author']\n",
    "plot_data.loc[plot_data['ownership'] <= 0.7, 'responsible']  = \"None\"\n",
    "plot_data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we need some colors per author to be able to differ them in our visualization. We use the two classic data analysis libraries for this. We just draw some colors from a color map here for each author."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>author</th>\n",
       "      <th>color</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Antoine Rey</td>\n",
       "      <td>#006837</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>boly38</td>\n",
       "      <td>#39a758</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Michael Isvy</td>\n",
       "      <td>#9dd569</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Tomas Repel</td>\n",
       "      <td>#e3f399</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>42</th>\n",
       "      <td>Tejas Metha</td>\n",
       "      <td>#fee999</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          author    color\n",
       "0    Antoine Rey  #006837\n",
       "1         boly38  #39a758\n",
       "2   Michael Isvy  #9dd569\n",
       "9    Tomas Repel  #e3f399\n",
       "42   Tejas Metha  #fee999"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np\n",
    "from matplotlib import cm\n",
    "from matplotlib.colors import rgb2hex\n",
    "\n",
    "authors = plot_data[['author']].drop_duplicates()\n",
    "rgb_colors = [rgb2hex(x) for x in cm.RdYlGn_r(np.linspace(0,1,len(authors)))]\n",
    "authors['color'] = rgb_colors\n",
    "authors.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then we combine the colors to the plot data and whiten the minor ownership with all the <tt>None</tt> responsibilities."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>path</th>\n",
       "      <th>author</th>\n",
       "      <th>additions</th>\n",
       "      <th>additions_sum</th>\n",
       "      <th>lines</th>\n",
       "      <th>ownership</th>\n",
       "      <th>responsible</th>\n",
       "      <th>author_color</th>\n",
       "      <th>color</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>src/main/java/org/springframework/samples/petc...</td>\n",
       "      <td>Antoine Rey</td>\n",
       "      <td>111</td>\n",
       "      <td>111</td>\n",
       "      <td>111</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>Antoine Rey</td>\n",
       "      <td>Antoine Rey</td>\n",
       "      <td>#006837</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>src/main/java/org/springframework/samples/petc...</td>\n",
       "      <td>boly38</td>\n",
       "      <td>51</td>\n",
       "      <td>70</td>\n",
       "      <td>47</td>\n",
       "      <td>0.728571</td>\n",
       "      <td>boly38</td>\n",
       "      <td>boly38</td>\n",
       "      <td>#39a758</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>src/main/java/org/springframework/samples/petc...</td>\n",
       "      <td>Michael Isvy</td>\n",
       "      <td>49</td>\n",
       "      <td>67</td>\n",
       "      <td>48</td>\n",
       "      <td>0.731343</td>\n",
       "      <td>Michael Isvy</td>\n",
       "      <td>Michael Isvy</td>\n",
       "      <td>#9dd569</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>src/main/java/org/springframework/samples/petc...</td>\n",
       "      <td>Michael Isvy</td>\n",
       "      <td>164</td>\n",
       "      <td>290</td>\n",
       "      <td>153</td>\n",
       "      <td>0.565517</td>\n",
       "      <td>None</td>\n",
       "      <td>NaN</td>\n",
       "      <td>white</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>src/main/java/org/springframework/samples/petc...</td>\n",
       "      <td>Michael Isvy</td>\n",
       "      <td>59</td>\n",
       "      <td>79</td>\n",
       "      <td>56</td>\n",
       "      <td>0.746835</td>\n",
       "      <td>Michael Isvy</td>\n",
       "      <td>Michael Isvy</td>\n",
       "      <td>#9dd569</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                path        author  additions  \\\n",
       "0  src/main/java/org/springframework/samples/petc...   Antoine Rey        111   \n",
       "1  src/main/java/org/springframework/samples/petc...        boly38         51   \n",
       "2  src/main/java/org/springframework/samples/petc...  Michael Isvy         49   \n",
       "3  src/main/java/org/springframework/samples/petc...  Michael Isvy        164   \n",
       "4  src/main/java/org/springframework/samples/petc...  Michael Isvy         59   \n",
       "\n",
       "   additions_sum  lines  ownership   responsible  author_color    color  \n",
       "0            111    111   1.000000   Antoine Rey   Antoine Rey  #006837  \n",
       "1             70     47   0.728571        boly38        boly38  #39a758  \n",
       "2             67     48   0.731343  Michael Isvy  Michael Isvy  #9dd569  \n",
       "3            290    153   0.565517          None           NaN    white  \n",
       "4             79     56   0.746835  Michael Isvy  Michael Isvy  #9dd569  "
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "colored_plot_data = pd.merge(\n",
    "    plot_data, authors, \n",
    "    left_on='responsible', \n",
    "    right_on='author', \n",
    "    how='left', \n",
    "    suffixes=['', '_color'])\n",
    "colored_plot_data.loc[colored_plot_data['responsible'] == 'None', 'color'] = \"white\"\n",
    "colored_plot_data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Visualizing\n",
    "\n",
    "The [bubble chart](https://github.com/feststelltaste/software-analytics/blob/master/notebooks/vis/knowledge_islands.html) needs D3's flare format for displaying. We just dump the <tt>DataFrame</tt> data into this hierarchical format. As for hierarchy, we use the Java source files that are structured via directories."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import json\n",
    "\n",
    "    \n",
    "json_data = {}\n",
    "json_data['name'] = 'flare'\n",
    "json_data['children'] = []\n",
    "\n",
    "for row in colored_plot_data.iterrows():\n",
    "    series = row[1]\n",
    "    path, filename = os.path.split(series['path'])\n",
    "\n",
    "    last_children = None\n",
    "    children = json_data['children']\n",
    "\n",
    "    for path_part in path.split(\"/\"):\n",
    "        entry = None\n",
    "\n",
    "        for child in children:\n",
    "            if \"name\" in child and child[\"name\"] == path_part:\n",
    "                entry = child\n",
    "        if not entry:\n",
    "            entry = {}\n",
    "            children.append(entry)\n",
    "\n",
    "        entry['name'] = path_part\n",
    "        if not 'children' in entry: \n",
    "            entry['children'] = []\n",
    "\n",
    "        children = entry['children']\n",
    "        last_children = children\n",
    "\n",
    "    last_children.append({\n",
    "        'name' : filename + \" [\" + series['responsible'] + \", \" + \"{:6.2f}\".format(series['ownership']) + \"]\",\n",
    "        'size' :  series['lines'],\n",
    "        'color' : series['color']})\n",
    "\n",
    "with open ( \"vis/flare.json\", mode='w', encoding='utf-8') as json_file:\n",
    "    json_file.write(json.dumps(json_data, indent=3))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Results\n",
    "You can see the complete, interactive visualization [here](https://www.feststelltaste.de/wp-content/uploads/demos/knowledge_islands/spring-petclinic.html). Just tap at one of the bubbles and you will see how it works.\n",
    "\n",
    "The source code files are ordered hierarchically into bubbles. The size of the bubbles represents the lines of code and the different colors stand for each developer.\n",
    "\n",
    "<a href=\"https://www.feststelltaste.de/wp-content/uploads/demos/knowledge_islands/spring-petclinic.html\">![](./resources/knowledge_island_1.png)</a>\n",
    "\n",
    "On the left side, you can see that there are some red bubbles. Drilling down, we see that one developer did add almost all the code for the tests:\n",
    "\n",
    "![](./resources/knowledge_island_2.png)\n",
    "\n",
    "On the right side, you see that some knowledge is evenly distributed (white bubbles), but there are also some knowledge islands. Especially the <tt>PetClinicInitializer.java</tt> class got my attention because it's big and only one developer knows what's going on here:\n",
    "\n",
    "![](./resources/knowledge_island_3.png)\n",
    "\n",
    "\n",
    "I also did the analysis for the huge repository of [IntelliJ IDEA Community Edition](https://github.com/JetBrains/intellij-community). It contains over 170000 commits for 55391 Java source code files. The visualization works even here (it's just a little bit slow and confusing), but the <tt>flare.json</tt> file is almost 30 MB and therefore it's not practical for viewing it online. But here is the overview picture:\n",
    "\n",
    "![](./resources/knowledge_island_4.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Summary\n",
    "We can quickly create an impression about the knowledge distribution of a software system. With the bubble chart visualization, you can get an overview as well as detailed information about the contributors of your source code.\n",
    "\n",
    "But I want to point out two points against this method:\n",
    "\n",
    "- Renamed or split source files will also get new file names. This will \"reset\" the history for older, renamed files. Thus developers that added code before a rename or split aren't included in the result. But we could argue that they can't remember \"the old code\" anyhow ;-)\n",
    "- We use additions as proxy for knowledge. Developers could also gain knowledge by doing code reviews or working together while coding. We cannot capture those constellations with such a simply analysis.\n",
    "\n",
    "But as you have seen, the analysis can guide you nevertheless and gives you great insights very quickly."
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}