{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction\n", "Knowing all about the software system we are developing is valuable, but too often a rare situation we are facing today. In times of software development experts shortages and stressful software projects, high turnover in teams leads quickly to lost knowledge about the source code (<scrumcasm>And who needs to document? We do Scrum!</scrumcasm>).\n", "\n", "I already did some calculations of the [knowledge distribution](https://www.feststelltaste.de/knowledge-islands/) of software systems based on Adam Tornhill marvelous ideas in \"Your Code as a Crime Scene\" ([publisher's site](https://pragprog.com/book/atcrime/your-code-as-a-crime-scene), but my recommendation is to get the [newer book](https://pragprog.com/book/atevol/software-design-x-rays)). It really did its job. But after my experiences from a Subversion to Git migration (where I, who guessed, ensured the quality of the migrated data :-) ), I've found another neat idea for another model to find lost knowledge." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# The Idea\n", "\n", "First: How do you define knowledge about your code? We could get into a deep philosophical discussion about what knowledge is and if there is such a thing as knowledge (invite to a beer if you want to start this discourse ;-)). But let's look at it from a developer's point of view: E. g. in my case, every time I want to know who could possibly know something about a code line written, for example, for this piece of code (the meaning of the source code is irrelevant here),\n", "\n", "![](resources/blame_game_source.png)\n", "\n", "I'm catching myself in using the blame appraisal features of version control systems heavily. This feature calculates the latest change for each line. The resulting view gives me very helpful hints about what changes in a source code line happened recently\n", "\n", "![](resources/blame_game_annotate1.png)\n", "\n", "with plenty of information about who and when this change occurred for each source code line.\n", "\n", "![](resources/blame_game_annotate2.png)\n", "\n", "With these results, I can take a look at the author of the code and the time stamp of the code change. This gives me enough hints about the circumstances in which the code change happened as well as the answer to the question if the original author could still possibly know anything about the source code line. For the latter, the answer is highly dependent on the recency of a code change. Here, I'm using Eagleson's law as a heuristic:\n", "\n", "

Eagleson's Law: Any code of your own that you haven't looked at for six or more months might as well have been written by someone else.

— Programming Wisdom (@CodeWisdom) Dezember 10th 2017
\n", "\n", "\n", "If the code change is older than six months, it's very likely that the knowledge about the source code line is lost completely.\n", "\n", "So there we have it: A reasonable model of code knowledge! If we would have this information for all source code files of our software project, we would be able to identify the areas in our software system with lost knowledge (and find out, if we are screwed in a challenging situation!).\n", "\n", "So, let's get the data we need for this analysis!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Number Crunching (aka Data Preparation)\n", "*Note: Click here if you can't see bad data: Take me to the actual analysis!*\n", "\n", "Getting the data is straightforward: We just need a software system that used a version control system like Git for managing the code changes. I choose the Git repository of Linux for this demonstration. The reason? It's big! In this repo, we have a snapshot of the last 13 years of Linux kernel development at our hand with over 750000 commits and some millions lines of source code.\n", "\n", "We can get our hands on the necessary data with the `git blame` command that we execute for each source code file in the repository. There is a nice little bash command that makes this happen for us:\n", "\n", "```\n", "find . -type d -name \".git\" -prune -o -type f \\( -iname \"*.c\" -o -iname \"*.h\" \\) | xargs -n1 git blame -w -f\n", "```\n", "\n", "It first finds all C programming language source code files (`.h` header files as well as th`.c` program files) and retrieves the `git blame` information for each source code line in each file. This information includes\n", "* the sha id of the commit\n", "* the relative path of the file\n", "* the name of the author\n", "* the commit timestamp\n", "* the source line number\n", "* the source code line itself\n", "\n", "At this point you might ask: \"OK, wait a minute: 13 years of development efforts, with around 750000 commits and (spoiler alert!) 10235 source code files that sum up to 5.6 millions lines of code. Are you insane?\"\n", "\n", "Well kind of. I hadn't thought that retrieving the data set would take so long. But in the end, I didn't care because I did the calculation on a Google Cloud Compute Engine, which was pretty busy for 11 hours on a n1-standard-1 CPU (which cost me 30 cents, donations welcome :-D):\n", "\n", "![](resources/blame_game_gcp.png)\n", "\n", "The result was a 3 GB big log file that I've packed and downloaded to my computer. And this is where we start the first part our analysis with Python and Pandas!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Wrangling the raw data\n", "First, I want to create a nice little comma separated file with only the data we need for later analysis. This is why we first take the raw data and transform it to something we can call a decent data set.\n", "\n", "I have my experiences with old and big repositories, so I know that it's not going to be easy to read in such a long runner. The nice and bad thing at the same time is, that the Linux kernel was developed internationally. This means fun with file encodings! It seems that every nation has its own way of encoding their special character sets. Especially when working with the author's names, it's really a PITA. You simply cannot read in the dataset with standard means. The universal weapon for this is to import data into a Pandas DataFrame with the encoding `latin-1` which seems to don't care if there are some weird characters in the data (but unfortunately, screwing up foreign characters, which is not so important in our case).\n", "\n", "Additionally, there is no good format for outputting the Git blame log in a way, that it's easy to process. There are some flags for machine-friendly output, but these are multi-line formats (which are...not very nice to work within Pandas). So in these cases, I like to eat get my data in a very raw format. There is this other trick to use a non-used separator (I prefer `\\u0012`) that makes Pandas reading in a text file in one single column.\n", "\n", "So let's do this!" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
raw
0linux_blame_temp.log
1NaN
2889d0d42667c9 drivers/scsi/bfa/bfad_drv.h (Ani...
3889d0d42667c9 drivers/scsi/bfa/bfad_drv.h (Ani...
47725ccfda5971 drivers/scsi/bfa/bfad_drv.h (Jin...
\n", "
" ], "text/plain": [ " raw\n", "0 linux_blame_temp.log\n", "1 NaN\n", "2 889d0d42667c9 drivers/scsi/bfa/bfad_drv.h (Ani...\n", "3 889d0d42667c9 drivers/scsi/bfa/bfad_drv.h (Ani...\n", "4 7725ccfda5971 drivers/scsi/bfa/bfad_drv.h (Jin..." ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "PATH = r\"C:\\Users\\Markus\\Downloads\\linux_blame_temp.tar.gz\"\n", "blame_raw = pd.read_csv(PATH, encoding=\"latin-1\", sep=\"\\u0012\", names=[\"raw\"])\n", "blame_raw.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After around 30 seconds, we read in our 3 GB Git blame log (I think it's kind of okayish). Maybe you're wondering why this takes so long. Well, let's see how many entries are in our dataset." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "5665949" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(blame_raw)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are 5.6 million entries (= source code lines) that needed to be read in.\n", "\n", "In the next code cell, we extract all the data from the `raw` column into a new DataFrame called `blame`. We just need to figure out the right regular expression for this and we are fine. In this step, we also exclude the source code of the blame log because we actually don't need this information for our knowledge loss calculation.\n", "\n", "As always when working with string data, this needs time." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
shapathauthortimestampline
0NaNNaNNaNNaNNaN
1NaNNaNNaNNaNNaN
2889d0d42667c9drivers/scsi/bfa/bfad_drv.hAnil Gurumurthy2015-11-26 03:54:45 -05002
3889d0d42667c9drivers/scsi/bfa/bfad_drv.hAnil Gurumurthy2015-11-26 03:54:45 -05003
47725ccfda5971drivers/scsi/bfa/bfad_drv.hJing Huang2009-09-23 17:46:15 -07004
\n", "
" ], "text/plain": [ " sha path author \\\n", "0 NaN NaN NaN \n", "1 NaN NaN NaN \n", "2 889d0d42667c9 drivers/scsi/bfa/bfad_drv.h Anil Gurumurthy \n", "3 889d0d42667c9 drivers/scsi/bfa/bfad_drv.h Anil Gurumurthy \n", "4 7725ccfda5971 drivers/scsi/bfa/bfad_drv.h Jing Huang \n", "\n", " timestamp line \n", "0 NaN NaN \n", "1 NaN NaN \n", "2 2015-11-26 03:54:45 -0500 2 \n", "3 2015-11-26 03:54:45 -0500 3 \n", "4 2009-09-23 17:46:15 -0700 4 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "blame = \\\n", " blame_raw.raw.str.extract(\n", " \"(?P.*?) (?P.*?) \\((?P.* ?) (?P[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2} .[0-9]{4}) *(?P[0-9]*)\\) .*\",\n", " expand=True)\n", "blame.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But after a minute we have the data in a nice DataFrame `blame`, too.\n", "\n", "## Fixing wrong entries\n", "In the next step, we have to face some missing data. I think I did a mistake by executing the bash command in the wrong directory. This could explain why there are some missing values. We simply drop the missing data but log the number of the missed entries to make sure we are working reproducible." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
shapathauthortimestampline
0NaNNaNNaNNaNNaN
1NaNNaNNaNNaNNaN
5665948NaNNaNNaNNaNNaN
\n", "
" ], "text/plain": [ " sha path author timestamp line\n", "0 NaN NaN NaN NaN NaN\n", "1 NaN NaN NaN NaN NaN\n", "5665948 NaN NaN NaN NaN NaN" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "has_null_value = blame.isnull().any(axis=1)\n", "dropped_entries = blame[has_null_value]\n", "blame = blame[~has_null_value]\n", "dropped_entries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So we lost the first non-sense line (because of my mistake), thus the second entry for the first Git blame result as well as the last line (which is negligible in our case). For completeness reasons, let's add the first line manually (because I want to share a clean dataset with you as well!)." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
shapathauthortimestampline
2889d0d42667c9drivers/scsi/bfa/bfad_drv.hAnil Gurumurthy2015-11-26 03:54:45 -05002
3889d0d42667c9drivers/scsi/bfa/bfad_drv.hAnil Gurumurthy2015-11-26 03:54:45 -05003
47725ccfda5971drivers/scsi/bfa/bfad_drv.hJing Huang2009-09-23 17:46:15 -07004
5889d0d42667c9drivers/scsi/bfa/bfad_drv.hAnil Gurumurthy2015-11-26 03:54:45 -05005
67725ccfda5971drivers/scsi/bfa/bfad_drv.hJing Huang2009-09-23 17:46:15 -07006
\n", "
" ], "text/plain": [ " sha path author \\\n", "2 889d0d42667c9 drivers/scsi/bfa/bfad_drv.h Anil Gurumurthy \n", "3 889d0d42667c9 drivers/scsi/bfa/bfad_drv.h Anil Gurumurthy \n", "4 7725ccfda5971 drivers/scsi/bfa/bfad_drv.h Jing Huang \n", "5 889d0d42667c9 drivers/scsi/bfa/bfad_drv.h Anil Gurumurthy \n", "6 7725ccfda5971 drivers/scsi/bfa/bfad_drv.h Jing Huang \n", "\n", " timestamp line \n", "2 2015-11-26 03:54:45 -0500 2 \n", "3 2015-11-26 03:54:45 -0500 3 \n", "4 2009-09-23 17:46:15 -0700 4 \n", "5 2015-11-26 03:54:45 -0500 5 \n", "6 2009-09-23 17:46:15 -0700 6 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "blame.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We take the second line and alter the data so that we have the first line again (as long as we have good and comprehensible reasons for this, I find this approach OK here)." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "sha 889d0d42667c9\n", "path drivers/scsi/bfa/bfad_drv.h\n", "author Anil Gurumurthy \n", "timestamp 2015-11-26 03:54:45 -0500\n", "line 1\n", "Name: 2, dtype: object" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "first_line = blame.loc[2].copy()\n", "first_line.line = 1\n", "first_line" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We transform our single entry back into a DataFrame, transpose the data and concatenate our `blame` DataFrame to it." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
shapathauthortimestampline
2889d0d42667c9drivers/scsi/bfa/bfad_drv.hAnil Gurumurthy2015-11-26 03:54:45 -05001
2889d0d42667c9drivers/scsi/bfa/bfad_drv.hAnil Gurumurthy2015-11-26 03:54:45 -05002
3889d0d42667c9drivers/scsi/bfa/bfad_drv.hAnil Gurumurthy2015-11-26 03:54:45 -05003
47725ccfda5971drivers/scsi/bfa/bfad_drv.hJing Huang2009-09-23 17:46:15 -07004
5889d0d42667c9drivers/scsi/bfa/bfad_drv.hAnil Gurumurthy2015-11-26 03:54:45 -05005
\n", "
" ], "text/plain": [ " sha path author \\\n", "2 889d0d42667c9 drivers/scsi/bfa/bfad_drv.h Anil Gurumurthy \n", "2 889d0d42667c9 drivers/scsi/bfa/bfad_drv.h Anil Gurumurthy \n", "3 889d0d42667c9 drivers/scsi/bfa/bfad_drv.h Anil Gurumurthy \n", "4 7725ccfda5971 drivers/scsi/bfa/bfad_drv.h Jing Huang \n", "5 889d0d42667c9 drivers/scsi/bfa/bfad_drv.h Anil Gurumurthy \n", "\n", " timestamp line \n", "2 2015-11-26 03:54:45 -0500 1 \n", "2 2015-11-26 03:54:45 -0500 2 \n", "3 2015-11-26 03:54:45 -0500 3 \n", "4 2009-09-23 17:46:15 -0700 4 \n", "5 2015-11-26 03:54:45 -0500 5 " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "blame = pd.concat([pd.DataFrame(first_line).T, blame])\n", "blame.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After this, we do some housekeeping of the data:\n", "* We fix the comma-separated values in the author's column by replacing all commas with a semicolon because we want to use the comma as separator for our comma-separated values (CSV) file\n", "* We also change the data type of the time stamp from a string to a `TimeStamp` to check it's data quality\n", "\n", "To improve performance, we first convert both columns into `Categorical` data to enable working with references instead of manipulating the values for all entries. It the end, this means that the same values need to be changed and converted only once." ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Int64Index: 5665947 entries, 2 to 5665947\n", "Data columns (total 5 columns):\n", "sha object\n", "path object\n", "author object\n", "timestamp datetime64[ns]\n", "line object\n", "dtypes: datetime64[ns](1), object(4)\n", "memory usage: 419.4+ MB\n" ] } ], "source": [ "blame.author = pd.Categorical(blame.author)\n", "blame.author = blame.author.str.replace(\",\", \";\")\n", "\n", "blame.timestamp = pd.Categorical(blame.timestamp)\n", "blame.timestamp = pd.to_datetime(blame.timestamp)\n", "blame.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This operation takes a minute, but we have now a decent dataset that we can use further on." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we take a look at another possible area of errors: The `TimeStamp` column." ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 5665947\n", "unique 92466\n", "top 2005-04-16 22:20:36\n", "freq 836898\n", "first 2002-04-09 19:14:34\n", "last 2018-04-11 17:26:09\n", "Name: timestamp, dtype: object" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "blame.timestamp.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Apparently, the Linux Git repository has also some issues with timestamps. The (still) top occurring time stamp is the initial commit of Linux Torvalds in April 2005. There is nothing we can do about that than mention that here. But there are also changes that are older than that. This is an error caused by wrong clock configurations from a few developers. This leaves us with no other choice with either deleting the entries with that data or assigning them to the initial commits.\n", "\n", "Because in our case, we want to have a complete dataset, we choose the second option. Not quite clean, but the best decision we can make in our situation. And always remember:\n", "\n", "

All models are wrong but some are useful

— George Box
\n", "\n", "Let's find the initial commit of Linux Torvalds." ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Timestamp('2005-04-16 22:20:36')" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "initial_commit = blame[blame.author == \"Linus Torvalds\"].timestamp.min()\n", "initial_commit" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We keep track of all wrong timestamps and set those to the initial commit time stamp." ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4955" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "is_wrong_timestamp = blame.timestamp < initial_commit\n", "wrong_timestamps = blame[is_wrong_timestamp]\n", "blame.timestamp = blame.timestamp.clip(initial_commit)\n", "len(wrong_timestamps)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Further, we convert the `TimeStamp` column to an integer to save some storage and to enable efficient time stamp transformation when reading the data in later on\n" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
shapathauthortimestampline
2889d0d42667c9drivers/scsi/bfa/bfad_drv.hAnil Gurumurthy14485280850000000001
2889d0d42667c9drivers/scsi/bfa/bfad_drv.hAnil Gurumurthy14485280850000000002
3889d0d42667c9drivers/scsi/bfa/bfad_drv.hAnil Gurumurthy14485280850000000003
47725ccfda5971drivers/scsi/bfa/bfad_drv.hJing Huang12537531750000000004
5889d0d42667c9drivers/scsi/bfa/bfad_drv.hAnil Gurumurthy14485280850000000005
\n", "
" ], "text/plain": [ " sha path author \\\n", "2 889d0d42667c9 drivers/scsi/bfa/bfad_drv.h Anil Gurumurthy \n", "2 889d0d42667c9 drivers/scsi/bfa/bfad_drv.h Anil Gurumurthy \n", "3 889d0d42667c9 drivers/scsi/bfa/bfad_drv.h Anil Gurumurthy \n", "4 7725ccfda5971 drivers/scsi/bfa/bfad_drv.h Jing Huang \n", "5 889d0d42667c9 drivers/scsi/bfa/bfad_drv.h Anil Gurumurthy \n", "\n", " timestamp line \n", "2 1448528085000000000 1 \n", "2 1448528085000000000 2 \n", "3 1448528085000000000 3 \n", "4 1253753175000000000 4 \n", "5 1448528085000000000 5 " ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "blame.timestamp = blame.timestamp.astype('int64')\n", "blame.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Last, we store the result in a gzipped CSV file as an immediate result." ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [], "source": [ "blame.to_csv(\"C:/Temp/linux_blame.gz\", encoding='utf-8', compression='gzip', index=None)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All right, the ugly work finished! Let's get some insights!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Analyzing the knowledge about Linux" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Importing the dataset (again)\n", "First, we reimport our newly created data set to check if it's as expected." ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
shapathauthortimestampline
0889d0d42667c9drivers/scsi/bfa/bfad_drv.hAnil Gurumurthy14485280850000000001
1889d0d42667c9drivers/scsi/bfa/bfad_drv.hAnil Gurumurthy14485280850000000002
2889d0d42667c9drivers/scsi/bfa/bfad_drv.hAnil Gurumurthy14485280850000000003
37725ccfda5971drivers/scsi/bfa/bfad_drv.hJing Huang12537531750000000004
4889d0d42667c9drivers/scsi/bfa/bfad_drv.hAnil Gurumurthy14485280850000000005
\n", "
" ], "text/plain": [ " sha path author \\\n", "0 889d0d42667c9 drivers/scsi/bfa/bfad_drv.h Anil Gurumurthy \n", "1 889d0d42667c9 drivers/scsi/bfa/bfad_drv.h Anil Gurumurthy \n", "2 889d0d42667c9 drivers/scsi/bfa/bfad_drv.h Anil Gurumurthy \n", "3 7725ccfda5971 drivers/scsi/bfa/bfad_drv.h Jing Huang \n", "4 889d0d42667c9 drivers/scsi/bfa/bfad_drv.h Anil Gurumurthy \n", "\n", " timestamp line \n", "0 1448528085000000000 1 \n", "1 1448528085000000000 2 \n", "2 1448528085000000000 3 \n", "3 1253753175000000000 4 \n", "4 1448528085000000000 5 " ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "git_blame = pd.read_csv(\"C:/Temp/linux_blame.gz\")\n", "git_blame.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's first have a look at what we've got here now." ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 5665947 entries, 0 to 5665946\n", "Data columns (total 5 columns):\n", "sha object\n", "path object\n", "author object\n", "timestamp int64\n", "line int64\n", "dtypes: int64(2), object(3)\n", "memory usage: 1.3 GB\n" ] } ], "source": [ "git_blame.info(memory_usage='deep')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have our 5.6 million Git blame log entries that are stored in memory with 1.3 GB. We need (as always) do some data wrangling by applying some data conversions. In this case, we convert the `sha`, `path`, and `author` columns to `Categorical` data (as mentioned, because of performance reasons)." ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 5665947 entries, 0 to 5665946\n", "Data columns (total 5 columns):\n", "sha category\n", "path category\n", "author category\n", "timestamp int64\n", "line int64\n", "dtypes: category(3), int64(2)\n", "memory usage: 140.3 MB\n" ] } ], "source": [ "git_blame.sha = pd.Categorical(git_blame.sha)\n", "git_blame.path = pd.Categorical(git_blame.path)\n", "git_blame.author = pd.Categorical(git_blame.author)\n", "git_blame.info(memory_usage='deep')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This bring down the data in memory down to 140 MB. Next, we convert the time stamp from nanoseconds to real `TimeStamp` data." ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 5665947 entries, 0 to 5665946\n", "Data columns (total 5 columns):\n", "sha category\n", "path category\n", "author category\n", "timestamp datetime64[ns]\n", "line int64\n", "dtypes: category(3), datetime64[ns](1), int64(1)\n", "memory usage: 140.3 MB\n" ] } ], "source": [ "git_blame.timestamp = pd.to_datetime(git_blame.timestamp)\n", "git_blame.info(memory_usage='deep')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This didn't improve the memory usage further, but we now have a nice `TimeStamp` data type in the `timestamp` column that makes working with time-based data easy as pie." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Calculate basic data\n", "Alright, time to get some overview of our dataset.\n", "\n", "How many files are we talking about?" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "10235" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "git_blame.path.nunique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What is the biggest file?" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
shapathauthortimestampline
44010678dbb0dc2cb160drivers/net/ethernet/broadcom/tg3.cPeter Hüwe2013-05-21 12:58:0618336
\n", "
" ], "text/plain": [ " sha path author \\\n", "4401067 8dbb0dc2cb160 drivers/net/ethernet/broadcom/tg3.c Peter Hüwe \n", "\n", " timestamp line \n", "4401067 2013-05-21 12:58:06 18336 " ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "git_blame[git_blame.line == git_blame.line.max()]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For which source file can we ask most of the authors?" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "path\n", "drivers/pci/quirks.c 145\n", "Name: author, dtype: int64" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "authors_per_file = git_blame.groupby(['path']).author.nunique()\n", "authors_per_file[authors_per_file == authors_per_file.max()]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And for how many source files is only one author coding?" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1903" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "authors_per_file[authors_per_file == 1].count()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's do some analysis regarding knowledge." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Active knowledge carrier\n", "Let's see (after Eagleson's law) which developers changes most of the source code files in the last six months.\n", "\n", "We first create the timestamp six month ago" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Timestamp('2017-10-23 21:20:02.969229')" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "six_months_ago = pd.Timestamp('now') - pd.DateOffset(months=6)\n", "six_months_ago" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and create a new column named `knowing` to get the information for the more recent changes." ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
shapathauthortimestamplineknowing
0889d0d42667c9drivers/scsi/bfa/bfad_drv.hAnil Gurumurthy2015-11-26 08:54:451False
1889d0d42667c9drivers/scsi/bfa/bfad_drv.hAnil Gurumurthy2015-11-26 08:54:452False
2889d0d42667c9drivers/scsi/bfa/bfad_drv.hAnil Gurumurthy2015-11-26 08:54:453False
37725ccfda5971drivers/scsi/bfa/bfad_drv.hJing Huang2009-09-24 00:46:154False
4889d0d42667c9drivers/scsi/bfa/bfad_drv.hAnil Gurumurthy2015-11-26 08:54:455False
\n", "
" ], "text/plain": [ " sha path author \\\n", "0 889d0d42667c9 drivers/scsi/bfa/bfad_drv.h Anil Gurumurthy \n", "1 889d0d42667c9 drivers/scsi/bfa/bfad_drv.h Anil Gurumurthy \n", "2 889d0d42667c9 drivers/scsi/bfa/bfad_drv.h Anil Gurumurthy \n", "3 7725ccfda5971 drivers/scsi/bfa/bfad_drv.h Jing Huang \n", "4 889d0d42667c9 drivers/scsi/bfa/bfad_drv.h Anil Gurumurthy \n", "\n", " timestamp line knowing \n", "0 2015-11-26 08:54:45 1 False \n", "1 2015-11-26 08:54:45 2 False \n", "2 2015-11-26 08:54:45 3 False \n", "3 2009-09-24 00:46:15 4 False \n", "4 2015-11-26 08:54:45 5 False " ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "git_blame['knowing'] = git_blame.timestamp >= six_months_ago\n", "git_blame.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's have look at the ratio of known code to unknown code (= code older than six months)." ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.0399647225786\n" ] }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAWUAAADuCAYAAADhhRYUAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAFINJREFUeJzt3X2UXVV9xvHvnvCWIOQNRJoETggBY3gJCQIFu0BABQ61oIAvoC5WQbGUql1CT6tIqrg8iqAgitRi0arVpbwUORWEJEAixIIQwARoCB5ISIBASEIiBMic/rFPmAmZycy9c+/9nZfns9Zdd3K5d+bJWuHJzj777O2yLENERIqhyzqAiIj0UCmLiBSISllEpEBUyiIiBaJSFhEpEJWyiEiBqJRFRApEpSwiUiAqZRGRAlEpi4gUiEpZRKRAVMoiIgWiUhYRKRCVsohIgaiURUQKRKUsIlIgKmURkQJRKYuIFIhKWUSkQFTKIiIFolIWESkQlbKISIGolEVECkSlLCJSICplEZECUSmLiBSISllEpEBUyiIiBaJSFhEpEJWyiEiBbGMdQKQ/QZTsCOwKvAXYNn9s1+vr/n49DHgNeAXYkD9veqwHXsofa9M4/HPnfkciA3NZlllnkJoIomQbYAKwG75s+3u8NX8e3oFYG4F1wCpgBfA0sDx/bPZ1GofrOpBHak6lLC0VREkXsAcwuddjn/w5wI9my+olekp6GfAYsDB/LEnjsNswm1SESlmaEkTJcOBAYH96SncyMAnY3jCalVfoKelFqKylSSplGVAQJTsA04AZ+eNgYAq6JjEYvcv6XmAe8EAahxtNU0lhqZRlC0GU7AH8Za/HNPwFNGmNdcDv8QU9D5iv+WrZRKUsBFGyC/A+4Djg3cA420S1sxF4kJ6SnpvG4TO2kcSKSrmGgigZBhyGL+HjgOlozXrRLAFuBRJgThqHLxvnkQ5RKddEECXj6SnhY4BRtomkAS8Ds/EFnaRx+JRxHmkjlXKFBVEyDfgIEAJTjeNI69wPXA/ckMbhIusw0loq5YrJL9Kdnj9UxNX3GHAD8PM0Dh+0DiNDp1KugCBKRgOn4Yv4XYCzTSRGHgCuAX6axuFq6zDSHJVySeVrh08EzgCOR0vWpMcrwI34gp6VxqH+Jy8RlXLJBFGyP3AefmQ80jiOFN+TwLXAf6Rx+KRxFhkElXIJBFHigBOAz+FXTog0KgNmAT8Erkvj8FXjPNIPlXKBBVEyAvgE8BlgX+M4Uh3LgcuAq3UnYfGolAsoiJJxwN8DnwTGGMeR6loFXAlckcbhC9ZhxFMpF0gQJQfjpyhOpdxbXEq5rAd+AFyaxuEy6zB1p1IugCBKZgBfxe8/IWLlNeAnwNfTOHzMOkxdqZQNBVHyDuArwAess4j00o2/IeXLaRw+ZB2mblTKBoIoCYAv42/20EZAUlTd+LXOX0zj8DnrMHWhUu6gIEpGAV/ArzOu4+kcUk5rgYuBy7WUrv1Uyh0QRMm2wN8BFwJjjeOINOtx4Pw0Dm+0DlJlKuU2C6LkKOBq/Dl2IlUwG/hsGocPWwepIpVymwRRsjNwCXA22iBIqmcj8O/AhWkcrrQOUyUq5TYIouRE4PvoWCWpvtX4UfOPrINUhUq5hfKz7q7AbywvUic3A59M43CFdZCy03KsFgmi5KPAI6iQpZ5OBBYGUXKGdZCy00h5iPKz767C/6EUEbgOODuNwxetg5SRSnkIgig5AX9b6mjrLCIFswz4eBqHc6yDlI1KuQn5/sZfAi5CKytE+tONX4F0YRqHr1mHKQuVcoPyu/J+gj8hWkQGdjdwsm7VHhyVcgOCKDkAf7T7JOssIiXzJPDXuuFkYFp9MUhBlJwO3IMKWaQZewK/y9fwy1ZopDyAfN+KS/GbCInI0HQDF6RxeKl1kKJSKW9Ffqv0TcCR1llEKuYa4NO6ALgllXI/gigZC9wKzLDOIlJRdwIf1PmAm1Mp9yGIkt2A24H9rLOIVNwS4H1pHC6xDlIUKuU3CaJkAjALmGydRaQmlgJHpnH4J+sgRaDVF70EUTIJmIsKWaSTJgCz8wFR7amUc0GUTAHuwi/dEZHOCoA5QZT8hXUQayplIIiSafiLDrX/AyFiaBJ+xPw26yCWaj+nHETJ24HfAWOss4gIAIuAo+p6okmtR8r5KovfoEIWKZJ3ALfny1Jrp7alHETJjkCCn8sSkWI5APhtECUjrIN0Wi1LOYiSYcAv0I0hIkU2HX84a63UspSBr6OtN0XK4CNBlHzGOkQn1e5CXxAlHwN+bJ1DRAbtdeDoNA7nWgfphFqVchAlh+CXvu1gnUVEGvIMML0Op2XXppSDKBkNPAyMs84iIk25G79UrtI7y9VpTvlyVMgiZXY4cJl1iHarxUg5P+3g19Y5RKQlzkjj8KfWIdql8qWcT1ssBHa3ziIiLbEGmFLV+eU6TF98GxWySJWMBL5jHaJdKj1S1rSFSKWdnMbhjdYhWq2ypRxEySj8tIV2fhOppqfx0xgvWQdppSpPX3wLFbJIlY0DvmgdotUqOVLO90e+H3DWWUSkrV4F9kvjcLF1kFap6kj5IlTIInWwHRVbu1y5kbJGySK1dEwah7OtQ7RCFUfKGiWL1M/51gFapVIjZY2SRWorw88tL7IOMlRVGylrlCxSTw74R+sQrVCZkbJGySK1twHYI43D56yDDEWVRspfRIUsUmfbA+dahxiqSoyUgygZg98Ee1vrLCJiaiV+tPyKdZBmVWWk/EFUyCICuwIftw4xFFUp5Q9bBxCRwjjHOsBQlH76IoiSt+E3JqnKXzAiMnTj0zh82jpEM6pQZKdSjd+HiLROaB2gWVUoM01diMiblbaUB5y+cM5txJ8CvclJWZal/bw3AG7Osmy/FuXbqiBK9gBStBRORDa3HhibxuEG6yCN2mYQ73k5y7JpbU/SnFNQIYvIlnYEjgJuNc7RsKamL5xzgXNurnPu/vxxeB/vmeqc+1/n3ALn3EPOucn562f0ev1q59ywIeQ/cgifFZFqO9E6QDMGU8rD8wJd4Jy7IX/tOeA9WZZNBz4EXNHH584BLs9H2QcDy5xzU/L3H5G/vhE4fQj5Dx3CZ0Wk2ko5r9zs9MW2wJXOuU3Fuk8fn7sH+IJzbjxwfZZli51zxwAzgHudcwDD8QXfsCBKJgK7NfNZEamFiUGUBGkcptZBGjGYUu7L54BngQPxo+0tbmnMsuxnzrnf4/+2utU5dxZ+/vdHWZb9c5M/tzeNkkVkIPviFwOURrNL4kYCK7Is6wY+BmwxL+yc2wt4IsuyK4CbgAOAWcApzrm35u8Z45zbs8kMBzb5ORGpj77+FV9ozZby94BPOOfm43/T6/t4z4eAPzrnFgBvB36cZdki/G5uv3XOPQTcBuzeZIapTX5OROpjsnWARpX2NusgSp4AJlrnEJFCuyWNw+OtQzSilHf0BVEyAgisc4hI4ZVupFzKUgZ2QTeNiMjAgiBKSrWtb1lLeSfrACJSCsOAvaxDNKKspfwW6wAiUhqluvakUhaRqhthHaARZS1lTV+IyGBtZx2gEWUtZY2URWSwtrcO0Ihmb7O2plKWLWzD66/t65Y++c6ux1aOd8+/BuVcgy+t9Xw28tUy7U2kUpbKeJ1ttl2YTdx74caJe49hzQsHdT2+9NCuR9ce1LWYSW7FzqNYN77LZbtY55SOuxauts4waGUt5dKdJiCdtYqRY2d1zxg7q3vGZq/nZb3skK5H10zvKetxXS7b1SiqtN/r1gEaUdZSXmEdQMpJZV1Lf7YO0AiVsggq64p71jpAI1TKIluhsq6EUvWFSlmkCVsr62ldS5Yd2vXImuldj7OXW77TaH+BUWVtp1R9UeatO1fjN9sXKTyVtZm1zFxTqp4o60gZ4BlUylISqxg5dnb39LGzu6dv9rrKuu2aGiU758biT0oCeBv+LNKV+a8PybLs1RZk61OZS3kF/vwtkdJSWbfdY818KMuyF4BpAM65mcC6LMu+2fs9zp/+7PJj8VqmzKX8KHCUdQiRdhi4rB9de1DX4u5JbvnOo1k3QTfF9OuhVn4z59zewI3APPzhzSc55x7MsmxU/t8/DBybZdlZzrndgKuAPYBu4B+yLJs/0M8ocynPBc6xDiHSSf2V9WjWrup1B6PKukdLSzn3DuDMLMvOcc5trUOvAL6RZdl851wA3AzsN9A3L3Mp32kdQKQoXmTnMbO7p49RWW/h3jZ8zyVZlg3m+x4L7OtnOQAY7ZwbnmXZy1v7UGlLOY3Dp/PDU0t1qoBIJ9W8rJ9j5pq0Dd93fa+vu9n8aLoden3taOKiYGlLOXcnKmWRhg1Q1ssO6Xp09fSuxdkkt3zkKNaNG1bOC4zz2v0Dsizrds696JybDCwBTqZnlcbtwLnAtwCcc9OyLFsw0PcseynfBZxpHUKkKipW1kmHfs4/AbcATwGL6Nm/+VzgKufcmfiunZO/tlWlvXkEIIiSicAT1jlE6qrAZZ0BuzNzTan2vYCSlzJAECVPAROsc4hIjwKU9X3MXPPODv2slir79AX4fzacbR1CRHpsbRpkWteSpf6mmMVMcst3GsW68W0o605NXbRcFUr5P1Epi5TCi+w8Zk73QWPmdB+02ettKOtfDz2tjSpMXzjgcbQKQ6RymizrRcxcM7VjIVus9KUMEETJRcBM6xwi0hkDlPUFzFxziXXGZlWllPfEr8Loss4iInbGsPaZ04bdcUD01atXDvzuYqpEKQMEUfLfwPutc4iIqZ+lcXi6dYihqNLI8rvWAUTE3JXWAYaqSqV8G03unSoilfCHNA7vsQ4xVJUp5TQOM+Ab1jlExExsHaAVKlPKuWuBh61DiEjH3ZPG4a+sQ7RCpUo5jcNu4HzrHCLScZ+3DtAqlSplgDQObwV+a51DRDrmujQO77YO0SqVK+Xc5/GbT4tItb2G3zqzMipZymkcPoyfXxaRavteGodLrEO0UiVLOXchmx/bIiLVshr4inWIVqtsKadxuBz4pnUOEWmbr6Zx+IJ1iFarbCnnvgY8aB1CRFpuPvBt6xDtUJm9L/oTRMkU4D5ghHUWEWmJtcC0NA7/ZB2kHao+UiaNw0eAz1rnEJGW+VRVCxlqMFLeJIiSXwKnWOcQkSG5No3DSp9gX/mRci9n448AF5FyWgycZx2i3WpTymkcrgZOBzZaZxGRhr0KfDiNw3XWQdqtNqUMkMbhPCq4rlGkBv4ljcP7rUN0Qq1KOXcx8BvrECIyaNcBl1mH6JTaXOjrLYiSEcAs4DDrLCKyVXOA49M43GAdpFNqWcoAQZSMAeYBU6yziEifFgBHpnG41jpIJ9W2lAGCKJkA3A2Mt84iIptZAhyRxuGz1kE6rY5zym9I43Ap8F5glXUWEXnDs8D76ljIUPNShjfu+AuBP1tnERHWAsdVbTvORtS+lAHSOJyPv9vvdessIjW2AfibNA4XWAexpFLOpXH4G+AM/EkGItJZG4DT0ji8wzqItVpf6OtLECUnAL8ChltnEamJl4D3q5A9lXIfgig5ArgZGGWdRaTiVuLXIf/BOkhRqJT7EUTJAcAtwO7WWUQq6ingvWkcPmYdpEg0p9yPNA4fwt/x90frLCIV9AfgUBXyllTKW5HG4VPAEcBt1llEKuQm/J16z1gHKSKV8gDyWzxPAK6xziJSAd8BTk7jUCfN90Nzyg0IouRT+MMad7DOIlIy64Bz0zj8sXWQolMpNyiIkv2BX6CNjEQGawHwoTQO/886SBlo+qJBaRw+DBwM/NA6i0gJXAkcpkIePI2UhyCIko8C3wd2ss4iUjAvAn+bxuEN1kHKRqU8REGU7A38HJhhnUWkIO4GPpKvXpIGafpiiNI4fBw4HLjcOouIsQz4Gn65mwq5SRopt1AQJe8BvgtMts4i0mELgU+ncTjXOkjZaaTcQmkc3gbsD1wEvGIcR6QT1gMXANNUyK2hkXKbBFGyF/7K8/HWWUTa5Hrgs/kJPtIiKuU2C6LkA/gbTiZYZxFpkSeA89I4/B/rIFWk6Ys2S+PwevyNJpegDfSl3DYAFwP7qZDbRyPlDgqiZCr+3v93W2cRadAtwGd0E0j7qZQNBFFyNPCvwLuss4gMYBYwM43DedZB6kKlbCiIkmPx5Xy4dRaRN7kDuCiNw7usg9SNSrkAgih5L76cD7POIrV3F76M77AOUlcq5QIJouQ4fDkfYp1FamcevoxnWwepO5VyAeUnan8JONQ6i1TeXODLaRzebh1EPJVygQVRcihwHnAqsJ1xHKmOV4D/Ar6TxuED1mFkcyrlEgiiZDfgU8A56HRtad5S4CrgB2kcPm8dRvqmUi6RIEq2Bd4PfBJ4D+BsE0kJdOPXGP8bcHMahxuN88gAVMolFUTJROAs4Ew0epYtLcMf9nuN9qYoF5VyyQVRsg1+1HwqcBIw2jaRGHoeuBH4JTBLo+JyUilXSD69cQxwCr6gx9omkg5YCdyAL+I70jh83TiPDJFKuaLyEfTR+II+GdjFNpG00LP0FPGdGhFXi0q5BvKCPoqeEfRupoGkGcvpmZq4K43DbuM80iYq5ZoJosQB++F3qjsaOBIYZRpK+rIKv//EbGB2GoeP2MaRTlEp11wQJV3AdHpK+q+AHU1D1dNL+LvrZuePBzUarieVsmwmv1h4CL6g343fJGm4aahqWg/Mp6eE79NFOgGVsgwgiJJhwD7AQcC0Xo9dLXOVzHPAA8CC/PEAsFgjYemLSlmaEkTJOHoKelNh70W97zLMgMfZvHwXpHG4wjSVlIpKWVomiJKdgKnAJHxB937enWoUdgY8jT88dEn+vOnrhWkcrjPMJhWgUpaOCKJkB2A8/lTvTc+bvh6LXwEyMn8eYRBxPbAaWJM/v4C/VXnpmx7L0jh81SCf1IRKWQonv9i4qaBH9fH1jjQ26s7wpbupcDc93vi1LrJJUaiURUQKpMs6gIiI9FApi4gUiEpZRKRAVMoiIgWiUhYRKRCVsohIgaiURUQKRKUsIlIgKmURkQJRKYuIFIhKWUSkQFTKIiIFolIWESkQlbKISIGolEVECkSlLCJSICplEZECUSmLiBSISllEpEBUyiIiBaJSFhEpEJWyiEiBqJRFRApEpSwiUiAqZRGRAlEpi4gUiEpZRKRAVMoiIgWiUhYRKRCVsohIgaiURUQK5P8BJLOyn5Euo8YAAAAASUVORK5CYII=\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%matplotlib inline\n", "git_blame['knowing'].value_counts().plot.pie(label=\"\")\n", "print(git_blame['knowing'].mean())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Phew...only 4% of the code changes occured in the last six months. This is what I would call \"challenging\".\n", "\n", "We need to take action! So let's find out which developers did the most changes (maybe sending them some presents can help here :-))." ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Anirudh Venkataramanan 18790\n", "Yasunari Takiguchi 12891\n", "Tomer Tayar 6458\n", "Jacopo Mondi 5260\n", "Mauro Carvalho Chehab 4643\n", "Edward Cree 4484\n", "Linus Walleij 3720\n", "Salil Mehta 3719\n", "Tim Harvey 3460\n", "Bryan Whitehead 3364\n", "Name: author, dtype: int64" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "top10 = git_blame[git_blame.knowing].author.value_counts().head(10)\n", "top10" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Actively developed components\n", "\n", "We can also find out in which code areas is still knowledge available and which ones are the \"no-go areas\" in our code base. For this, we aggregate our data on a higher level by looking at the source code at the component level. In the Linux kernel, we can easily create this view because mostly, the first two parts of the source code path indicate the component." ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "drivers/media/dvb-frontends/drx39xyj/drxj_map.h 15055\n", "drivers/isdn/hardware/eicon/message.c 14954\n", "drivers/net/ethernet/sfc/mcdi_pcol.h 14534\n", "drivers/net/ethernet/intel/i40e/i40e_main.c 14484\n", "drivers/staging/rdma/hfi1/chip.c 13914\n", "Name: path, dtype: int64" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "git_blame.path.value_counts().head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We need to do some string magic to fetch only the first two parts." ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
shapathauthortimestamplineknowingcomponent
0889d0d42667c9drivers/scsi/bfa/bfad_drv.hAnil Gurumurthy2015-11-26 08:54:451Falsedrivers:scsi
1889d0d42667c9drivers/scsi/bfa/bfad_drv.hAnil Gurumurthy2015-11-26 08:54:452Falsedrivers:scsi
2889d0d42667c9drivers/scsi/bfa/bfad_drv.hAnil Gurumurthy2015-11-26 08:54:453Falsedrivers:scsi
37725ccfda5971drivers/scsi/bfa/bfad_drv.hJing Huang2009-09-24 00:46:154Falsedrivers:scsi
4889d0d42667c9drivers/scsi/bfa/bfad_drv.hAnil Gurumurthy2015-11-26 08:54:455Falsedrivers:scsi
\n", "
" ], "text/plain": [ " sha path author \\\n", "0 889d0d42667c9 drivers/scsi/bfa/bfad_drv.h Anil Gurumurthy \n", "1 889d0d42667c9 drivers/scsi/bfa/bfad_drv.h Anil Gurumurthy \n", "2 889d0d42667c9 drivers/scsi/bfa/bfad_drv.h Anil Gurumurthy \n", "3 7725ccfda5971 drivers/scsi/bfa/bfad_drv.h Jing Huang \n", "4 889d0d42667c9 drivers/scsi/bfa/bfad_drv.h Anil Gurumurthy \n", "\n", " timestamp line knowing component \n", "0 2015-11-26 08:54:45 1 False drivers:scsi \n", "1 2015-11-26 08:54:45 2 False drivers:scsi \n", "2 2015-11-26 08:54:45 3 False drivers:scsi \n", "3 2009-09-24 00:46:15 4 False drivers:scsi \n", "4 2015-11-26 08:54:45 5 False drivers:scsi " ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "git_blame['component'] = git_blame.path.str.split(\"/\", n=2).str[:2].str.join(\":\")\n", "git_blame.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After this, we can group our data by the new `component` column and calculate the ratio of known and not known code by using the `mean` method." ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "component\n", "arch:arc 0.000000\n", "arch:arm 0.000118\n", "arch:i386 0.000000\n", "arch:ia64 0.000000\n", "arch:mips 0.000000\n", "Name: knowing, dtype: float64" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "knowledge_per_component = git_blame.groupby('component').knowing.mean()\n", "knowledge_per_component.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can create a little visualization of the top 10 known parts by sorting the \"knowledge\" for a component and plotting a a bar chart." ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAFZCAYAAABjZm+4AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAIABJREFUeJzt3Xu8HWV97/HPFwJyBESUHKtCCNiojQoCQUFUFO0RvIR6VC6KehDFSwWtVotSBWm1FWvtQTkqL5WbtwReeowUij0QQFEwQe43G+OFFNuGahGvCHzPHzOLrCxWslfI3jPPTL7v12u/9p5Zs+f5vdbe+7ef+T3PPCPbREREv2zWdgARETH9ktwjInooyT0iooeS3CMieijJPSKih5LcIyJ6KMk9IqKHktwjInooyT0ioodmtdXwDjvs4Llz57bVfEREJ1111VV32J491XGtJfe5c+eyfPnytpqPiOgkST+e5LiUZSIieijJPSKih5LcIyJ6KMk9IqKHktwjInpoyuQu6XOS/kPSDet4XZJOkbRC0nWS9pz+MCMiYkNM0nM/AzhwPa8fBMyrP44GPrnxYUVExMaYMrnbvgz42XoOORg4y5UrgIdLevR0BRgRERtuOm5ieixw29D2qnrfT0cPlHQ0Ve+eOXPmrPekc4/7x40O7Ed/+6KNPkdERBdNx4Cqxuwb+9Rt26fZXmB7wezZU949GxERD9J0JPdVwE5D2zsCt0/DeSMi4kGajuS+BHhNPWtmH+BO2w8oyURERHOmrLlL+hLwHGAHSauAE4AtAGx/CjgfeCGwAvg1cORMBRsREZOZMrnbPnyK1w386bRFFBERGy13qEZE9FCSe0REDyW5R0T0UJJ7REQPJblHRPRQkntERA8luUdE9FCSe0REDyW5R0T0UJJ7REQPJblHRPRQkntERA8luUdE9FCSe0REDyW5R0T0UJJ7REQPJblHRPRQkntERA8luUdE9FCSe0REDyW5R0T0UJJ7REQPJblHRPRQkntERA8luUdE9FCSe0REDyW5R0T0UJJ7REQPJblHRPRQkntERA8luUdE9FCSe0RED02U3CUdKOlWSSskHTfm9TmSlkq6WtJ1kl44/aFGRMSkpkzukjYHTgUOAuYDh0uaP3LYXwKLbe8BHAb8n+kONCIiJjdJz/1pwArbK23fDXwZOHjkGAMPq7/eDrh9+kKMiIgNNWuCYx4L3Da0vQp4+sgxJwLfkHQMsDXw/GmJLiIiHpRJeu4as88j24cDZ9jeEXghcLakB5xb0tGSlktavnr16g2PNiIiJjJJcl8F7DS0vSMPLLscBSwGsP0dYCtgh9ET2T7N9gLbC2bPnv3gIo6IiClNktyXAfMk7SJpS6oB0yUjx/wEeB6ApD+iSu7pmkdEtGTK5G77HuCtwIXAzVSzYm6UdJKkhfVh7wTeIOla4EvA/7I9WrqJiIiGTDKgiu3zgfNH9r1/6OubgP2mN7SIiHiwcodqREQPJblHRPRQkntERA8luUdE9FCSe0REDyW5R0T0UJJ7REQPJblHRPRQkntERA8luUdE9FCSe0REDyW5R0T0UJJ7REQPJblHRPRQkntERA8luUdE9FCSe0REDyW5R0T0UJJ7REQPJblHRPRQkntERA8luUdE9FCSe0REDyW5R0T0UJJ7REQPJblHRPRQkntERA8luUdE9FCSe0REDyW5R0T0UJJ7REQPJblHRPRQkntERA9NlNwlHSjpVkkrJB23jmMOkXSTpBslfXF6w4yIiA0xa6oDJG0OnAr8MbAKWCZpie2bho6ZB7wH2M/2zyX995kKOCIipjZJz/1pwArbK23fDXwZOHjkmDcAp9r+OYDt/5jeMCMiYkNMktwfC9w2tL2q3jfs8cDjJV0u6QpJB447kaSjJS2XtHz16tUPLuKIiJjSJMldY/Z5ZHsWMA94DnA48BlJD3/AN9mn2V5ge8Hs2bM3NNaIiJjQJMl9FbDT0PaOwO1jjvma7d/b/iFwK1Wyj4iIFkyS3JcB8yTtImlL4DBgycgx/xd4LoCkHajKNCunM9CIiJjclMnd9j3AW4ELgZuBxbZvlHSSpIX1YRcC/ynpJmAp8C7b/zlTQUdExPpNORUSwPb5wPkj+94/9LWBd9QfERHRstyhGhHRQ0nuERE9lOQeEdFDSe4RET2U5B4R0UNJ7hERPZTkHhHRQ0nuERE9lOQeEdFDSe4RET2U5B4R0UNJ7hERPZTkHhHRQ0nuERE9lOQeEdFDSe4RET2U5B4R0UNJ7hERPZTkHhHRQ0nuERE9lOQeEdFDSe4RET2U5B4R0UNJ7hERPZTkHhHRQ0nuERE9lOQeEdFDSe4RET2U5B4R0UNJ7hERPZTkHhHRQ0nuERE9NFFyl3SgpFslrZB03HqOe7kkS1owfSFGRMSGmjK5S9ocOBU4CJgPHC5p/pjjtgWOBa6c7iAjImLDTNJzfxqwwvZK23cDXwYOHnPcXwEnA7+dxvgiIuJBmCS5Pxa4bWh7Vb3vfpL2AHayfd76TiTpaEnLJS1fvXr1BgcbERGTmSS5a8w+3/+itBnwMeCdU53I9mm2F9heMHv27MmjjIiIDTJJcl8F7DS0vSNw+9D2tsCTgUsk/QjYB1iSQdWIiPZMktyXAfMk7SJpS+AwYMngRdt32t7B9lzbc4ErgIW2l89IxBERMaUpk7vte4C3AhcCNwOLbd8o6SRJC2c6wIiI2HCzJjnI9vnA+SP73r+OY5+z8WFFRMTGyB2qERE9lOQeEdFDSe4RET2U5B4R0UNJ7hERPZTkHhHRQ0nuERE9lOQeEdFDSe4RET2U5B4R0UNJ7hERPZTkHhHRQ0nuERE9lOQeEdFDSe4RET2U5B4R0UNJ7hERPZTkHhHRQ0nuERE9lOQeEdFDSe4RET2U5B4R0UNJ7hERPZTkHhHRQ7PaDqB4J243Dee4c+PPERGxAdJzj4jooST3iIgeSnKPiOihJPeIiB5Kco+I6KEk94iIHkpyj4jooST3iIgemii5SzpQ0q2SVkg6bszr75B0k6TrJF0kaefpDzUiIiY1ZXKXtDlwKnAQMB84XNL8kcOuBhbY3g04Fzh5ugONiIjJTdJzfxqwwvZK23cDXwYOHj7A9lLbv643rwB2nN4wIyJiQ0yS3B8L3Da0varety5HAReMe0HS0ZKWS1q+evXqyaOMiIgNMkly15h9HnugdASwAPjIuNdtn2Z7ge0Fs2fPnjzKiIjYIJOsCrkK2Gloe0fg9tGDJD0fOB7Y3/bvpie8iIh4MCbpuS8D5knaRdKWwGHAkuEDJO0BfBpYaPs/pj/MiIjYEFMmd9v3AG8FLgRuBhbbvlHSSZIW1od9BNgGOEfSNZKWrON0ERHRgIke1mH7fOD8kX3vH/r6+dMcV0REbITcoRoR0UNJ7hERPZTkHhHRQ0nuERE9lOQeEdFDSe4RET2U5B4R0UNJ7hERPZTkHhHRQ0nuERE9lOQeEdFDSe4RET2U5B4R0UNJ7hERPTTRkr/Rrqec+ZSNPsf1r71+GiKJiK5Izz0iooeS3CMieihlmZjYzU/8o40+xx/dcvM0RBIRU0nPPSKih5LcIyJ6KMk9IqKHUnOPTjn1TRdv9Dn+9FMHTEMkEWVLco94ED566Is3+hzvXHTeRn3/quO+udEx7Pi3z9roc0SZktwjYqOceOKJRZwj1pbkHhGdd9HFj9voczzvgB9s9Dn+YOk1G32Of3vuUzf6HJAB1YiIXkpyj4jooST3iIgeSnKPiOihJPeIiB5Kco+I6KEk94iIHpoouUs6UNKtklZIOm7M6w+RtKh+/UpJc6c70IiImNyUyV3S5sCpwEHAfOBwSfNHDjsK+LntPwQ+Bnx4ugONiIjJTdJzfxqwwvZK23cDXwYOHjnmYODM+utzgedJ0vSFGRERG0K213+A9HLgQNuvr7dfDTzd9luHjrmhPmZVvf2D+pg7Rs51NHB0vfkE4NaNjH8H4I4pj5pZJcQAZcRRQgxQRhwlxABlxFFCDFBGHNMRw862Z0910CRry4zrgY/+R5jkGGyfBpw2QZsTkbTc9oLpOl9XYygljhJiKCWOEmIoJY4SYigljiZjmKQsswrYaWh7R+D2dR0jaRawHfCz6QgwIiI23CTJfRkwT9IukrYEDgOWjByzBHht/fXLgYs9Vb0nIiJmzJRlGdv3SHorcCGwOfA52zdKOglYbnsJ8FngbEkrqHrsh81k0EOmrcSzEUqIAcqIo4QYoIw4SogByoijhBigjDgai2HKAdWIiOie3KEaEdFDSe4RET2U5B4R0UNJ7htA0n7154e0HUtExPp0MrlL2rqlpk+pP3+npfaLI+laSe+VtPFPKN7wts+uP7+t6bZj/STtMsm+GY5hC0nHSjq3/jhG0hZNxlDHsbWkzYa2N5P00Blvt0uzZSQ9A/gMsI3tOZJ2B95o+y0NtX8FcDPwIqo1dtZi+9gGYlhs+xBJ17P2XcCqQvBuMx3DSDw7A4fWH/cBi4DFtn/SQNs3US1otwR4DiN3Sttu9Ea6+sruRGBnqmnGg5/Jrg20/Q+23y7p64y/O3zhTMcwEs/3bO85su8q23s1GMNngC1Ys+7Vq4F7B0upNBjHFcDzbf+y3t4G+IbtZ8xku5MsP1CSjwEvoL6Jyva1kp7dYPsvBp4PHABc1WC7wwa91Be31P5abP8YOBk4WdI84H1Uq4Ju3kDznwL+CdiV6ucxnNxd72/SZ4E/q2O5t+G2z64//13D7a5F0hOBJwHbSfqfQy89DNiq4XD2tr370PbFkq5tOAaArQaJHcD2L5vouXctuWP7tpEFJxv7I7J9h6RzgMfYPnPKb5iZGH5af3kH8Bvb90l6PPBE4II2YqrX7z+Eqvd+L/DuJtq1fQpwiqRP2n5zE21O4U7brfwMbA86G8upfy/g/iW7mxwjegJVx+PhwEuG9t8FvKHBOADulfQ42z8AkLQrzf/TBfiVpD1tf6+OYy/gNzPdaNfKMucCfw98AtgHOBZYYLupO2IHcSy1/dwm2xwTw1XAs4DtgSuo/qh/bftVDcdxJdWl7znAItsrm2x/KI7dqd4PgMtsX9dg24PywyFUVyxfAX43eH3wR91QLK2UAMbEsa/tVsemJD0POB1YSXVVtzNwpO2lDcexN1UZd7Am16OBQ4f+Ic9Mux1L7jsA/5uqNCLgG8DbbP9nw3F8kGpxtEXArwb7G/4j/p7tPSUdA/w32ydLutr2Hk3FUMfxRNu3NNnmmBiOpVpK+iv1rpcCp9n+eEPtry9Z2PYBTcRRx3KN7adOta+BOB4PfBJ4lO0nS9oNWGj7rxuO4yFUVxMCbrH9uym+Zabi2GIkjt/PeJtdSe715eWxtj9WQCzj/pib/iO+GngL1TjEUfV6P9fbfkpTMdRxPAR4GTCXoTKf7ZMajOE6YF/bv6q3twa+08Lg8q6jVy7j9s1wDJcDx4yUAD5he9+mYqjbvRR4F/DpQYdD0g22n9xA2wfYvnik5n8/218Zt38G49mcahLGXNb+G/n7mWy3MzV32/dKOpgqmbUdS6slmdrbgPcAX60T+65Ao5ebta8Bd1INIrbSK6LqDQ3XUu9l/DMGZtq5wJ4j+84BGpshArwdOEfSWiWABtsfeKjt746Mj93TUNv7Axezds1/wKy5wmvK14HfAtdTzShrRGeSe+1ySZ+gxXIIgKRHAR+iGlg9qH6m7L62P9tA22fbfjWwx/D0trp3OONTMcfY0faBLbQ77HTgSklfrbf/hGrmSiNKmiFie1kdT6MlgDHuqO99MNz/RLefrv9bpoftE+rPRzbR3gR2bPoqEjpUloEyyiF1HBdQJZTjbe9eP6Dk6iZKIgXO7T4N+Ljt65tsd0wcewH7Ub0fl9m+usG2D6b6h7KQtZ91cBfwZdvfbjCWrajKdc+kSqzfBD5l+7dNxVDHsSvV8rbPAH4O/BA4wvaPGozh/eP2N1kyrOP4MHCR7W802m6XknspJC2zvffwAGZTg1b14OGbqeZw/ysjc7ubuGGmjmNwE9UsYB7VjITf0dLNVCUoZIbIYqp/Kp+vdx0ObG/7FS3FszWwme27Wmj7nUObW1FN0bzZ9usajuOlVD+PzYDfs+Zv5GEz2m4XkrukI2x/XtI7xr0+0wMTY+K5hGoQ8Z/rGSv7AB+2vX+DMbQ6t7u+M3Wd6pubGiPpPNsvXtd2QzHsCHyc6grCwLeoZnOtajCGa0du3Bm7r4E4Wh9oX0dMS2y/oOF2V1Jd2V3vBhNuV2rug7Vktm01ijXeQXX5/bh6dsJsqscLNqbtm3aaTt4TGL1BpukbZqAq1X0RGPSSj6j3/XGDMVwtaR/bVwBIejpweYPtD5Qw0D7qoTR/1zLAvwA3NJnYoSM99wFJWzVdO1yXus4+GLS6tY1BqxJ6qyXFUbe9PbBTkzcxDbU9rtfc6BxzSTdT/V4O1vaZQ7Ue0n00WC5ratrjFDEMr7+0OVUn7CTbn2g4jjOo/qlcwNo3t2Uq5JAbJP071SDRZcDltu9sOohxg1aSGh+0ooze6rh2G42jLpMtpPp9vgZYLelS22PLeDNotaQjgC/V24cDjd5gB7Q9c2ng25Ke0vJA+3AH4x7g3203NR1z2A/rjy3rj0Z0qucOIGkO1W3m+wEvBP6rhbvvShu0aq23WkIcg4FtSa+v2z9B0nUt3MQ0h2ppjMENQ5dT1dxLK2HNuHpW1x9SJbVNeqC9LZ3qudcDVvtRJffdgRupBq2a9oSRy++lani1uVJ6q4XEMUvSo6nWdjm+wXbX4mqZ40aX1h2nkDLZQQ23N1YJ74WkBVS/l4OloAGY6X90nUruVHXEZcCHbL+pxThKGLTazvYv6t7q6YPeasMxlBLHB4ALgW/VN/HsSjWI1agSZsvUWi/XFXS10vp7AXyBaimGRu9Q7dqTmPYAzgJeKek7ks6SdFQLcTydqqb4I0k/onoy0/6Srm8wsQ33Vs9rqM3i4qjX7djJ9m6uH9pie6XtlzUdC9XMmCXAY4DHUt12fnrTQbheFlrS9pJ285plohsl6bz1bTehkPdite0ltn9o+8eDj5lutIs1922oBjKfRTXVzLbnNhxD63O869u530/VW31L3Vv9SNNJrYQ4VMASzHUcra/IOK5MBrQxuIykRw8n09HtBtq/hALeC1VLDx8OXMTas2VmdI2bTpVlJC2nevDAt6kueZ/dxuVf25ecw73VwT5Xa8s0ndiLiIPqKqr1NYeo1lNpe7ZMCWUyYO1eM+0M+JfyXhxJ9TCdLVhTlpnxBcw6ldyBg2yvbjsIaHegxtUKmQtpeYXMUuKgWr8EYPjuR1M9DrFJr6OaLfOxuv1vU/1hN6mIweUMtK9ldze8FDd0JLkPLzsgPXAl16aXH6i1PVBTSm+19ThKKMnUdvLIg6hVPTR7xh8WPqSIwWXK6DWX8l5cIWm+7ZuabLQTyZ01yw48AdibNSvvvYTqZqbGFXDJWUpvtfU41OISzCM+zgPXcx+3b0YUVCaDlnvNhb0XzwReK6nROf+dGlCV9A3gZa5XmJO0LXCOG15PvJSBmqioxSWY6/b3pfon93bWLlE9DHjp6JIEMxxLKYPLGWhfE8fYCRgzPXbXlZ77wBzg7qHtu6lWnWta65ecpfRWC4ljB9uLJb0HwPY9kpp8yv2WwDZUf0/Di9v9goYXlKOAMllBvebW34u6vVYmYHQtuZ8NfFfVE3dM9SDks1qIo4SBmjOoe6v19vepfombLkWUEMevJD2SNU/92YdqRcJG2L4UuFTSGbZ/LGlr189zbUHrZbIMtD9QGxMwOpXcbX9Q0j9R1bAAjnSDT9wZUsJATdu91ZLieCctL8Fce0xdItoGmCNpd+CNg5urmlBCGaLWeq+5oPcCWpiA0ankXruG6lmMs6BarMnVmh6NKOiSs9Xeaklx2L5K0v60vAQz8A/AC6gH/G1fK+nZTQZQSJkMCug1F/RetDMBw3ZnPoBjgDuoFgy7jmqthutaiGNpAe/FXlTr2dxZf/4+sNumGAdwLfBe4HEt/0yurD9fPRxbwzFcQFUuvLbenkX1BKDW3pcWfx5FvBfAJVSD64+gmhZ7FfD3M91u13rub6NakbHpu/5GlXDJWURvtZA4FgKHAosl3Uf1c1nsBq/oardJegZgSVsCx1I9KKNJJZTJSuk1F/Fe0NIEjK4tHHYb7ZQeRj0DeBLVJedH64+/azIAVUsMvxv4re0b2kjspcThaiGmk23vBbwS2I1qHfGmvQn4U6pFw1YBT623m9R6max2BtW41GPq7e9TTRVtUinvRSuL63Wt574SuETSP9Lg46pGuYyBmlJ6q0XEIWku1R/PocC9VP9wGmX7DuBVTbc7opTB5RJ6zaW8F61MwOjaTUwnjNtv+wMNx1HCJedwPPOA9wGvsr15GzG0GYekK6kWZToHWORqgLtxkk5nzTM772f7dQ3HUcLzfS+hmmTwz7b3rHvNH7a9f8NxtPpe1BMwjrXd+LTQTiX3UrR9R+RQHHNZu7e6yPZHm4yhhDgkPdH2LU21t544hmdMbUV1H8btto9tMIZrqa6eFtn+QVPtjoljL+AU4MnADdS9Zje4TEdB78XSNq72O5XcJS1lfM+o0ZsSJC2zvbfqZ3fW+5pet7uU3mprcUg6wvbnNbSw3LCmy3WjJG0G/L8mfz/rW90PrT/aLNeV0Gsu4r2Q9EFgOxqegNG1mvufD329FdVlXxtPMy9hoOa1JfRWaTeOrevP2673qPbMo1oyozGubnU/GTh5qEz2YaDRcl0JveZS3gtamvPfqeRu+6qRXZdLurSFUFobqBn0VoEXSnrh6OtN9VZLiMP2p+ua5i/aqGmOknQX1R+t6s//BvxFC3HMpeXBZTLQfr+2JmB0KrlLesTQ5mZUN9D8QdNxtDy3u5TeahFxuJx1TLDd9s9ktEz2irbKdSX0mkt5L9qagNG1mvsPWdMzuodqLvNJtr/VcBytXnK2OQJfaByt1DSH2l/veu1N3txWyuAyZKB9KI5WJmB0KrmXooSBmrZG4EuMox5ohzWD7YOHITQykDnU/jiNxFHa4HIG2teKp5UJGF0ry2wBvBkYLMZ0CfDppkfhS7jkpIAlENqOY+iP9zzWXNHdH8JMt39/QwX8k6WQMtmQDLSv0coEjE713CV9hqo3cGa969XAvbZf30Isc2n3krPV3moJcQzd1DZ4/OLX6vZfAlzW9O+FpK2At1AtSW3gm8CnbP+2ofZbL5OV0msu4b0YiqWVOf+d6rkDe3vtR5ZdXNe/G9XmQE0pvdUS4hjcmazq8Yt7es3jF0+k+tk07SzgLqrnpgIcTvWAmVc00Xghg8tF9JoLeS8GsbQyAaNryf1eSY8bDGLWazS0scpbm5ecow8LX6u3ugnGAeU8fvEJI52PpS10Plot1xU2PbWI0mVbEzC6VpZ5HtWo80qqRLIz1dOY1jegNZ3tF3HJWcdSysPCW49D0vFUJbLhxy8usv03TcVQx3EGVRnminr76VQdgcaexFRSua7tsYiC3otWJmB0qudu+6J6AHNweXOL7d9N8W3TqYhLzlopvdXW43D1+MULgGfVuxp9/KKk66kSyBbAayT9pN7eGbipoRhaL5ON2OQH2u9vsKUJGJ1K7pJeAfyT7esk/SVwgqS/3kQvOcc9LPzM9X9Lf+Oofweanik0MKMPOp5QSWUyWHPL/WDF1sFdu030mkt7L1q5U7ZrZZnrbO8m6ZnA31A9IOO9tp/ecBytX3LWcezJmt7qZU32VkuMI9ovkw31mgfJfK1e8yZaumxlzn+neu6sGTx9EfBJ21+rZ0Y0rYiBmpZ7q8XFUQJJ59l+8bq2G9B2maykXnPb78VAKxMwutZzPw/4V+D5VOvK/Ab47sgMhSbiKGKgJsoj6dGun3Q/bruB9ksZXG6919z2e9H2BIyu9dwPAQ4E/s72f6l6LuG7mmq8tIGaKM8gkUvaHthppm9UGdN+q4PLQ1rvNRfwXrQ6AaNTyd32r4GvDG3/FGisV0RZl5xRGFWPlltI9Xd1DbBa0qW2x/bcZkohZbJNfqC97QkYnSrLQBE1zSIuOaM8g4WhJL2eqtd+wmASQNuxtSED7ZW2JmB0qudee8MU201o/ZIzijSrLhUeAhzfdjBtK+QKogStTMDoXHJvu6ZZK+KSM4rzAeBC4Fu2l9XLY/xLyzFF+1qZ89+pssy4mibQeE2zjiWXnHG/klYhjDK0Pee/az337Wz/oq5pnj6oabYRSC45Y1hJqxBGMVqdgNG15J6aZpSsiJvbogxtL0ndteSemmaUbFBbPWloX1PrqUS5WpmA0ZnkXtc0dxqeVlav0fCy9qKKWKOE9YaiSK1MwOjagGoRC3ZFjCPpUcCHgMfYPkjSfGBf259tObRoWRsTMLqW3D8IbEdqmlGg+lb304Hjbe8uaRZwte2ntBxabIK6ltzHPXEpC3ZFESQts7334E7Vet81tp/admyx6elMzR1S04zi/UrSI6kXkZO0D3BnuyHFpmqztgPYEJIeJemz9eUvkuZLOqrtuCJq7wSWAI+TdDlwFnBMuyHFpqprZZnUNKNo9e/k4Bm/t9r+fcshxSaqUz13YAfbi6meII7te1jzdKaIVkm6lurZmL+1fUMSe7Spa8k9Nc0o2ULgHmCxpGWS/lzSnLaDik1T18oyewGnAE8GbgBmAy9vaWXIiHWSNA94H/Aq25u3HU9sejqV3CE1zSibpLlUax8dSlUyXGT7o23GFJumTk2FrGuai6j+YH7QdjwRwyRdCWxBtSjUK+rlMSJa0ameu6SdqXpEh1INqi4CFtv+SauBRQCSnmj7lrbjiICOJfdhqWlGKSQdYfvzQw9nWMtMP5QhYpxOlWVgbE3z3W3GEwFsXX/edr1HRTSoUz33kZrmotQ0oxR5zF6UpmvJPTXNKFaWpI6SdCK5p6YZXZAlqaMkXam5p6YZXTB4zN4H6s+Dp95nSepoXCeSu+1P1zXNX6SmGaUZuqI8jyqZa+jl8i+No5c6s7aM7Xup1u6IKM229cdewJuBRwOPAd4IzG8xrtiEdaLmPpCaZpRM0jeAl9m+q97eFjjH9oHtRhabok6UZYakphklmwPcPbRQLeHSAAAC7klEQVR9NzC3nVBiU9eJ5J6aZnTE2cB3JX2V6vfypcCZ7YYUm6pOJHfWzJJ5ArA38DWqBP8S4LK2gooYZvuD9dPCnlXvOtL21W3GFJuurtXcU9OMiJhAZ2bL1FLTjIiYQFfKMgOpaUZETKBTZRkASXuypqZ5WWqaEREP1LnkHhERU+tazT0iIiaQ5B4R0UNJ7hEtkPR2SQ9tO47or9TcI1og6UfAAtt3tB1L9FN67tEZkl4j6TpJ10o6W9LOki6q910kaU593BmSPilpqaSVkvaX9DlJN0s6Y+h8v5T0UUnfq79/dr3/qZKuqM/7VUnb1/svkfRhSd+V9H1Jz6r3by7pI5KW1d/zxnr/c+rvOVfSLZK+oMqxVKtGLpW0tOG3MTYRSe7RCZKeBBwPHGB7d+BtwCeAs2zvBnwBOGXoW7anWlDuz4CvAx8DngQ8RdJT62O2Br5ne0/gUuCEev9ZwF/U571+aD/ALNtPA94+tP8o4E7be1Mtj/EGSbvUr+1RHzsf2BXYz/YpwO3Ac/NYvpgpSe7RFQcA5w7KGLZ/BuwLfLF+/WzgmUPHf91VzfF64N9tX2/7PuBG1tzVfB/V8tEAnweeKWk74OG2L633nwk8e+i8X6k/XzV0nv8BvEbSNcCVwCOBefVr37W9qm77GnJHdTSka3eoxqZrsLzz+gy//rv6831DXw+21/V7P8kA1OBc9w6dR8Axti8cPlDSc0baHv6eiBmVnnt0xUXAIZIeCSDpEcC3gcPq118FfGsDz7kZ8PL661cC37J9J/DzQT0deDVVyWZ9LgTeLGmLOrbHS9p6iu+5izwTOGZQehHRCbZvrJ/Edamke4GrgWOBz0l6F7AaOHIDT/sr4EmSrgLuBA6t978W+FQ9VXHlBOf9DFW55XuSVMfyJ1N8z2nABZJ+mrp7zIRMhYxNlqRf2t6m7TgiZkLKMhERPZSee0RED6XnHhHRQ0nuERE9lOQeEdFDSe4RET2U5B4R0UP/H9Ohze3l0TrOAAAAAElFTkSuQmCC\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "knowledge_per_component.sort_values(ascending=False).head(10).plot.bar();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Discussion** \n", "We see that in the last six months, some developers worked heavily on the SoundWIRE capabilities of Linux. So there might be still the chance to find some author how knows everything about this component. We also see some minor changes in other areas." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Top 10 No-Go Areas\n", "\n", "But where are the no-go areas in the Linux kernel project? That means which are the oldest parts of the system (where it is also likely that nobody knows anything all)? For this, we create a new column named `age` and calculate the difference between today and the `timestamp` column." ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
shapathauthortimestamplineknowingcomponentage
0889d0d42667c9drivers/scsi/bfa/bfad_drv.hAnil Gurumurthy2015-11-26 08:54:451Falsedrivers:scsi879 days 12:25:30.819337
1889d0d42667c9drivers/scsi/bfa/bfad_drv.hAnil Gurumurthy2015-11-26 08:54:452Falsedrivers:scsi879 days 12:25:30.819337
2889d0d42667c9drivers/scsi/bfa/bfad_drv.hAnil Gurumurthy2015-11-26 08:54:453Falsedrivers:scsi879 days 12:25:30.819337
37725ccfda5971drivers/scsi/bfa/bfad_drv.hJing Huang2009-09-24 00:46:154Falsedrivers:scsi3133 days 20:34:00.819337
4889d0d42667c9drivers/scsi/bfa/bfad_drv.hAnil Gurumurthy2015-11-26 08:54:455Falsedrivers:scsi879 days 12:25:30.819337
\n", "
" ], "text/plain": [ " sha path author \\\n", "0 889d0d42667c9 drivers/scsi/bfa/bfad_drv.h Anil Gurumurthy \n", "1 889d0d42667c9 drivers/scsi/bfa/bfad_drv.h Anil Gurumurthy \n", "2 889d0d42667c9 drivers/scsi/bfa/bfad_drv.h Anil Gurumurthy \n", "3 7725ccfda5971 drivers/scsi/bfa/bfad_drv.h Jing Huang \n", "4 889d0d42667c9 drivers/scsi/bfa/bfad_drv.h Anil Gurumurthy \n", "\n", " timestamp line knowing component age \n", "0 2015-11-26 08:54:45 1 False drivers:scsi 879 days 12:25:30.819337 \n", "1 2015-11-26 08:54:45 2 False drivers:scsi 879 days 12:25:30.819337 \n", "2 2015-11-26 08:54:45 3 False drivers:scsi 879 days 12:25:30.819337 \n", "3 2009-09-24 00:46:15 4 False drivers:scsi 3133 days 20:34:00.819337 \n", "4 2015-11-26 08:54:45 5 False drivers:scsi 879 days 12:25:30.819337 " ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "git_blame['age'] = pd.Timestamp('now') - git_blame.timestamp\n", "git_blame.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We need a little helper method `mean` that calculates the mean timestamp delta for each component (for reasons unknown, it doesn't work with the standard `mean()` method of Pandas)." ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
meancount
component
arch:arc1142 days 23:53:26.745382311
arch:arm2874 days 19:59:45.0530268507
arch:i3864637 days 17:37:10.9912626561
arch:ia644507 days 16:19:43.812625298
arch:mips2803 days 13:41:28.230795192
\n", "
" ], "text/plain": [ " mean count\n", "component \n", "arch:arc 1142 days 23:53:26.745382 311\n", "arch:arm 2874 days 19:59:45.053026 8507\n", "arch:i386 4637 days 17:37:10.991262 6561\n", "arch:ia64 4507 days 16:19:43.812625 298\n", "arch:mips 2803 days 13:41:28.230795 192" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def mean(x):\n", " return x.mean()\n", "\n", "mean_age_per_component = git_blame.groupby('component').age.agg([mean, 'count'])\n", "mean_age_per_component.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Again, we sort the resulting values and just take the first 10 rows with the oldest mean age." ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
meancount
component
arch:sparc644744 days 06:10:06.800469530
arch:i3864637 days 17:37:10.9912626561
drivers:usb4588 days 04:09:13.4798199163
arch:ia644507 days 16:19:43.812625298
drivers:parisc4483 days 12:38:23.08146012765
drivers:sn4438 days 15:36:58.457533843
include:asm-arm4426 days 02:40:34.803833129
drivers:isdn4324 days 16:56:44.996032184917
arch:powerpc4312 days 13:35:52.3043851973
drivers:message4286 days 00:53:10.58011135594
\n", "
" ], "text/plain": [ " mean count\n", "component \n", "arch:sparc64 4744 days 06:10:06.800469 530\n", "arch:i386 4637 days 17:37:10.991262 6561\n", "drivers:usb 4588 days 04:09:13.479819 9163\n", "arch:ia64 4507 days 16:19:43.812625 298\n", "drivers:parisc 4483 days 12:38:23.081460 12765\n", "drivers:sn 4438 days 15:36:58.457533 843\n", "include:asm-arm 4426 days 02:40:34.803833 129\n", "drivers:isdn 4324 days 16:56:44.996032 184917\n", "arch:powerpc 4312 days 13:35:52.304385 1973\n", "drivers:message 4286 days 00:53:10.580111 35594" ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "top10_no_go_areas = mean_age_per_component.sort_values('mean', ascending=False).head(10)\n", "top10_no_go_areas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Discussion** \n", "The result isn't as bad as it looks at the first glance: The oldest parts of the system are archaic computer architectures like SPARC64 (`arch:sparc64`), x86 (`arch:i386`), Itanium (`arch:ia64`) or PowerPC (`arch:powerpc`). So this is negligible because these architectures are dead anyways. \n", "\n", "We also got lucky with the component `drivers:isdn`, where we have around 184917 lines of rotted code. In the age of glass fiber, we surely don't need any improvements in the [ISDN](https://en.wikipedia.org/wiki/Integrated_Services_Digital_Network) features of Linux.\n", "\n", "Unfortunately, I'm not a Linux expert, so I can't judge the impact of the lost knowledge in the component `drivers:usb` as well as the remaining components." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Conclusion\n", "\n", "OK, I hope you've enjoyed this little longer analysis of the Linux Git repository! We created a big data set right from the origin, wrangled our way through the Git blame data to finally get some insights into the parts of the Linux kernel, where knowledge is most likely lost forever.\n", "\n", "\n", "One remark on the meta-level: Did I exaggerate by taking the complete Git blame log from one of the biggest open-source projects out there? Absolutely! Couldn't I've just exported the really needed sub data set for the analysis? For Sure! But besides showing you what you can find out with this dataset, I wanted to show you that it's absolutely no problem to work with a 3 GB data set like we had. You don't need any Big Data tooling or set up a cluster for computations. With Pandas, e. g. I can execute this analysis on my six-year-old notebook. Period!" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }