# Introduction
There are multiple reasons for analyzing a version control system like your Git repository. See for example Adam Tornhill's book ["Your Code as a Crime Scene"](https://pragprog.com/book/atcrime/your-code-as-a-crime-scene) or his upcoming book ["Software Design X-Rays"](http://www.adamtornhill.com/swevolution/reviewersprogress.html) for plenty of inspirations:

You can 
- analyze knowledge islands
- distinguish often changing code from stable code parts
- identify code that is temporal coupled to other code

Having the necessary data for those analyses in a [Pandas](http://pandas.pydata.org/) <tt>DataFrame</tt> gives you many possibilities to quickly gain insights into the evolution of your software system in various ways.

# The idea

In a preceding [blog post](https://www.feststelltaste.de/reading-a-git-log-file-output-with-pandas/), I showed you a way to read a Git log file with Pandas' DataFrame and [GitPython](https://gitpython.readthedocs.io/en/stable/). Looking back, this was really complicated and tedious. So, with a few tricks we can do it much better this time:

- We use GitPython's feature to directly access an underlying Git installation. This is way more faster than using GitPython's object representation of the commits and makes it possible to have everything we need in one notebook.
- We use in-memory reading by using StringIO to avoid unnecessary file access. This avoids storing the Git output on disk and read it from from disc again. This method is faster, too.
- We also exploit Pandas' <tt>read_csv</tt> method even more.  This makes the transformation of the Git log into a <tt>DataFrame</tt> as easy as pie.

# Exporting the Git repo's history
The first step is to connect GitPython with the Git repo. If we have an instance of the repo, we can gain access to the underlying Git installation of the operating system via <tt>repo.git</tt>.

In our case, we tap the [Spring PetClinic repo](https://github.com/spring-projects/spring-petclinic), a small sample application for the Spring framework (I also analyzed the huge [Linux repo](https://github.com/torvalds/linux/), works as well).

In [1]:
import git 

GIT_REPO_PATH = r'../../spring-petclinic/'
repo = git.Repo(GIT_REPO_PATH)
git_bin = repo.git
git_bin

<git.cmd.Git at 0x24a61ce8ee8>

With the <tt>git_bin</tt>, we can execute almost any Git command we like directly. In our hypothetical use case, we want to retrieve some information about the change frequency of files. For this, we need the complete history of the Git repo including statistics for the changed files (via <tt>--numstat</tt>).

We use a little trick to make sure, that the format for the file's statistics fits nicely with the commit's metadata (SHA <tt>%h</tt>, UNIX timestamp <tt>%at</tt> and author's name <tt>%aN</tt>). The <tt>--numstat</tt> option provides data for additions and deletions for the affected file name in one line &ndash; separated by the tabulator character <tt>\t</tt>:  
<p>
<tt>1<b>\t</b>1<b>\t</b>some/file/name.ext</tt>
</p>

We use the same tabular separator <tt>\t</tt> for the format string:
<p>
<tt>%h<b>\t</b>%at<b>\t</b>%aN</tt>
</p>

And here is the trick: Additionally, we add the number of tabulators of the file's statistics plus an additional tabulator in front of the format string to pretend that there is an empty file statistics' information in front of each commit meta data string.

The results looks like this:

<p>
<tt>\t\t\t%h\t%at\t%aN</tt>
</p>

Note: If you want to export the Git log on the command line into a file, you need to use the horizontal tab <tt>%x0A</tt> as separator instead of <tt>\t</tt> in the format string. Otherwise, the trick doesn't work (I'll show the corresponding format string at the end of this article).


OK, let's executed the Git log export:

In [2]:
git_log = git_bin.execute('git log --numstat --pretty=format:"\t\t\t%h\t%at\t%aN"')
git_log[:80]

'\t\t\t101c9dc\t1498817227\tDave Syer\n2\t3\tpom.xml\n\n\t\t\tffa967c\t1492026060\tAntoine Rey\n1'

# Reading the Git log
We now read in the complete files' history in the <tt>git_log</tt> variable. Don't let confuse you by all the <tt>\t</tt> characters. 

Let's read the result into a Pandas <tt>DataFrame</tt> by using the <tt>read_csv</tt> method. Because we can't provide a file path to a CSV data, we have to use StringIO to read in our in-memory buffered content.

Pandas will read the first line of the tabular-separated "file", sees the many tabular-separated columns and parses all other lines in the same format / column layout. Additionally, we set the <tt>header</tt> to <tt>None</tt> because we don't have one and provide nice names for all the columns that we read in.

In [3]:
import pandas as pd
from io import StringIO

commits_raw = pd.read_csv(StringIO(git_log), 
    sep="\t",
    header=None,              
    names=['additions', 'deletions', 'filename', 'sha', 'timestamp', 'author']
    )
commits_raw.head()

Unnamed: 0,additions,deletions,filename,sha,timestamp,author
0,,,,101c9dc,1498817000.0,Dave Syer
1,2.0,3.0,pom.xml,,,
2,,,,ffa967c,1492026000.0,Antoine Rey
3,1.0,1.0,readme.md,,,
4,,,,fd1c742,1488785000.0,Antoine Rey


Now we have two different kinds of content for the rows:
- The commit meta data without file statistics (see rows with the indexes 0, 2 and 4 above)
- The file statistics without the commit meta data (see rows with the indexes 1 and 3 above)

But we are interested in the commit meta data for each file's statistic. For this, we forward fill (<tt>ffill</tt>) the empty commit meta data entries of the file statistics rows with the preceding commit's meta data via the <tt>DataFrame</tt>'s <tt>fillna</tt> method and <tt>join</tt> this data with the existing columns of the file statistics.

In [4]:
commits = commits_raw[['additions', 'deletions', 'filename']]\
            .join(commits_raw[['sha', 'timestamp', 'author']].fillna(method='ffill'))
commits.head()

Unnamed: 0,additions,deletions,filename,sha,timestamp,author
0,,,,101c9dc,1498817000.0,Dave Syer
1,2.0,3.0,pom.xml,101c9dc,1498817000.0,Dave Syer
2,,,,ffa967c,1492026000.0,Antoine Rey
3,1.0,1.0,readme.md,ffa967c,1492026000.0,Antoine Rey
4,,,,fd1c742,1488785000.0,Antoine Rey


This gives use the commit meta data for each file change!

Because we aren't interested in the pure commit meta data anymore, we drop all those rows that don't contain file statistics aka contain null values via <tt>dropna</tt>.

In [5]:
commits = commits.dropna()
commits.head()

Unnamed: 0,additions,deletions,filename,sha,timestamp,author
1,2,3,pom.xml,101c9dc,1498817000.0,Dave Syer
3,1,1,readme.md,ffa967c,1492026000.0,Antoine Rey
5,1,0,pom.xml,fd1c742,1488785000.0,Antoine Rey
8,1,1,pom.xml,75912a0,1487331000.0,Stephane Nicoll
9,11,9,src/main/java/org/springframework/samples/petc...,75912a0,1487331000.0,Stephane Nicoll


And that's it! We are finished!

In summary, you just need a "one-liner" for converting the Git log file output that was exported with
```
git log --numstat --pretty=format:"%x09%x09%x09%h%x09%at%x09%aN" > git.log
```
and read into a <tt>DataFrame</tt>:

In [6]:
# reading
git_log = pd.read_csv(
    "../../spring-petclinic/git.log",
    sep="\t", 
    header=None,
    names=[
        'additions', 
        'deletions', 
        'filename', 
        'sha', 
        'timestamp', 
        'author'])

# converting in "one line"
git_log[['additions', 'deletions', 'filename']]\
    .join(git_log[['sha', 'timestamp', 'author']]\
    .fillna(method='ffill'))\
    .dropna().head()

Unnamed: 0,additions,deletions,filename,sha,timestamp,author
1,2,3,pom.xml,101c9dc,1498817000.0,Dave Syer
3,1,1,readme.md,ffa967c,1492026000.0,Antoine Rey
5,1,0,pom.xml,fd1c742,1488785000.0,Antoine Rey
8,1,1,pom.xml,75912a0,1487331000.0,Stephane Nicoll
9,11,9,src/main/java/org/springframework/samples/petc...,75912a0,1487331000.0,Stephane Nicoll


# Summary
In this notebook, I showed you how you can read a Git log output in only one line by using Pandas' <tt>read_csv</tt> method. This is a very handy method and a good starting point for further analyses! 