{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction\n", "This notebook is a simple mini-tutorial to introduce you to basic functions of Jupyter, Python, Pandas and matplotlib with the aim of analyzing software data. Therefore, the example is chosen in such a way that we come across the typical methods in a data analysis. Have fun!\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# The Jupyter Notebook System\n", "First, we'll take a closer look at Jupyter Notebook. What you see here is Jupyter, the interactive notebook environment for programming. We see below a cell in which we can enter Python code. Let's just type in a string called `\"Hello World\"` here. With the key combination `Ctrl` + `Enter` we can execute this cell." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Hello World'" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\"Hello World\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result is immediately visible under the cell. Let's create another cell! This works by pressing the `ESC` key followed by the letter `b`. Alternatively, at the end of a notebook, we can run a cell with `Shift` + `Enter` and create a new cell right away.\n", "\n", "Here we see an important feature of Jupyter: The distinction between command mode (accessible via the `Esc` key) and input mode (accessible via the `Enter` key). In command mode, the border of the current cell is blue. In input mode, the border turns green. Let's go to the command mode and press `m`. This changes the cell type to a markdown cell. Markdown is a simple markup language that can be used to write and format texts. This allows us to directly document the steps we have taken." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# A quick Introduction to Python\n", "Let's take a look at some basic Python programming constructs that we will need later when working with Pandas.\n", "\n", "First, we save our text \"Hello World\" into to a variable called `text` by assigning it with the `=` symbol. We write `text` again in the row below, execute the cell and see the result displayed under the cell." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Hello World!'" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text = \"Hello World!\"\n", "text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By using the array notation with the square brackets `[` and `]`, we can access the first letter of our text with a 0-based index (this also works for other types like lists)." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'H'" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can output the last letter with a `[-1]` as selector because negative numbers in parentheses represent the indexing from behind." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'!'" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text[-1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also work with so-called \"slices\". This allows us to output any range of values from the content in our `text` variable." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'llo'" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text[2:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following slice `[:-1]` is an abbreviation for a 0-based slice `[0:-1]`." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Hello World'" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text[:-1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we take a look at how to reverse a text (or even a list). This works with the `::` notation and the specification of `-1`." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'!dlroW olleH'" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text[::-1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can explore the further functionality of a library (or an object stored in a variable) by looking at the methods and attributes of a class or object. When we write the beginning of a command, like `text.` in our string example, we can use the integrated autocompletion of Jupyter with the tab key `Tab` to see which methods the currently used object offers. We can then select a method that we want with the arrow key `down` or narrowing our search by pressing e.g. the first letters of `upper`. If we press `Enter` on the selected item and then `Shift`+ `Tab`, the signature of the corresponding functionality and the section of the help documentation appears. Pressing `Shift` + `Tab` twice will display the full help. So in our case, by calling `upper()` on our `text` variable, we can have our text written in capital letters." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text.upper" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The interactive source code documentation also helps us to find out which arguments we can add in a method in addition to the obligate parameters. This can be easily observed when using the `split` method on our `text` variable and the integrated help functionality." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Hello', 'World!']" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text.split(maxsplit=2, sep=\" \")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Git history analysis\n", "OK, let's start the analysis!\n", "\n", "In this notebook, we want to take a closer look at the development history of the open source project \"Linux\" based on the history of the corresponding GitHub mirror repository.\n", "\n", "A local clone of the GitHub repository https://github.com/torvalds/linux/ was created by using the command \n", "\n", "```\n", "git clone https://github.com/torvalds/linux.git\n", "```\n", "\n", "The relevant parts of the history for this analysis were produced by using\n", "\n", "```\n", "git log --pretty=\"%ad,%aN\" --no-merges > git_demo_timestamp_linux.csv\n", "```\n", "\n", "This command returned the commit timestamp (`%ad`) and the author name (`%aN`) for each commit of the Git repository. The corresponding values are separated by commas. We also indicated that we do not want to receive merge commits (via `--no-merges`). The result of the output was saved in the file `git_demo_timestamp_linux.csv`.\n", "\n", "_Note: For an optimized demo, headers and the separator has been changed manually in the provided dataset to get through this analysis more easily. The differences can be seen at https://www.feststelltaste.de/developers-habits-linux-edition/, which was done with the original dataset._" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Getting to know Pandas\n", "Pandas is a data analysis tool written in Python (and partly in C), which is ideally suited for the evaluation of tabular data due to the use of effective data structures and built-in statistics functions.\n", "\n", "## Basics\n", "\n", "We import the data from above with the help of Pandas. We import `pandas` with the common abbreviation `pd` using the `import... as..` syntax of Python." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can check whether the import of the module really worked by checking the documentation of the `pd` module. To do this, we append the `?` operator to the `pd` variable and execute the cell. The documentation of the module appears in the lower part of the browser window. We can read through this area and make it disappear again with the `ESC` key." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "pd?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We read the compressed CSV file `git_demo_timestamp_linux.gz` with the `read_csv()` method and the parameter `URL` which defines the location of the file. Since the file is a `gzip`-packed file and we get this file over the web, we have to specify the used compression algorithm using `compression='gzip'`. \n", "\n", "The result of our execution is stored in the variable `git_log`. We've just loaded data into a so-called **DataFrame** (something like a programmable Excel worksheet), which in our case consists of two **Series** (= columns). \n", "\n", "We can now perform operations on the DataFrame. For example, we can use `head()` to display the first five entries." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
timestampauthor
02017-12-31 14:47:43Linus Torvalds
12017-12-31 13:13:56Linus Torvalds
22017-12-31 13:03:05Linus Torvalds
32017-12-31 12:30:34Linus Torvalds
42017-12-31 12:29:02Linus Torvalds
\n", "
" ], "text/plain": [ " timestamp author\n", "0 2017-12-31 14:47:43 Linus Torvalds\n", "1 2017-12-31 13:13:56 Linus Torvalds\n", "2 2017-12-31 13:03:05 Linus Torvalds\n", "3 2017-12-31 12:30:34 Linus Torvalds\n", "4 2017-12-31 12:29:02 Linus Torvalds" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "URL = \"https://raw.githubusercontent.com/feststelltaste/software-analytics/master/demos/dataset/git_demo_timestamp_linux.gz\"\n", "git_log = pd.read_csv(URL, compression=\"gzip\")\n", "git_log.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we call `info()` on the `DataFrame` to get some basic data about the read in data." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 723214 entries, 0 to 723213\n", "Data columns (total 2 columns):\n", "timestamp 723214 non-null object\n", "author 723213 non-null object\n", "dtypes: object(2)\n", "memory usage: 11.0+ MB\n" ] } ], "source": [ "git_log.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can access the individual Series / columns by using the `['']` or (in most cases, i.e. as long as the column names do not overlap with the method name offered by the `DataFrame` itself) by directly using the name of the `Series`." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 Linus Torvalds\n", "1 Linus Torvalds\n", "2 Linus Torvalds\n", "3 Linus Torvalds\n", "4 Linus Torvalds\n", "Name: author, dtype: object" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "git_log.author.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## First Analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also perform various operations on a `Series`. For example, with `value_counts()`, we can count the values contained in a `Series` and let them sort according to their frequency. The result is again a `Series`, but this time with the totaled and sorted values. We can additionally call `head(10)` on this `Series`. This gives us a quick way to display the TOP-10 values of a `Series`. We can then record the result in a variable `top10` and output it by writing the variable to the next cell row." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Linus Torvalds 24259\n", "David S. Miller 9563\n", "Mark Brown 6917\n", "Takashi Iwai 6293\n", "Al Viro 6064\n", "H Hartley Sweeten 5942\n", "Ingo Molnar 5462\n", "Mauro Carvalho Chehab 5384\n", "Arnd Bergmann 5305\n", "Greg Kroah-Hartman 4687\n", "Name: author, dtype: int64" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "top10 = git_log.author.value_counts().head(10)\n", "top10" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## First visualizations\n", "Next, we want to visualize or plot the result. To display the plotting result of the internally used plotting library `matplotlib` directly in the notebook, we have to execute this magic command in our notebook\n", "\n", "```\n", "%matplotlib inline\n", "```\n", "\n", "before calling the `plot()` method.\n", "\n", "By default, when `plot()` is called on a `DataFrame` or `Series`, a line chart is created." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%matplotlib inline\n", "top10.plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That doesn't make much sense here, so we use a sub-method of `plot` called `bar()` to create a bar chart." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "top10.plot.bar()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This data can also be visualized as a pie chart. For this, we call the `pie()` method instead of `bar()`. We can also add a semicolon `;` after the `plot` command to avoid printing the text of the reference. " ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "top10.plot.pie();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, the diagram does not look very nice here.\n", "\n", "With the optional styling parameters, we can achieve that we get a nicer graphics. We use\n", "* `figsize=[7,7]` as size\n", "* `title=\"Top 10 authors\"` as title\n", "* `labels=None` to avoid displaying the superfluous label on the left." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "top10.plot.pie(\n", " figsize=[5,5],\n", " title=\"Top 10 Authors\",\n", " label=\"\");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Working with dates\n", "Now let's look at the timestamp information. We want to find out at what time of day the developers commit." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 2017-12-31 14:47:43\n", "1 2017-12-31 13:13:56\n", "2 2017-12-31 13:03:05\n", "3 2017-12-31 12:30:34\n", "4 2017-12-31 12:29:02\n", "Name: timestamp, dtype: object" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "git_log.timestamp.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before we can enter the world of time series processing, we must first convert our column with the dates into the appropriate data type. At the moment our column `timestamp` is still a string, i.e. of textual nature. We can see this by using the helper function `type()` to display the first entry of the `timestamp` column:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "str" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(git_log.timestamp[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Of course, Pandas also helps us to convert data types. The function `pd.to_datetime` takes as first parameter a `series` with dates and converts them. The return value is a `Series` with values of the data type `Timestamp`. The conversion works for most textual dates mostly automagically [sic!], because Pandas can handle different date formats. We also write the result back into the same column." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
timestampauthor
02017-12-31 14:47:43Linus Torvalds
12017-12-31 13:13:56Linus Torvalds
22017-12-31 13:03:05Linus Torvalds
32017-12-31 12:30:34Linus Torvalds
42017-12-31 12:29:02Linus Torvalds
\n", "
" ], "text/plain": [ " timestamp author\n", "0 2017-12-31 14:47:43 Linus Torvalds\n", "1 2017-12-31 13:13:56 Linus Torvalds\n", "2 2017-12-31 13:03:05 Linus Torvalds\n", "3 2017-12-31 12:30:34 Linus Torvalds\n", "4 2017-12-31 12:29:02 Linus Torvalds" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "git_log.timestamp = pd.to_datetime(git_log.timestamp)\n", "git_log.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To check if the conversion was successful, we can check the first value of our converted column `timestamp_local` by calling `type()` again." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pandas._libs.tslib.Timestamp" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(git_log.timestamp[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now also access individual parts of the date values. For this, we use the `dt` (\"datetime\") object with its properties like `hour`." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 14\n", "1 13\n", "2 13\n", "3 12\n", "4 12\n", "Name: timestamp, dtype: int64" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "git_log.timestamp.dt.hour.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Together with the `value_counts()` method that I've already introduced above, we can now count values again after their occurrence. However, it is important that we also set the parameter `sort=False` to avoid sorting according to keep the order of the hours." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 19533\n", "1 15044\n", "2 10420\n", "3 7000\n", "4 6068\n", "Name: timestamp, dtype: int64" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "commits_per_hour = git_log.timestamp.dt.hour.value_counts(sort=False)\n", "commits_per_hour.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can display the result by means of a bar chart and thus get an overview of how many commits occured for each hour." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "commits_per_hour.plot.bar();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now additionally label the plot. To do this, we store the return object of the `bar()` function in the variable `ax`. This is an `Axes` object of the underlying plotting library `matplotlib`, through which we can customize additional properties of the plot. We set here\n", "\n", "* the title via `set_title(\"\")`\n", "* the label of the X-axis with `set_xlabel(\"\")`\n", "* the label of the Y-axis with `set_ylabel<\"Y-axis label>\")`\n", "\n", "The result is a more meaningful, labeled bar chart." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text(0,0.5,'Number of Commits')" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "ax = commits_per_hour.plot.bar()\n", "ax.set_title(\"Commits per Hour\")\n", "ax.set_xlabel(\"Hour of Day\")\n", "ax.set_ylabel(\"Number of Commits\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also analyze the commits per weekdays. To do this, we use the `weekday` attribute of the datetime attribute `dt`. The values here are 0-based with Monday as the first day of the week. As usual, we count the values using `value_counts` and do not sort the values by size but keep the sorting by weekday." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 124296\n", "1 131690\n", "2 131019\n", "3 127097\n", "4 117635\n", "5 44877\n", "6 46600\n", "Name: timestamp, dtype: int64" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "commits_per_weekday = git_log.timestamp.dt.weekday.value_counts(sort=False)\n", "commits_per_weekday" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result in `commits_per_weekday` can be output as a bar chart using `plot.bar()`." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAY0AAAD4CAYAAAAQP7oXAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAE6xJREFUeJzt3X+s3fV93/HnK3ZgpF2AhBuW2qymiteEkP4Ay7BlqzLowCRRjCbQzKpiZaRWI1jTqdoC6yRvSZgSrRsNEmGyYicmyuIw1gqvdeJakKzLFgiXwADjUN9BCnf8upkdkpYmxMl7f5yPu9PLte+He3w51+H5kI7O9/v+fr6f8z7Cuq/z/XEOqSokSerxqnE3IEk6fhgakqRuhoYkqZuhIUnqZmhIkroZGpKkboaGJKmboSFJ6mZoSJK6LR93A8faaaedVqtWrRp3G5J0XLn33nu/VVUT8437sQuNVatWMTk5Oe42JOm4kuRPe8Z5ekqS1M3QkCR1MzQkSd0MDUlSN0NDktTN0JAkdTM0JEndDA1JUrcfuy/3veL865MXef7nFnd+SccVQ0Nj9bbtb1u0uR/c+OCizS29Unl6SpLUzdCQJHUzNCRJ3QwNSVI3Q0OS1M3QkCR185ZbaYH2vfktizr/W76xb1HnlxbCIw1JUjdDQ5LUzdCQJHV7xV/TWHXtHy7q/N/86LsWdX5Jejl5pCFJ6mZoSJK6zRsaSbYleTbJQ0O1f5fkG0keSPL7SU4Z2nZdkqkkjyS5eKi+rtWmklw7VD8zyd1J9if5fJITWv3Etj7Vtq86Vm9akrQwPUcanwbWzartAc6uqp8D/gS4DiDJWcAG4K1tn08kWZZkGXATcAlwFnBFGwvwMeCGqloNHASuavWrgINV9SbghjZOkjRG84ZGVf0xcGBW7Y+q6lBbvQtY2ZbXAzuq6vtV9RgwBaxtj6mqerSqXgB2AOuTBLgAuK3tvx24dGiu7W35NuDCNl6SNCbH4prGPwG+0JZXAE8MbZtutSPVXw98eyiADtf/ylxt+3NtvCRpTEYKjSS/DRwCPnu4NMewWkD9aHPN1cemJJNJJmdmZo7etCRpwRYcGkk2Au8GfqWqDv8xnwbOGBq2EnjyKPVvAackWT6r/lfmattPZtZpssOqaktVramqNRMTEwt9S5KkeSwoNJKsAz4IvKeqnh/atBPY0O58OhNYDXwNuAdY3e6UOoHBxfKdLWy+BFzW9t8I3D4018a2fBlw51A4SZLGYN5vhCf5HPAO4LQk08BmBndLnQjsadem76qqX6+qvUluBR5mcNrq6qr6YZvnGmA3sAzYVlV720t8ENiR5CPAfcDWVt8KfCbJFIMjjA3H4P1Kam769TsXdf6r/+MFizq/xmPe0KiqK+Yob52jdnj89cD1c9R3AbvmqD/K4O6q2fXvAZfP158k6eXjN8IlSd0MDUlSN0NDktTN0JAkdTM0JEndDA1JUjdDQ5LUzdCQJHUzNCRJ3QwNSVI3Q0OS1M3QkCR1MzQkSd0MDUlSN0NDktTN0JAkdTM0JEndDA1JUjdDQ5LUzdCQJHUzNCRJ3QwNSVI3Q0OS1G3e0EiyLcmzSR4aqr0uyZ4k+9vzqa2eJDcmmUryQJJzhvbZ2MbvT7JxqH5ukgfbPjcmydFeQ5I0Pj1HGp8G1s2qXQvcUVWrgTvaOsAlwOr22ATcDIMAADYD5wFrgc1DIXBzG3t4v3XzvIYkaUzmDY2q+mPgwKzyemB7W94OXDpUv6UG7gJOSfJG4GJgT1UdqKqDwB5gXdv22qr6alUVcMusueZ6DUnSmCz0msbpVfUUQHt+Q6uvAJ4YGjfdakerT89RP9prvEiSTUkmk0zOzMws8C1JkuZzrC+EZ45aLaD+klTVlqpaU1VrJiYmXurukqROCw2NZ9qpJdrzs60+DZwxNG4l8OQ89ZVz1I/2GpKkMVloaOwEDt8BtRG4fah+ZbuL6nzguXZqaTdwUZJT2wXwi4Ddbdt3k5zf7pq6ctZcc72GJGlMls83IMnngHcApyWZZnAX1EeBW5NcBTwOXN6G7wLeCUwBzwPvBaiqA0k+DNzTxn2oqg5fXH8/gzu0TgK+0B4c5TUkSWMyb2hU1RVH2HThHGMLuPoI82wDts1RnwTOnqP+f+d6DUnS+PiNcElSN0NDktTN0JAkdTM0JEndDA1JUjdDQ5LUzdCQJHUzNCRJ3QwNSVI3Q0OS1M3QkCR1MzQkSd0MDUlSN0NDktTN0JAkdTM0JEndDA1JUjdDQ5LUzdCQJHUzNCRJ3QwNSVI3Q0OS1G2k0Ejyz5LsTfJQks8l+WtJzkxyd5L9ST6f5IQ29sS2PtW2rxqa57pWfyTJxUP1da02leTaUXqVJI1uwaGRZAXwG8CaqjobWAZsAD4G3FBVq4GDwFVtl6uAg1X1JuCGNo4kZ7X93gqsAz6RZFmSZcBNwCXAWcAVbawkaUxGPT21HDgpyXLgNcBTwAXAbW37duDStry+rdO2X5gkrb6jqr5fVY8BU8Da9piqqker6gVgRxsrSRqTBYdGVf0f4HeAxxmExXPAvcC3q+pQGzYNrGjLK4An2r6H2vjXD9dn7XOkuiRpTEY5PXUqg0/+ZwI/BfwEg1NJs9XhXY6w7aXW5+plU5LJJJMzMzPztS5JWqBRTk/9MvBYVc1U1Q+A3wP+DnBKO10FsBJ4si1PA2cAtO0nAweG67P2OVL9RapqS1Wtqao1ExMTI7wlSdLRjBIajwPnJ3lNuzZxIfAw8CXgsjZmI3B7W97Z1mnb76yqavUN7e6qM4HVwNeAe4DV7W6sExhcLN85Qr+SpBEtn3/I3Krq7iS3AV8HDgH3AVuAPwR2JPlIq21tu2wFPpNkisERxoY2z94ktzIInEPA1VX1Q4Ak1wC7GdyZta2q9i60X0nS6BYcGgBVtRnYPKv8KIM7n2aP/R5w+RHmuR64fo76LmDXKD1Kko4dvxEuSepmaEiSuhkakqRuhoYkqZuhIUnqZmhIkroZGpKkboaGJKmboSFJ6mZoSJK6GRqSpG6GhiSpm6EhSepmaEiSuhkakqRuhoYkqZuhIUnqZmhIkroZGpKkboaGJKmboSFJ6mZoSJK6GRqSpG4jhUaSU5LcluQbSfYl+dtJXpdkT5L97fnUNjZJbkwyleSBJOcMzbOxjd+fZONQ/dwkD7Z9bkySUfqVJI1m1CONjwNfrKo3Az8P7AOuBe6oqtXAHW0d4BJgdXtsAm4GSPI6YDNwHrAW2Hw4aNqYTUP7rRuxX0nSCBYcGkleC/wSsBWgql6oqm8D64Htbdh24NK2vB64pQbuAk5J8kbgYmBPVR2oqoPAHmBd2/baqvpqVRVwy9BckqQxGOVI42eAGeBTSe5L8skkPwGcXlVPAbTnN7TxK4AnhvafbrWj1afnqL9Ikk1JJpNMzszMjPCWJElHM0poLAfOAW6uql8E/pz/fypqLnNdj6gF1F9crNpSVWuqas3ExMTRu5YkLdgooTENTFfV3W39NgYh8kw7tUR7fnZo/BlD+68EnpynvnKOuiRpTBYcGlX1NPBEkp9tpQuBh4GdwOE7oDYCt7flncCV7S6q84Hn2umr3cBFSU5tF8AvAna3bd9Ncn67a+rKobkkSWOwfMT9/ynw2SQnAI8C72UQRLcmuQp4HLi8jd0FvBOYAp5vY6mqA0k+DNzTxn2oqg605fcDnwZOAr7QHpKkMRkpNKrqfmDNHJsunGNsAVcfYZ5twLY56pPA2aP0KEk6dvxGuCSpm6EhSepmaEiSuhkakqRuhoYkqZuhIUnqZmhIkroZGpKkboaGJKmboSFJ6mZoSJK6GRqSpG6GhiSpm6EhSepmaEiSuo36P2GSJC3A9LX/fVHnX/nRv7co83qkIUnqZmhIkrp5ekrScenf/6N3L+r8v/X5P1jU+Y9XHmlIkroZGpKkboaGJKnbyKGRZFmS+5L8QVs/M8ndSfYn+XySE1r9xLY+1bavGprjulZ/JMnFQ/V1rTaV5NpRe5UkjeZYHGl8ANg3tP4x4IaqWg0cBK5q9auAg1X1JuCGNo4kZwEbgLcC64BPtCBaBtwEXAKcBVzRxkqSxmSk0EiyEngX8Mm2HuAC4LY2ZDtwaVte39Zp2y9s49cDO6rq+1X1GDAFrG2Pqap6tKpeAHa0sZKkMRn1SON3gX8B/Kitvx74dlUdauvTwIq2vAJ4AqBtf66N/8v6rH2OVJckjcmCQyPJu4Fnq+re4fIcQ2uebS+1Plcvm5JMJpmcmZk5SteSpFGMcqTxduA9Sb7J4NTRBQyOPE5JcvhLgyuBJ9vyNHAGQNt+MnBguD5rnyPVX6SqtlTVmqpaMzExMcJbkiQdzYJDo6quq6qVVbWKwYXsO6vqV4AvAZe1YRuB29vyzrZO235nVVWrb2h3V50JrAa+BtwDrG53Y53QXmPnQvuVJI1uMX5G5IPAjiQfAe4Dtrb6VuAzSaYYHGFsAKiqvUluBR4GDgFXV9UPAZJcA+wGlgHbqmrvIvQrSep0TEKjqr4MfLktP8rgzqfZY74HXH6E/a8Hrp+jvgvYdSx6lCSNzm+ES5K6GRqSpG6GhiSpm6EhSepmaEiSuhkakqRuhoYkqZuhIUnqZmhIkroZGpKkboaGJKmboSFJ6mZoSJK6GRqSpG6GhiSpm6EhSepmaEiSuhkakqRuhoYkqZuhIUnqZmhIkroZGpKkboaGJKnbgkMjyRlJvpRkX5K9ST7Q6q9LsifJ/vZ8aqsnyY1JppI8kOScobk2tvH7k2wcqp+b5MG2z41JMsqblSSNZpQjjUPAb1XVW4DzgauTnAVcC9xRVauBO9o6wCXA6vbYBNwMg5ABNgPnAWuBzYeDpo3ZNLTfuhH6lSSNaMGhUVVPVdXX2/J3gX3ACmA9sL0N2w5c2pbXA7fUwF3AKUneCFwM7KmqA1V1ENgDrGvbXltVX62qAm4ZmkuSNAbH5JpGklXALwJ3A6dX1VMwCBbgDW3YCuCJod2mW+1o9ek56nO9/qYkk0kmZ2ZmRn07kqQjGDk0kvwk8F+A36yq7xxt6By1WkD9xcWqLVW1pqrWTExMzNeyJGmBRgqNJK9mEBifrarfa+Vn2qkl2vOzrT4NnDG0+0rgyXnqK+eoS5LGZJS7pwJsBfZV1X8Y2rQTOHwH1Ebg9qH6le0uqvOB59rpq93ARUlObRfALwJ2t23fTXJ+e60rh+aSJI3B8hH2fTvwq8CDSe5vtX8JfBS4NclVwOPA5W3bLuCdwBTwPPBegKo6kOTDwD1t3Ieq6kBbfj/waeAk4AvtIUkakwWHRlV9hbmvOwBcOMf4Aq4+wlzbgG1z1CeBsxfaoyTp2PIb4ZKkboaGJKmboSFJ6mZoSJK6GRqSpG6GhiSpm6EhSepmaEiSuhkakqRuhoYkqZuhIUnqZmhIkroZGpKkboaGJKmboSFJ6mZoSJK6GRqSpG6GhiSpm6EhSepmaEiSuhkakqRuhoYkqduSD40k65I8kmQqybXj7keSXsmWdGgkWQbcBFwCnAVckeSs8XYlSa9cSzo0gLXAVFU9WlUvADuA9WPuSZJesVJV4+7hiJJcBqyrqve19V8Fzquqa2aN2wRsaqs/CzyyiG2dBnxrEedfbPY/Psdz72D/47bY/f90VU3MN2j5IjZwLGSO2otSrqq2AFsWvx1IMllVa16O11oM9j8+x3PvYP/jtlT6X+qnp6aBM4bWVwJPjqkXSXrFW+qhcQ+wOsmZSU4ANgA7x9yTJL1iLenTU1V1KMk1wG5gGbCtqvaOua2X5TTYIrL/8Tmeewf7H7cl0f+SvhAuSVpalvrpKUnSEmJoSJK6GRqSpG5L+kL4uCV5M4NvoK9g8P2QJ4GdVbVvrI3puJBkLVBVdU/7+Zt1wDeqateYW1uQJLdU1ZXj7kPj5YXwI0jyQeAKBj9dMt3KKxnc9rujqj46rt5eKVporwDurqo/G6qvq6ovjq+z+SXZzOA305YDe4DzgC8Dvwzsrqrrx9fd/JLMvrU9wN8H7gSoqve87E2NIMnfZfCzRA9V1R+Nu5/5JDkP2FdV30lyEnAtcA7wMPBvq+q5sfVmaMwtyZ8Ab62qH8yqnwDsrarV4+ns2Ejy3qr61Lj7OJIkvwFcDewDfgH4QFXd3rZ9varOGWd/80nyIIO+TwSeBlYO/QG4u6p+bqwNziPJ1xn8gfokg6PsAJ9j8KGJqvpv4+tufkm+VlVr2/KvMfi39PvARcB/Xeof+pLsBX6+fe1gC/A8cBtwYav/w3H15umpI/sR8FPAn86qv7FtO979G2DJhgbwa8C5VfVnSVYBtyVZVVUfZ+6fl1lqDlXVD4Hnk/zvqvoOQFX9RZLj4d/PGuADwG8D/7yq7k/yF0s9LIa8emh5E/APqmomye8AdwFLOjSAV1XVoba8ZuhD0leS3D+upsDQOJrfBO5Ish94otX+JvAm4Joj7rWEJHngSJuA01/OXhZg2eFTUlX1zSTvYBAcP83xERovJHlNVT0PnHu4mORkjoMPHVX1I+CGJP+5PT/D8fX34lVJTmVws0+qagagqv48yaGj77okPDR0NuB/JVlTVZNJ/hbwg/l2XkzH0z+Cl1VVfbH9B1rL4Lx6GFzbuKd9gjwenA5cDBycVQ/wP1/+dl6Sp5P8QlXdD9COON4NbAPeNt7WuvxSVX0f/vIP8GGvBjaOp6WXrqqmgcuTvAv4zrj7eQlOBu5l8G+9kvyNqno6yU9yfHzoeB/w8ST/isEv2341yRMMPsC+b5yNeU3jx1iSrcCnquorc2z7T1X1j8fQVpckKxmc4nl6jm1vr6r/MYa2dJxL8hrg9Kp6bNy99Ejy14GfYfABf7qqnhlzS4aGJKmfX+6TJHUzNCRJ3QwNSVI3Q0OS1O3/AevW/zlWd4B6AAAAAElFTkSuQmCC\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "commits_per_weekday.plot.bar();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Displaying the commit history\n", "In the following, we want to see the progress of the number of commits over the last years by using a `DatetimeIndex` based DataFrame. To do this, we set the `timestamp` column as index using `set_index('')`. Furthermore, we select just the `author` column. Thus we work continuously on a pure `Series` instead of a `DataFrame`. \n", "\n", "Side note: The usage of a `Series` is almost similar to a `DataFrame` with regard to the statistical functions. However, a `Series` is not beautifully formatted in a table, which is why I personally prefer using a `DataFrame`." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "timestamp\n", "2017-12-31 14:47:43 Linus Torvalds\n", "2017-12-31 13:13:56 Linus Torvalds\n", "2017-12-31 13:03:05 Linus Torvalds\n", "2017-12-31 12:30:34 Linus Torvalds\n", "2017-12-31 12:29:02 Linus Torvalds\n", "Name: author, dtype: object" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "git_timed = git_log.set_index('timestamp').author\n", "git_timed.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using the `resample(\"