{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c2e32c39",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-05-01T18:06:33.787598Z",
     "start_time": "2023-05-01T18:06:29.137044Z"
    }
   },
   "outputs": [],
   "source": [
    "!git clone https://github.com/gotec/git2net-tutorials\n",
    "import os\n",
    "os.chdir('git2net-tutorials')\n",
    "!pip install -r requirements.txt\n",
    "os.chdir('..')\n",
    "!git clone https://github.com/gotec/git2net git2net4analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "156945bb",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-05-01T18:06:35.885944Z",
     "start_time": "2023-05-01T18:06:33.792821Z"
    }
   },
   "outputs": [],
   "source": [
    "import git2net\n",
    "import os\n",
    "import sqlite3\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "26cd1e5f",
   "metadata": {},
   "source": [
    "## Complexity computation\n",
    "\n",
    "In this final tutorial, we will take a look at how you can use `git2net` to compute the complexity of files in a repository, and proxy the productivity of developers contributing to it.\n",
    "\n",
    "As before, we start by mining a new database for which we also disambiguate the authors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "732e193e",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-05-01T18:06:50.521060Z",
     "start_time": "2023-05-01T18:06:35.887724Z"
    }
   },
   "outputs": [],
   "source": [
    "# We assume a clone of git2net's repository exists in the folder below following the first tutorial.\n",
    "git_repo_dir = 'git2net4analysis'\n",
    "\n",
    "# Here, we specify the database in which we will store the results of the mining process.\n",
    "sqlite_db_file = 'git2net4analysis.db'\n",
    "\n",
    "# Remove database if exists\n",
    "if os.path.exists(sqlite_db_file):\n",
    "    os.remove(sqlite_db_file)\n",
    "    \n",
    "git2net.mine_git_repo(git_repo_dir, sqlite_db_file)\n",
    "git2net.disambiguate_aliases_db(sqlite_db_file)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "32ac6f35",
   "metadata": {},
   "source": [
    "By calling the function `git2net.compute_complexity()` we can now create a new table `complexity` in our database which contains the complexity of all modified files for all commits.\n",
    "Let's create this table and have a look at what it contains."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "207c6290",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-05-01T18:07:06.560823Z",
     "start_time": "2023-05-01T18:06:50.525104Z"
    }
   },
   "outputs": [],
   "source": [
    "git2net.compute_complexity(git_repo_dir, sqlite_db_file)\n",
    "\n",
    "with sqlite3.connect(sqlite_db_file) as con:\n",
    "    complexity = pd.read_sql(\"SELECT * FROM complexity\", con)\n",
    "    \n",
    "complexity.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d6259896",
   "metadata": {},
   "source": [
    "As we can see, the table contains information about the commit/file pair (i.e., `commit_hash`, `old_path`, `new_path`).\n",
    "It further contains the number of edit events (`events`)&mdash;i.e. additions, deletions, and replacements&mdash;that the commit made in a given file, and the total Levenshtein edit distance (`levenshtein_distance`) of these edits.\n",
    "Finally, the table contains the Halstead effort (`HE`), the cyclomatic complexity (`CCN`), the number of lines of code (`NLOC`), the number of tokens (`TOK`), and the number of functions (`FUN`) in all modified files before (`*_pre`) and after (`*_post`) each commit.\n",
    "As we show in [this publication](https://arxiv.org/abs/2201.04588), we can use the absolute value of the change in complexity (`*_delta`) as a proxy for the productivity of developers in Open Source software projects.\n",
    "\n",
    "Let's look at the changes in complexity for the `git2net`project.\n",
    "To do this, we first create a dataframe listing the absolute changes in complexity for all contributors to `git2net` over time. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d4444d7d",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-05-01T18:07:06.621028Z",
     "start_time": "2023-05-01T18:07:06.565368Z"
    }
   },
   "outputs": [],
   "source": [
    "with sqlite3.connect(sqlite_db_file) as con:\n",
    "    complexity = pd.read_sql(\"\"\"SELECT commit_hash,\n",
    "                                       events,\n",
    "                                       levenshtein_distance,\n",
    "                                       HE_delta,\n",
    "                                       CCN_delta,\n",
    "                                       NLOC_delta,\n",
    "                                       TOK_delta,\n",
    "                                       FUN_delta\n",
    "                                FROM complexity\"\"\", con)\n",
    "\n",
    "# We compute the absolute differences.\n",
    "complexity['HE_absdelta'] = np.abs(complexity.HE_delta)\n",
    "complexity['CCN_absdelta'] = np.abs(complexity.CCN_delta)\n",
    "complexity['NLOC_absdelta'] = np.abs(complexity.NLOC_delta)\n",
    "complexity['TOK_absdelta'] = np.abs(complexity.TOK_delta)\n",
    "complexity['FUN_absdelta'] = np.abs(complexity.FUN_delta)\n",
    "\n",
    "complexity.groupby(['commit_hash']) \\\n",
    "          .agg({'events': 'sum',\n",
    "                'levenshtein_distance': 'sum',\n",
    "                'HE_absdelta': 'sum',\n",
    "                'CCN_absdelta': 'sum',\n",
    "                'NLOC_absdelta': 'sum',\n",
    "                'TOK_absdelta': 'sum',\n",
    "                'FUN_absdelta': 'sum'}).reset_index()\n",
    "    \n",
    "complexity.drop(columns=['HE_delta', 'CCN_delta', 'NLOC_delta', 'TOK_delta',\n",
    "                         'FUN_delta'], inplace=True)    \n",
    "  \n",
    "# We add a counter for the commits.\n",
    "complexity['commits'] = 1\n",
    "    \n",
    "with sqlite3.connect(sqlite_db_file) as con:\n",
    "    commits = pd.read_sql(\"SELECT hash, author_date, author_id FROM commits\", con)\n",
    "\n",
    "complexity = pd.merge(complexity, commits, left_on='commit_hash', right_on='hash', how='left')\n",
    "complexity = complexity.set_index(pd.DatetimeIndex(complexity['author_date']))\n",
    "complexity = complexity.sort_index()\n",
    "complexity.drop(columns=['hash', 'commit_hash', 'author_date'], inplace=True)\n",
    "\n",
    "complexity.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7c801fe9",
   "metadata": {},
   "source": [
    "We can then plot these changes, e.g., as cumulative sums as shown below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ec644540",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-05-01T18:07:08.619742Z",
     "start_time": "2023-05-01T18:07:06.622791Z"
    }
   },
   "outputs": [],
   "source": [
    "with sqlite3.connect(sqlite_db_file) as con:\n",
    "    commits = pd.read_sql(\"SELECT author_name, author_id FROM commits\", con)\n",
    "\n",
    "commits.drop_duplicates(inplace=True)\n",
    "    \n",
    "author_id_name_map = {}\n",
    "for idx, group in commits.groupby('author_id'):\n",
    "    author_id_name_map[idx] = list(group.author_name)\n",
    "\n",
    "fig, axs = plt.subplots(4,2, figsize=(14,10))\n",
    "\n",
    "axs[0,0].set_title('Commits')\n",
    "axs[0,1].set_title('Events')\n",
    "axs[1,0].set_title('Levenshtein Distance')\n",
    "axs[1,1].set_title('Halstead Effort')\n",
    "axs[2,0].set_title('Cyclomatic Complexity')\n",
    "axs[2,1].set_title('Lines of Code')\n",
    "axs[3,0].set_title('Tokens')\n",
    "axs[3,1].set_title('Functions')\n",
    "\n",
    "for author_id, group in complexity.groupby(['author_id']):\n",
    "    group_cs = group.cumsum()\n",
    "    axs[0,0].plot(group_cs.index, group_cs.commits, label=author_id_name_map[author_id][0])\n",
    "    axs[0,1].plot(group_cs.index, group_cs.events)\n",
    "    axs[1,0].plot(group_cs.index, group_cs.levenshtein_distance)\n",
    "    axs[1,1].plot(group_cs.index, group_cs.HE_absdelta)\n",
    "    axs[2,0].plot(group_cs.index, group_cs.CCN_absdelta)\n",
    "    axs[2,1].plot(group_cs.index, group_cs.NLOC_absdelta)\n",
    "    axs[3,0].plot(group_cs.index, group_cs.TOK_absdelta)\n",
    "    axs[3,1].plot(group_cs.index, group_cs.FUN_absdelta)\n",
    "\n",
    "plt.tight_layout()\n",
    "axs[0,0].legend(loc='upper center', bbox_to_anchor=(1, 1.4), ncol=8)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bc5c2db3",
   "metadata": {},
   "source": [
    "Can you create a similar plot with a rolling window instead of the cumulative sum?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7ab1e4d2",
   "metadata": {},
   "source": [
    "With this, we conclude both this tutorial and the series of tutorials for `git2net`.\n",
    "We hope you found them helpful.\n",
    "Enjoy using `git2net`, and best of luck with your research!\n",
    "If you have any feedback or find bugs within the code, please let us know at [`gotec/git2net`](https://github.com/gotec/git2net).\n",
    "\n",
    "`git2net` is developed as an Open Source project, which means your ideas and inputs are highly welcome.\n",
    "Feel free to share the project and contribute yourself.\n",
    "You can immediately get started on the repository you just downloaded!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.10",
   "language": "python",
   "name": "py310"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}