{ "cells": [ { "cell_type": "code", "execution_count": null, "id": "126da169", "metadata": { "ExecuteTime": { "end_time": "2023-05-01T20:49:15.559077Z", "start_time": "2023-05-01T20:49:11.655594Z" }, "scrolled": true }, "outputs": [], "source": [ "!git clone https://github.com/gotec/git2net-tutorials\n", "import os\n", "os.chdir('git2net-tutorials')\n", "!pip install -r requirements.txt\n", "os.chdir('..')" ] }, { "cell_type": "code", "execution_count": null, "id": "229321b7", "metadata": { "ExecuteTime": { "end_time": "2023-05-01T20:49:17.503816Z", "start_time": "2023-05-01T20:49:15.564914Z" } }, "outputs": [], "source": [ "import git2net\n", "import networkx as nx\n", "import matplotlib.pyplot as plt\n", "import shutil\n", "import sqlite3\n", "import pandas as pd\n", "from collections import Counter, defaultdict\n", "import pathpy as pp\n", "from datetime import timedelta\n", "import numpy as np" ] }, { "cell_type": "markdown", "id": "d00a8948", "metadata": {}, "source": [ "# git2net @ MSR 2023\n", "\n", "This is a 40-min tutorial prepared for the 20th International Conference on Mining Software Repositories (MSR 2023) held on the 15-16th of May 2023 in Melbourne, Australia.\n", "\n", "## The Basics\n", "\n", "Wary of the short time, let's jump right in:" ] }, { "cell_type": "code", "execution_count": null, "id": "99fd392b", "metadata": { "ExecuteTime": { "end_time": "2023-05-01T20:49:17.516496Z", "start_time": "2023-05-01T20:49:17.505523Z" } }, "outputs": [], "source": [ "github_url = 'gotec/git2net'\n", "git_repo_dir = 'git2net4analysis'\n", "sqlite_db_file = 'git2net4analysis.db'\n", "\n", "# Remove the clone of the repository if it already exists from a previous run\n", "if os.path.exists(git_repo_dir):\n", " shutil.rmtree(git_repo_dir)\n", "\n", "# Remove resulting sqlite database if it already exists from a previous run\n", "if os.path.exists(sqlite_db_file):\n", " os.remove(sqlite_db_file)" ] }, { "cell_type": "code", "execution_count": null, "id": "0dffe0cc", "metadata": { "ExecuteTime": { "end_time": "2023-05-01T20:49:33.759614Z", "start_time": "2023-05-01T20:49:17.518086Z" } }, "outputs": [], "source": [ "# Call git2net\n", "git2net.mine_github(github_url, git_repo_dir, sqlite_db_file)\n", "n, node_info, edge_info = git2net.get_coediting_network(sqlite_db_file,\n", " author_identifier='author_name',\n", " engine='networkx')\n", "\n", "# Visualise the result\n", "plt.figure(figsize=(15,10))\n", "nx.draw(n, with_labels=True)\n", "plt.margins(x=.1)" ] }, { "cell_type": "markdown", "id": "c1c88537", "metadata": { "ExecuteTime": { "end_time": "2023-04-30T15:29:42.859938Z", "start_time": "2023-04-30T15:29:42.851736Z" } }, "source": [ "Congratulations, you have just mined and visualised your first repository using `git2net`!\n", "\n", "So what did we just do?\n", "The function `git2net.mine_github()` clones a given repository from GitHub and mines it's content.\n", "Next to repositories from GitHub, `git2net` also works on all other git repositories.\n", "To mine these, simply create a local clone yourself and call `git2net`'s mining function:" ] }, { "cell_type": "code", "execution_count": null, "id": "a4ce741a", "metadata": { "ExecuteTime": { "end_time": "2023-05-01T20:49:33.843513Z", "start_time": "2023-05-01T20:49:33.762695Z" } }, "outputs": [], "source": [ "git2net.mine_git_repo(git_repo_dir, sqlite_db_file)" ] }, { "cell_type": "markdown", "id": "c62b3cc9", "metadata": {}, "source": [ "This also shows a core feature of `git2net`. It is able to seamlessly resume mining from a previously mined repository.\n", "But what happens during mining?" ] }, { "cell_type": "code", "execution_count": null, "id": "f8ed9fa6", "metadata": { "ExecuteTime": { "end_time": "2023-05-01T20:49:33.849591Z", "start_time": "2023-05-01T20:49:33.845139Z" } }, "outputs": [], "source": [ "from IPython.display import IFrame\n", "IFrame(\"./figures/mining.pdf\", width=900, height=450)" ] }, { "cell_type": "markdown", "id": "9857ccf0", "metadata": {}, "source": [ "Mining a repository with `git2net` yields an SQLite database that contains three tables: `_metadata`, `commits`, and `edits`.\n", "The table `_metadata` contains information such as the path to the original repository, the time it was mined, and the settings for `git2net` that were used." ] }, { "cell_type": "code", "execution_count": null, "id": "9e089657", "metadata": { "ExecuteTime": { "end_time": "2023-05-01T20:49:33.864518Z", "start_time": "2023-05-01T20:49:33.851180Z" } }, "outputs": [], "source": [ "with sqlite3.connect(sqlite_db_file) as con:\n", " commit_data = pd.read_sql_query(\"SELECT * FROM _metadata\", con)\n", "\n", "commit_data.tail()" ] }, { "cell_type": "markdown", "id": "cf218860", "metadata": { "ExecuteTime": { "end_time": "2023-04-30T16:06:00.959223Z", "start_time": "2023-04-30T16:06:00.950299Z" } }, "source": [ "The table `commits` stores all information related to the commits themselves, e.g., information regarding the author and time, the branch(es) in which it appears, and their parent-commit(s).\n", "All commits are uniquely identified by their hash." ] }, { "cell_type": "code", "execution_count": null, "id": "f303d650", "metadata": { "ExecuteTime": { "end_time": "2023-05-01T20:49:33.888960Z", "start_time": "2023-05-01T20:49:33.866041Z" } }, "outputs": [], "source": [ "with sqlite3.connect(sqlite_db_file) as con:\n", " commit_data = pd.read_sql_query(\"SELECT * FROM commits\", con)\n", "\n", "commit_data.tail()" ] }, { "cell_type": "markdown", "id": "d089c22d", "metadata": {}, "source": [ "Each commit may contain multiple files, each including numerous changes to lines.\n", "These changes are reflected in the `edits` table, which contains detailed information about all modifications of the files contained in the commits.\n", "Here `git2net` distinguishes between different file-level modifications (`modification_type`) and the corresponding line-level edits within files (`edit_type`).\n", "\n", "A file included in a commit may be an added (`ADD`), deleted (`DELETE`), or modified (`MODIFY`) file.\n", "In addition, changes to individual lines are saved for each file.\n", "These can be newly added (`addition`), changed (`replacement`) or deleted (`deletion`) lines." ] }, { "cell_type": "code", "execution_count": null, "id": "6fccc113", "metadata": { "ExecuteTime": { "end_time": "2023-05-01T20:49:34.442094Z", "start_time": "2023-05-01T20:49:33.890524Z" } }, "outputs": [], "source": [ "with sqlite3.connect(sqlite_db_file) as con:\n", " commit_data = pd.read_sql_query(\"SELECT * FROM edits\", con)\n", "\n", "commit_data.tail()" ] }, { "cell_type": "markdown", "id": "c8bc14d9", "metadata": {}, "source": [ "## Author Disambiguation\n", "\n", "Some of you may have may have spotted an issue with the network I presented in the very beginning.\n", "Going back and looking at it in more detail, we observe that one user appears twice.\n", "This is obviously an issue.\n", "Let's assume that we want to study the relationship between team size and team productivity like we did in a paper presented at ICSE last year (doi.org/10.1145/3510003.3510619).\n", "If we were to split all authors into two identities we would double the size of the team and halve the producvity of each member (on average).\n", "So how can we avoid this?\n", "Using `git2net` for our own research on a daily basis, we wanted to make this process as simple as possible:" ] }, { "cell_type": "code", "execution_count": null, "id": "2b823b80", "metadata": { "ExecuteTime": { "end_time": "2023-05-01T20:49:34.541133Z", "start_time": "2023-05-01T20:49:34.443822Z" } }, "outputs": [], "source": [ "git2net.disambiguate_aliases_db(sqlite_db_file, method='gambit', thresh=.45)\n", "\n", "with sqlite3.connect(sqlite_db_file) as con:\n", " authors = pd.read_sql(\"\"\"SELECT author_name, author_email, author_id FROM commits\"\"\", con)\n", "\n", "Counter(['{} --- {}, <{}>'.format(row.author_id, row.author_name, row.author_email)\n", " for idx, row in authors.iterrows()])" ] }, { "cell_type": "markdown", "id": "41579e67", "metadata": { "ExecuteTime": { "end_time": "2023-04-30T16:28:46.891224Z", "start_time": "2023-04-30T16:28:46.845150Z" } }, "source": [ "Next to our disambiguation algorithm `gambit` (doi.org/10.1109/MSR52588.2021.00021), `git2net` also supports the algorithms from Goeminne and Mens (`method='simple'`, doi.org/10.1016/j.scico.2011.11.004) and Bird et al. (`method='bird'`, doi.org/10.1145/1137983.1138016).\n", "Try those for yourself!" ] }, { "cell_type": "markdown", "id": "c3588682", "metadata": {}, "source": [ "## Network Analysis and Visualisation\n", "\n", "So much on the mining part.\n", "Let's now look at the network that we obtained with the second call in our short example (`git2net.get_coediting_network()`)" ] }, { "cell_type": "code", "execution_count": null, "id": "55bb1fd2", "metadata": { "ExecuteTime": { "end_time": "2023-05-01T20:49:34.547199Z", "start_time": "2023-05-01T20:49:34.542863Z" } }, "outputs": [], "source": [ "from IPython.display import IFrame\n", "IFrame(\"./figures/coediting_network_creation.pdf\", width=900, height=700)" ] }, { "cell_type": "markdown", "id": "47013a79", "metadata": {}, "source": [ "Before, we were looking at the static network visualisation. Let's now have a look at the temporal network shown in the middle.\n", "\n", "For this example, we will use a rolling time window spanning 365 days, which we then shift in 30-day increments.\n", "We can use the `RollingTimeWindow()` function in `pathpy` to compute the networks over time.\n", "Then, we only need to add the statistics of interest to a dictionary and plot them." ] }, { "cell_type": "code", "execution_count": null, "id": "fbfd53f2", "metadata": { "ExecuteTime": { "end_time": "2023-05-01T20:49:35.132211Z", "start_time": "2023-05-01T20:49:34.548853Z" } }, "outputs": [], "source": [ "n_coediting_t, node_info, edge_info = git2net.get_coediting_network(sqlite_db_file)\n", "\n", "WINDOW_SIZE = 365*24*60*60\n", "STEP_SIZE = 30*24*60*60\n", "\n", "data = defaultdict(list)\n", "for network, window in pp.RollingTimeWindow(n_coediting_t,\n", " window_size=WINDOW_SIZE,\n", " step_size=STEP_SIZE,\n", " directed=True, return_window=True):\n", " data['number of developers'].append(network.ncount())\n", " data['time'].append(window[1]) # append window end time\n", "\n", "df = pd.DataFrame(data, columns=list(data.keys()))\n", "df.set_index(pd.to_datetime(df.time, unit='s'), inplace=True)\n", "df.drop('time', axis=1, inplace=True)\n", "\n", "df.plot(y='number of developers', figsize=(15,7))" ] }, { "cell_type": "markdown", "id": "2b0c6bbe", "metadata": {}, "source": [ "## Database Analysis: Code Ownership\n", "\n", "Let's next look at the database that `git2net` creates.\n", "Specifically, we will look at an exemplary analysis on how the editing of own and foreign code evolves in the repository of `git2net`, similarly to the analyses we performed here (doi.org/10.1007/s10664-020-09928-2).\n", "\n", "To do so, we first get the required data from the SQLite database.\n", "Then, we join the two resulting data frames to get a single one containing information about how much code of whom and when." ] }, { "cell_type": "code", "execution_count": null, "id": "7fbc1b27", "metadata": { "ExecuteTime": { "end_time": "2023-05-01T20:49:35.189728Z", "start_time": "2023-05-01T20:49:35.133861Z" } }, "outputs": [], "source": [ "with sqlite3.connect(sqlite_db_file) as con:\n", " edits = pd.read_sql(\"\"\"SELECT\n", " commit_hash,\n", " original_commit_deletion,\n", " levenshtein_dist\n", " FROM edits\n", " WHERE edit_type=='replacement'\"\"\", con).drop_duplicates()\n", " commits = pd.read_sql(\"\"\"SELECT\n", " hash,\n", " author_id,\n", " author_date\n", " FROM commits\"\"\", con)\n", " \n", "edit_info = pd.merge(edits, commits, how='left', left_on='commit_hash', right_on='hash') \\\n", " .drop(columns=['commit_hash', 'hash'])\n", "\n", "edit_info = pd.merge(edit_info, commits, how='left', left_on='original_commit_deletion',\n", " right_on='hash', suffixes=('', '_before')) \\\n", " .drop(columns=['original_commit_deletion', 'hash', 'author_date_before'])\n", "\n", "edit_info = edit_info[['author_id_before', 'author_id', 'author_date', 'levenshtein_dist']]\n", "\n", "edit_info.index = pd.DatetimeIndex(edit_info.author_date)\n", "edit_info = edit_info.drop(columns=['author_date'])\n", "\n", "edit_info" ] }, { "cell_type": "markdown", "id": "5b27bd91", "metadata": {}, "source": [ "Now, we can compute the Levenshtein distances for changes in both own and foreign code for a rolling time window.\n", "In a final step, we normalise them to allow a better comparison and plot them over time." ] }, { "cell_type": "code", "execution_count": null, "id": "df0deb28", "metadata": { "ExecuteTime": { "end_time": "2023-05-01T20:49:37.717043Z", "start_time": "2023-05-01T20:49:35.193191Z" } }, "outputs": [], "source": [ "windowsize = timedelta(days=365)\n", "increment = timedelta(days=30)\n", "\n", "plot_data = defaultdict(list)\n", "\n", "time = min(edit_info.index) + windowsize\n", "while time < max(edit_info.index):\n", " mask = (edit_info.index > time - windowsize) & (edit_info.index <= time)\n", " wdata = edit_info.loc[mask]\n", " self_changes_dist = 0\n", " foreign_changes_dist = 0\n", " for idx, row in wdata.iterrows():\n", " self_changes_dist += row['levenshtein_dist'] * (row['author_id_before'] == row['author_id'])\n", " foreign_changes_dist += row['levenshtein_dist'] * (row['author_id_before'] != row['author_id'])\n", " plot_data['time'].append(time)\n", " plot_data['self_changes_dist'].append(self_changes_dist)\n", " plot_data['foreign_changes_dist'].append(foreign_changes_dist)\n", " time += increment\n", "\n", "\n", "plot_data['self_changes_dist_norm'] = [s / (s + f) for s, f in zip(plot_data['self_changes_dist'],\n", " plot_data['foreign_changes_dist'])]\n", "plot_data['foreign_changes_dist_norm'] = [f / (s + f) for s, f in zip(plot_data['self_changes_dist'],\n", " plot_data['foreign_changes_dist'])]\n", "\n", "plot_data = pd.DataFrame(plot_data)\n", "\n", "plt.figure(figsize=(15,10))\n", "ax = plt.subplot(2,1,1)\n", "plot_data.plot(x='time', y=['self_changes_dist', 'foreign_changes_dist'], ax=ax, ylabel=\"Levenshtein distance\")\n", "\n", "ax = plt.subplot(2,1,2)\n", "plot_data.plot(x='time', y=['self_changes_dist_norm', 'foreign_changes_dist_norm'], kind='area', ax=ax,\n", " ylabel=\"relative Levenshtein distance\")\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "13ef8184", "metadata": {}, "source": [ "As you can see, code ownership in the repository of `git2net` is very high.\n", "However, this should be expected as, as we saw earlier, most of the code is written by a single person.\n", "With minor modifications to the code above, you can also figure out how much code each contributor owns at every point in time.\n", "\n", "`git2net` also works with the git repositories behind [Overleaf projects](https://www.overleaf.com/).\n", "So go ahead and try it with your most recent paper :)" ] }, { "cell_type": "markdown", "id": "87a65194", "metadata": {}, "source": [ "## Complexity Analysis\n", "\n", "Finally, let's look at code complexity.\n", "By calling the function `git2net.compute_complexity()` we can now create a new table `complexity` in our database which contains the complexity of all modified files for all commits.\n", "Let's create this table and have a look at what it contains." ] }, { "cell_type": "code", "execution_count": null, "id": "9ac745ff", "metadata": { "ExecuteTime": { "end_time": "2023-05-01T20:49:52.045090Z", "start_time": "2023-05-01T20:49:37.718706Z" } }, "outputs": [], "source": [ "git2net.compute_complexity(git_repo_dir, sqlite_db_file)\n", "\n", "with sqlite3.connect(sqlite_db_file) as con:\n", " complexity = pd.read_sql(\"SELECT * FROM complexity\", con)\n", " \n", "complexity.head()" ] }, { "cell_type": "markdown", "id": "25b0a7f4", "metadata": {}, "source": [ "To see how code complexity changes for the `git2net`project, we create a dataframe listing the absolute changes in complexity for all contributors to `git2net` over time. " ] }, { "cell_type": "code", "execution_count": null, "id": "c0e7d242", "metadata": { "ExecuteTime": { "end_time": "2023-05-01T20:49:52.087816Z", "start_time": "2023-05-01T20:49:52.046838Z" } }, "outputs": [], "source": [ "with sqlite3.connect(sqlite_db_file) as con:\n", " complexity = pd.read_sql(\"\"\"SELECT commit_hash,\n", " events,\n", " levenshtein_distance,\n", " HE_delta,\n", " CCN_delta,\n", " NLOC_delta,\n", " TOK_delta,\n", " FUN_delta\n", " FROM complexity\"\"\", con)\n", "\n", "# We compute the absolute differences.\n", "complexity['HE_absdelta'] = np.abs(complexity.HE_delta)\n", "complexity['CCN_absdelta'] = np.abs(complexity.CCN_delta)\n", "complexity['NLOC_absdelta'] = np.abs(complexity.NLOC_delta)\n", "complexity['TOK_absdelta'] = np.abs(complexity.TOK_delta)\n", "complexity['FUN_absdelta'] = np.abs(complexity.FUN_delta)\n", "\n", "complexity.groupby(['commit_hash']) \\\n", " .agg({'events': 'sum',\n", " 'levenshtein_distance': 'sum',\n", " 'HE_absdelta': 'sum',\n", " 'CCN_absdelta': 'sum',\n", " 'NLOC_absdelta': 'sum',\n", " 'TOK_absdelta': 'sum',\n", " 'FUN_absdelta': 'sum'}).reset_index()\n", " \n", "complexity.drop(columns=['HE_delta', 'CCN_delta', 'NLOC_delta', 'TOK_delta',\n", " 'FUN_delta'], inplace=True) \n", " \n", "# We add a counter for the commits.\n", "complexity['commits'] = 1\n", " \n", "with sqlite3.connect(sqlite_db_file) as con:\n", " commits = pd.read_sql(\"SELECT hash, author_date, author_id FROM commits\", con)\n", "\n", "complexity = pd.merge(complexity, commits, left_on='commit_hash', right_on='hash', how='left')\n", "complexity = complexity.set_index(pd.DatetimeIndex(complexity['author_date']))\n", "complexity = complexity.sort_index()\n", "complexity.drop(columns=['hash', 'commit_hash', 'author_date'], inplace=True)\n", "\n", "complexity.head()" ] }, { "cell_type": "markdown", "id": "a59955d0", "metadata": {}, "source": [ "We can then plot these changes, e.g., as cumulative sums as shown below." ] }, { "cell_type": "code", "execution_count": null, "id": "323f2e28", "metadata": { "ExecuteTime": { "end_time": "2023-05-01T20:49:54.112552Z", "start_time": "2023-05-01T20:49:52.089538Z" } }, "outputs": [], "source": [ "with sqlite3.connect(sqlite_db_file) as con:\n", " commits = pd.read_sql(\"SELECT author_name, author_id FROM commits\", con)\n", "\n", "commits.drop_duplicates(inplace=True)\n", " \n", "author_id_name_map = {}\n", "for idx, group in commits.groupby('author_id'):\n", " author_id_name_map[idx] = list(group.author_name)\n", "\n", "fig, axs = plt.subplots(4,2, figsize=(14,10))\n", "\n", "axs[0,0].set_title('Commits')\n", "axs[0,1].set_title('Events')\n", "axs[1,0].set_title('Levenshtein Distance')\n", "axs[1,1].set_title('Halstead Effort')\n", "axs[2,0].set_title('Cyclomatic Complexity')\n", "axs[2,1].set_title('Lines of Code')\n", "axs[3,0].set_title('Tokens')\n", "axs[3,1].set_title('Functions')\n", "\n", "for author_id, group in complexity.groupby('author_id'):\n", " group_cs = group.cumsum()\n", " axs[0,0].plot(group_cs.index, group_cs.commits, label=author_id_name_map[author_id][0])\n", " axs[0,1].plot(group_cs.index, group_cs.events)\n", " axs[1,0].plot(group_cs.index, group_cs.levenshtein_distance)\n", " axs[1,1].plot(group_cs.index, group_cs.HE_absdelta)\n", " axs[2,0].plot(group_cs.index, group_cs.CCN_absdelta)\n", " axs[2,1].plot(group_cs.index, group_cs.NLOC_absdelta)\n", " axs[3,0].plot(group_cs.index, group_cs.TOK_absdelta)\n", " axs[3,1].plot(group_cs.index, group_cs.FUN_absdelta)\n", "\n", "plt.tight_layout()\n", "axs[0,0].legend(loc='upper center', bbox_to_anchor=(1, 1.5), ncol=8)\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "594ff88a", "metadata": {}, "source": [ "Try this out for your own repository!\n", "Can you create a similar plot with a rolling window instead of the cumulative sum?" ] }, { "cell_type": "markdown", "id": "e827f1b5", "metadata": { "ExecuteTime": { "end_time": "2023-05-01T18:07:50.881784Z", "start_time": "2023-05-01T18:07:50.870332Z" } }, "source": [ "With this, we conclude this tutorial on `git2net`.\n", "We hope you found it interesting and helpful.\n", "If you did, you can find a broad range of more detailed tutorials on `git2net` at https://github.com/gotec/git2net-tutorials.\n", "You can also find a full documentation at https://git2net.readthedocs.io/en/latest/.\n", "\n", "Enjoy using `git2net`, and best of luck with your research!\n", "If you have any feedback or find bugs within the code, please let us know at https://github.com/gotec/git2net." ] }, { "cell_type": "code", "execution_count": null, "id": "b0d611bd", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3.10", "language": "python", "name": "py310" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.2" } }, "nbformat": 4, "nbformat_minor": 5 }