{ "cells": [ { "cell_type": "code", "execution_count": null, "id": "32f62787", "metadata": {}, "outputs": [], "source": [ "!git clone https://github.com/gotec/git2net-tutorials\n", "import os\n", "os.chdir('git2net-tutorials')\n", "!pip install -r requirements.txt\n", "os.chdir('..')\n", "!git clone https://github.com/gotec/git2net git2net4analysis" ] }, { "cell_type": "code", "execution_count": null, "id": "9ea4ccd2", "metadata": { "ExecuteTime": { "end_time": "2022-02-18T12:17:22.407323Z", "start_time": "2022-02-18T12:17:21.727563Z" } }, "outputs": [], "source": [ "import git2net\n", "import os\n", "import pandas as pd\n", "import sqlite3\n", "from collections import Counter" ] }, { "cell_type": "markdown", "id": "7025b745", "metadata": {}, "source": [ "# Author disambiguation\n", "\n", "In the previous tutorial we discussed the options of `git2net`'s `mine_git_repo()` function.\n", "Let's now call this function with its default options and have a closer look at the results we get." ] }, { "cell_type": "code", "execution_count": null, "id": "9ac40578", "metadata": { "ExecuteTime": { "end_time": "2022-02-18T12:17:34.188904Z", "start_time": "2022-02-18T12:17:22.409373Z" } }, "outputs": [], "source": [ "# We assume a clone of git2net's repository exists in the folder below following the first tutorial.\n", "git_repo_dir = 'git2net4analysis'\n", "\n", "# Here, we specify the database in which we will store the results of the mining process.\n", "sqlite_db_file = 'git2net4analysis.db'\n", "\n", "# Remove database if exists\n", "if os.path.exists(sqlite_db_file):\n", " os.remove(sqlite_db_file)\n", " \n", "git2net.mine_git_repo(git_repo_dir, sqlite_db_file)" ] }, { "cell_type": "markdown", "id": "8bd742a9", "metadata": {}, "source": [ "Specifically, we are interested in who worked on the project.\n", "Using the `commits` table in the resulting database, we can find this out." ] }, { "cell_type": "code", "execution_count": null, "id": "32c3ffbc", "metadata": { "ExecuteTime": { "end_time": "2022-02-18T12:17:34.219142Z", "start_time": "2022-02-18T12:17:34.190986Z" } }, "outputs": [], "source": [ "with sqlite3.connect(sqlite_db_file) as con:\n", " authors = pd.read_sql(\"\"\"SELECT author_name, author_email FROM commits\"\"\", con)\n", "\n", "Counter(['{}, <{}>'.format(row.author_name, row.author_email) for idx, row in authors.iterrows()])" ] }, { "cell_type": "markdown", "id": "956a9aa5", "metadata": {}, "source": [ "The results show that both *Ingo Scholtes* and *Christoph Gote* made commits with multiple different name-email combinations, i.e., using multiple different aliases.\n", "Let's try to understand why this could be an issue based on an example from one of our [recent publications](https://arxiv.org/abs/2201.04588).\n", "In this project, we use `git2net` to study the relationship between team size and productivity using over 200 repositories from GitHub.\n", "If we used the identities above, we would overestimate the team size (assuming the team size is the number of name-email combinations).\n", "Simultaneously, we would underestimate the productivity of individual team members (assuming the productivity is the number of commits).\n", "As a consequence, we would potentially come to vastly different conclusions.\n", "Therefore, we need a method to disambiguate which aliases belong to the same author.\n", "\n", "One might assume from the results above that we could simply use the names instead of name-email information.\n", "However, similar challenges can arise here too.\n", "Just think of middle names or apostrophes/umlauts, which might be stored differently on one computer compared to another.\n", "To deal with these challenges, `git2net` uses the Open Source name disambiguation tool `gambit` that we have developed specifically with the application in `git2net` in mind.\n", "We provide links to its [development repository](https://github.com/gotec/gambit) and the [original publication](https://arxiv.org/abs/2103.05666) for those of you who are interested.\n", "\n", "Let's now apply `gambit` to the database from above and look at the results.\n", "To do so, we can simply call the function `disambiguate_aliases_db()` on the `sqlite_db_file` resulting from `mine_git_repo()`." ] }, { "cell_type": "code", "execution_count": null, "id": "a13cda9d", "metadata": { "ExecuteTime": { "end_time": "2022-02-18T12:17:34.327503Z", "start_time": "2022-02-18T12:17:34.220836Z" } }, "outputs": [], "source": [ "git2net.disambiguate_aliases_db(sqlite_db_file)\n", "\n", "with sqlite3.connect(sqlite_db_file) as con:\n", " authors = pd.read_sql(\"\"\"SELECT author_name, author_email, author_id FROM commits\"\"\", con)\n", "\n", "Counter(['{} --- {}, <{}>'.format(row.author_id, row.author_name, row.author_email)\n", " for idx, row in authors.iterrows()])" ] }, { "cell_type": "markdown", "id": "f3dd4bdd", "metadata": {}, "source": [ "We find that—as we were hoping for—`gambit` assigns both aliases of Ingo Scholtes and Christoph Gote the same `author_id`s, respectively.\n", "Hence, we can use `author_id` in our subsequent analyses and visualisations as a unique identifier for the different authors active in the repository." ] }, { "cell_type": "markdown", "id": "4c554470", "metadata": {}, "source": [ "With this, we conclude our tutorial on author disambiguation.\n", "We have now covered the cloning of repositories and the different options you can select during mining.\n", "Finally, we showed how to disambiguate author identities, providing clean data for our subsequent analyses.\n", "In the following tutorial, it is finally time to look at the `net`work part in `git2net`!" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.12" } }, "nbformat": 4, "nbformat_minor": 5 }