{ "cells": [ { "cell_type": "code", "execution_count": null, "id": "8f2b7a84", "metadata": { "ExecuteTime": { "end_time": "2022-03-04T15:08:22.933907Z", "start_time": "2022-03-04T15:08:20.024616Z" } }, "outputs": [], "source": [ "!git clone https://github.com/gotec/git2net-tutorials\n", "import os\n", "os.chdir('git2net-tutorials')\n", "!pip install -r requirements.txt\n", "os.chdir('..')\n", "!git clone https://github.com/gotec/git2net git2net4analysis" ] }, { "cell_type": "code", "execution_count": null, "id": "4cacb8c8", "metadata": { "ExecuteTime": { "end_time": "2022-03-04T15:08:23.538874Z", "start_time": "2022-03-04T15:08:22.938644Z" } }, "outputs": [], "source": [ "import git2net\n", "import os\n", "from collections import defaultdict\n", "import pandas as pd\n", "import pathpy as pp" ] }, { "cell_type": "markdown", "id": "96a05281", "metadata": {}, "source": [ "## Network Analysis and Visualisation\n", "\n", "This tutorial discusses the functions of `git2net` that allow us to generate various network projections of a git repository.\n", "However, before we do so, let's mine the repository and disambiguate the author aliases again to have a clean starting point to work with.\n", "For the sake of exposition, we start with a repository only including commits that touched five or fewer files." ] }, { "cell_type": "code", "execution_count": null, "id": "bcee1c86", "metadata": { "ExecuteTime": { "end_time": "2022-03-04T15:08:36.733139Z", "start_time": "2022-03-04T15:08:23.541143Z" } }, "outputs": [], "source": [ "# We assume a clone of git2net's repository exists in the folder below following the first tutorial.\n", "git_repo_dir = 'git2net4analysis'\n", "\n", "# Here, we specify the database in which we will store the results of the mining process.\n", "sqlite_db_file = 'git2net4analysis.db'\n", "\n", "# Remove database if exists\n", "if os.path.exists(sqlite_db_file):\n", " os.remove(sqlite_db_file)\n", " \n", "git2net.mine_git_repo(git_repo_dir, sqlite_db_file, max_modifications=5)\n", "git2net.disambiguate_aliases_db(sqlite_db_file)" ] }, { "cell_type": "markdown", "id": "5a574d23", "metadata": {}, "source": [ "Great, with these few lines, we're already all set.\n", "\n", "## Network projections\n", "\n", "We now provide a brief overview of all network projections included in `git2net`.\n", "\n", "### Co-editing networks\n", "\n", "To start our exploration, let's try to obtain a co-editing network for our project.\n", "We can do this by simply calling the `get_coediting_network()` function and providing the database we just mined.\n", "\n", "Note that by default, all network visualisations use the `author_id` that we obtained from the author disambiguation in the previous tutorial to plot the networks.\n", "However, if you want to use the `author_name` or `author_email` instead, you can provide it as an optional argument, e.g., `author_identifier='author_email'`." ] }, { "cell_type": "code", "execution_count": null, "id": "227c7755", "metadata": { "ExecuteTime": { "end_time": "2022-03-04T15:09:12.768793Z", "start_time": "2022-03-04T15:09:12.613122Z" } }, "outputs": [], "source": [ "t, node_info, edge_info = git2net.get_coediting_network(sqlite_db_file)\n", "print(t)\n", "t" ] }, { "cell_type": "markdown", "id": "e94e6a8d", "metadata": {}, "source": [ "The function returns a `pathpy` temporal network object and two dictionaries containing properties of nodes and edges.\n", "As of writing this tutorial, not all of them are used. \n", "However, they are set as placeholders for future versions of `git2net`.\n", "\n", "As shown above, a `pathpy` temporal network object can be visualised by itself.\n", "In addition, we can also aggregate the network by dropping the order of events, yielding a standard network object.\n", "Let's do this next." ] }, { "cell_type": "code", "execution_count": null, "id": "5cacfd45", "metadata": { "ExecuteTime": { "end_time": "2022-03-04T15:08:36.880564Z", "start_time": "2022-03-04T15:08:36.868227Z" } }, "outputs": [], "source": [ "pp.Network.from_temporal_network(t)" ] }, { "cell_type": "markdown", "id": "acd584d2", "metadata": {}, "source": [ "In both the temporal and aggregated network, a node represents an author, whereas edges point from the person changing a line of code to the person who was the original author.\n", "\n", "### Bipartite author-file networks\n", "\n", "Next, we could ask the question of which files different authors collaborated on. Therefore, we can plot a bipartite network containing both files and authors as nodes." ] }, { "cell_type": "code", "execution_count": null, "id": "7a154ef6", "metadata": { "ExecuteTime": { "end_time": "2022-03-04T15:08:36.990352Z", "start_time": "2022-03-04T15:08:36.882220Z" } }, "outputs": [], "source": [ "t, node_info, edge_info = git2net.get_bipartite_network(sqlite_db_file)\n", "n = pp.Network.from_temporal_network(t)\n", "n" ] }, { "cell_type": "markdown", "id": "841609dc", "metadata": {}, "source": [ "For this network, `node_info` contains the classes of authors in the network. These can e.g. be used to colour nodes as shown below." ] }, { "cell_type": "code", "execution_count": null, "id": "4ff2fbf3", "metadata": { "ExecuteTime": { "end_time": "2022-03-04T15:08:36.997534Z", "start_time": "2022-03-04T15:08:36.991683Z" } }, "outputs": [], "source": [ "colour_map = {'author': '#73D2DE', 'file': '#2E5EAA'}\n", "node_color = {node: colour_map[node_info['class'][node]] for node in n.nodes}\n", "pp.visualisation.plot(n, node_color=node_color)" ] }, { "cell_type": "markdown", "id": "f9d20507", "metadata": {}, "source": [ "If we are interested in, e.g. more recently edited files, we can filter the database by providing the `time_from` and `time_to` options. Let's check the files edited since May 2019." ] }, { "cell_type": "code", "execution_count": null, "id": "d1634386", "metadata": { "ExecuteTime": { "end_time": "2022-03-04T15:08:37.092994Z", "start_time": "2022-03-04T15:08:36.998744Z" } }, "outputs": [], "source": [ "from datetime import datetime\n", "time_from = datetime(2019, 5, 1)\n", "t, node_info, edge_info = git2net.get_bipartite_network(sqlite_db_file, time_from=time_from)\n", "n = pp.Network.from_temporal_network(t)\n", "colour_map = {'author': '#73D2DE', 'file': '#2E5EAA'}\n", "node_color = {node: colour_map[node_info['class'][node]] for node in n.nodes}\n", "pp.visualisation.plot(n, node_color=node_color)" ] }, { "cell_type": "markdown", "id": "d6ce2900", "metadata": {}, "source": [ "### Co-authorship networks\n", "\n", "The projection of this network that links authors editing the same file is the co-authorship network." ] }, { "cell_type": "code", "execution_count": null, "id": "26e7bd1f", "metadata": { "ExecuteTime": { "end_time": "2022-03-04T15:08:38.072470Z", "start_time": "2022-03-04T15:08:37.094396Z" } }, "outputs": [], "source": [ "n, node_info, edge_info = git2net.get_coauthorship_network(sqlite_db_file)\n", "n" ] }, { "cell_type": "markdown", "id": "a31b8bfb", "metadata": {}, "source": [ "Note that it looks similar as, at its core, the co-authorship network is a projection of the bipartite network to authors." ] }, { "cell_type": "markdown", "id": "abb6ebb7", "metadata": {}, "source": [ "### Line editing paths\n", "\n", "`git2net` allows the extraction of editing paths on the level of individual lines. I.e. we can track consecutive changes made to a single line over time—even if these lines move within a file or even across files. This is very powerful, as it allows us to determine editing sequences and find lines that require more editing than others.\n", "These could either be lines that are tough to implement, or they could contain essential information, such as the version number in an `__init__.py` file.\n", "\n", "To extract these paths, we can use the `get_line_editing_paths` function. As these networks tend to be very large, we limit the analysis to a small file for this tutorial.\n", "To only look at a specific set of file paths, we can use the `file_paths` option." ] }, { "cell_type": "code", "execution_count": null, "id": "4872cfed", "metadata": { "ExecuteTime": { "end_time": "2022-03-04T15:08:38.504550Z", "start_time": "2022-03-04T15:08:38.074957Z" } }, "outputs": [], "source": [ "file_paths = ['setup.py']\n", "paths, dag, node_info, edge_info = git2net.get_line_editing_paths(sqlite_db_file, git_repo_dir,\n", " file_paths=file_paths)\n", "pp.visualisation.plot(dag, node_color=node_info['colors'])" ] }, { "cell_type": "markdown", "id": "a4052f52", "metadata": {}, "source": [ "Notice that despite only looking at a single file, the network shown above is not connected. This is due to our database not being complete. Let's fix this now and try again." ] }, { "cell_type": "code", "execution_count": null, "id": "fb731222", "metadata": { "ExecuteTime": { "end_time": "2022-03-04T15:08:46.151351Z", "start_time": "2022-03-04T15:08:38.505867Z" } }, "outputs": [], "source": [ "git2net.mine_git_repo(git_repo_dir, sqlite_db_file)" ] }, { "cell_type": "markdown", "id": "41dfd57a", "metadata": {}, "source": [ "As we now have more commits in our database, we also need to rerun the disambiguation." ] }, { "cell_type": "code", "execution_count": null, "id": "7e3c6782", "metadata": { "ExecuteTime": { "end_time": "2022-03-04T15:08:46.256815Z", "start_time": "2022-03-04T15:08:46.154930Z" } }, "outputs": [], "source": [ "git2net.disambiguate_aliases_db(sqlite_db_file)" ] }, { "cell_type": "markdown", "id": "631fdda3", "metadata": {}, "source": [ "Let's colour the nodes by type and look at the visualisation again." ] }, { "cell_type": "code", "execution_count": null, "id": "38fd19f6", "metadata": { "ExecuteTime": { "end_time": "2022-03-04T15:08:46.926460Z", "start_time": "2022-03-04T15:08:46.261004Z" } }, "outputs": [], "source": [ "paths, dag, node_info, edge_info = git2net.get_line_editing_paths(sqlite_db_file, git_repo_dir,\n", " file_paths=file_paths)\n", "colors = {}\n", "for x in dag.nodes:\n", " colors[x] = '#FBB13C'\n", "for x in dag.roots:\n", " colors[x] = '#A83236'\n", "for x in dag.leafs:\n", " colors[x] = '#21830'\n", "\n", "pp.visualisation.plot(dag, node_color=colors, width=1000, height=1000)" ] }, { "cell_type": "markdown", "id": "05ccc1fa", "metadata": {}, "source": [ "As mentioned before, these networks get very large very quickly. Therefore, it is often more helpful to work with the `pathpy` path object that is also returned by the function.\n", "It contains all paths and subpaths contained in the network shown above. More information regarding this object can be found in the documentation on [pathpy.net](http://www.pathpy.net/)." ] }, { "cell_type": "markdown", "id": "a0765b8e", "metadata": {}, "source": [ "### Commit editing paths\n", "\n", "Finally, let's look at a projection in which nodes represent commits and a directed edge $c_1 \\rightarrow c_2$ between two commits $c_1, c_2$ exists if commit $c_2$ modified lines written in $c_1$." ] }, { "cell_type": "code", "execution_count": null, "id": "948379b2", "metadata": { "ExecuteTime": { "end_time": "2022-03-04T15:08:47.270703Z", "start_time": "2022-03-04T15:08:46.928190Z" } }, "outputs": [], "source": [ "dag, node_info, edge_info = git2net.get_commit_editing_dag(sqlite_db_file)\n", "\n", "dag" ] }, { "cell_type": "markdown", "id": "c2f67455", "metadata": {}, "source": [ "## Comparison co-editing and co-authorship networks\n", "\n", "In our [original publication](https://arxiv.org/abs/1903.10180) of `git2net`, we compared the co-editing and co-authorship networks of an Open Source and proprietary software development project.\n", "Let's see how a similar comparison would work for the repository behind `git2net`." ] }, { "cell_type": "code", "execution_count": null, "id": "b804e2b8", "metadata": { "ExecuteTime": { "end_time": "2022-03-04T15:08:49.110484Z", "start_time": "2022-03-04T15:08:47.272605Z" } }, "outputs": [], "source": [ "# Get the co-authorship network\n", "n_coauthorship, node_info, edge_info = git2net.get_coauthorship_network(sqlite_db_file)\n", "\n", "# Get the (aggregated) co-editing network\n", "n_coediting_t, node_info, edge_info = git2net.get_coediting_network(sqlite_db_file)\n", "n_coediting = pp.Network.from_temporal_network(n_coediting_t)" ] }, { "cell_type": "code", "execution_count": null, "id": "a9d1d213", "metadata": { "ExecuteTime": { "end_time": "2022-03-04T15:08:49.116672Z", "start_time": "2022-03-04T15:08:49.111940Z" } }, "outputs": [], "source": [ "print('==============================================\\n')\n", "print('# co-authorship')\n", "print(n_coauthorship)\n", "print('# co-editing')\n", "print(n_coediting)\n", "print('==============================================')" ] }, { "cell_type": "markdown", "id": "a51702d5", "metadata": {}, "source": [ "We find that both networks have seven nodes representing the seven individuals that worked on `git2net`.\n", "In the co-authorship network, the nodes are connected by eight undirected links indicating the two people edited the same file.\n", "In the co-editing network, we have nine directed links that show who edited lines from whom.\n", "\n", "Finally, let's see how the co-editing network evolves over time!\n", "For this example, we will use a rolling time window spanning 365 days, which we then shift in 30-day increments.\n", "We can use the `RollingTimeWindow()` function in `pathpy` to compute the networks over time.\n", "Then, we only need to add the statistics of interest to a dictionary." ] }, { "cell_type": "code", "execution_count": null, "id": "adc30ea9", "metadata": { "ExecuteTime": { "end_time": "2022-03-04T15:08:49.154589Z", "start_time": "2022-03-04T15:08:49.117883Z" } }, "outputs": [], "source": [ "WINDOW_SIZE = 365*24*60*60\n", "STEP_SIZE = 30*24*60*60\n", "\n", "data = defaultdict(list)\n", "for network, window in pp.RollingTimeWindow(n_coediting_t,\n", " window_size=WINDOW_SIZE,\n", " step_size=STEP_SIZE,\n", " directed=True, return_window=True):\n", " data['number of developers'].append(network.ncount())\n", " data['unique relations directed'].append(network.ecount())\n", " data['mean outdegree'].append(pp.algorithms.statistics.mean_degree(network, degree='outdegree'))\n", " data['time'].append(window[1]) # append window end time" ] }, { "cell_type": "markdown", "id": "3f0112f9", "metadata": {}, "source": [ "Now that we have the data let's plot them and see what we get!" ] }, { "cell_type": "code", "execution_count": null, "id": "e3841c54", "metadata": { "ExecuteTime": { "end_time": "2022-03-04T15:08:49.719501Z", "start_time": "2022-03-04T15:08:49.155911Z" } }, "outputs": [], "source": [ "# Plot time-variable network measures\n", "df = pd.DataFrame(data, columns=list(data.keys()))\n", "df.set_index(pd.to_datetime(df.time, unit='s'), inplace=True)\n", "df.drop('time', axis=1, inplace=True)\n", "df.plot(y='number of developers')\n", "df.plot(y='unique relations directed')\n", "df.plot(y='mean outdegree')" ] }, { "cell_type": "markdown", "id": "1541474f", "metadata": {}, "source": [ "With this, we conclude this part of the tutorial.\n", "The following part will look at the remaining information mined by `git2net` and see how we can efficiently handle the resulting SQLite database." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.12" } }, "nbformat": 4, "nbformat_minor": 5 }