{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!git clone https://github.com/gotec/git2net-tutorials\n", "import os\n", "os.chdir('git2net-tutorials')\n", "!pip install -r requirements.txt\n", "os.chdir('..')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Cloning a git repository for analysis with `git2net`\n", "\n", "In this notebook, we go through the first steps necessary to analyse your repository. Specifically, we show how to create a local copy (a clone) of an existing git repository. In sections 1 and 2 of this tutorial, we will solely consider the default branch (typically the `main` or `master` branch) of repositories. In section 3, we will then show how different or even multiple branches of repositories can be analysed." ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2022-02-14T08:06:42.318806Z", "start_time": "2022-02-14T08:06:42.312112Z" } }, "source": [ "## 1 - Manually cloning git repositories\n", "\n", "The easiest way to clone a git repository and immediately be ready for analysis with `git2net` is to clone the repository manually.\n", "To do this, you need the URL to the repository you want to analyse. For repositories on GitHub, [this](https://docs.github.com/en/repositories/creating-and-managing-repositories/cloning-a-repository) short manual shows you how to obtain the URL.\n", "Once you have the URL, you can clone the repository using the command `git clone ` from a terminal, where you replace `` with the URL to the repository of interest.\n", "Subsequently, when starting to analyse the repository with `git2net`, you need to provide the path to the repository you just cloned.\n", "\n", "Cloning repositories in this way works great for both public and private repositories as you will be asked for your access credentials during the cloning process." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2 - Cloning git repositories with Python\n", "\n", "Manually cloning repositories is excellent when the number of repositories you aim to analyse is low.\n", "But what when you are faced with analysing many hundred repositories for a study?\n", "In this case, you will likely be looking to write a script that can clone the repositories for you.\n", "Below, we show you how you can achieve this directly from Python.\n", "\n", "As the process will be completely automated, you will not be asked for your access credentials while cloning the repository.\n", "Therefore, the processes for cloning public and private repositories is slightly different, as we will explain in the following sections.\n", "\n", "### Public repositories\n", "\n", "Let's start with public repositories.\n", "To start, you will need to select and clone a git repository that you are interested in analysing. For the purpose of this tutorial, we will explore the repository behind `git2net`—aiming to finally find a solution to the well-known chicken and egg problem.\n", "\n", "The following lines will clone the `git2net` repository to your current working directory. You can edit the path to the local directory stored in `git_repo_dir` to change this location." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2022-02-18T11:20:50.561284Z", "start_time": "2022-02-18T11:20:49.970067Z" } }, "outputs": [], "source": [ "import pygit2 as git2\n", "import os\n", "import shutil\n", "\n", "git_repo_url = 'https://github.com/gotec/git2net.git'\n", "git_repo_dir = 'git2net4analysis'\n", "\n", "# Remove the clone of the repository if it already exists from a previous run\n", "if os.path.exists(git_repo_dir):\n", " shutil.rmtree(git_repo_dir)\n", "\n", "repo = git2.clone_repository(git_repo_url, git_repo_dir) # Clones a non-bare repository" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Public repositories on GitHub\n", "\n", "For public repositories hosted on GitHub, `git2net` also provides the function `mine_github()` that allows you to clone and analyse a repository in a single step.\n", "The URL to the repository can be provided either as a full HTTPS URL (e.g. `https://github.com/gotec/git2net`) or simply as a combination `/` (e.g. `gotec/git2net`).\n", "Further, you need to specify the path where the repository is cloned to, `git_repo_dir`, and the path to the `sqlite_db_file` to which the results are written." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2022-02-18T11:21:02.275144Z", "start_time": "2022-02-18T11:20:50.562804Z" } }, "outputs": [], "source": [ "import git2net\n", "import os\n", "import shutil\n", "\n", "github_url = 'gotec/git2net'\n", "git_repo_dir = 'git2net4analysis'\n", "sqlite_db_file = 'git2net4analysis.db'\n", "\n", "# Remove the clone of the repository if it already exists from a previous run\n", "if os.path.exists(git_repo_dir):\n", " shutil.rmtree(git_repo_dir)\n", "\n", "# Remove resulting sqlite database if it already exists from a previous run\n", "if os.path.exists(sqlite_db_file):\n", " os.remove(sqlite_db_file)\n", " \n", "git2net.mine_github(github_url, git_repo_dir, sqlite_db_file)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From here, you can immediately start analysing the repository, e.g., by looking at network projections or performing time series analysis.\n", "We will provide more details on how to do so in the subsequent tutorials." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Private repositories\n", "\n", "Private repositories require some more effort. Firstly, you have to generate a personal token. The procedure on the GitHub side is explained [here](https://docs.github.com/en/github/authenticating-to-github/creating-a-personal-access-token). Make sure to copy your new access token to a file (`secret.txt` for instance). You won't be able to see it again! Please add `secret.txt` directly to your `.gitignore` file! You wouldn't believe how many access tokens are freely available at GitHub :-)\n", "\n", "Now, we can pass the token as a third parameter embedded in a callback method to `clone_repository()`.\n", "\n", "*The code below is commented to allow for the execution of the entire notebook. Uncomment it to use your own private repository for the tutorial.*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2022-02-18T11:21:02.279144Z", "start_time": "2022-02-18T11:21:02.276882Z" } }, "outputs": [], "source": [ "# import pygit2 as git2\n", "# import os\n", "# import shutil\n", "\n", "# git_repo_url = 'https://github.com/user/SecretRepository.git' # does not exist :-)\n", "# git_repo_dir = 'secretRepository'\n", "\n", "# f = open(\"secret.txt\", \"r\")\n", "# token = f.read()\n", "\n", "# if os.path.exists(git_repo_dir):\n", "# shutil.rmtree(git_repo_dir)\n", "\n", "# callbacks = git2.RemoteCallbacks(git2.UserPass(token, 'x-oauth-basic'))\n", "# repo = git2.clone_repository(git_repo_url, git_repo_dir, callbacks=callbacks)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3 - Analysing different or multiple branches of a repository\n", "\n", "So far, we have focused on looking at the default branch (typically the `main` or `master` branch) of a repository.\n", "For most analyses, this is sufficient as this is (usually) the branch to which other branches are merged once their content is sufficiently developed.\n", "That said, there are repositories where this does not occur.\n", "Further, you might be interested in analysing the development in a specific branch or all branches of a repository.\n", "\n", "### Tracking multiple branches of a repository\n", "\n", "Again we first look at the general approach to achieve this.\n", "Here, we need to first clone the repository as before.\n", "Then, we need to track all branches in which we are interested.\n", "Below, we show you how you can do this using `GitPython`, another library next to `pygit2` that we can use to interact with git repositories." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2022-02-18T11:21:02.970964Z", "start_time": "2022-02-18T11:21:02.280661Z" } }, "outputs": [], "source": [ "import git\n", "import os\n", "import shutil\n", "\n", "# Step 1: Clone repository as before\n", "\n", "git_repo_url = 'https://github.com/gotec/git2net.git'\n", "git_repo_dir = 'git2net4analysis'\n", "\n", "if os.path.exists(git_repo_dir):\n", " shutil.rmtree(git_repo_dir)\n", "\n", "repo = git.Repo.clone_from(git_repo_url, git_repo_dir)\n", "\n", "\n", "# Step 2: Track remote branches\n", "\n", "existing_branches = [b.name for b in repo.branches]\n", "for ref in repo.remote().refs:\n", " branch_name = ref.name.split('/')[-1]\n", " if branch_name != 'HEAD' and branch_name not in existing_branches:\n", " repo.git.branch('--track', branch_name, 'remotes/origin/' + branch_name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Subsequently, in step 3, we can start mining the repository using the option `all_branches` set to `True`.\n", "As stated earlier, we will provide more details on the options you can use with `git2net` in the subsequent tutorials." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2022-02-18T11:21:13.731627Z", "start_time": "2022-02-18T11:21:02.974209Z" } }, "outputs": [], "source": [ "import git2net\n", "import sqlite3\n", "import pandas as pd\n", "from collections import Counter\n", "\n", "# Step 3: Crawl the local repository\n", "\n", "git_repo_dir = 'git2net4analysis'\n", "sqlite_db_file = 'git2net4analysis.db'\n", "\n", "if os.path.exists(sqlite_db_file):\n", " os.remove(sqlite_db_file) \n", "git2net.mine_git_repo(git_repo_dir, sqlite_db_file, all_branches=True)\n", "\n", "\n", "# Step 4: Check the covered branches\n", "\n", "with sqlite3.connect(sqlite_db_file) as con:\n", " branches = pd.read_sql_query(\"SELECT branches FROM commits\", con).branches\n", "\n", "print(Counter([b for b_list in branches for b in b_list.split(',')]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we check the resulting database, we can see that we have processed the commits from multiple branches." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Analysing a specific branch from public GitHub repository\n", "\n", "When analysing a repository using `git2net`'s `mine_github()` function introduced above, you can also specify the branch you want to analyse.\n", "We show an example for this below." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2022-02-18T11:21:25.040515Z", "start_time": "2022-02-18T11:21:13.733032Z" } }, "outputs": [], "source": [ "import git2net\n", "import os\n", "import shutil\n", "\n", "github_url = 'gotec/git2net'\n", "git_repo_dir = 'git2net4analysis'\n", "sqlite_db_file = 'git2net4analysis.db'\n", "branch = 'object-oriented'\n", "\n", "# Remove the clone of the repository if it already exists from a previous run\n", "if os.path.exists(git_repo_dir):\n", " shutil.rmtree(git_repo_dir)\n", "\n", "# Remove resulting sqlite database if it already exists from a previous run\n", "if os.path.exists(sqlite_db_file):\n", " os.remove(sqlite_db_file)\n", "\n", "# Mine a specific branch using the branch option\n", "git2net.mine_github(github_url, git_repo_dir, sqlite_db_file, branch=branch)\n", "\n", "# Check the resulting database\n", "with sqlite3.connect(sqlite_db_file) as con:\n", " branches = pd.read_sql_query(\"SELECT branches FROM commits\", con).branches\n", "print(Counter([b for b_list in branches for b in b_list.split(',')]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looking at the resulting repository, we can see that only commits from the selected branch are in the database.\n", "\n", "We can extend the database with commits from another branch by rerunning the command with another active branch.\n", "Note, however, that to do so, you first need to remove the already existing clone of the repository as `git2net` will not overwrite your existing data." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2022-02-18T11:21:25.957136Z", "start_time": "2022-02-18T11:21:25.042176Z" } }, "outputs": [], "source": [ "branch = 'main'\n", "\n", "# Remove the clone of the repository if it already exists from a previous run\n", "if os.path.exists(git_repo_dir):\n", " shutil.rmtree(git_repo_dir)\n", "\n", "git2net.mine_github(github_url, git_repo_dir, sqlite_db_file, branch=branch)\n", "\n", "with sqlite3.connect(sqlite_db_file) as con:\n", " branches = pd.read_sql_query(\"SELECT branches FROM commits\", con).branches\n", "print(Counter([b for b_list in branches for b in b_list.split(',')]))" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2022-02-14T09:51:22.549902Z", "start_time": "2022-02-14T09:51:22.543877Z" } }, "source": [ "As you can see, if a commit from the new branch is already present in the database, `git2net` will recognise this and not mine the commit again.\n", "Instead, only the information on which branches these commits appear in is updated." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With this, we conclude the first tutorial in which we showed you how to clone a repository.\n", "With some examples, we even started looking at how you can start mining the cloned repositories using `git2net`.\n", "In these examples, we have used `git2net`'s default settings.\n", "However, depending on your research or application, you might be interested in extracting additional data, such as the content of modified lines or information on the cyclomatic complexity of files.\n", "As we will cover in the following tutorial, `git2net` comes with various options that allow you to achieve this." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.12" } }, "nbformat": 4, "nbformat_minor": 2 }