---
layout: docu
redirect_from:
- /docs/guides/snippets/analyze_git_repository
title: Analyzing a Git Repository
---

You can use DuckDB to analyze Git logs using the output of the [`git log` command](https://git-scm.com/docs/git-log).

## Exporting the Git Log

We start by picking a character that doesn't occur in any part of the commit log (author names, messages, etc).
Since version v1.2.0, DuckDB's CSV reader supports [4-byte delimiters]({% post_url 2025-02-05-announcing-duckdb-120 %}#csv-features), making it possible to use emojis! πŸŽ‰

Despite being featured in the [Emoji Movie](https://www.imdb.com/title/tt4877122/) (IMDb rating: 3.4),
we can assume that the [Fish Cake with Swirl emoji (πŸ₯)](https://emojipedia.org/fish-cake-with-swirl) is not a common occurrence in most Git logs.
So, let's clone the [`duckdb/duckdb` repository](https://github.com/duckdb/duckdb) and export its log as follows:

```bash
git log --date=iso-strict --pretty=format:%adπŸ₯%hπŸ₯%anπŸ₯%s > git-log.csv
```

The resulting file looks like this:

```text
2025-02-25T18:12:54+01:00πŸ₯d608a31e13πŸ₯MarkπŸ₯MAIN_BRANCH_VERSIONING: Adopt also for Python build and amalgamation (#16400)
2025-02-25T15:05:56+01:00πŸ₯920b39ad96πŸ₯MarkπŸ₯Read support for Parquet Float16 (#16395)
2025-02-25T13:43:52+01:00πŸ₯61f55734b9πŸ₯Carlo PiovesanπŸ₯MAIN_BRANCH_VERSIONING: Adopt also for Python build and amalgamation
2025-02-25T12:35:28+01:00πŸ₯87eff7ebd3πŸ₯MarkπŸ₯Fix issue #16377 (#16391)
2025-02-25T10:33:49+01:00πŸ₯35af26476eπŸ₯Hannes MΓΌhleisenπŸ₯Read support for Parquet Float16
```

## Loading the Git Log into DuckDB

Start DuckDB and read the log as a <s>CSV</s> πŸ₯SV:

```sql
CREATE TABLE commits AS 
    FROM read_csv(
            'git-log.csv',
            delim = 'πŸ₯',
            header = false,
            column_names = ['timestamp', 'hash', 'author', 'message']
        );
```

This will result in a nice DuckDB table:

```sql
FROM commits
LIMIT 5;
```

```text
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚      timestamp      β”‚    hash    β”‚      author      β”‚                                    message                                    β”‚
β”‚      timestamp      β”‚  varchar   β”‚     varchar      β”‚                                    varchar                                    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 2025-02-25 17:12:54 β”‚ d608a31e13 β”‚ Mark             β”‚ MAIN_BRANCH_VERSIONING: Adopt also for Python build and amalgamation (#16400) β”‚
β”‚ 2025-02-25 14:05:56 β”‚ 920b39ad96 β”‚ Mark             β”‚ Read support for Parquet Float16 (#16395)                                     β”‚
β”‚ 2025-02-25 12:43:52 β”‚ 61f55734b9 β”‚ Carlo Piovesan   β”‚ MAIN_BRANCH_VERSIONING: Adopt also for Python build and amalgamation          β”‚
β”‚ 2025-02-25 11:35:28 β”‚ 87eff7ebd3 β”‚ Mark             β”‚ Fix issue #16377 (#16391)                                                     β”‚
β”‚ 2025-02-25 09:33:49 β”‚ 35af26476e β”‚ Hannes MΓΌhleisen β”‚ Read support for Parquet Float16                                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

## Analyzing the Log

We can analyze the table as any other in DuckDB.

### Common Topics

Let's start with a simple question: which topic was the most commonly mentioned in the commit messages: CI, CLI, or Python?

```sql
SELECT
    message.lower().regexp_extract('\b(ci|cli|python)\b') AS topic,
    count(*) AS num_commits
FROM commits
WHERE topic <> ''
GROUP BY ALL
ORDER BY num_commits DESC;
```

```text
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  topic  β”‚ num_commits β”‚
β”‚ varchar β”‚    int64    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ ci      β”‚         828 β”‚
β”‚ python  β”‚         666 β”‚
β”‚ cli     β”‚          49 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

Out of these three topics, commits related to continuous integration dominate the log!

We can also do a more exploratory analysis by looking at all words in the commit messages.
To do so, we first tokenize the messages:

```sql
CREATE TABLE words AS
    SELECT unnest(
        message
            .lower()
            .regexp_replace('\W', ' ')
            .trim(' ')
            .string_split_regex('\W')
        ) AS word    
FROM commits;
```

Then, we remove stopwords using a pre-defined list:

```sql
CREATE TABLE stopwords AS
    SELECT unnest(['a', 'about', 'above', 'after', 'again', 'against', 'all', 'am', 'an', 'and', 'any', 'are', 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'did', 'do', 'does', 'doing', 'don', 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'has', 'have', 'having', 'he', 'her', 'here', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'i', 'if', 'in', 'into', 'is', 'it', 'its', 'itself', 'just', 'me', 'more', 'most', 'my', 'myself', 'no', 'nor', 'not', 'now', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 's', 'same', 'she', 'should', 'so', 'some', 'such', 't', 'than', 'that', 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'there', 'these', 'they', 'this', 'those', 'through', 'to', 'too', 'under', 'until', 'up', 'very', 'was', 'we', 'were', 'what', 'when', 'where', 'which', 'while', 'who', 'whom', 'why', 'will', 'with', 'you', 'your', 'yours', 'yourself', 'yourselves']) AS word;

CREATE OR REPLACE TABLE words AS
    FROM words
    NATURAL ANTI JOIN stopwords
    WHERE word != '';
```

> We use the `NATURAL ANTI JOIN` clause here, which allows us to elegantly filter out values that occur in the `stopwords` table.

Finally, we select the top-20 most common words.

```sql
SELECT word, count(*) AS count FROM words
GROUP BY ALL
ORDER BY count DESC
LIMIT 20;
```

```text
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”
β”‚    w     β”‚ count β”‚
β”‚ varchar  β”‚ int64 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€
β”‚ merge    β”‚ 12550 β”‚
β”‚ fix      β”‚  6402 β”‚
β”‚ branch   β”‚  6005 β”‚
β”‚ pull     β”‚  5950 β”‚
β”‚ request  β”‚  5945 β”‚
β”‚ add      β”‚  5687 β”‚
β”‚ test     β”‚  3801 β”‚
β”‚ master   β”‚  3289 β”‚
β”‚ tests    β”‚  2339 β”‚
β”‚ issue    β”‚  1971 β”‚
β”‚ main     β”‚  1935 β”‚
β”‚ remove   β”‚  1884 β”‚
β”‚ format   β”‚  1819 β”‚
β”‚ duckdb   β”‚  1710 β”‚
β”‚ use      β”‚  1442 β”‚
β”‚ mytherin β”‚  1410 β”‚
β”‚ fixes    β”‚  1333 β”‚
β”‚ hawkfish β”‚  1147 β”‚
β”‚ feature  β”‚  1139 β”‚
β”‚ function β”‚  1088 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€
β”‚     20 rows      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

As expected, there are many Git terms (`merge`, `branch`, `pull`, etc.), followed by terminology related to development (`fix`, `test`/`tests`, `issue`, `format`).
We also see the account names of some developers ([`mytherin`](https://github.com/Mytherin), [`hawkfish`](https://github.com/hawkfish)), which are likely there due to commit message for merging pull requests (e.g., [”Merge pull request #13776 from Mytherin/expressiondepth”](https://github.com/duckdb/duckdb/commit/4d18b9d05caf88f0420dbdbe03d35a0faabf4aa7)).
Finally, we also see some DuckDB-related terms such as `duckdb` (shocking!) and `function`.

### Visualizing the Number of Commits

Let's visualize the number of commits each year:

```sql
SELECT
    year(timestamp) AS year,
    count(*) AS num_commits,
    num_commits.bar(0, 20_000) AS num_commits_viz
FROM commits
GROUP BY ALL
ORDER BY ALL;
```

```text
β”Œβ”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ year  β”‚ num_commits β”‚                                 num_commits_viz                                  β”‚
β”‚ int64 β”‚    int64    β”‚                                     varchar                                      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  2018 β”‚         870 β”‚ β–ˆβ–ˆβ–ˆβ–                                                                             β”‚
β”‚  2019 β”‚        1621 β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–                                                                          β”‚
β”‚  2020 β”‚        3484 β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰                                                                   β”‚
β”‚  2021 β”‚        6488 β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰                                                       β”‚
β”‚  2022 β”‚        9817 β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž                                         β”‚
β”‚  2023 β”‚       14585 β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž                      β”‚
β”‚  2024 β”‚       15949 β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š                 β”‚
β”‚  2025 β”‚        1788 β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–                                                                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

We see a steady growth over the years –
especially considering that many of DuckDB's functionalities and clients, which were originally part of the main repository, are now maintained in separate repositories
(e.g., [Java](https://github.com/duckdb/duckdb-java), [R](https://github.com/duckdb/duckdb-r)).

Happy hacking!