--- layout: docu redirect_from: - /docs/guides/snippets/analyze_git_repository title: Analyzing a Git Repository --- You can use DuckDB to analyze Git logs using the output of the [`git log` command](https://git-scm.com/docs/git-log). ## Exporting the Git Log We start by picking a character that doesn't occur in any part of the commit log (author names, messages, etc). Since version v1.2.0, DuckDB's CSV reader supports [4-byte delimiters]({% post_url 2025-02-05-announcing-duckdb-120 %}#csv-features), making it possible to use emojis! π Despite being featured in the [Emoji Movie](https://www.imdb.com/title/tt4877122/) (IMDb rating: 3.4), we can assume that the [Fish Cake with Swirl emoji (π₯)](https://emojipedia.org/fish-cake-with-swirl) is not a common occurrence in most Git logs. So, let's clone the [`duckdb/duckdb` repository](https://github.com/duckdb/duckdb) and export its log as follows: ```bash git log --date=iso-strict --pretty=format:%adπ₯%hπ₯%anπ₯%s > git-log.csv ``` The resulting file looks like this: ```text 2025-02-25T18:12:54+01:00π₯d608a31e13π₯Markπ₯MAIN_BRANCH_VERSIONING: Adopt also for Python build and amalgamation (#16400) 2025-02-25T15:05:56+01:00π₯920b39ad96π₯Markπ₯Read support for Parquet Float16 (#16395) 2025-02-25T13:43:52+01:00π₯61f55734b9π₯Carlo Piovesanπ₯MAIN_BRANCH_VERSIONING: Adopt also for Python build and amalgamation 2025-02-25T12:35:28+01:00π₯87eff7ebd3π₯Markπ₯Fix issue #16377 (#16391) 2025-02-25T10:33:49+01:00π₯35af26476eπ₯Hannes MΓΌhleisenπ₯Read support for Parquet Float16 ``` ## Loading the Git Log into DuckDB Start DuckDB and read the log as a <s>CSV</s> π₯SV: ```sql CREATE TABLE commits AS FROM read_csv( 'git-log.csv', delim = 'π₯', header = false, column_names = ['timestamp', 'hash', 'author', 'message'] ); ``` This will result in a nice DuckDB table: ```sql FROM commits LIMIT 5; ``` ```text βββββββββββββββββββββββ¬βββββββββββββ¬βββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β timestamp β hash β author β message β β timestamp β varchar β varchar β varchar β βββββββββββββββββββββββΌβββββββββββββΌβββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β 2025-02-25 17:12:54 β d608a31e13 β Mark β MAIN_BRANCH_VERSIONING: Adopt also for Python build and amalgamation (#16400) β β 2025-02-25 14:05:56 β 920b39ad96 β Mark β Read support for Parquet Float16 (#16395) β β 2025-02-25 12:43:52 β 61f55734b9 β Carlo Piovesan β MAIN_BRANCH_VERSIONING: Adopt also for Python build and amalgamation β β 2025-02-25 11:35:28 β 87eff7ebd3 β Mark β Fix issue #16377 (#16391) β β 2025-02-25 09:33:49 β 35af26476e β Hannes MΓΌhleisen β Read support for Parquet Float16 β βββββββββββββββββββββββ΄βββββββββββββ΄βββββββββββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ ``` ## Analyzing the Log We can analyze the table as any other in DuckDB. ### Common Topics Let's start with a simple question: which topic was the most commonly mentioned in the commit messages: CI, CLI, or Python? ```sql SELECT message.lower().regexp_extract('\b(ci|cli|python)\b') AS topic, count(*) AS num_commits FROM commits WHERE topic <> '' GROUP BY ALL ORDER BY num_commits DESC; ``` ```text βββββββββββ¬ββββββββββββββ β topic β num_commits β β varchar β int64 β βββββββββββΌββββββββββββββ€ β ci β 828 β β python β 666 β β cli β 49 β βββββββββββ΄ββββββββββββββ ``` Out of these three topics, commits related to continuous integration dominate the log! We can also do a more exploratory analysis by looking at all words in the commit messages. To do so, we first tokenize the messages: ```sql CREATE TABLE words AS SELECT unnest( message .lower() .regexp_replace('\W', ' ') .trim(' ') .string_split_regex('\W') ) AS word FROM commits; ``` Then, we remove stopwords using a pre-defined list: ```sql CREATE TABLE stopwords AS SELECT unnest(['a', 'about', 'above', 'after', 'again', 'against', 'all', 'am', 'an', 'and', 'any', 'are', 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'did', 'do', 'does', 'doing', 'don', 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'has', 'have', 'having', 'he', 'her', 'here', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'i', 'if', 'in', 'into', 'is', 'it', 'its', 'itself', 'just', 'me', 'more', 'most', 'my', 'myself', 'no', 'nor', 'not', 'now', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 's', 'same', 'she', 'should', 'so', 'some', 'such', 't', 'than', 'that', 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'there', 'these', 'they', 'this', 'those', 'through', 'to', 'too', 'under', 'until', 'up', 'very', 'was', 'we', 'were', 'what', 'when', 'where', 'which', 'while', 'who', 'whom', 'why', 'will', 'with', 'you', 'your', 'yours', 'yourself', 'yourselves']) AS word; CREATE OR REPLACE TABLE words AS FROM words NATURAL ANTI JOIN stopwords WHERE word != ''; ``` > We use the `NATURAL ANTI JOIN` clause here, which allows us to elegantly filter out values that occur in the `stopwords` table. Finally, we select the top-20 most common words. ```sql SELECT word, count(*) AS count FROM words GROUP BY ALL ORDER BY count DESC LIMIT 20; ``` ```text ββββββββββββ¬ββββββββ β w β count β β varchar β int64 β ββββββββββββΌββββββββ€ β merge β 12550 β β fix β 6402 β β branch β 6005 β β pull β 5950 β β request β 5945 β β add β 5687 β β test β 3801 β β master β 3289 β β tests β 2339 β β issue β 1971 β β main β 1935 β β remove β 1884 β β format β 1819 β β duckdb β 1710 β β use β 1442 β β mytherin β 1410 β β fixes β 1333 β β hawkfish β 1147 β β feature β 1139 β β function β 1088 β ββββββββββββ΄ββββββββ€ β 20 rows β ββββββββββββββββββββ ``` As expected, there are many Git terms (`merge`, `branch`, `pull`, etc.), followed by terminology related to development (`fix`, `test`/`tests`, `issue`, `format`). We also see the account names of some developers ([`mytherin`](https://github.com/Mytherin), [`hawkfish`](https://github.com/hawkfish)), which are likely there due to commit message for merging pull requests (e.g., [βMerge pull request #13776 from Mytherin/expressiondepthβ](https://github.com/duckdb/duckdb/commit/4d18b9d05caf88f0420dbdbe03d35a0faabf4aa7)). Finally, we also see some DuckDB-related terms such as `duckdb` (shocking!) and `function`. ### Visualizing the Number of Commits Let's visualize the number of commits each year: ```sql SELECT year(timestamp) AS year, count(*) AS num_commits, num_commits.bar(0, 20_000) AS num_commits_viz FROM commits GROUP BY ALL ORDER BY ALL; ``` ```text βββββββββ¬ββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β year β num_commits β num_commits_viz β β int64 β int64 β varchar β βββββββββΌββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β 2018 β 870 β ββββ β β 2019 β 1621 β βββββββ β β 2020 β 3484 β ββββββββββββββ β β 2021 β 6488 β ββββββββββββββββββββββββββ β β 2022 β 9817 β ββββββββββββββββββββββββββββββββββββββββ β β 2023 β 14585 β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β 2024 β 15949 β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β 2025 β 1788 β ββββββββ β βββββββββ΄ββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ ``` We see a steady growth over the years β especially considering that many of DuckDB's functionalities and clients, which were originally part of the main repository, are now maintained in separate repositories (e.g., [Java](https://github.com/duckdb/duckdb-java), [R](https://github.com/duckdb/duckdb-r)). Happy hacking!