---
name: codebase-comprehension-algorithms
description: Mapping an unfamiliar codebase into feature/business domains — answering "what is this about", "which files implement feature X", "where is the architectural spine", or reviewing a refactor that crosses module boundaries. 47 algorithms across 9 categories — graph construction (omnipresent filter, multilayer, SCC), lexical preprocessing (Samurai, TF-IDF), community detection (Leiden, Infomap, SBM, MCL, Walktrap, spectral, HDBSCAN), architecture recovery (Bunch+MQ, ACDC, Limbo, Reflexion, DSM), topic modelling (LDA, LSI, NMF, HDP), evolutionary coupling (Gall, ROSE), information-theoretic (NCD, MI, MDL, naturalness), centrality (PageRank, HITS, betweenness, TextRank), validation (MoJoFM, ARI/NMI, resolution limit, consensus, co-change prediction, ablation). Trigger without explicit "clustering" mention — codebase grokking, dependency mapping, domain extraction, architecture-recovery validation all apply.
---
# Community Codebase Comprehension And Domain Mapping Algorithms Best Practices

A practitioner-oriented reference of the **algorithms that work for mapping a codebase into understandable feature/business domains**. Most of these techniques live in the Software Architecture Recovery and Mining Software Repositories literatures and are invisible to working engineers — yet they're the right tools for the job a coding agent is asked to do every day: *"what does this codebase do, and where?"*

The 47 rules are organized by **execution-lifecycle impact**: a wrong decision early in the pipeline (which graph to build, which identifiers to keep) propagates through everything downstream. The three CRITICAL categories (`graph-`, `clust-`, `valid-`) are the ones a wrong call cannot be recovered from later. Read them first.

**Scope**: proven algorithms with peer-reviewed citations or canonical books — Newman *Networks*, Leskovec-Rajaraman-Ullman *Mining of Massive Datasets*, Ganter-Wille *Formal Concept Analysis*, plus 40+ ICSE / FSE / TSE / PNAS / JMLR papers. No tutorial sites, no Stack Overflow, no marketing posts. Deliberately deferred to a future version: GNN/CodeBERT/code2vec (not "proven over decades" yet) and refactoring-recipe stuff (covered by sibling skills like `react-refactor` and `typescript-refactor`).

## When to Apply

Use these rules when:

- Onboarding an agent into an unfamiliar codebase: "explain what this codebase does, by domain"
- Producing an architecture map: "what are the main subsystems and how do they connect?"
- Locating a feature: "which files implement payments / authentication / search?"
- Reviewing a refactor: "did this change respect the architectural boundaries?"
- Detecting architectural debt: "what files have surprising coupling?"
- Validating an existing decomposition: "does the README's architecture match the code?"
- Picking algorithms for any of the above — the user wants something that's *proven*, not vibes

## Rule Categories By Priority

| # | Category | Prefix | Impact | What it does |
|---|----------|--------|--------|--------------|
| 1 | Graph Construction & Edge Weighting | `graph-` | CRITICAL | Which graph to build; omnipresent filter; cycle handling; multilayer |
| 2 | Community Detection & Clustering | `clust-` | CRITICAL | Leiden, Infomap, SBM, MCL, Walktrap, spectral, HDBSCAN |
| 3 | Validation & Quality Metrics | `valid-` | CRITICAL | MoJoFM, ARI/NMI, resolution limit, consensus, co-change prediction, ablation |
| 4 | Identifier & Lexical Preprocessing | `lex-` | HIGH | Samurai splitting, abbreviation expansion, TF-IDF/BM25, stemming, V-O parsing |
| 5 | Software-Specific Architecture Recovery | `arch-` | HIGH | Bunch + MQ, ACDC, Limbo, Reflexion, DSM |
| 6 | Topic Modelling on Source Code | `topic-` | HIGH | LDA, LSI/SVD, NMF, HDP, coherence-based K selection |
| 7 | Evolutionary Coupling & Co-Change Mining | `evol-` | HIGH | Lift / confidence / support, large-commit filter, temporal decay, logical coupling |
| 8 | Information-Theoretic Methods | `info-` | MEDIUM-HIGH | Normalized Compression Distance, Mutual Information, MDL, code naturalness |
| 9 | Centrality, Hierarchy & Labelling | `rank-` | MEDIUM | PageRank, HITS, betweenness, TextRank/YAKE labels |

## Quick Reference

### 1. Graph Construction & Edge Weighting (CRITICAL)

- [`graph-filter-omnipresent-utilities-before-clustering`](references/graph-filter-omnipresent-utilities-before-clustering.md) — Drop the loggers and base classes BEFORE clustering (20-40 MoJoFM points)
- [`graph-pick-edge-type-by-question-asked`](references/graph-pick-edge-type-by-question-asked.md) — Call, import, co-change, bipartite — the question determines the graph
- [`graph-collapse-sccs-before-clustering`](references/graph-collapse-sccs-before-clustering.md) — Tarjan SCC condensation makes cycles explicit and stabilises every algorithm
- [`graph-weight-edges-by-information-content`](references/graph-weight-edges-by-information-content.md) — IDF / PMI / Jaccard on edges suppresses noise (2-5× MoJoFM)
- [`graph-bipartite-file-term-for-joint-structure`](references/graph-bipartite-file-term-for-joint-structure.md) — When DI / dynamic dispatch hides the call graph
- [`graph-combine-signals-in-multilayer-graphs`](references/graph-combine-signals-in-multilayer-graphs.md) — Mucha 2010 multilayer modularity over normalised α-weighted layers

### 2. Community Detection & Clustering (CRITICAL)

- [`clust-leiden-not-louvain`](references/clust-leiden-not-louvain.md) — Louvain produces disconnected communities on 5-25% of nodes (Traag 2019)
- [`clust-infomap-mdl-on-random-walks`](references/clust-infomap-mdl-on-random-walks.md) — MDL on random walks; the right tool for flow-meaningful graphs
- [`clust-stochastic-block-model`](references/clust-stochastic-block-model.md) — Bayesian, hierarchical, learns K from data; handles non-assortative structure
- [`clust-mcl-markov-clustering`](references/clust-mcl-markov-clustering.md) — Flow simulation; dominant in bioinformatics; robust to noise
- [`clust-walktrap-short-random-walks`](references/clust-walktrap-short-random-walks.md) — Random-walk distance + hierarchical agglomerative
- [`clust-spectral-laplacian-fiedler`](references/clust-spectral-laplacian-fiedler.md) — Optimal k-way normalised cut via Laplacian eigenvectors
- [`clust-hdbscan-density-based`](references/clust-hdbscan-density-based.md) — When clustering on file embeddings, not graphs

### 3. Validation & Quality Metrics (CRITICAL)

- [`valid-mojofm-as-software-clustering-distance`](references/valid-mojofm-as-software-clustering-distance.md) — The SAR gold-standard distance metric (Wen-Tzerpos 2004)
- [`valid-adjusted-rand-index-and-nmi`](references/valid-adjusted-rand-index-and-nmi.md) — Chance-corrected cross-algorithm comparison
- [`valid-be-aware-of-resolution-limit`](references/valid-be-aware-of-resolution-limit.md) — Modularity can't see clusters smaller than √(2m) (Fortunato-Barthélemy PNAS 2007)
- [`valid-consensus-clustering-for-stability`](references/valid-consensus-clustering-for-stability.md) — A single-run answer is unreliable; consensus across runs is the right answer
- [`valid-cochange-prediction-as-ground-truth-proxy`](references/valid-cochange-prediction-as-ground-truth-proxy.md) — Temporal held-out co-change replaces missing ground truth
- [`valid-ablate-each-input-signal`](references/valid-ablate-each-input-signal.md) — Leave-one-out; reveals which input actually drives the result

### 4. Identifier & Lexical Preprocessing (HIGH)

- [`lex-split-identifiers-with-samurai`](references/lex-split-identifiers-with-samurai.md) — 87% precision vs 60% for naive camelCase (Enslen MSR 2009)
- [`lex-build-programming-language-stop-words`](references/lex-build-programming-language-stop-words.md) — Three-layer: keywords + generic + IDF-driven
- [`lex-expand-abbreviations-with-context`](references/lex-expand-abbreviations-with-context.md) — usr → user, ctx → context (Lawrie GenTest 2011)
- [`lex-tf-idf-and-bm25-on-identifiers`](references/lex-tf-idf-and-bm25-on-identifiers.md) — Raw counts are dominated by common terms; TF-IDF / BM25 fix it
- [`lex-stem-versus-subword-tokenization`](references/lex-stem-versus-subword-tokenization.md) — Porter stemmer for clustering, BPE for embeddings
- [`lex-extract-verb-object-pattern-from-method-names`](references/lex-extract-verb-object-pattern-from-method-names.md) — getUserById → (verb=get, object=user); compound concept signal

### 5. Software-Specific Architecture Recovery (HIGH)

- [`arch-bunch-with-mq-fitness`](references/arch-bunch-with-mq-fitness.md) — MQ fitness function + search; better than Q-maximization on code
- [`arch-acdc-subgraph-patterns`](references/arch-acdc-subgraph-patterns.md) — Subsystem and skeleton patterns; matches architect intuition
- [`arch-limbo-information-bottleneck`](references/arch-limbo-information-bottleneck.md) — Tishby's IB applied to software (Andritsos-Tzerpos 2005)
- [`arch-reflexion-model`](references/arch-reflexion-model.md) — Compare hypothesized vs actual; the underused gem from Murphy-Notkin 1995
- [`arch-dsm-partitioning`](references/arch-dsm-partitioning.md) — Design Structure Matrix; 60-year-old engineering technique

### 6. Topic Modelling on Source Code (HIGH)

- [`topic-lda-on-source-code`](references/topic-lda-on-source-code.md) — Probabilistic per-file topic distributions over identifier+comment text
- [`topic-lsi-svd-on-term-document`](references/topic-lsi-svd-on-term-document.md) — Deterministic SVD-based semantic embeddings (Maletic-Marcus 2001)
- [`topic-nmf-non-negative-factorization`](references/topic-nmf-non-negative-factorization.md) — Parts-based additive topics, fully reproducible
- [`topic-hdp-for-nonparametric-topic-count`](references/topic-hdp-for-nonparametric-topic-count.md) — Hierarchical Dirichlet Process — learns K from data
- [`topic-pick-topic-count-by-coherence-not-perplexity`](references/topic-pick-topic-count-by-coherence-not-perplexity.md) — Perplexity is anti-correlated with human topic quality

### 7. Evolutionary Coupling & Co-Change Mining (HIGH)

- [`evol-mine-cochange-with-lift-and-confidence`](references/evol-mine-cochange-with-lift-and-confidence.md) — Lift > 2 is the cutoff; raw co-change count is noise
- [`evol-filter-large-commits`](references/evol-filter-large-commits.md) — A 200-file commit produces 20K spurious pair-counts; filter aggressively
- [`evol-temporal-decay-on-edge-weights`](references/evol-temporal-decay-on-edge-weights.md) — Exponential decay with 6-month half-life
- [`evol-logical-coupling-as-architectural-signal`](references/evol-logical-coupling-as-architectural-signal.md) — 30-50% of strongest coupling is invisible to static analysis (Gall 1998)

### 8. Information-Theoretic Methods (MEDIUM-HIGH)

- [`info-normalized-compression-distance`](references/info-normalized-compression-distance.md) — Cluster without feature engineering; gzip-based universal similarity
- [`info-mutual-information-as-coupling`](references/info-mutual-information-as-coupling.md) — Catches non-linear / conditional coupling that lift misses
- [`info-mdl-for-model-selection`](references/info-mdl-for-model-selection.md) — Principled K selection; Occam's razor as a code length
- [`info-naturalness-of-code-as-quality-signal`](references/info-naturalness-of-code-as-quality-signal.md) — Hindle 2012 — code is 30-50% more predictable than English; bugs spike entropy

### 9. Centrality, Hierarchy & Labelling (MEDIUM)

- [`rank-pagerank-for-module-importance`](references/rank-pagerank-for-module-importance.md) — Architectural spine via PageRank on the reversed dependency graph
- [`rank-hits-hubs-and-authorities`](references/rank-hits-hubs-and-authorities.md) — Orchestrators vs implementations (Kleinberg 1999)
- [`rank-betweenness-centrality-for-bottlenecks`](references/rank-betweenness-centrality-for-bottlenecks.md) — Bridges between domains; god-class detection
- [`rank-textrank-for-cluster-labels`](references/rank-textrank-for-cluster-labels.md) — Multi-word keyphrases as cluster labels (Mihalcea-Tarau 2004, YAKE 2020)

## How to Use

Start with the question the agent is trying to answer:

- **"What are the main domains in this codebase?"** → `graph-` (pick a graph) → `clust-` (Leiden / Infomap / SBM) → `topic-` (label them) → `valid-` (sanity-check stability and ablate)
- **"Which files implement feature X?"** → `topic-lda-on-source-code` for theme location; `rank-pagerank-for-module-importance` with X's files as seed for personalized PageRank
- **"Where is the architectural spine?"** → `rank-pagerank-for-module-importance` + `rank-hits-hubs-and-authorities` on the dependency graph
- **"Does the README's architecture match the code?"** → `arch-reflexion-model` is purpose-built for this
- **"What's the *real* coupling here (beyond static dependencies)?"** → `evol-logical-coupling-as-architectural-signal` and `evol-mine-cochange-with-lift-and-confidence`
- **"How do I cluster without designing features?"** → `info-normalized-compression-distance`
- **"How big are the clusters supposed to be?"** → `valid-be-aware-of-resolution-limit` and `topic-hdp-for-nonparametric-topic-count`
- **"How do I know my decomposition is right?"** → the entire `valid-` category; multi-proxy evaluation is mandatory

The skill's worldview: **build the right graph first** (and filter omnipresent files), **pick an algorithm matching the graph and the question**, **use a code-specific preprocessing pipeline** (Samurai + stop-words + stemming + TF-IDF) where lexical signals matter, and **always validate** — MoJoFM if you have expert ground truth, consensus + co-change prediction + ablation if you don't.

Code examples are in Python because the reference implementations (networkx, igraph, leidenalg, scikit-learn, gensim, graph-tool, hdbscan) all live there. The reasoning generalises to any language.

## Reference Files

| File | Description |
|------|-------------|
| [references/_sections.md](references/_sections.md) | Category definitions and ordering |
| [assets/templates/_template.md](assets/templates/_template.md) | Template for new rules |
| [metadata.json](metadata.json) | Version and reference information |
| [AGENTS.md](AGENTS.md) | Auto-built TOC navigation |

## Related Skills

- `computer-science-algorithms` — Algorithm-and-data-structure reference (this skill cross-references it for MinHash/LSH, Aho-Corasick, etc.)
- `complexity-optimizer` — Static analysis for hot paths the rules here identify
- `design-to-react-algorithms` — Companion skill for design-to-code structural recovery