--- name: codebase-reading description: Systematic methodology for reading and understanding large codebases efficiently. Use when (1) Understanding a new or unfamiliar codebase quickly, (2) Preparing to modify or extend existing code safely, (3) Debugging complex issues requiring deep code understanding, (4) Onboarding new team members to a codebase, (5) Performing code audits or security reviews, (6) Refactoring legacy code with confidence, (7) Creating documentation for existing systems, (8) Tracing execution flows and data transformations metadata: short-description: Efficiently read and understand large codebases --- # Large Codebase Reading Methodology A systematic approach to understanding large codebases efficiently: **read to modify safely, not to memorize everything**. ## Core Principle **Goal-oriented reading**: Always start with a concrete objective (fix bug, add feature, debug issue, security audit). Without a goal, you'll get lost in details. **Success criteria**: You can explain the execution flow and modify code confidently, not that you've read every file. ## Documentation Structure When documenting your understanding, use this structure: ``` code-reading/ ├── README.md # Index and progress tracking ├── code-reading.md # Main framework (methodology + findings + terminology) ├── architecture.md # C4 architecture map (Level 1-4) ├── api-flow.md # Execution flow tracing └── key-modules.md # Detailed module analysis ``` **Progressive disclosure:** - README: Overview and navigation - Main doc: Methodology and high-level findings - Specialized docs: Detailed analysis (loaded only when needed) **Terminology section:** - Include a dedicated "Terminology" or "Key Concepts and Terminology" section in `code-reading.md` (e.g., section 9) - Build incrementally from day 1, don't wait until everything is understood - Cross-reference terms throughout documentation (architecture, flows, modules) - Maintain as a living document - update as understanding deepens ## Quick Start ### Step 1: Define Your Goal Start with a concrete, one-sentence objective: - ✅ Good: "Trace HTTP request from route handler to database and back" - ✅ Good: "Understand how authentication tokens are validated and refreshed" - ❌ Bad: "Understand the entire codebase" ### Step 2: Run the Project First **First hour priority:** 1. Read README - project purpose, setup, minimal example 2. Read CONTRIBUTING/development guide - testing, structure, workflow 3. Run a minimal path - local build, one test, or demo **Why:** Running code transforms reading from guessing to verification. ### Step 3: Map the Architecture (C4 Thinking) Draw a rough map from far to near (no need to be perfect): **Level 1: System Context** - Who uses it? What external dependencies? **Level 2: Containers** - What are the deployable units? (services, databases, workers) **Level 3: Components** - What are the key components within one container? > You don't need beautiful diagrams - just enough to navigate: **where's the entry point, how does data flow, where are the boundaries?** ### Step 4: Trace a Real Request Path Instead of reading modules, **read one execution path**: - Web: route → controller → service → repository → external API - CLI: main → command parser → execution path - Async: consumer → handler → processing → ack/retry **Strongly recommended:** Use a debugger, add logging, set breakpoints to walk through once. ### Step 5: Treat Tests as Executable Documentation If code is old/hard to test, write **characterization tests** first - record current behavior, then refactor under protection. ## Core Workflow ### 1. Goal-Oriented Setup **Define your reading task:** Write a one-sentence goal that describes what you want to achieve: - "I can explain the request flow from HTTP to database" - "I can safely modify the authentication module" - "I can trace the OCR pipeline from image input to result output" **Avoid:** Reading without purpose or starting from directory trees. ### 2. Project Bootstrap **Priority order (first hour):** 1. **README.md** - What does it do? How to start? Minimal example? 2. **CONTRIBUTING.md / docs/** - How to test? Code structure? Branch strategy? 3. **Run minimal path** - Can you build it? Run one test? Execute a demo? **Verify:** Model files exist, tests run, examples work. ### 3. Architecture Mapping (C4 Model) **Level 1: System Context** - Users/applications that interact with the system - External dependencies (databases, APIs, services) - System boundaries **Level 2: Containers** - Deployable units (frontend, API server, workers, databases, queues) - Communication between containers **Level 3: Components** - Key components within a container (auth service, domain service, repository, adapter) - Component relationships and dependencies **Output:** A rough map that answers: - Where's the entry point? - How does data flow? - Where are the boundaries? ### 4. Entry Point + Request Tracing **Find the entry point:** - Web: `main()`, route handlers, controllers - CLI: `main()`, command parsers - Library: public API functions, constructors **Trace one complete path:** - Use debugger to step through - Add logging at key points - Set breakpoints to verify understanding **Document the flow:** Write down the execution path as you trace it. ### 5. Tests as Documentation **Existing tests:** - Read tests to understand expected behavior - Tests show how components are used - Tests document edge cases and error handling **Missing tests:** - Write characterization tests (record current behavior) - Use tests as a safety net before modifications - Tests become regression suite ### 6. Git Archaeology **When you see "weird code":** Don't dismiss it immediately - ask: **What historical problem is this solving?** **Commands:** ```bash git blame -w -- path/to/file # Who changed this and when? git log -p -- path/to/file # Full change history git log --grep="keyword" # Find related commits git show # View specific change ``` **What to look for:** - Commit messages explaining why - PR discussions showing trade-offs - Major refactors showing architecture evolution ### 7. Use Tools Efficiently **Code search:** ```bash rg "keyword" -n . # Ripgrep (faster than grep) git grep "keyword" # Git-optimized search rg "main\(" . # Find entry points rg "TODO|FIXME" . # Find todos ``` **Git tools:** ```bash git log --graph --oneline --all # Visual history git log --follow -- path/to/file # File rename tracking git diff HEAD~5 HEAD # Compare versions ``` **Language-specific tools:** - Rust: `cargo tree`, `cargo clippy`, `cargo doc` - Python: `pytest`, `mypy`, `pylint` - JavaScript: `npm list`, `eslint`, `tsc` ### 8. Build and Maintain Terminology Glossary **Critical for understanding:** Build a living terminology glossary from day 1, not as an afterthought. **Why it matters:** - Domain-specific terms (OCR: detection, recognition, CTC, NMS) - Acronyms that need expansion (CLS, DB, IOU) - Project-specific abstractions (custom types, internal concepts) - Confusion between similar concepts (Orientation vs CLS, Detection vs Recognition) **When to build:** - **Initial collection**: During first reading pass (Steps 1-4) - collect terms as you encounter them - **Deep dive enrichment**: During detailed module analysis (Step 10) - add technical details and context - **Continuous maintenance**: Update as understanding deepens, add cross-references, refine definitions **What to include:** - **Definition**: Clear explanation of what the term means - **Context**: Where and how it's used in the codebase - **Code references**: File paths, function names, line numbers - **Relationships**: Related terms, parent/child concepts, synonyms - **Examples**: Usage examples, code snippets when helpful **Organization:** - Classify by domain (OCR terms, ML terms, project-specific) - Group by abstraction level (concepts, algorithms, data structures, implementation) - Cross-reference to architecture diagrams, API flows, module analysis **Example structure:** ```markdown ## Terminology Glossary ### Domain-Specific Terms - **Detection (检测)**: Locating text regions in images - **Implementation**: `src/det.rs` - **Related**: Recognition, NMS, DB - **See**: `api-flow.md` section 4.1 ### Data Structures - **`Mat`**: Image matrix abstraction - **Implementation**: `src/image_impl.rs` - **Purpose**: Unified image representation for Pure Rust and OpenCV backends ### Algorithms - **NMS (Non-Maximum Suppression)**: Algorithm for filtering overlapping boxes - **Implementation**: `src/geometry.rs::nms()` - **Related**: IOU (Intersection over Union) ``` **Tools for term extraction:** ```bash # Extract acronyms (all caps, 2-5 chars) rg "\b[A-Z]{2,5}\b" README.md docs/ | sort -u # Extract struct/enum/trait names rg "^(pub )?(struct|enum|trait|type) \w+" src/ -o | sort -u # Find config-related terms rg "Config.*\{|struct \w+Config" src/ ``` **Maintenance checklist:** - [ ] All acronyms have expansions - [ ] All domain terms have clear definitions - [ ] All key data structures are documented - [ ] Cross-references to code are accurate - [ ] Related terms are linked - [ ] Examples provided for complex terms ## Common Patterns ### Pattern 1: Reading Without a Goal **Problem:** Reading files alphabetically or by directory structure. **Solution:** Start with a concrete task. Even if it's "understand how X works," make it specific. ### Pattern 2: Getting Lost in Details **Problem:** Deep-diving into every module before understanding the flow. **Solution:** First trace one complete execution path. Then dive into specific modules as needed. ### Pattern 3: Ignoring Tests **Problem:** Treating tests as optional or skipping them. **Solution:** Tests are the most reliable documentation. Read them first. If missing, write characterization tests. ### Pattern 4: Not Using Git History **Problem:** Seeing "weird code" and assuming it's wrong. **Solution:** Use `git blame` and `git log` to understand context. Code exists for reasons, even if not obvious. ### Pattern 5: Ignoring Terminology (Acronyms and Domain Terms) **Problem:** Encountering acronyms (CTC, CLS, DB, NMS) or domain terms (Detection vs Recognition) and assuming you'll remember what they mean later. Repeatedly looking up the same terms. **Solution:** Build a terminology glossary from day 1. Collect terms as you encounter them, enrich with context as understanding deepens. Cross-reference terms throughout documentation. Treat it as a living document, not a one-time task. ## Success Indicators You've successfully understood a codebase when: ✅ You can explain the main execution flow from entry to exit ✅ You can locate where specific functionality is implemented ✅ You can trace data transformations through the system ✅ You can identify where to make changes for your goal ✅ You can explain architectural decisions (even if you'd do differently) ✅ You have a comprehensive terminology glossary that helps navigate the codebase ✅ You can explain any domain-specific term in one sentence ✅ You rarely need to re-lookup the same term ✅ You feel confident modifying code without breaking things **Not required:** - ❌ Having read every file - ❌ Memorizing every function - ❌ Understanding every detail ## Troubleshooting **"I don't know where to start"** → Define a concrete goal. Even "understand how user authentication works" is better than "understand everything." **"The codebase is too large"** → Focus on your goal. Trace one execution path. Ignore unrelated modules. **"I can't find the entry point"** → Look for `main()`, route definitions, or public API functions. Use code search tools. **"The code doesn't make sense"** → Use Git history to understand why it's written this way. Check tests for usage examples. **"There are no tests"** → Write characterization tests. Record current behavior before modifying. **"I keep forgetting what CTC/CLS/DB means"** → Build a terminology glossary. Start early, collect terms as you encounter them. Include definitions, context, code references, and relationships. Cross-reference throughout documentation. **"The domain terms are confusing"** → Don't just look them up once. Document them with context, usage examples, and relationships to other terms. Classify by domain and abstraction level. Update as understanding deepens. ## References For detailed methodology and examples, see: - [methodology.md](references/methodology.md) - Complete 11-step methodology with detailed explanations - [tool-checklist.md](references/tool-checklist.md) - Comprehensive tool checklist by category (search, Git, language-specific, debugging) - [terminology-building.md](references/terminology-building.md) - **Complete guide to building and maintaining terminology glossaries** (5-step process, classification methods, best practices, common patterns)