# Benchmarks A **reproducible, deterministic** measure of Ama's *retrieval efficiency*: how many tool calls and tokens it takes to **obtain** the answer to a code question — one Ama MCP call versus the grep-then-read an agent falls back to without a code graph. Run it yourself (no LLM, no API key): ```bash node scripts/benchmark.mjs [repoPath] # defaults to this repo ``` ## Java Resolution Rate Track Java deep-tier call-graph completeness on deterministic fixture repos: ```bash npm run bench:java-resolution npm run bench:java-resolution -- --json ``` The harness reports `callsResolved / callsTotal` plus diagnostic buckets per fixture. By default it sets `AMA_JAVA_INCLUDE_STDLIB=0` for reproducible fixture-only numbers; set it explicitly before running if you want local JDK standard-library symbols included. ## Results — Ama's own repo (~317 files) | Question | Ama (calls / tokens) | Baseline (calls / tokens) | Fewer tokens | |---|---|---|---| | who calls `symbolId`? | 1 / 6,177 | 66 / 298,920 | 98% | | who calls `fileId`? | 1 / 930 | 24 / 295,439 | ~100% | | what breaks if `AmaSession` changes? | 1 / 1,082 | 21 / 270,948 | ~100% | | show `createServer`'s source | 1 / 4,679 | 10 / 264,251 | 98% | | who calls `walkSymbols`? | 1 / 344 | 8 / 254,489 | ~100% | | **Total** | **5 / 13,212** | **129 / 1,384,047** | **99%** | **→ ~99% fewer tokens and ~96% fewer tool calls** to retrieve these answers. ## Methodology - **With Ama:** one MCP tool call (`find_callers` / `impact_analysis` / `get_code_snippet`). Cost = the result's size in tokens. - **Baseline (no graph):** an agent greps for the symbol (1 call), then reads each matching file **in full** (N calls) to find and confirm the answer. Cost = those files' tokens. - Tokens ≈ bytes ÷ 4 (the usual rough rule). File set = `git grep -lI`. ## What this does and doesn't measure - ✅ **Retrieval cost** — tool calls + tokens to *get* the answer. This is the mechanism behind the savings: one focused, structured result instead of pulling whole files into context. - ⚠️ **The baseline is an upper bound.** It reads *every* grep-matching file in full — a thorough agent confirming each reference. A leaner agent reads fewer files, so real-world savings are **lower than the headline**; treat 99% as the ceiling, not a promise. - ❌ **Not a full agent-loop benchmark.** It excludes the LLM's reasoning tokens and multi-turn overhead. Some code-graph tools publish whole-agent "% cheaper" figures; a comparable number for Ama would need an LLM harness (future work). - The deeper win is **qualitative**: Ama returns the *precise, structured* answer — real callers, transitive blast radius, exact snippet — with no grep false positives and no call-vs-mention guesswork, which grep can't give at any token budget.