# Lumen Benchmarks Lumen is evaluated using **bench-swe**: a SWE-bench-style harness that measures whether Lumen reduces cost, time, and token usage when Claude fixes real GitHub bugs. Results are fully reproducible and all artifacts are committed to this repository. ## Methodology ### Evaluation Framework `bench-swe` tests two scenarios head-to-head against real, fixed GitHub issues: - **baseline** — Claude with default tools only (Read, Write, Edit, Grep, Bash etc.), no Lumen - **with-lumen** — all default tools plus Lumen's `semantic_search` MCP tool Each task is a real GitHub bug from an open-source project. Claude is given the issue description and the codebase at the pre-fix commit. It must produce a patch that fixes the issue. ### Judging Patches are rated by Claude Sonnet 4.6 acting as a blind judge, comparing each generated patch to the known-correct gold patch: - **Perfect** — fixes the issue with equivalent or better logic than the gold patch - **Good** — fixes the issue correctly using a different valid approach - **Poor** — wrong, incomplete, doesn't compile, or doesn't fix the issue The judge also evaluates `files_correct` (did the patch touch the right files?) and `logic_equivalent` (is the fix semantically identical to the gold patch?). ### Metrics Captured For each run, bench-swe captures: | Metric | Source | | ------------- | ------------------------------------ | | Cost (USD) | Claude API usage from raw JSONL | | Duration | Wall time from session start to exit | | Output tokens | Tokens generated by Claude | | Cache reads | Tokens read from prompt cache | | Tool calls | Number of tool invocations | ### Current Test Suite 9 languages, hard difficulty — all against real GitHub bugs: | Task | Language | Repository | Issue | | --------------- | ---------- | ------------------------------------------------------------- | -------------------------------------------------------------------------------- | | go-hard | Go | [goccy/go-yaml](https://github.com/goccy/go-yaml) | Decoder overrides defaults with null values | | javascript-hard | JavaScript | [markedjs/marked](https://github.com/markedjs/marked) | Blockquotes in lists ignore indentation for nesting | | php-hard | PHP | [Seldaek/monolog](https://github.com/Seldaek/monolog) | JsonFormatter crashes on stringable object error | | python-hard | Python | [pallets/click](https://github.com/pallets/click) | Boolean flag show_default ignores default_map | | typescript-hard | TypeScript | [commander-js/commander](https://github.com/tj/commander.js) | Negative flag negation doesn't propagate to aliases | | ruby-hard | Ruby | [ruby-grape/grape](https://github.com/ruby-grape/grape) | Wrong content type when Accept header is a wildcard | | cpp-hard | C++ | [fmtlib/fmt](https://github.com/fmtlib/fmt) | Add a C API (feature implementation) | | dart-hard | Dart | [dart-lang/shelf](https://github.com/dart-lang/shelf) | shelf_router HEAD request incorrectly sets content-length to 0 | | rust-hard | Rust | [toml-rs/toml](https://github.com/toml-rs/toml) | False duplicate key error for dotted keys when parent table is implicitly created | Embedding model: `ordis/jina-embeddings-v2-base-code` (Ollama, 768-dim). Claude model: Sonnet (execution), Sonnet 4.6 (judging). --- ## Results Overview **9 benchmark runs across 9 languages.** Quality was maintained in every single task — no regressions. Cost was reduced in every language tested. ### Bug-Fix Tasks (8 languages, excluding C++ feature task) | Metric | Baseline avg | With-Lumen avg | Delta | | ------------- | ------------ | -------------- | --------- | | Cost | $0.43 | $0.27 | **-37%** | | Time | 183s | 116s | **-37%** | | Output tokens | 8,278 | 4,787 | **-42%** | ### All 9 Tasks (including C++ feature task) | Metric | Baseline avg | With-Lumen avg | Delta | | ------------- | ------------ | -------------- | --------- | | Cost | $0.50 | $0.37 | **-26%** | | Time | 204s | 146s | **-28%** | | Output tokens | 9,439 | 7,042 | **-25%** | Cost was reduced in **all 9 languages** — the only universally positive metric. Quality was maintained in every task. --- ## Full Results Table | Task | Lang | Scenario | Rating | Cost | Time | Output Tok | Cache Read | Tool Calls | | --------------- | ---- | ---------- | ------- | ------ | ------ | ---------- | ---------- | ---------- | | javascript-hard | JS | baseline | Perfect | $0.482 | 254.7s | 14,286 | 486K | 18 | | javascript-hard | JS | with-lumen | Perfect | $0.325 | 119.3s | 4,872 | 464K | 16 | | rust-hard | Rust | baseline | Poor | $0.611 | 309.7s | 17,717 | 719K | 22 | | rust-hard | Rust | with-lumen | Poor | $0.375 | 204.0s | 12,291 | 241K | 9 | | php-hard | PHP | baseline | Good | $0.186 | 51.5s | 1,936 | 249K | 10 | | php-hard | PHP | with-lumen | Good | $0.136 | 34.0s | 796 | 66K | 7 | | typescript-hard | TS | baseline | Good | $0.186 | 84.4s | 4,994 | 120K | 6 | | typescript-hard | TS | with-lumen | Good | $0.136 | 56.3s | 1,813 | 183K | 9 | | python-hard | Py | baseline | Perfect | $0.119 | 43.0s | 1,710 | 132K | 7 | | python-hard | Py | with-lumen | Perfect | $0.096 | 30.6s | 1,092 | 90K | 5 | | ruby-hard | Ruby | baseline | Good | $0.539 | 185.5s | 6,143 | 517K | 53 | | ruby-hard | Ruby | with-lumen | Good | $0.411 | 165.2s | 5,581 | 295K | 47 | | go-hard | Go | baseline | Good | $0.646 | 291.2s | 11,475 | 658K | 51 | | go-hard | Go | with-lumen | Good | $0.568 | 264.1s | 10,283 | 538K | 35 | | dart-hard | Dart | baseline | Good | $0.634 | 246.1s | 21,286 | 4,126K | 61 | | dart-hard | Dart | with-lumen | Good | $0.153 | 50.9s | 3,862 | 663K | 14 | | cpp-hard | C++ | baseline | Good | $1.102 | 370.7s | 15,506 | 1,327K | 63 | | cpp-hard | C++ | with-lumen | Good | $1.014 | 359.1s | 22,056 | 1,019K | 51 | --- ## Per-Language Results ### JavaScript — marked (blockquote nesting) **The strongest result.** Lumen found the exact function (`list()` in `Tokenizer.ts`) on the first semantic search, eliminating all exploratory file reading. | Metric | Baseline | With Lumen | Delta | | ------------- | -------- | ---------- | ---------- | | Rating | Perfect | Perfect | Same | | Cost | $0.482 | $0.325 | **-32.6%** | | Time | 254.7s | 119.3s | **-53.2%** | | Output tokens | 14,286 | 4,872 | **-65.9%** | | Cache reads | 486K | 464K | -4.5% | | Tool calls | 18 | 16 | -11.1% | Both scenarios produced **functionally identical patches** — the same `blockquoteBeginRegex` function added to `rules.ts` and the same break condition in `Tokenizer.ts`. The judge rated both Perfect: > "The candidate patch implements identical logic to the gold patch in both > `src/Tokenizer.ts` and `src/rules.ts`." Lumen cut time by more than half and output tokens by two-thirds while delivering the same perfect fix. ### Rust — toml (dotted key duplicate error) **The best cost savings.** Lumen cut cost by 39% and time by 34% — the largest cost reduction across all 8 languages. Both scenarios struggled with this multi-crate task (neither fixed the parallel bug in the `toml` crate), but Lumen dramatically reduced the exploration overhead. | Metric | Baseline | With Lumen | Delta | | ------------- | -------- | ---------- | ---------- | | Rating | Poor | Poor | Same | | Cost | $0.611 | $0.375 | **-38.7%** | | Time | 309.7s | 204.0s | **-34.1%** | | Output tokens | 17,717 | 12,291 | **-30.6%** | | Cache reads | 719K | 241K | **-66.5%** | | Tool calls | 22 | 9 | **-59.1%** | Even when both approaches fail to produce a correct fix, Lumen saves money by reducing exploration. The baseline spent 22 tool calls exploring; Lumen narrowed it to 9 with targeted semantic searches. Cache reads dropped by two-thirds, showing that Lumen helped Claude avoid reading large amounts of irrelevant code. ### PHP — monolog (JsonFormatter crash) Lumen navigated from the parent class (`NormalizerFormatter`) to the correct child class (`JsonFormatter`) in two semantic searches, reaching the fix location with minimal exploration. | Metric | Baseline | With Lumen | Delta | | ------------- | -------- | ---------- | ---------- | | Rating | Good | Good | Same | | Cost | $0.186 | $0.136 | **-26.8%** | | Time | 51.5s | 34.0s | **-34.0%** | | Output tokens | 1,936 | 796 | **-58.9%** | | Cache reads | 249K | 66K | **-73.5%** | | Tool calls | 10 | 7 | -30.0% | Both patches wrap the `__toString()` call in a try/catch and fall back to the class name. The 73.5% reduction in cache reads shows Lumen helping Claude avoid reading large amounts of irrelevant code. ### TypeScript — commander.js (negative flag negation) Lumen found the option parsing logic directly, letting Claude focus on the fix rather than exploring the codebase structure. | Metric | Baseline | With Lumen | Delta | | ------------- | -------- | ---------- | ---------- | | Rating | Good | Good | Same | | Cost | $0.186 | $0.136 | **-27.1%** | | Time | 84.4s | 56.3s | **-33.3%** | | Output tokens | 4,994 | 1,813 | **-63.7%** | | Cache reads | 120K | 183K | +52.4% | | Tool calls | 6 | 9 | +50.0% | Despite using more tool calls (Lumen search calls + follow-up reads), the net effect was strongly positive: 64% fewer output tokens and 33% faster completion. The additional cache reads came from Lumen loading relevant context that Claude would otherwise have had to discover through exploration. ### Python — click (boolean flag default_map) Both scenarios found the one-line fix immediately. Lumen's semantic search located `get_help_record` in the `Option` class directly, saving a few Grep round-trips. | Metric | Baseline | With Lumen | Delta | | ------------- | -------- | ---------- | ---------- | | Rating | Perfect | Perfect | Same | | Cost | $0.119 | $0.096 | **-19.5%** | | Time | 43.0s | 30.6s | **-28.8%** | | Output tokens | 1,710 | 1,092 | **-36.1%** | | Cache reads | 132K | 90K | **-32.1%** | | Tool calls | 7 | 5 | -28.6% | Both produced the **identical single-line patch** — changing `self.default` to `default_value` on line 2800 of `core.py`. The judge confirmed: > "The candidate patch makes the identical one-line change as the gold patch." ### Ruby — grape (wrong content type with wildcard Accept) Lumen helped Claude navigate a large, convention-heavy Ruby codebase more efficiently, reducing cache reads by 43% and tool calls by 11%. | Metric | Baseline | With Lumen | Delta | | ------------- | -------- | ---------- | ---------- | | Rating | Good | Good | Same | | Cost | $0.539 | $0.411 | **-23.7%** | | Time | 185.5s | 165.2s | **-10.9%** | | Output tokens | 6,143 | 5,581 | -9.1% | | Cache reads | 517K | 295K | **-43.0%** | | Tool calls | 53 | 47 | -11.3% | Ruby showed the most modest output token improvement (-9%) but strong cache read and cost reductions. The high baseline tool call count (53) reflects the exploration-heavy approach needed without semantic search in a large Ruby project. ### Go — go-yaml (null value decoder) Lumen helped Claude find `createDecodedNewValue` in `decode.go` and produce a complete patch including test files. | Metric | Baseline | With Lumen | Delta | | ------------- | -------- | ---------- | ---------- | | Rating | Good | Good | Same | | Cost | $0.646 | $0.568 | **-12.2%** | | Time | 291.2s | 264.1s | -9.3% | | Output tokens | 11,475 | 10,283 | -10.4% | | Cache reads | 658K | 538K | **-18.2%** | | Tool calls | 51 | 35 | **-31.4%** | Both scenarios produced correct patches with test files. The with-lumen patch was more thorough — table-driven tests covering both null values and comments-only nodes, vs a single test case in the baseline. ### Dart — shelf (HEAD content-length RFC violation) **The strongest result overall.** Lumen cut cost by 76% and time by 79% — the largest improvements across all 9 languages. The bug was an RFC 9110 violation where `shelf_router`'s `_removeBody` middleware incorrectly set `content-length` to 0 for HEAD requests. | Metric | Baseline | With Lumen | Delta | | ------------- | -------- | ---------- | ---------- | | Rating | Good | Good | Same | | Cost | $0.634 | $0.153 | **-75.8%** | | Time | 246.1s | 50.9s | **-79.3%** | | Output tokens | 21,286 | 3,862 | **-81.9%** | | Cache reads | 4,126K | 663K | **-83.9%** | | Tool calls | 61 | 14 | **-77.0%** | Both scenarios fixed the bug correctly. The baseline spent 61 tool calls and over 4 minutes exploring the monorepo structure (`pkgs/shelf_router/` inside the larger `shelf` repository). With Lumen, semantic search located `_removeBody` and the router's HEAD handling directly, completing the fix in under a minute with only 14 tool calls. ### C++ — fmt (C API feature) The only **feature implementation** task (not a bug fix). Both scenarios produced complete, working C API implementations with tests, using different but valid architectural approaches. | Metric | Baseline | With Lumen | Delta | | ------------- | -------- | ---------- | ---------- | | Rating | Good | Good | Same | | Cost | $1.102 | $1.014 | **-8.0%** | | Time | 370.7s | 359.1s | -3.1% | | Output tokens | 15,506 | 22,056 | +42.2% | | Cache reads | 1,327K | 1,019K | -23.2% | | Tool calls | 63 | 51 | -19.0% | C++ is the most expensive task in the suite — a feature implementation in a large codebase. Lumen reduced cost by 8% and tool calls by 19%, but output tokens increased by 42%, suggesting Lumen's search results provided context that Claude used to generate more comprehensive code. Despite being the one task type where Lumen's advantage is smallest, it still delivered cost savings. --- ## Quality Summary | Language | Baseline Rating | With-Lumen Rating | Quality Delta | | ---------- | --------------- | ----------------- | ------------- | | JavaScript | Perfect | Perfect | Same | | Python | Perfect | Perfect | Same | | Dart | Good | Good | Same | | PHP | Good | Good | Same | | TypeScript | Good | Good | Same | | Ruby | Good | Good | Same | | Go | Good | Good | Same | | C++ | Good | Good | Same | | Rust | Poor | Poor | Same | Quality was maintained in **all 9 tasks** — zero regressions. Where the baseline produced Perfect patches, Lumen matched it. Where the baseline produced Good patches, Lumen matched it. And where the task was too hard for the baseline (Rust), Lumen didn't make it worse — it just made the failure cheaper. --- ## Key Findings ### 1. Cost Reduced in Every Language Lumen reduced cost in **all 9 languages** — the only universally positive metric. The range spans from -8% (C++) to -76% (Dart): | Language | Baseline cost | With-Lumen cost | Delta | | ---------- | ------------- | --------------- | ---------- | | Dart | $0.634 | $0.153 | **-75.8%** | | Rust | $0.611 | $0.375 | **-38.7%** | | JavaScript | $0.482 | $0.325 | **-32.6%** | | TypeScript | $0.186 | $0.136 | **-27.1%** | | PHP | $0.186 | $0.136 | **-26.8%** | | Ruby | $0.539 | $0.411 | **-23.7%** | | Python | $0.119 | $0.096 | **-19.5%** | | Go | $0.646 | $0.568 | **-12.2%** | | C++ | $1.102 | $1.014 | **-8.0%** | ### 2. Output Token Reduction Is the Primary Driver In 8/9 languages, output tokens dropped — up to 82% for Dart. The one exception is C++ where output tokens increased (+42%) due to more comprehensive code generation. Fewer output tokens means Claude explores less and acts more: | Language | Baseline output | With-Lumen output | Delta | | ---------- | --------------- | ----------------- | ---------- | | Dart | 21,286 | 3,862 | **-81.9%** | | JavaScript | 14,286 | 4,872 | **-65.9%** | | TypeScript | 4,994 | 1,813 | **-63.7%** | | PHP | 1,936 | 796 | **-58.9%** | | Python | 1,710 | 1,092 | **-36.1%** | | Rust | 17,717 | 12,291 | **-30.6%** | | Go | 11,475 | 10,283 | -10.4% | | Ruby | 6,143 | 5,581 | -9.1% | | C++ | 15,506 | 22,056 | +42.2% | ### 3. Time Savings Scale with Exploration The languages where the baseline needed the most exploration saw the largest time reductions: | Language | Baseline time | With-Lumen time | Delta | | ---------- | ------------- | --------------- | ---------- | | Dart | 246.1s | 50.9s | **-79.3%** | | JavaScript | 254.7s | 119.3s | **-53.2%** | | Rust | 309.7s | 204.0s | **-34.1%** | | PHP | 51.5s | 34.0s | **-34.0%** | | TypeScript | 84.4s | 56.3s | **-33.3%** | | Python | 43.0s | 30.6s | **-28.8%** | | Ruby | 185.5s | 165.2s | **-10.9%** | | Go | 291.2s | 264.1s | -9.3% | | C++ | 370.7s | 359.1s | -3.1% | ### 4. Search Calls Are Modest Lumen typically uses 1-10 search calls per task. It supplements rather than replaces other tool usage: | Language | Lumen search calls | Total tool calls (Lumen) | Total tool calls (baseline) | | ---------- | ------------------ | ------------------------ | --------------------------- | | Python | 2 | 5 | 7 | | PHP | 2 | 7 | 10 | | Rust | 2 | 9 | 22 | | TypeScript | 1 | 9 | 6 | | Dart | — | 14 | 61 | | JavaScript | 2 | 16 | 18 | | Go | 3 | 35 | 51 | | Ruby | 10 | 47 | 53 | | C++ | 6 | 51 | 63 | ### 5. Zero Quality Regressions Lumen maintained patch quality in all 9 tasks. Two tasks achieved Perfect ratings (JavaScript, Python) — identical patches to the gold standard. Six achieved Good ratings with correct fixes via different approaches. Even the one task too hard for either approach (Rust) showed no degradation — Lumen just made the failure 39% cheaper. ### 6. Results Are Reproducible All benchmark artifacts — raw JSONL streams, patch diffs, metrics, and judge ratings — are committed to this repository. The benchmark framework is deterministic in setup (same commit, same issue, same tools) while allowing natural LLM variation in execution. The consistent direction of improvement across 9 independent language benchmarks validates that the results are reliable. --- ## Reproduce Requirements: Ollama running with `ordis/jina-embeddings-v2-base-code`, the `claude` CLI, `git`, `go`, `jq`. ```bash cd bench-swe # Run all tasks, both scenarios go run ./cmd/run --output ../bench-results/my-run # Run a single language go run ./cmd/run --filter go-hard --output ../bench-results/my-run # Generate report from existing results go run ./cmd/report --input ../bench-results/my-run ``` Results land in `bench-results//`. Each run produces: - `--raw.jsonl` — full Claude session stream - `--metrics.json` — extracted cost/time/tokens - `--patch.diff` — generated patch - `--judge.json` — judge rating and reasoning - `--judge.md` — judge rationale in markdown - `detail-report.md` / `summary-report.md` — human-readable output The benchmark is entirely self-contained in `bench-swe/`. Tasks are defined as JSON files in `bench-swe/tasks/`. To add a new language or difficulty level, add a task JSON and re-run. Current results are committed at `bench-results/`.