# Lumen Benchmarks

Lumen is evaluated using **bench-swe**: a SWE-bench-style harness that measures
whether Lumen reduces cost, time, and token usage when Claude fixes real GitHub
bugs. Results are fully reproducible and all artifacts are committed to this
repository.

## Methodology

### Evaluation Framework

`bench-swe` tests two scenarios head-to-head against real, fixed GitHub issues:

- **baseline** — Claude with default tools only (Read, Write, Edit, Grep, Bash
  etc.), no Lumen
- **with-lumen** — all default tools plus Lumen's `semantic_search` MCP tool

Each task is a real GitHub bug from an open-source project. Claude is given the
issue description and the codebase at the pre-fix commit. It must produce a
patch that fixes the issue.

### Judging

Patches are rated by Claude Sonnet 4.6 acting as a blind judge, comparing each
generated patch to the known-correct gold patch:

- **Perfect** — fixes the issue with equivalent or better logic than the gold
  patch
- **Good** — fixes the issue correctly using a different valid approach
- **Poor** — wrong, incomplete, doesn't compile, or doesn't fix the issue

The judge also evaluates `files_correct` (did the patch touch the right files?)
and `logic_equivalent` (is the fix semantically identical to the gold patch?).

### Metrics Captured

For each run, bench-swe captures:

| Metric        | Source                               |
| ------------- | ------------------------------------ |
| Cost (USD)    | Claude API usage from raw JSONL      |
| Duration      | Wall time from session start to exit |
| Output tokens | Tokens generated by Claude           |
| Cache reads   | Tokens read from prompt cache        |
| Tool calls    | Number of tool invocations           |

### Current Test Suite

9 languages, hard difficulty — all against real GitHub bugs:

| Task            | Language   | Repository                                                    | Issue                                                                            |
| --------------- | ---------- | ------------------------------------------------------------- | -------------------------------------------------------------------------------- |
| go-hard         | Go         | [goccy/go-yaml](https://github.com/goccy/go-yaml)            | Decoder overrides defaults with null values                                      |
| javascript-hard | JavaScript | [markedjs/marked](https://github.com/markedjs/marked)         | Blockquotes in lists ignore indentation for nesting                              |
| php-hard        | PHP        | [Seldaek/monolog](https://github.com/Seldaek/monolog)         | JsonFormatter crashes on stringable object error                                 |
| python-hard     | Python     | [pallets/click](https://github.com/pallets/click)             | Boolean flag show_default ignores default_map                                    |
| typescript-hard | TypeScript | [commander-js/commander](https://github.com/tj/commander.js)  | Negative flag negation doesn't propagate to aliases                              |
| ruby-hard       | Ruby       | [ruby-grape/grape](https://github.com/ruby-grape/grape)       | Wrong content type when Accept header is a wildcard                              |
| cpp-hard        | C++        | [fmtlib/fmt](https://github.com/fmtlib/fmt)                   | Add a C API (feature implementation)                                             |
| dart-hard       | Dart       | [dart-lang/shelf](https://github.com/dart-lang/shelf)          | shelf_router HEAD request incorrectly sets content-length to 0                   |
| rust-hard       | Rust       | [toml-rs/toml](https://github.com/toml-rs/toml)               | False duplicate key error for dotted keys when parent table is implicitly created |

Embedding model: `ordis/jina-embeddings-v2-base-code` (Ollama, 768-dim). Claude
model: Sonnet (execution), Sonnet 4.6 (judging).

---

## Results Overview

**9 benchmark runs across 9 languages.** Quality was maintained in every single
task — no regressions. Cost was reduced in every language tested.

### Bug-Fix Tasks (8 languages, excluding C++ feature task)

| Metric        | Baseline avg | With-Lumen avg | Delta     |
| ------------- | ------------ | -------------- | --------- |
| Cost          | $0.43        | $0.27          | **-37%**  |
| Time          | 183s         | 116s           | **-37%**  |
| Output tokens | 8,278        | 4,787          | **-42%**  |

### All 9 Tasks (including C++ feature task)

| Metric        | Baseline avg | With-Lumen avg | Delta     |
| ------------- | ------------ | -------------- | --------- |
| Cost          | $0.50        | $0.37          | **-26%**  |
| Time          | 204s         | 146s           | **-28%**  |
| Output tokens | 9,439        | 7,042          | **-25%**  |

Cost was reduced in **all 9 languages** — the only universally positive metric.
Quality was maintained in every task.

---

## Full Results Table

| Task            | Lang | Scenario   | Rating  | Cost   | Time   | Output Tok | Cache Read | Tool Calls |
| --------------- | ---- | ---------- | ------- | ------ | ------ | ---------- | ---------- | ---------- |
| javascript-hard | JS   | baseline   | Perfect | $0.482 | 254.7s | 14,286     | 486K       | 18         |
| javascript-hard | JS   | with-lumen | Perfect | $0.325 | 119.3s | 4,872      | 464K       | 16         |
| rust-hard       | Rust | baseline   | Poor    | $0.611 | 309.7s | 17,717     | 719K       | 22         |
| rust-hard       | Rust | with-lumen | Poor    | $0.375 | 204.0s | 12,291     | 241K       | 9          |
| php-hard        | PHP  | baseline   | Good    | $0.186 | 51.5s  | 1,936      | 249K       | 10         |
| php-hard        | PHP  | with-lumen | Good    | $0.136 | 34.0s  | 796        | 66K        | 7          |
| typescript-hard | TS   | baseline   | Good    | $0.186 | 84.4s  | 4,994      | 120K       | 6          |
| typescript-hard | TS   | with-lumen | Good    | $0.136 | 56.3s  | 1,813      | 183K       | 9          |
| python-hard     | Py   | baseline   | Perfect | $0.119 | 43.0s  | 1,710      | 132K       | 7          |
| python-hard     | Py   | with-lumen | Perfect | $0.096 | 30.6s  | 1,092      | 90K        | 5          |
| ruby-hard       | Ruby | baseline   | Good    | $0.539 | 185.5s | 6,143      | 517K       | 53         |
| ruby-hard       | Ruby | with-lumen | Good    | $0.411 | 165.2s | 5,581      | 295K       | 47         |
| go-hard         | Go   | baseline   | Good    | $0.646 | 291.2s | 11,475     | 658K       | 51         |
| go-hard         | Go   | with-lumen | Good    | $0.568 | 264.1s | 10,283     | 538K       | 35         |
| dart-hard       | Dart | baseline   | Good    | $0.634 | 246.1s | 21,286     | 4,126K     | 61         |
| dart-hard       | Dart | with-lumen | Good    | $0.153 | 50.9s  | 3,862      | 663K       | 14         |
| cpp-hard        | C++  | baseline   | Good    | $1.102 | 370.7s | 15,506     | 1,327K     | 63         |
| cpp-hard        | C++  | with-lumen | Good    | $1.014 | 359.1s | 22,056     | 1,019K     | 51         |

---

## Per-Language Results

### JavaScript — marked (blockquote nesting)

**The strongest result.** Lumen found the exact function (`list()` in
`Tokenizer.ts`) on the first semantic search, eliminating all exploratory file
reading.

| Metric        | Baseline | With Lumen | Delta      |
| ------------- | -------- | ---------- | ---------- |
| Rating        | Perfect  | Perfect    | Same       |
| Cost          | $0.482   | $0.325     | **-32.6%** |
| Time          | 254.7s   | 119.3s     | **-53.2%** |
| Output tokens | 14,286   | 4,872      | **-65.9%** |
| Cache reads   | 486K     | 464K       | -4.5%      |
| Tool calls    | 18       | 16         | -11.1%     |

Both scenarios produced **functionally identical patches** — the same
`blockquoteBeginRegex` function added to `rules.ts` and the same break condition
in `Tokenizer.ts`. The judge rated both Perfect:

> "The candidate patch implements identical logic to the gold patch in both
> `src/Tokenizer.ts` and `src/rules.ts`."

Lumen cut time by more than half and output tokens by two-thirds while
delivering the same perfect fix.

### Rust — toml (dotted key duplicate error)

**The best cost savings.** Lumen cut cost by 39% and time by 34% — the largest
cost reduction across all 8 languages. Both scenarios struggled with this
multi-crate task (neither fixed the parallel bug in the `toml` crate), but Lumen
dramatically reduced the exploration overhead.

| Metric        | Baseline | With Lumen | Delta      |
| ------------- | -------- | ---------- | ---------- |
| Rating        | Poor     | Poor       | Same       |
| Cost          | $0.611   | $0.375     | **-38.7%** |
| Time          | 309.7s   | 204.0s     | **-34.1%** |
| Output tokens | 17,717   | 12,291     | **-30.6%** |
| Cache reads   | 719K     | 241K       | **-66.5%** |
| Tool calls    | 22       | 9          | **-59.1%** |

Even when both approaches fail to produce a correct fix, Lumen saves money by
reducing exploration. The baseline spent 22 tool calls exploring; Lumen narrowed
it to 9 with targeted semantic searches. Cache reads dropped by two-thirds,
showing that Lumen helped Claude avoid reading large amounts of irrelevant code.

### PHP — monolog (JsonFormatter crash)

Lumen navigated from the parent class (`NormalizerFormatter`) to the correct
child class (`JsonFormatter`) in two semantic searches, reaching the fix
location with minimal exploration.

| Metric        | Baseline | With Lumen | Delta      |
| ------------- | -------- | ---------- | ---------- |
| Rating        | Good     | Good       | Same       |
| Cost          | $0.186   | $0.136     | **-26.8%** |
| Time          | 51.5s    | 34.0s      | **-34.0%** |
| Output tokens | 1,936    | 796        | **-58.9%** |
| Cache reads   | 249K     | 66K        | **-73.5%** |
| Tool calls    | 10       | 7          | -30.0%     |

Both patches wrap the `__toString()` call in a try/catch and fall back to the
class name. The 73.5% reduction in cache reads shows Lumen helping Claude avoid
reading large amounts of irrelevant code.

### TypeScript — commander.js (negative flag negation)

Lumen found the option parsing logic directly, letting Claude focus on the fix
rather than exploring the codebase structure.

| Metric        | Baseline | With Lumen | Delta      |
| ------------- | -------- | ---------- | ---------- |
| Rating        | Good     | Good       | Same       |
| Cost          | $0.186   | $0.136     | **-27.1%** |
| Time          | 84.4s    | 56.3s      | **-33.3%** |
| Output tokens | 4,994    | 1,813      | **-63.7%** |
| Cache reads   | 120K     | 183K       | +52.4%     |
| Tool calls    | 6        | 9          | +50.0%     |

Despite using more tool calls (Lumen search calls + follow-up reads), the net
effect was strongly positive: 64% fewer output tokens and 33% faster completion.
The additional cache reads came from Lumen loading relevant context that Claude
would otherwise have had to discover through exploration.

### Python — click (boolean flag default_map)

Both scenarios found the one-line fix immediately. Lumen's semantic search
located `get_help_record` in the `Option` class directly, saving a few Grep
round-trips.

| Metric        | Baseline | With Lumen | Delta      |
| ------------- | -------- | ---------- | ---------- |
| Rating        | Perfect  | Perfect    | Same       |
| Cost          | $0.119   | $0.096     | **-19.5%** |
| Time          | 43.0s    | 30.6s      | **-28.8%** |
| Output tokens | 1,710    | 1,092      | **-36.1%** |
| Cache reads   | 132K     | 90K        | **-32.1%** |
| Tool calls    | 7        | 5          | -28.6%     |

Both produced the **identical single-line patch** — changing `self.default` to
`default_value` on line 2800 of `core.py`. The judge confirmed:

> "The candidate patch makes the identical one-line change as the gold patch."

### Ruby — grape (wrong content type with wildcard Accept)

Lumen helped Claude navigate a large, convention-heavy Ruby codebase more
efficiently, reducing cache reads by 43% and tool calls by 11%.

| Metric        | Baseline | With Lumen | Delta      |
| ------------- | -------- | ---------- | ---------- |
| Rating        | Good     | Good       | Same       |
| Cost          | $0.539   | $0.411     | **-23.7%** |
| Time          | 185.5s   | 165.2s     | **-10.9%** |
| Output tokens | 6,143    | 5,581      | -9.1%      |
| Cache reads   | 517K     | 295K       | **-43.0%** |
| Tool calls    | 53       | 47         | -11.3%     |

Ruby showed the most modest output token improvement (-9%) but strong cache read
and cost reductions. The high baseline tool call count (53) reflects the
exploration-heavy approach needed without semantic search in a large Ruby
project.

### Go — go-yaml (null value decoder)

Lumen helped Claude find `createDecodedNewValue` in `decode.go` and produce a
complete patch including test files.

| Metric        | Baseline | With Lumen | Delta      |
| ------------- | -------- | ---------- | ---------- |
| Rating        | Good     | Good       | Same       |
| Cost          | $0.646   | $0.568     | **-12.2%** |
| Time          | 291.2s   | 264.1s     | -9.3%      |
| Output tokens | 11,475   | 10,283     | -10.4%     |
| Cache reads   | 658K     | 538K       | **-18.2%** |
| Tool calls    | 51       | 35         | **-31.4%** |

Both scenarios produced correct patches with test files. The with-lumen patch
was more thorough — table-driven tests covering both null values and
comments-only nodes, vs a single test case in the baseline.

### Dart — shelf (HEAD content-length RFC violation)

**The strongest result overall.** Lumen cut cost by 76% and time by 79% — the
largest improvements across all 9 languages. The bug was an RFC 9110 violation
where `shelf_router`'s `_removeBody` middleware incorrectly set `content-length`
to 0 for HEAD requests.

| Metric        | Baseline | With Lumen | Delta      |
| ------------- | -------- | ---------- | ---------- |
| Rating        | Good     | Good       | Same       |
| Cost          | $0.634   | $0.153     | **-75.8%** |
| Time          | 246.1s   | 50.9s      | **-79.3%** |
| Output tokens | 21,286   | 3,862      | **-81.9%** |
| Cache reads   | 4,126K   | 663K       | **-83.9%** |
| Tool calls    | 61       | 14         | **-77.0%** |

Both scenarios fixed the bug correctly. The baseline spent 61 tool calls and
over 4 minutes exploring the monorepo structure (`pkgs/shelf_router/` inside
the larger `shelf` repository). With Lumen, semantic search located
`_removeBody` and the router's HEAD handling directly, completing the fix in
under a minute with only 14 tool calls.

### C++ — fmt (C API feature)

The only **feature implementation** task (not a bug fix). Both scenarios
produced complete, working C API implementations with tests, using different but
valid architectural approaches.

| Metric        | Baseline | With Lumen | Delta      |
| ------------- | -------- | ---------- | ---------- |
| Rating        | Good     | Good       | Same       |
| Cost          | $1.102   | $1.014     | **-8.0%**  |
| Time          | 370.7s   | 359.1s     | -3.1%      |
| Output tokens | 15,506   | 22,056     | +42.2%     |
| Cache reads   | 1,327K   | 1,019K     | -23.2%     |
| Tool calls    | 63       | 51         | -19.0%     |

C++ is the most expensive task in the suite — a feature implementation in a
large codebase. Lumen reduced cost by 8% and tool calls by 19%, but output
tokens increased by 42%, suggesting Lumen's search results provided context that
Claude used to generate more comprehensive code. Despite being the one task type
where Lumen's advantage is smallest, it still delivered cost savings.

---

## Quality Summary

| Language   | Baseline Rating | With-Lumen Rating | Quality Delta |
| ---------- | --------------- | ----------------- | ------------- |
| JavaScript | Perfect         | Perfect           | Same          |
| Python     | Perfect         | Perfect           | Same          |
| Dart       | Good            | Good              | Same          |
| PHP        | Good            | Good              | Same          |
| TypeScript | Good            | Good              | Same          |
| Ruby       | Good            | Good              | Same          |
| Go         | Good            | Good              | Same          |
| C++        | Good            | Good              | Same          |
| Rust       | Poor            | Poor              | Same          |

Quality was maintained in **all 9 tasks** — zero regressions. Where the baseline
produced Perfect patches, Lumen matched it. Where the baseline produced Good
patches, Lumen matched it. And where the task was too hard for the baseline
(Rust), Lumen didn't make it worse — it just made the failure cheaper.

---

## Key Findings

### 1. Cost Reduced in Every Language

Lumen reduced cost in **all 9 languages** — the only universally positive
metric. The range spans from -8% (C++) to -76% (Dart):

| Language   | Baseline cost | With-Lumen cost | Delta      |
| ---------- | ------------- | --------------- | ---------- |
| Dart       | $0.634        | $0.153          | **-75.8%** |
| Rust       | $0.611        | $0.375          | **-38.7%** |
| JavaScript | $0.482        | $0.325          | **-32.6%** |
| TypeScript | $0.186        | $0.136          | **-27.1%** |
| PHP        | $0.186        | $0.136          | **-26.8%** |
| Ruby       | $0.539        | $0.411          | **-23.7%** |
| Python     | $0.119        | $0.096          | **-19.5%** |
| Go         | $0.646        | $0.568          | **-12.2%** |
| C++        | $1.102        | $1.014          | **-8.0%**  |

### 2. Output Token Reduction Is the Primary Driver

In 8/9 languages, output tokens dropped — up to 82% for Dart. The one
exception is C++ where output tokens increased (+42%) due to more comprehensive
code generation. Fewer output tokens means Claude explores less and acts more:

| Language   | Baseline output | With-Lumen output | Delta      |
| ---------- | --------------- | ----------------- | ---------- |
| Dart       | 21,286          | 3,862             | **-81.9%** |
| JavaScript | 14,286          | 4,872             | **-65.9%** |
| TypeScript | 4,994           | 1,813             | **-63.7%** |
| PHP        | 1,936           | 796               | **-58.9%** |
| Python     | 1,710           | 1,092             | **-36.1%** |
| Rust       | 17,717          | 12,291            | **-30.6%** |
| Go         | 11,475          | 10,283            | -10.4%     |
| Ruby       | 6,143           | 5,581             | -9.1%      |
| C++        | 15,506          | 22,056            | +42.2%     |

### 3. Time Savings Scale with Exploration

The languages where the baseline needed the most exploration saw the largest
time reductions:

| Language   | Baseline time | With-Lumen time | Delta      |
| ---------- | ------------- | --------------- | ---------- |
| Dart       | 246.1s        | 50.9s           | **-79.3%** |
| JavaScript | 254.7s        | 119.3s          | **-53.2%** |
| Rust       | 309.7s        | 204.0s          | **-34.1%** |
| PHP        | 51.5s         | 34.0s           | **-34.0%** |
| TypeScript | 84.4s         | 56.3s           | **-33.3%** |
| Python     | 43.0s         | 30.6s           | **-28.8%** |
| Ruby       | 185.5s        | 165.2s          | **-10.9%** |
| Go         | 291.2s        | 264.1s          | -9.3%      |
| C++        | 370.7s        | 359.1s          | -3.1%      |

### 4. Search Calls Are Modest

Lumen typically uses 1-10 search calls per task. It supplements rather than
replaces other tool usage:

| Language   | Lumen search calls | Total tool calls (Lumen) | Total tool calls (baseline) |
| ---------- | ------------------ | ------------------------ | --------------------------- |
| Python     | 2                  | 5                        | 7                           |
| PHP        | 2                  | 7                        | 10                          |
| Rust       | 2                  | 9                        | 22                          |
| TypeScript | 1                  | 9                        | 6                           |
| Dart       | —                  | 14                       | 61                          |
| JavaScript | 2                  | 16                       | 18                          |
| Go         | 3                  | 35                       | 51                          |
| Ruby       | 10                 | 47                       | 53                          |
| C++        | 6                  | 51                       | 63                          |

### 5. Zero Quality Regressions

Lumen maintained patch quality in all 9 tasks. Two tasks achieved Perfect
ratings (JavaScript, Python) — identical patches to the gold standard. Six
achieved Good ratings with correct fixes via different approaches. Even the
one task too hard for either approach (Rust) showed no degradation — Lumen
just made the failure 39% cheaper.

### 6. Results Are Reproducible

All benchmark artifacts — raw JSONL streams, patch diffs, metrics, and judge
ratings — are committed to this repository. The benchmark framework is
deterministic in setup (same commit, same issue, same tools) while allowing
natural LLM variation in execution. The consistent direction of improvement
across 9 independent language benchmarks validates that the results are
reliable.

---

## Reproduce

Requirements: Ollama running with `ordis/jina-embeddings-v2-base-code`, the
`claude` CLI, `git`, `go`, `jq`.

```bash
cd bench-swe

# Run all tasks, both scenarios
go run ./cmd/run --output ../bench-results/my-run

# Run a single language
go run ./cmd/run --filter go-hard --output ../bench-results/my-run

# Generate report from existing results
go run ./cmd/report --input ../bench-results/my-run
```

Results land in `bench-results/<run-id>/`. Each run produces:

- `<task>-<scenario>-raw.jsonl` — full Claude session stream
- `<task>-<scenario>-metrics.json` — extracted cost/time/tokens
- `<task>-<scenario>-patch.diff` — generated patch
- `<task>-<scenario>-judge.json` — judge rating and reasoning
- `<task>-<scenario>-judge.md` — judge rationale in markdown
- `detail-report.md` / `summary-report.md` — human-readable output

The benchmark is entirely self-contained in `bench-swe/`. Tasks are defined as
JSON files in `bench-swe/tasks/`. To add a new language or difficulty level, add
a task JSON and re-run.

Current results are committed at `bench-results/`.