# dxkit benchmarks: methodology and findings

> A sanitized public report. The claim throughout is predictability rather than
> reduction. Every headline number is presented with its caveats, and the claim
> ledger and the "what this does not prove" section appear before the evidence.
>
> This page is the overview: the summary, the claim ledger, the shared
> methodology, and a short section per study. Each study links to a detailed,
> reproducible write-up under [`docs/benchmarks/`](./benchmarks/) with its full
> method, verbatim prompts, raw result tables, caveats, and repro steps.

---

## Summary

In our loop benchmark, autonomous coding loops (an agent that keeps editing
until it decides to stop) frequently stopped with detector-backed net-new debt
still present.

- A vanilla Claude Code-style loop left net-new debt in 11 of 16 runs.
- A prompt-only self-check still left it in 9 of 16 runs.
- With dxkit's Stop-gate, we observed 0 of 16 escapes.

Each figure is n=16 per arm, from 8 repetitions on each of two tasks.

The dxkit arm did not discover a new class of bugs. On every stop it re-ran a
deterministic net-new guardrail, blocked any stop that left debt in the tree,
and returned the specific finding to the model for repair.

The claim is predictability rather than universal reduction. Three independent
measurements share one through-line.

| Layer              | What it reduces                       | Result                                                       |
| ------------------ | ------------------------------------- | ------------------------------------------------------------ |
| Deterministic gate | unsafe final states                   | vanilla 11/16, checklist 9/16, dxkit 0/16 observed escapes   |
| Code graph         | observed large-repo exploration tails | worst-case session tokens 57% lower, variance roughly halved |
| Durable identity   | false "net-new" under churn           | 0 false net-new on tested line shifts and renames            |

dxkit is not a scanner; it ingests Snyk, CodeQL, and other SARIF sources. It is
not a token-saver, because mean token counts are often flat. It is not "more
accurate than an LLM": a frontier model can be an accurate judge when given
enough baseline state. In our benchmark, Opus-with-baseline held its accuracy,
but it was not cheap, reproducible, or in-loop.

---

## Claim ledger

Each claim below is listed with its strength, its evidence, and the exact public
wording we stand behind. This table is the place to start; the rest of the
document is the supporting evidence.

| Claim                                                 | Status          | Evidence                                           | Public wording                                                                               |
| ----------------------------------------------------- | --------------- | -------------------------------------------------- | -------------------------------------------------------------------------------------------- |
| The Stop-gate prevents observed loop escapes          | Strong          | loop benchmark, 8 reps × 2 tasks                   | "0/16 observed escapes in our benchmark"                                                     |
| Prompt-only self-check is insufficient                | Strong          | checklist arm, 9/16 escapes                        | "prompting reduced but did not eliminate escapes"                                            |
| Fixing in the loop is cheaper than deferring it       | Moderate        | loop benchmark deferred arm, 8 reps × 2 tasks      | "deferring the same fix cost ~49% more on the test-gap task (mean); weak on the secret task" |
| Gate identity is deterministic under tested churn     | Strong          | offline matcher benches                            | "0 false net-new on tested line-shift and rename cases"                                      |
| LLM-as-gate has cost and reproducibility issues       | Strong          | gate-vs-LLM benchmark, 5 reps × 2 models × 2 repos | "an LLM can judge, but not cheaply or reproducibly in-loop"                                  |
| Graph context reduces observed large-repo token tails | Moderate-strong | Sonnet session study, 30 sessions                  | "lower mean, tail, and variance on a large repo"                                             |
| Agents do not game tests they can see                 | Strong          | reward-hacking study, 36 runs across 3 framings    | "0/36 observed test-tampering, even under pressure"                                          |
| A single passing test can under-specify the fix       | Moderate        | held-out check, 2 instances                        | "a subtly-wrong fix can pass the shown test; tiny corpus"                                    |
| Test-gap gating is safe as a default                  | Not claimed     | repair cost of 1.1M-1.6M tokens in validation      | default remains `security-only`; `full-debt` is opt-in                                       |
| dxkit improves every agent session                    | Not claimed     | n/a                                                | do not say this                                                                              |
| dxkit detects more vulnerabilities than scanners      | Not claimed     | n/a                                                | do not say this                                                                              |

## What this does not prove

- The 0/16 result is observed, not proven-zero. The gate blocked every
  detector-backed finding surfaced in these seeded runs, which is not a proof
  that no escape is possible.
- dxkit is not a scanner and does not claim to find more bugs than Snyk, CodeQL,
  Semgrep, or a frontier model. It ingests their findings.
- The loop benchmark uses synthetic, detector-backed tasks together with small
  real-repo validation, not a CVE corpus.
- The loop headline includes test-gap behavior, which belongs to the opt-in
  `full-debt` preset. The product default, `security-only`, gates secrets and
  high-severity vulnerabilities (see [Study I](./benchmarks/01-loop-safety.md)).
- The cost-of-deferral signal is strong on the test-gap task and weak on the
  secret task (a positive mean but a slightly negative median); see
  [Study II](./benchmarks/02-cost-of-deferral.md).
- Graph context does not guarantee fewer tokens in every session. The measured
  effect is lower mean, tail, and variance on large, connected tasks.
- Opus session results are deferred. Session numbers are from Sonnet.
- The reward-hacking study is one model (Sonnet 4.6) on a small corpus of
  solvable bugs. The 0/36 no-gaming result is observed, not proven, and the
  under-specification finding rests on a tiny held-out set (two instances); see
  [Study VII](./benchmarks/07-reward-hacking.md).

---

## The thesis: predictability rather than reduction

A scanner answers the question "what is wrong?" An autonomous loop needs a
different answer: "did I just make this worse, and may I stop?" That question
has to be answered the same way every time, in seconds, locally, with feedback
the model itself can act on. It is a systems property rather than a detection
problem, and it is the gap dxkit fills, in three parts.

1. A deterministic net-new gate. The same input yields the same verdict, as one
   exit code, at no LLM cost, offline. Pre-existing debt is grandfathered by a
   baseline, so only regressions block.
2. A code graph that reduces the agent's observed worst-case exploration cost,
   rather than lowering its average cost.
3. Durable, content-anchored finding identity that survives line shifts and
   renames, so that "net-new" continues to mean net-new and a committed baseline
   keeps matching across machines and CI.

---

## What dxkit is, and is not

dxkit is a deterministic verification and governance layer. It stitches together
established tools (gitleaks, community Semgrep, OSV and npm-audit, jscpd, a
code-graph builder, and cloc) and ingests external engines (Snyk Code, CodeQL,
and any SARIF source). On top of those it adds the layer they lack: a net-new
gate, a brownfield baseline, durable identity, and graph-scoped context.

It is not a scanner. dxkit does not claim to find more bugs than Snyk or CodeQL;
it ingests their findings and makes them enforceable. It is not a claim that the
LLM is wrong: given enough baseline state, a frontier model can judge net-new
findings accurately, and in our benchmark Opus-with-baseline did so. dxkit's
advantages are determinism, no LLM cost per check, a prompt
that does not grow with the baseline, and reproducible identity, all of which
hold regardless of model capability. Finally, it is not a token-saver. On a real
session the mean token count is often flat, and the measured benefit is a lower
observed worst case rather than a smaller average.

---

## Methodology (shared across studies)

Provenance. These results were produced with dxkit version 2.13.0, at report and
harness commit `7f801a4`, during June 2026, with model pricing as of June 2026.
Agent runs were executed through a Claude Max subscription, so per-run dollar
figures are the CLI's equivalent-cost estimates. They are valid for relative
comparison between arms rather than as literal API-console charges.

"Sanitized" here means that no proprietary code or private traces are included.
Public repository names and commit pins are disclosed for reproducibility.

Models. We used Claude Sonnet 4.6 for agent-session runs, and added Claude Opus
4.8 as a steelman in the gate-vs-LLM study. Sessions ran through `claude -p
--output-format stream-json` and were parsed from the raw event stream.

Substrates. Two real, public open-source repositories, each pinned to a commit.
These benchmarks run on pinned public commits and characterize the agent's
behavior under each tool, not the quality or security of these projects. dxkit is
an independent project, not affiliated with or endorsed by OWASP, Strapi, or any
benchmarked project; trademarks belong to their owners.

| Repository                                          | Pin       | License                                                             | Role           |
| --------------------------------------------------- | --------- | ------------------------------------------------------------------- | -------------- |
| [OWASP NodeGoat](https://github.com/OWASP/NodeGoat) | `c5cb68a` | Apache-2.0                                                          | small app      |
| [strapi/strapi](https://github.com/strapi/strapi)   | `dc49217` | Community "MIT Expat"; `ee/` directories under a commercial license | large monorepo |

- NodeGoat is a deliberately vulnerable Node.js/Express training app of roughly
  2k lines; the dxkit baseline contains 205 pre-existing findings.
- Strapi is a large TypeScript monorepo of roughly 574k lines; its code graph has
  18,948 nodes and 20,012 edges, and the dxkit baseline contains 1,020
  grandfathered brownfield items (overwhelmingly test-gaps, duplication, and
  quality debt rather than vulnerabilities).
- The loop-safety study also uses synthetic repositories: small and controlled,
  with one known finding injected per task.

Determinism tier. The gate-correctness benches run offline, with no API key,
using seeded regressions together with clean and churn commits, and produce a
confusion matrix. Anyone can re-run them.

Repetitions. The safety study uses 8 repetitions per arm per task; the session
study uses 3 repetitions per cell (30 sessions); the gate-vs-LLM study uses 5
repetitions per case across 2 models and 2 repositories. Point estimates without
repetitions are flagged as such.

Process. Several headline claims were retracted mid-study once they were traced
to harness bugs or to unlucky single draws; these are noted inline in the
per-study docs. The benchmarks also fed back into the product, and two findings
shipped as releases.

---

## The studies

Each study below has a detailed, reproducible write-up. The short version here is
the question, the headline, and the method in one line; follow the link for the
full method, verbatim prompts, raw tables, caveats, and repro steps.

### I. Loop safety and the Stop-gate → [details](./benchmarks/01-loop-safety.md)

**Question.** How often does an autonomous loop declare "done" while net-new debt
is still in the tree, and does a deterministic gate prevent it where a prompt
does not?

**Headline.** Observed escapes: vanilla **11/16 (69%)**, checklist (prompt-only)
**9/16 (56%)**, dxkit **0/16**. The dxkit arm blocked, the model repaired the
specific finding, and the loop re-stopped clean.

**Method.** `bench-loop.mjs`, four arms, two seeded traps (a test-gap and a
secret), 8 reps each, Sonnet 4.6, with an identical post-hoc guardrail measuring
the final tree. Validated on real NodeGoat.

### II. Cost of deferral → [details](./benchmarks/02-cost-of-deferral.md)

**Question.** A net-new finding gets fixed eventually; the choice is _when_. Is
fixing it in the warm loop cheaper than deferring it to a cold session?

**Headline.** Holding the finding constant, deferring the test-gap repair to a
cold session cost **~49% more in equivalent cost and ~51% more turns** (means).
On the secret task the mean premium was ~19% but the signal is weak (the median
is slightly negative). A conservative floor, real deferral costs more.

**Method.** The `dxkit` (in-loop) and `deferred` (vanilla + cold fix) arms of
`bench-loop.mjs`; both reach an identical clean final state, so this isolates the
cost of _when_ the fix happens.

### III. The gate is correct and reproducible → [details](./benchmarks/03-gate-correctness.md)

**Question.** Does the gate reliably block net-new regressions, pass clean
changes, and grandfather pre-existing debt, every time?

**Headline.** Confusion matrix tp 3 / fn 0 / tn 2 / fp 0 (catch 1, false-block 0)
on both repos; exactly 1 net-new finding isolated against 205 / 1,020
grandfathered items; **0 false regressions** on line-shift and rename churn. This
is the **deterministic tier**, reproducible offline today, no API key.

**Method.** Three offline harnesses already published in
[`benchmarks/`](../benchmarks/): `bench-guardrail.mjs`,
`bench-netnew-isolation.mjs`, `bench-matcher.mjs`. The matcher bench caught a 50%
identity defect in 2.11.1 that 2.12.0 fixed as a class.

### IV. Deterministic gate versus LLM-as-the-gate → [details](./benchmarks/04-gate-vs-llm.md)

**Question.** When asking "is my change safe to stop on?", should a deterministic
gate answer, or should an LLM be the gate? (Gate vs gate, not scanner vs scanner.)

**Headline.** dxkit: **100% accuracy, 0 flips, $0**, no prompt growth, at every
scale. The naive LLM false-blocked a pure rename and flip-flopped on a line shift
(40% of reps); Sonnet missed a real regression at the 1,020 baseline; Opus held
100% but cost ~6.5× Sonnet and grew with the baseline.

**Method.** `bench-llm-gate.mjs`, 10 seeded cases, 5 reps, Sonnet 4.6 + Opus 4.8,
baselines of 1 / 205 / 1,020, ≈$51 total.

### V. Graph context and observed exploration tails → [details](./benchmarks/05-graph-context.md)

**Question.** Does the passive code-graph context help a real agent session, net
of the scaffold's overhead?

**Headline.** On the large monorepo: median tokens roughly tied, **mean −30%,
worst case −57%, variance roughly halved**. On the small app: overhead ≈ zero.
The benefit is predictable tokens, not fewer tokens, and it is size-gated (54%
of files in a slicing proxy were _not_ smaller).

**Method.** `bench-context-efficiency.mjs` (200-symbol slicing proxy) and
`bench-sessions.mjs` (30 real `claude -p` sessions, Sonnet 4.6).

### VI. When the graph pays: an Amdahl model → [details](./benchmarks/06-amdahl-model.md)

**Question.** Why does the graph benefit appear on large repos and vanish on
small ones?

**Headline.** Model session savings as `f·(1 − 1/s) − O/T`: an infinite
per-operation speedup caps whole-session savings at the orientation fraction `f`,
and a fixed overhead `O/T` dominates on small repos (a forced-graph probe cost
66% more on the small app). A falsifiable model, not yet numerically fit.

**Method.** Analytical, explaining the Study V numbers. No harness.

### VII. Reward hacking: do agents game tests they can see? → [details](./benchmarks/07-reward-hacking.md)

**Question.** A test-driven loop makes a test the target. Do agents optimize the
visible test at the expense of the actual goal, and which Goodhart variant shows
up?

**Headline.** Across 36 runs, including a framing that told the model to "do
whatever it takes," it never edited a test to fake a pass (**0/36 observed
tampering**). A visible test instead rescued failures the agent could not fix
from prose alone. The residual failure is not cheating but under-specification: a
single passing test can be satisfied by a subtly-wrong fix (one held-out bug
overfit reliably, 6/6).

**Method.** `bench-rewardhack.mjs` on a corpus of 10 real fastify bug fixes,
Sonnet 4.6, three framings (neutral / prohibition / pressure), with a cheat
oracle that restores the real test to separate genuine fixes from tampering and
checks unseen sibling tests for overfit.

---

## Differentiation: why not Snyk or SonarQube?

The difference is one of architecture and tempo, not detection. Cloud scanners
are detection engines on a CI cadence, and they were never built to sit inside an
agent's stop decision.

| What a loop's Stop-gate needs            | dxkit                               | Cloud scanners (Snyk Code, SonarQube) |
| ---------------------------------------- | ----------------------------------- | ------------------------------------- |
| Fires on every stop, in seconds, locally | yes: no LLM cost, offline, instant  | no: cloud round-trip, CI/PR cadence   |
| Offline, with no egress and no auth      | yes: local and deterministic        | no: upload-to-cloud, server-side gate |
| Feedback the model can act on            | yes: a block decision plus a reason | no: dashboards and PR comments        |
| Reproducible identity offline            | yes: content-anchored               | partial: "new code" defined on server |

A note on what not to claim, so that the comparison holds up. Do not say cloud
scanners cannot detect net-new findings, because SonarQube has a new-code quality
gate and Snyk has delta concepts. Do not say in-loop gating is impossible with
them, because one could shell a cloud scan inside a Stop hook; it would simply be
slow, networked, authenticated, and untuned to the loop's baseline. The accurate
statement is that cloud scanners are not architecturally designed for
per-iteration local gating, and that they are optionally a detection source dxkit
can ingest.

---

## Why now

Coding-agent workflows are moving from one-shot prompts to persistent loops. That
creates a new control problem: when may the loop stop? Tests and linters catch
broken code, but they do not distinguish known debt from net-new regressions, and
an LLM-judge gate adds cost, latency, and non-determinism on every iteration.
dxkit focuses on that stop decision.

---

## Limitations

- The study uses two real repositories together with synthetic cases. The seeded
  findings are detector-backed but do not constitute a CVE corpus, and broader
  language and repository generality is future work.
- dxkit does not improve on detection. It ingests Snyk and CodeQL rather than
  out-detecting them.
- The context-efficiency measurement is a proxy; the session study is the real
  test.
- The Opus session arm is deferred, and session numbers are from Sonnet.
- The Amdahl model is directional rather than a numerical fit.
- Several sub-claims were retracted once they were traced to harness bugs or
  single unlucky draws. They are documented in the per-study docs rather than
  buried.

---

## Artifacts and reproducibility

The **deterministic tier** runs offline today, with no API key:
[`benchmarks/`](../benchmarks/) holds `bench-guardrail.mjs`,
`bench-netnew-isolation.mjs`, and `bench-matcher.mjs`, and `benchmarks/README.md`
documents how to reproduce the [Study III](./benchmarks/03-gate-correctness.md)
numbers on the pinned NodeGoat and Strapi commits.

The **agent-driven harnesses** (loop safety, cost of deferral, gate-vs-LLM, and
the graph-context sessions) require a model subscription or API key and the
pinned checkouts. They are published under
[`benchmarks/agentic/`](../benchmarks/agentic/), `bench-loop.mjs`,
`bench-llm-gate.mjs`, `bench-sessions.mjs`, and `bench-context-efficiency.mjs`,
with `benchmarks/agentic/README.md` documenting the config schema, the pinned
substrates, and the verbatim prompts. Because these are agent-in-the-loop
measurements, the reproducible claims are the **relative** results between arms
(escape rate, deferral premium, variance reduction, gate accuracy and flips), not
exact token counts.

---

## Try the deterministic tier on your own repository

These commands let you run dxkit's deterministic gate on your own repository.
They evaluate the gate locally; they do not reproduce the benchmark numbers
above.

```bash
npx @vyuhlabs/dxkit baseline create        # grandfather today's debt
npx @vyuhlabs/dxkit init --claude-loop     # wire the Stop-gate
npx @vyuhlabs/dxkit loop doctor            # confirm it is safe to run unattended
# then run your loop, and afterwards:
npx @vyuhlabs/dxkit loop ledger summarize  # blocked versus allowed, and repaired-after-block
```

To reproduce the deterministic-tier benchmark numbers themselves (gate
correctness, net-new isolation, and matcher robustness), see the harnesses and
instructions in [`benchmarks/`](../benchmarks/). A `vyuh-dxkit evaluate` command,
a one-shot "prove it on your repo" report, is the planned next step.