# SirixDB Performance Benchmarks Two benchmarks with real, reproducible numbers: REST-API behavior under concurrency (validating the unordered-`executeBlocking` fix) and a core-level large-history benchmark (10,000 commits). Raw logs for every number in this document live in `/tmp/wave4-d/logs/` (`read-ladder.log`, `mixed.log`, `read-recheck.log`, `read-control-reseed.log`, `seed.log`, `large-history-10k.log`, `large-history-1k.log`, `server.log`). ## Environment | | | |---|---| | CPU | Intel Core i7-12700H (12th gen, 14 cores / 20 threads, hybrid P+E) | | RAM | 31 GiB | | Disk | WDC PC SN810 NVMe 1 TB, ext4 | | OS | Linux 6.8.0-107-generic | | JVM | Oracle GraalVM 25.0.3+9.1 (JDK 25), ZGC | | Server | `sirix-rest-api-1.0.0-alpha22-fat.jar`, `-Xms1g -Xmx4g`, CI launch flags, HTTP (no TLS) | | Auth | Keycloak 25.0.1 from `bundles/sirix-rest-api/src/test/resources/docker-compose.yml` (test realm `sirixdb`, user `admin`) | | Topology | Client and server co-located on the same host, loopback, HTTP/1.1 keep-alive | Benchmark sources (zero dependencies beyond the JDK + sirix-core test classpath): - `bundles/sirix-core/src/test/java/io/sirix/bench/RestConcurrencyBenchMain.java` — load generator (virtual threads, closed loop, exact percentiles over all per-request latencies) - `bundles/sirix-core/src/test/java/io/sirix/bench/LargeHistoryBenchMain.java` — large-history core benchmark (compile/run instructions in the class javadoc; no gradle needed) --- ## Benchmark 1 — REST API under concurrency **What it validates.** The REST handlers previously ran every blocking task on the Vert.x context's **ordered** worker queue: all blocking work of the verticle executed strictly serially, server-wide, so p95 latency at concurrency C approached `C × p95(1)` (the "~20× at c=16" finding from the 2026-06-09 audit, measured against the pre-fix server). The fix passes `ordered = false` (`AbstractGetHandler.kt`), keeping per-resource write exclusivity via sirix's single-writer lock. The pre-fix server was not re-measured here (it would require a rebuild); the validation criterion is that the ratio is now far from the concurrency level and near-flat until CPU saturation. **Setup.** One database/resource seeded with a 1.71 MB JSON document (`{"hot":0,"data":[…20,000 objects…]}`) + 5 single-field update revisions (6 revisions total). Measured request: `GET /bench-db/big?maxLevel=4&maxChildren=50&withMetaData=nodeKeyAndChildCount` (~18 KB response). Closed loop from N virtual threads, 5 s warm-up (excluded), 30 s measured window, every request timed; zero HTTP/transport errors in every run. ### Read-only ladder (6-revision resource) | Concurrency | Throughput (req/s) | p50 (ms) | p95 (ms) | p99 (ms) | max (ms) | errors | |---:|---:|---:|---:|---:|---:|---:| | 1 | 2,155 | 0.41 | 0.87 | 1.19 | 6.5 | 0 | | 4 | 7,001 | 0.55 | 0.77 | 0.90 | 8.7 | 0 | | 8 | 8,826 | 0.84 | 1.37 | 1.68 | 9.4 | 0 | | 16 | 10,451 | 1.36 | 2.68 | 4.17 | 13.3 | 0 | | 32 | 11,030 | 2.64 | 5.58 | 7.88 | 14.9 | 0 | ### Headline: concurrency ratio > **p95(c=16) / p95(c=1) = 2.68 ms / 0.87 ms ≈ 3.1×** (old ordered-queue behavior > approached ~16×). A later same-process re-run on a freshly seeded resource gave > 2.0 ms / 0.7 ms ≈ **2.9×** (`read-control-reseed.log`), so ~3× is stable. Throughput scales 1→16 by 4.85× and only +5.5% from 16→32 while p95 doubles — classic CPU saturation (client and server share the 20 hardware threads), not queue serialization. Minor oddity: p95(c=4)=0.77 ms is *below* p95(c=1)=0.87 ms; the c=1 run executed first on the freshly started server, so its tail still contains some JIT/page-cache warmth — if anything the true ratio is slightly better than reported. ### Mixed workload: 16 readers + 1 writer (single-field commit per request) | Role | Throughput | p50 (ms) | p95 (ms) | p99 (ms) | max (ms) | |---|---:|---:|---:|---:|---:| | 16 readers | 1,791 req/s | 4.7 | 10.8 | 224.1 | 448.7 | | 1 writer (commits) | 39.1 commits/s | 14.0 | 33.7 | 283.5 | 467.5 | A single small-commit writer (~39 commits/s) costs the readers ~6× throughput vs the read-only c=16 run and introduces a heavy tail (p99 224 ms vs 4.2 ms). Part of this is *not* classic write-contention — see the anomaly below: the commits themselves grow the revision history, which slows every subsequent read. ### ANOMALY (controlled): read-latest latency degrades with revision count After the mixed run the resource had ~1.4k revisions (6 seed + ~1,370 writer commits). Re-running the *pure read-only* bench on the same server process, then deleting and re-seeding back to 6 revisions and running it again: | Resource state | c | Throughput (req/s) | p50 (ms) | p95 (ms) | p99 (ms) | max (ms) | |---|---:|---:|---:|---:|---:|---:| | 6 revisions (ladder) | 1 | 2,155 | 0.41 | 0.87 | 1.19 | 6.5 | | ~1.4k revisions | 1 | 501 | 1.6 | 2.2 | 2.5 | 307.6 | | 6 revisions (re-seeded, same process) | 1 | 2,363 | 0.4 | 0.7 | 0.9 | 7.1 | | 6 revisions (ladder) | 16 | 10,451 | 1.36 | 2.68 | 4.17 | 13.3 | | ~1.4k revisions | 16 | 1,042 | 7.8 | 13.9 | 245.4 | 1,312.3 | | 6 revisions (re-seeded, same process) | 16 | 12,426 | 1.1 | 2.0 | 3.3 | 70.4 | Reading the **latest** revision of the **same-sized document** is ~4× slower at c=1 and ~10–12× lower throughput at c=16 once the resource carries ~1.4k revisions — and fully recovers after re-seeding on the same JVM, ruling out server aging/GC/heap as the cause. The degradation is a function of revision count alone. Code-level correlate (hypothesis, consistent with Benchmark 2's measurements): the REST layer opens the database + resource session per request, and every storage open eagerly runs `loadRevisionFileDataIntoMemory` + `loadRevisionIndex` (`bundles/sirix-core/src/main/java/io/sirix/io/StorageType.java`, `FILE_CHANNEL.getInstance`) — O(revisions) work per request, measured at ~0.46 µs/revision in-core (see below). At c=16 this O(R)-per-request work multiplies across all workers and saturates CPU early; the 245 ms / 1.3 s tail spikes under concurrency are unexplained by the linear term alone and deserve profiling (suspects: contended Caffeine revision-data cache loads and allocation bursts from per-open array copies). **Follow-up: cache the loaded revision index across request-scoped opens (it is already held in a global `REVISION_INDEX_REPOSITORY`, but the eager per-open reload dominates).** --- ## Benchmark 2 — Large history (10,000 commits, core API) **Setup.** `LargeHistoryBenchMain`: one resource (FILE_CHANNEL storage, SLIDING_SNAPSHOT versioning, rolling hashes, path summary on), initial tiny document `{"counter":0,"label":…,"tags":[…]}`, then 9,999 explicit `setNumberValue` + `wtx.commit()` commits (no auto-commit batching) on one field. **Cold** = first run after `Databases.clearGlobalCaches()` (in-process caches dropped; OS page cache stays warm). **Warm** = median of 7 runs. A 3-iteration JIT warm-up precedes each metric so cold isolates cache state, not compilation. **Build:** 10,000 commits in **48.6 s** (4.86 ms/commit average), 15.9 MB on disk (~1.6 KB/commit). Single 1k/10k-commit runs; treat small deltas (<2×) as noise. | Metric | Cold (ms) | Warm median (ms) | |---|---:|---:| | open database + resource session (incl. close) | 6.28 | 4.64 | | `getHistory()` full list [10,000 revisions] | 50.77 | 3.05 | | `getHistory(100)` most-recent page | 0.90 | 0.03 | | `beginNodeReadOnlyTrx(1)` + 3-step read | 0.54 | 0.018 | | `beginNodeReadOnlyTrx(5000)` + read | 0.46 | 0.018 | | `beginNodeReadOnlyTrx(10000)` + read (latest) | 0.33 | 0.018 | | `diff(1, 2)` (BasicJsonDiff) | 1.74 | 0.18 | | `diff(9999, 10000)` | 1.31 | 0.29 | | serialize revision 1 (full document) | 1.13 | 0.20 | | serialize revision 10000 | 1.04 | 0.21 | **Flat (good):** random-revision access is position-independent — trx open+read is ~18 µs warm whether the revision is the 1st, 5,000th, or 10,000th; diff and full-document serialization are likewise flat across history position. `getHistory(100)` does *not* scan the full history (0.9 ms cold vs 50.8 ms for the full list — properly paged). Cold `getHistory()` of all 10k revisions is 50.8 ms (~5 µs/revision) and the (alpha20) history cache brings warm calls to 3 ms. ### SCALING FLAG 1: session open is linear in history length Warm open+close of the same resource at three history sizes (200-commit run from the smoke log, plus `large-history-1k.log` / `large-history-10k.log`): | Revisions | Warm open (ms) | Cold open (ms) | |---:|---:|---:| | 200 | 0.42 | 0.58 | | 1,000 | 0.95 | 1.24 | | 10,000 | 4.64 | 6.28 | Linear fit ≈ 0.4 ms fixed + **~0.46 µs per revision**. Cause (by code inspection): every storage open eagerly loads all per-revision file data and rebuilds/loads the revision index (`StorageType.FILE_CHANNEL.getInstance` → `loadRevisionFileDataIntoMemory` + `loadRevisionIndex`). Harmless at 10k revisions in absolute terms (4.6 ms), but it is exactly the per-request cost that produces the REST anomaly above, and extrapolates to ~0.5 s per open at 1 M revisions. ### SCALING FLAG 2: per-commit cost grows linearly with history Per-1,000-commit build rate declines monotonically once JIT-warm: | Commits | 2k | 3k | 4k | 5k | 6k | 7k | 8k | 9k | 10k | |---|---:|---:|---:|---:|---:|---:|---:|---:|---:| | commits/s (last 1000) | 289 | 275 | 246 | 225 | 199 | 187 | 176 | 163 | 154 | Per-commit cost roughly doubles from ~3.5 ms to ~6.5 ms over 10k commits (≈ +0.33 µs per existing revision per commit ⇒ cumulative build is O(R²)). A code-confirmed O(R)-per-commit path exists: `RevisionIndex.withNewRevision` (`bundles/sirix-core/src/main/java/io/sirix/io/RevisionIndex.java`) copies the full timestamp/offset arrays **and rebuilds the Eytzinger search layout on every commit** (the comment "O(n) but only on commit" acknowledges it). Whether that copy dominates the measured +0.33 µs/rev/commit, or the eager revision-data reload / another O(R) path contributes, needs a profile — flagged for follow-up. An incremental (append-only or batched) index update would remove the quadratic term. --- ## Methodology notes & honest caveats - **Local loopback, single machine, co-located client+server.** No network latency; the load generator competes with the server for CPU, so absolute throughput ceilings (≈11–12k req/s) understate a dedicated server and saturation onset (~c=16) is partly client-induced. Latency *ratios* between concurrency levels are the meaningful signal. - **Closed-loop load** (each worker waits for its response): percentiles are exact over every measured request (~65k–373k samples/run), but there is no coordinated-omission correction; under saturation closed loops self-throttle. - **"Cold" ≠ cold disk.** `Databases.clearGlobalCaches()` drops sirix's in-process caches only; the OS page cache stays warm (dropping it needs root). True cold-disk numbers would be higher. - **Single runs** per configuration (plus the c=1/c=16 re-run and re-seed control for Bench 1, and 1k/10k history sizes for Bench 2). Variance was not characterized beyond those repeats; treat <2× differences as noise, the flagged 3×–12× effects replicated. - **Tiny document in Benchmark 2** (4-key object) — deliberately isolates per-revision overheads from data-volume effects; it says nothing about large-document scaling. - **HTTP/1.1 forced** in the client so concurrency = real connections (no h2 multiplexing). - Auth token fetched once per run (admin/admin against the test realm); token validation is part of every measured request, as in production. ## Reproducing ```bash # Core benchmark (no gradle; uses prebuilt classes + the captured test classpath) javac --enable-preview --release 25 --add-modules jdk.incubator.vector \ -cp "$(cat /tmp/sirix-test-cp.txt)" -d /tmp/wave4-d/classes \ bundles/sirix-core/src/test/java/io/sirix/bench/*.java java --enable-preview --add-modules jdk.incubator.vector --enable-native-access=ALL-UNNAMED \ --add-opens java.base/sun.nio.ch=ALL-UNNAMED --add-opens java.base/java.nio=ALL-UNNAMED \ -Xms1g -Xmx4g -cp "/tmp/wave4-d/classes:$(cat /tmp/sirix-test-cp.txt)" \ io.sirix.bench.LargeHistoryBenchMain 10000 # REST benchmark (cd bundles/sirix-rest-api/src/test/resources && docker compose up -d keycloak) # wait for realm + users java -Xms1g -Xmx4g -Duser.home=/tmp/wave4-d/server-home --enable-preview \ --enable-native-access=ALL-UNNAMED --add-modules=jdk.incubator.vector \ --add-exports=java.base/sun.nio.ch=ALL-UNNAMED ... (CI flag set, see .github/workflows/gradle.yml) \ -jar bundles/sirix-rest-api/build/libs/sirix-rest-api-1.0.0-alpha22-fat.jar \ -conf bundles/sirix-rest-api/src/main/resources/sirix-conf.json & java --enable-preview -cp /tmp/wave4-d/classes io.sirix.bench.RestConcurrencyBenchMain seed http://localhost:9443 bench-db big 20000 5 java --enable-preview -cp /tmp/wave4-d/classes io.sirix.bench.RestConcurrencyBenchMain read http://localhost:9443 bench-db big 16 5 30 java --enable-preview -cp /tmp/wave4-d/classes io.sirix.bench.RestConcurrencyBenchMain mixed http://localhost:9443 bench-db big 16 5 30 (cd bundles/sirix-rest-api/src/test/resources && docker compose down) ``` --- ## Post-fix re-measurement (same day): revision-history scaling The two anomalies above (read throughput collapsing with revision count; session opens linear in history) were root-caused to `IOStorage.loadRevisionIndex` re-reading **every** revision record on **every** storage open, while the in-JVM `RevisionIndexHolder` was already kept current by the writer. Fixes: - `loadRevisionIndex` now reloads only when the in-memory index size disagrees with the on-disk revision count (covers fresh processes AND out-of-band truncation in both directions) — session opens are O(1) in history. - `RevisionIndex.withNewRevision` appends amortized (capacity-doubling shared arrays + deferred Eytzinger rebuild once the uncovered tail exceeds max(64, size/8); searches bridge with a bounded binary search on the tail) — removes the former O(size) copy + rebuild per commit (O(size²) cumulative). ### Large-history core benchmark, 10,000 commits (before → after) | Metric | Before (warm) | After (warm) | |---|---:|---:| | open database+session | 4.64 ms (linear: ~0.46 µs/revision) | **0.18 ms, flat** | | getHistory() full [10k] | 3.05 ms | 0.84 ms | | everything else | — | unchanged (already flat) | The per-commit rate decline was subsequently root-caused and FIXED (same day). The hunt eliminated, by direct experiment: the revision index (above), `storeNodeHistory` record growth, GC, buffer-pool occupancy, per-transaction state, syscall-count growth in opens/stats/preads/fsyncs, and file-extent fragmentation. Wall-clock profiling (async-profiler, `wall` event) then showed the late-phase main thread dominated by `access(2)` — and a syscall census confirmed a perfect quadratic: **50,196,928 access() calls over a 10k-commit build (~50M of them ENOENT), vs 520k for 1k commits (Σi ≈ N²/2)**. Root cause: `AbstractResourceSession.initializeIndexController` probed `revision.xml, (revision-1).xml, …, 0.xml` with one `Files.exists` per step to find the most recent index definitions — O(revision) syscalls per index- controller creation, and a new controller is created per commit. With no secondary indexes (the default), NO file ever exists and every commit walked the entire history. Fix: one directory listing picking the max-numbered file ≤ revision (an empty directory short-circuits instantly). Second contributor fixed: the commit protocol issued 7 sync calls per commit (strace: 5 fsync + 2 fdatasync). The t3 `forceAll` was fully redundant with `writeUberPageReference`'s internal write-ahead barrier (which flushes the buffered tail FIRST and then forces both files — the t3 barrier ran while the tail was still buffered and covered strictly less), and the commit-acknowledge barrier only needs a data-only `fdatasync` (the primary beacon is an in-place overwrite; the revisions file saw no writes after its own barrier). New protocol: **4 sync calls** — fsync(data) write-ahead, fsync(revisions), fdatasync(data) beacon-order, fsync(data) acknowledge. The two data barriers that cover the tail append stay full fsyncs deliberately: the power-loss simulation's metadata-split model (stricter than POSIX fdatasync) loses acked revisions if size durability leans on fdatasync semantics. Re-validated GREEN by the power-loss gate (force-contract AND metadata-split) and the SIGKILL gate. ### Result (same 10k-commit build) | | Before | After | |---|---:|---:| | total build | 48.4 s (4.84 ms/commit) | **20.5 s (2.05 ms/commit)** | | commit rate at depth 10k | ~150 commits/s, declining | **~570 commits/s, FLAT** | | access() syscalls | 50.2M (quadratic) | O(commits) | | sync calls per commit | 7 | 4 | The decline is eliminated, not reduced — the curve is flat, so 100k+ revision builds no longer degrade. On "is one fsync per commit enough": with per-commit acknowledged durability and the dual-file layout, the logical floor is two ordered barriers (write-ahead: data+revisions durable before beacons; acknowledge: primary beacon durable before return), which costs 4 calls across two files as implemented. Reaching ONE explicit sync call per commit is possible by opening the revisions channel and a dedicated beacon channel with `StandardOpenOption.DSYNC` — tiny synchronous writes (FUA on NVMe, cheaper than full cache flushes) make the record and beacons durable at write-return, leaving a single explicit `fdatasync` for the data tail. Documented as a follow-up design; the ordering guarantees stay identical. ### REST read throughput at high revision count (the collapse scenario) Re-run with the rebuilt fat jar, `auth.mode=none` (no per-request JWT validation — within-run comparisons are the meaningful ones), same generated document, history grown to **1,901 revisions** via the mixed workload: | Cell | Before fix (~1.4k revs) | After fix (1.9k revs) | |---|---:|---:| | read c=1 | 501 req/s, p50 1.6 ms | **2,897 req/s, p50 0.29 ms** | | read c=16 | 1,042 req/s, p99 245 ms | **18,361 req/s, p99 1.84 ms** | | fresh-resource baseline c=1 (same run) | 2,155–2,273 req/s | 2,273 req/s | Read performance at 1.9k revisions now **exceeds** the fresh-resource baseline — the history-depth degradation is eliminated, not merely reduced. Zero errors in both measured cells. ### Follow-up implemented: write-through (O_SYNC/O_DSYNC) commit protocol The "one explicit sync" design was implemented: the revisions record goes through an `O_SYNC` channel (durable incl. size at write-return), both beacon slots through an `O_DSYNC` channel (in-place overwrites; write-return gives secondary-before-primary ordering and makes the primary's return the commit acknowledge). Per commit: ONE explicit `fsync` (data tail write-ahead) plus three small write-through writes; the async acknowledge machinery is gone and `Writer.writeUberPageReference` now carries a durable-on-return contract. Measured on this workstation (ext4, consumer NVMe with FUA): **parity** with the 4-sync protocol (2.09 vs 2.05 ms/commit) — three serialized write-through round-trips cost about what the saved flushes did here. The win is structural (simpler, contract-explicit) with expected gains on server stacks where FUA writes are materially cheaper than cache flushes. All power-loss and SIGKILL gates re-validated green (the simulation now models per-write durability for write-through channels).