# sirix-bench native image (GraalVM)
The `:sirix-query:nativeCompile` task builds a Graal native image for
`io.sirix.query.bench.ScaleBenchMain` → `build/native/nativeCompile/sirix-bench`.
## Build modes
- `./gradlew :sirix-query:nativeCompile`
default `quickBuild = true` (good enough for iterative dev).
- `./gradlew :sirix-query:nativeCompile -Pquick-build=false`
full `-O3 -march=native` build. On this codebase/host (20 cores, GraalVM 25.0.3)
both modes finish in ~2 min — the compile phase is only ~70-80 s because the
reachable method count is modest; peak builder RSS ~6 GB.
The builder heap defaults to `-XX:MaxRAMPercentage=65` (coexists with a live
Gradle daemon on a 31 GB host). Pass `-Pnative.builderXmx=10g` to hard-cap it
when other agents/suites share the box.
Override the main class/image name to reuse this recipe for the write smoke:
`-Pnative.mainClass=io.sirix.query.bench.NativeWriteSmokeMain -Pnative.imageName=sirix-write-smoke`.
## PGO (profile-guided optimization)
```bash
./gradlew :sirix-query:nativeCompile -Ppgo-instrument
./build/native/nativeCompile/sirix-bench 1000000 true 5 # produces default.iprof
./gradlew :sirix-query:nativeCompile -Ppgo=default.iprof
```
## Arena blocker — RESOLVED (`io.sirix.io.SharedArenas`)
The write path is now native-clean. `MMStorage` (and the `ProjectionIndexHOTStorage`
scratch) used to map each generation into an `Arena.ofShared()` and `close()` it on
remap/teardown; closing a shared arena in a native image requires
`-H:+SharedArenaSupport`, which **GraalVM 25 cannot combine with the Vector API**.
`SharedArenas` routes all shared-access arena creation through a pluggable strategy:
`Arena.ofShared()` + explicit close on HotSpot (unchanged, deterministic unmap),
`Arena.ofAuto()` in a native image (same cross-thread access semantics, GC-reclaimed,
`close()` is a no-op). The full create/shred/commit/reopen/append-remap/time-travel
lifecycle passes in a native image — see `NativeWriteSmokeMain` (`:sirix-query:writeSmoke`
on the JVM, or build natively with `-Pnative.mainClass=io.sirix.query.bench.NativeWriteSmokeMain`).
### Why not `-H:+SharedArenaSupport` (GraalVM 25.0.3 matrix, reproduced)
| Build/run config (GraalVM 25.0.3, Vector API reachable) | Outcome |
|---|---|
| `-H:+SharedArenaSupport` **and** `-H:+VectorAPISupport` | build **rejected** up front: `Error: Support for Arena.ofShared is not available with Vector API support. Either disable Vector API support ... or replace usages of Arena.ofShared with Arena.ofAuto` |
| `-H:+SharedArenaSupport`, no `-H:+VectorAPISupport`, vector classes reachable | build **crashes** during `[6/8] Compiling`: `GraalError: ... AbstractLayout.varHandleInternal was not inlined and could access a session` at `SubstrateOptimizeSharedArenaAccessPhase.cleanupClusterNodes(:772)` (identical on 25.0.1 and 25.0.3) |
| `Arena.ofShared()` + `close()`, no `-H:+SharedArenaSupport` | builds; at **run time** the *close* throws `UnsupportedFeatureError: Support for Arena.ofShared is not active` — creation/mapping/cross-thread reads all succeed, only `close()` is gated |
| `Arena.ofShared()` **without** `close()`, no flag | works, but leaks the mapping every remap — rejected in favour of `Arena.ofAuto()` |
| **`Arena.ofAuto()`, no flag, `-H:+VectorAPISupport`** (current) | **works** — native write smoke passes (~50 ms), SIMD kernels keep AVX codegen |
The restriction is on shared-arena **close**, not creation; since the SIMD kernels are
non-negotiable for query speed we keep `-H:+VectorAPISupport` and drop the shared-arena
close instead (exactly what the builder's own error message recommends).
## Measured: native vs JVM (GraalVM 25.0.3, `-O3 -march=native`, no PGO)
Apples-to-apples: shred a 1 M-record DB **once**, then run the 9-query
`ScaleBenchMain` workload against that same on-disk DB on both runtimes
(`-Dsirix.db=
`), so only query execution differs. Both use the columnar
`SirixVectorizedExecutor` over `jdk.incubator.vector` (AVX on both). Build:
```bash
# full -O3 -march=native (quickBuild=false); cap the builder heap on a shared box
./gradlew :sirix-query:nativeCompile -Pquick-build=false -Pnative.builderXmx=10g
DB=/tmp/sirix-perf-db
./bundles/sirix-query/build/native/nativeCompile/sirix-bench -Dsirix.shredDbPath=$DB 1000000 true 0 # shred once
perf stat -- ./bundles/sirix-query/build/native/nativeCompile/sirix-bench -Dsirix.db=$DB 1000000 true 30
```
### Warm steady-state (reuse DB, 30 iters) — native wins 7–17×
| query | JVM avg | native avg | factor |
|---|---|---|---|
| filterCount | 0.630 ms | **0.037 ms** | 17× |
| groupByDept | 0.323 ms | **0.067 ms** | 4.8× |
| sumAge | 0.300 ms | **0.028 ms** | 11× |
| avgAge | 0.191 ms | **0.029 ms** | 6.6× |
| minMaxAge | 0.262 ms | **0.049 ms** | 5.3× |
| groupBy2Keys | 0.282 ms | **0.096 ms** | 2.9× |
| filterGroupBy | 0.155 ms | **0.086 ms** | 1.8× |
| countDistinct | 0.092 ms | **0.060 ms** | 1.5× |
| compoundAndFilterCount | 0.128 ms | **0.065 ms** | 2.0× |
`perf stat` deltas (whole 30-iter run): native retires more of its work
(`tma_retiring` 71 % vs JVM 53 %), runs at higher IPC (3.18 vs 2.90 core),
and has a lower branch-miss rate (0.05 % vs 0.51 %) — no deopt guards, no
tiering, fully-hydrated data. This is the headline native query win and it
holds **even though predicate codegen falls back** (below).
### The runtime-codegen-fallback loss (the real native cost)
`SirixVectorizedExecutor.compileToClass()` JIT-emits a specialised
`BatchPredicate` class per distinct predicate via
`MethodHandles.Lookup.defineHiddenClass`. **A native image cannot define
classes at runtime**, so the first call of every distinct predicate throws
```
UnsupportedFeatureError: Classes cannot be defined at runtime by default
when using ahead-of-time Native Image compilation.
Tried to define class 'io/sirix/query/scan/SirixBatchPred$1'
```
and the executor falls back to the **interpreted** op-array predicate
(`evalCompiledBatch`). Correctness is identical, and *warm* the interpreter is
actually faster than the JVM's compiled predicate (above). But on a **true cold
first query** (`-Dsirix.noWarmup=true`, iter 1) the interpreted full-scan over
freshly page-faulted mmap data, with no JIT to amortise, is far slower than the
JVM:
| query (cold iter-1, no warmup) | JVM | native | note |
|---|---|---|---|
| filterCount (first predicate query) | 11.9 s | **43 s** | cold mmap hydrate + interpreted predicate |
| groupByDept | 1.48 s | 13.6 s | |
| sumAge | 0.29 s | 4.7 s | |
| avgAge / minMaxAge (pure aggregate, no predicate codegen) | ~1–2 ms | **~0.1 ms** | native already optimal — no fallback on this path |
So the earlier "cold iter-1 → 0.22 ms, 4000×" claim only held with a covering
**projection index** (`-Dprojection=true`, the `ProjectionIndexByteScan` path),
which sidesteps predicate codegen entirely — not the default generic predicate
path measured here. **Cheap mitigation applied:** `COMPILED_PREDICATE_ENABLED`
now defaults off in a native image (`SirixVectorizedExecutor`, gated on the
`org.graalvm.nativeimage.imagecode` property), so we skip the doomed classfile
build + throw/catch on the first call of each predicate and the noisy stderr
dump; the result is unchanged. The real fixes are larger and out of scope here:
emit the predicate variants at **build time** into static fields
(`--initialize-at-build-time`), or always use the projection-index scan for
predicate-bearing queries in native images.
### Ingest regression (unchanged): native shred ~8× slower
Native shred measured here: **25.4 K rec/s** (1 M records), vs JVM **~200 K
rec/s**. Profile (prior `perf stat`, 200 K records) showed IPC 3.74 but
`CPUs utilized = 1.72 / 20`: the Gson `JsonReader` tokenizer is effectively
single-threaded and native-image's per-thread tokenizer throughput is ~10×
below HotSpot tiered. Not compute-bound — it's serialization + single-thread.
**Pragmatic split: ingest on the JVM, query on native** — both share the
on-disk V0 format. A JVM-shredded DB queried natively hits the same warm
numbers above. Future levers: an AOT-friendly JSON parser (fastjson2 / a
record-shape-specific parser / simdjson) and parallel shred partitioning.
(**This split is largely obsolete on the GraalVM 25.1 line — see the update
below.**)
### Update — GraalVM 25.1-dev (EA): MemorySegment intrinsification closes most of the ingest gap
GraalVM commit `8edcbb77` ("Intrinsify MemorySegment.get/set before analysis",
2026-03-06) makes native-image intrinsify the scalar `MemorySegment` accessors
that HotSpot's JIT already intrinsifies. It is **not in any stable release**; it
first appears in the Oracle GraalVM **25.1-dev** EA line (verified here in
`graalvm-jdk-25e1-25.0.3-ea.32`, 2026-06-16). A standalone scalar get/set
microbench goes **4466 ms → ~75 ms native (≈56×)** on it — native is now ~2× the
JVM instead of ~100×.
Re-measured on this codebase/host, same binary recipe (`-O3 -march=native`),
1 M records:
| ingest (shred 1 M) | rec/s | vs JVM |
|---|---|---|
| JVM (GraalVM 25.1-dev) | 110 K | 1.0× |
| **native, GraalVM 25.1-dev `-O3`** | **~90 K** | **~1.2× slower** |
| native, GraalVM 25.0.3 `-O3` | ~23 K | ~4.8× slower |
The native ingest penalty drops from **~4.8× → ~1.2×** (near parity). That a
*MemorySegment*-specific fix alone buys ~4× indicates the un-intrinsified scalar
accessor in the page-serialization **write path** was a substantial part of the
native ingest cost — not only the single-threaded Gson tokenizer the earlier
profile flagged. On the 25.1 line a single native binary can ingest *and* query
with only a ~20 % ingest tax (was ~5×).
The warm analytical kernels above are **unchanged** — they are already
AVX-vectorized (Vector API) and never touched the slow scalar accessor.
**PGO did not help**: ingest stayed flat and the sub-ms query kernels *regressed*
(e.g. `filterCount` 0.053 ms `-O3` → 0.165 ms PGO) because the instrumented
profile is dominated by the ~28 s shred and mis-weights the microsecond kernels;
plain `-O3` is the better build here.
Caveat: 25.1-dev is a **pre-release EA build** — treat these as a preview until
the intrinsification ships in a stable GraalVM.