# Why SirixDB

*A document store where history is the data model, not a feature.*

SirixDB is an embeddable, open-source (BSD-3) store for JSON and XML that
**never overwrites data**. Every commit creates a new, immutable revision that
structurally shares everything it didn't change with the revision before it.
That single design decision is where everything below falls out from — the
good parts and the trade-offs. This document is the honest version of the
pitch: what the architecture buys you, what we've measured, and where it
loses.

---

## The one-paragraph mental model

Think of a persistent (copy-on-write) tree, like the data structures inside
Clojure or Git, but as a database engine with fine-grained nodes instead of
whole files: every JSON object, array, field, and value is a node with a
stable identity across revisions. Commit `N+1` copies only the page fragments
it touches; everything else is a reference into the past. A *sliding
snapshot* algorithm bounds how many fragments any read must consult to
reconstruct a page, so a database with 10,000 revisions opens and reads as
fast as one with ten — we benchmarked exactly that claim, found it false in
two places, and fixed both (see "Receipts" below).

## What this buys you

### 1. Time travel is a query, not a restore job

Any revision, any wall-clock instant, first-class in the query language
(XQuery / JSONiq with temporal extensions):

```xquery
jn:open('orders','orders.json', xs:dateTime('2026-05-01T00:00:00Z'))
  .customers[].name
```

Every node also knows its own history — `jn:all-times($node)` walks every
version of one field without touching the rest of the document. There is no
"as-of replica", no WAL archaeology, no application-level `valid_from`
columns. Auditing "what did this record say when we made that decision?" is a
one-liner.

### 2. Diffs are semantic and instant

Because revisions share structure physically, SirixDB computes diffs by tree
comparison with rolling hashes, not by serializing two snapshots and running a
text diff. Diffing two revisions of a document returns exact node-level
operations (insert/update/delete with stable node keys) in **~0.3 ms** in our
benchmark — and it stays flat regardless of document size, because unchanged
subtrees hash-skip. The same machinery powers the web UI's revision scrubber
and structured diff view.

### 3. Storage grows with *change*, not with *data size*

A commit costs O(changed nodes), not O(document). Update one field in a 100 MB
document and you pay for one page-fragment chain, not a 100 MB copy. We
decomposed the actual per-commit byte cost on disk (currently ~1.7 KB fixed
overhead per small commit, fully attributed byte-by-byte in
[`STORAGE_COST.md`](STORAGE_COST.md), with a roadmap to ~700 B). For
small-document workloads PostgreSQL's storage is still tighter (see the
honest comparison below); the crossover argument is about *large documents
with small edits*, and we won't quote a crossover point until we've
benchmarked it.

### 4. Crash safety you can audit, not just trust

The commit protocol is two ordered write barriers: data write-ahead, then a
dual-slot "uber beacon" flipped with data-integrity write-through (O_DSYNC;
FUA on NVMe), with the revisions file opened O_SYNC. One explicit fsync per
commit. We built a **power-loss simulation harness** that records every write
and force at the FileChannel level, then materializes thousands of crash
states — torn writes, dropped unforced writes, metadata/size splits — and
cold-opens each one: acked revisions must survive, unacked ones must be
rejected *cleanly*. 0 failures across the state space, and the harness is in
the tree (`bundles/sirix-core/src/test/java/io/sirix/crash/`), not in a
slide deck.

### 5. Analytics without an ETL hop

Pages carry columnar (PAX) regions — dictionary-encoded strings, bit-packed
numbers with zone maps, bit-packed booleans — and a vectorized executor with
SIMD kernels uses them for group-by, filtered counts, aggregates, and
count-distinct. At 1M records (cold executor, results verified byte-identical
against the interpreted pipeline):

| query | interpreted | vectorized |
|---|---|---|
| group-by (string key) | 18.2 s | 3.2 s |
| group-by (two keys) | 20.4 s | 1.6 s |
| group-by (numeric key) | 18.3 s | 1.4 s |
| sum / avg / min+max | 15–37 s | 1.4–1.8 s |
| count-distinct | 18.4 s | 1.7 s |

With the in-memory columnar projection installed, the whole suite lands
within 1.1–4.5× of DuckDB 1.5.2 on the same machine at 100M records — sum
16 ms, two-key group-by 240 ms — and the profile-guided-optimized native
binary comes out ahead of DuckDB on three of nine shapes (filtered count,
filtered group-by, compound-range count). Full methodology and honest
caveats in
[`COMPARISON_DUCKDB.md`](COMPARISON_DUCKDB.md). Every fast path is
**fail-closed**: the optimizer only claims a pipeline when it can prove the
query's shape matches what the kernel emits, and kernels verify their own
coverage per page (a value the column can't represent falls back to the
general path). Wrong-but-fast is treated as a bug class, not a configuration
option — a differential suite runs every shape through both pipelines and
requires byte-identical output.

### 6. It's a library first — and it compiles to a native binary

The core is an embeddable Java library (also usable from Kotlin); the
Vert.x-based REST server, the CLI, and the SolidJS web UI are layers on top.
Single process, no sidecar, no cluster to operate. `jn:store` a document and
you have a versioned database in a directory.

It also builds as a GraalVM native image — *including the write path*, which
took resolving a real toolchain blocker (GraalVM restricts shared `Arena`
*close*, not creation, so the off-heap allocator uses an auto-managed arena in
AOT and lets the GC reclaim mappings; the on-disk file is fsync'd at commit
independently). A native binary creates, shreds, commits, reopens, and
time-travels with no JVM warmup, and on warm analytical queries the
ahead-of-time binary runs **7–17× faster than the JVM** (better instruction
throughput, no JIT ramp). The honest caveat: a cold query whose predicate
needs runtime code generation falls back to the interpreter in AOT (no
class-loading at image runtime), and single-threaded ingest is slower than the
JVM — so the natural split is *ingest on the JVM, embed the native binary for
read/query latency*. Both verdicts and the full perf tables are in
[`NATIVE_IMAGE.md`](NATIVE_IMAGE.md).

## Receipts (we benchmark against ourselves and publish the losses)

- **History-independent performance**: we found session opens were O(history)
  (a quadratic `access()` syscall storm — 50M syscalls over a 10k-commit
  build) and per-commit work degraded 296→154 commits/s. Both root-caused and
  fixed: opens now 0.18 ms flat at 10k revisions, commit rate flat ~570/s.
  The full causal chain is in [`BENCHMARKS.md`](BENCHMARKS.md).
- **Concurrent reads while a writer commits**: a mixed-workload benchmark (16
  reader threads + 1 committing writer over REST) exposed two real defects —
  a reader-side page-lifecycle bug that freed a shared page out from under a
  concurrent reader (sporadic 500s *and* silently-wrong reads), and a second
  O(history) cost where every storage open racing a commit re-read the whole
  revision index one syscall at a time. Both root-caused (via a use-after-free
  stress gate that reproduced the crash in seconds, and a wall-clock profile)
  and fixed at the root. On a 12,800-revision database the same workload went
  from 361 to **11,198 reads/s**, reader p99 from 334 ms to **4.8 ms**, with
  **zero errors** — the aged database now outruns the pre-fix fresh one.
- **Honest PostgreSQL comparison** ([`COMPARISON_POSTGRES.md`](COMPARISON_POSTGRES.md)):
  PG 17 with a history table wins raw small-document numbers — ingest 4,015
  vs ~430 commits/s (PG sits at 84% of the device's fsync floor; that's its
  home turf) and total storage 4.7 vs ~12 MiB. SirixDB wins per-statement
  embedded reads, 0.3 ms semantic diffs, and sub-document time travel, which
  PG simply doesn't have. Durability settings were verified equivalent before
  measuring.
- **Correctness sweeps as release gates**: adversarial round-trip fidelity
  (83 JSON shapes × every serializer mode), a JSONiq result-correctness sweep,
  and the vectorized differential suite all run as tests. Several of the bugs
  they caught — invalid JSON from fused-node serialization, an optimizer
  rewrite that returned empty results for valid predicates, group-by over
  numeric keys returning nothing — were found *by these gates*, then fixed at
  the root.

## Where SirixDB is the wrong choice (today)

- **High-rate small-document OLTP.** PostgreSQL ingests ~10× faster in that
  regime and stores it tighter. If you don't need history, you don't need us.
- **Distributed workloads.** Single-node, single-writer-per-resource by
  design. Horizontal scale is not on the short-term roadmap.
- **Relational queries across many entities.** It's a document store; joins
  exist in JSONiq but a relational engine will beat it at relational shapes.
- **Maturity.** This is a beta of a research-grade engine. The disk format
  (V0) is now contract-documented and crash-tested, but you should expect
  rough edges, and the known limitations are listed in
  [`KNOWN_LIMITATIONS.md`](KNOWN_LIMITATIONS.md) rather than hidden.

## Use cases where it shines

- **Audit-grade record keeping** — financial/medical/compliance documents
  where "show me this record exactly as it was on date X, and prove what
  changed since" is the product requirement, not an afterthought.
- **Collaborative or ML-pipeline document evolution** — checkpoint every
  transformation of a config/feature/document tree and diff any two states
  semantically in sub-millisecond time.
- **Debugging production state** — keep the full history of a fast-changing
  JSON state tree and bisect *data* the way you bisect code.
- **Versioned content/configuration APIs** — serve any historical version
  with the same latency as the head revision.

## Try it

- Live read-only demo (tree explorer, query editor with optimized plan view,
  revision scrubber, structured diffs): **https://demo.sirix.io**
- Quickstart (one process, no auth, for evaluation):
  [`QUICKSTART.md`](QUICKSTART.md)
- Sources: https://github.com/sirixdb/sirix — the engine,
  https://github.com/sirixdb/brackit — the query compiler,
  https://github.com/sirixdb/sirixdb-web-gui — the UI.