# Sirix — Operations Guide

Production deployment, tuning, and troubleshooting for `sirix-core` and the
`sirix-rest-api` server. This document focuses on the operational surface — JVM
flags, cache budgets, OS limits, observability, backups — rather than on API
usage. For API documentation, see the project README and JavaDoc; for storage-
format internals, see `docs/ARCHITECTURE.md`.

> **Status.** Sirix is currently at `1.0.0-alpha10`. The wire format is on
> `BinaryEncodingVersion.V0`; bumps are stamped into the page header and rejected
> on read with a clear "version not known" error. There is **no migration tool
> yet** — when V1 is introduced, a one-shot upgrader will ship alongside.

---

## 1. Supported environment

| Dimension | Value |
|---|---|
| **JDK** | Java 25 LTS (sourceCompatibility / targetCompatibility = 25). Earlier JDKs are not supported. |
| **OS / arch** | Linux x86_64 — fully supported, including the bundled native LZ77 decoder. macOS and Windows run on the pure-Java LZ77 fallback (correct, slower). |
| **Other JVMs** | OpenJDK HotSpot is the reference. GraalVM Community / EE work; the perf-campaign baseline runs on a recent EA build for the MemorySegment fixes (see `graal-issue-13377.md` in project memory). |
| **Native image** | Supported via GraalVM `native-image` for `sirix-rest-api` and `sirix-kotlin-cli`. See `docs/NATIVE_IMAGE.md`. |
| **Cluster** | Single-node only. No replication, no consensus. Multi-tenancy at the database level (one resource session writer per resource). |

---

## 2. Mandatory JVM flags

Sirix uses Foreign Function & Memory (FFM), the Vector API, preview features, and
several JDK-internal exports that must be opened. These flags **are not optional**
— omission produces `IllegalAccessError` at startup.

```
--enable-preview
--enable-native-access=ALL-UNNAMED
--add-modules=jdk.incubator.vector
--add-exports=java.base/jdk.internal.ref=ALL-UNNAMED
--add-exports=java.base/sun.nio.ch=ALL-UNNAMED
--add-exports=jdk.unsupported/sun.misc=ALL-UNNAMED
--add-exports=jdk.compiler/com.sun.tools.javac.file=ALL-UNNAMED
--add-opens=jdk.compiler/com.sun.tools.javac=ALL-UNNAMED
--add-opens=java.base/java.lang=ALL-UNNAMED
--add-opens=java.base/java.lang.reflect=ALL-UNNAMED
--add-opens=java.base/java.io=ALL-UNNAMED
--add-opens=java.base/java.util=ALL-UNNAMED
```

The same set is applied in the project's Gradle build (`build.gradle:215`) and
in the REST API CI workflow (`.github/workflows/gradle.yml`).

Notes:

- `jdk.unsupported/sun.misc` is required only because of a transitive
  `net.openhft/zero-allocation-hashing` dependency. Sirix code itself no longer
  uses `sun.misc.Unsafe` directly.
- The `jdk.compiler` opens are needed when using the Brackit query stack with
  ahead-of-time AST compilation; they are harmless when not exercised.

---

## 3. Heap sizing and GC choice

Sirix is built around **off-heap** `MemorySegment`-allocated page memory. The
on-heap budget covers (a) the JVM and Brackit's query state, (b) per-thread
buffers and caches, (c) intermediate query result objects, and (d) on-heap
references to off-heap pages held by transactions. A typical production sizing:

| Workload | `-Xms` | `-Xmx` | `-XX:MaxDirectMemorySize` |
|---|---|---|---|
| Embedded library, single resource, ~1 GB working set | 2 GB | 4 GB | 1 GB |
| `sirix-rest-api` server, mixed workload | 4 GB | 8 GB | 1 GB |
| Analytical workload over multi-GB data (Chicago-scale) | 5 GB | 12 GB | 2 GB |

Defaults inside the gradle `:test` JVM are `-Xms5g -Xmx12g` (build.gradle:251) —
not because tests need 12 GB, but because they pre-touch the heap (`AlwaysPreTouch`)
to make GC behavior comparable across runs.

### GC

The reference GC is **ZGC** with always-pretouch and large-pages, configured in
the project's gradle test-JVM as:

```
-XX:+UseZGC
-XX:+AlwaysPreTouch
-XX:+UseLargePages
-XX:+UseStringDeduplication
-XX:+HeapDumpOnOutOfMemoryError
-XX:ReservedCodeCacheSize=1000m
-XX:EliminateAllocationArraySizeLimit=1024
```

Z is preferred because:

- Sirix's hot path is largely off-heap, so old-gen pressure is dominated by long-
  lived caches, not transient objects. Z's region-based collector handles this
  well.
- Sirix expects sub-second pause budgets; G1's 50–200 ms pauses on a 12 GB heap
  with deep object graphs are too disruptive.

Generational ZGC (`-XX:+ZGenerational`) is supported but currently commented out
in the build because some workloads regress versus single-gen Z; benchmark
before flipping it on.

### Direct memory

`-XX:MaxDirectMemorySize` should be at least **1 GB**. Sirix uses direct buffers
for FFI (LZ4), file-channel reads, and certain serialization paths.

### Other flags worth knowing

- `-XX:-UseJVMCICompiler` — workaround for a Graal JIT speculation bug
  (oracle/graal#13387) that caused 27% wall-clock regressions on
  `conjunctiveCountByGroup` queries. See `graal-jit-speculation-bug.md`.
- `-Xlog:gc*=debug:file=gc.log` for production GC tracing.
- `-Ddisable.single.threaded.check=true` — disables a single-threaded-access
  check in some legacy code paths; needed for the parallel path.

---

## 4. Cache budgets

Sirix's `BufferManager` is a multi-tier cache. The defaults are computed as
fractions of the **memory budget** (the off-heap allocator's max segment size),
and can be overridden via system properties.

| Cache | Default | Property | Purpose |
|---|---|---|---|
| `RecordPageCache` | 50% of budget | `sirix.cache.recordPage` | Most-recent record-page versions — primary data cache |
| `RecordPageFragmentCache` | 18.75% of budget | `sirix.cache.recordPageFragment` | Older revision fragments needed to reconstruct historical records |
| `PageCache` | 6.25% of budget (min 100 MB) | `sirix.cache.page` | Index pages, RevisionRoot pages — metadata, not records |
| `RevisionRootPageCache` | 5,000 entries (fixed count) | — | Revision root pointers |
| `RBTreeNodeCache` | 50,000 entries (fixed) | — | RB-tree index nodes |
| `NamesCache` | 500 entries (fixed) | — | Interned QName / property-name strings |
| `PathSummaryCache` | 20 entries (fixed) | — | Per-resource path-summary readers |

Set explicit byte counts when you know your working set:

```
-Dsirix.cache.recordPage=8589934592   # 8 GB
-Dsirix.cache.recordPageFragment=3221225472   # 3 GB
-Dsirix.cache.page=536870912   # 512 MB
```

Initial sizing log line (look for it in startup output):

```
INFO  io.sirix.access.Databases - Initializing global BufferManager with memory budget: 16 GB
INFO  io.sirix.access.Databases -   - RecordPageCache: 8589934592 bytes (8192 MB) (default: 25% of budget)
INFO  io.sirix.access.Databases -   - RecordPageFragmentCache: 3221225472 bytes (3072 MB) (default: 12.5% of budget)
INFO  io.sirix.access.Databases -   - PageCache: 1073741824 bytes (1024 MB) (default)
```

---

## 5. Native libraries

### `libsirix_lz77.so`

A bundled native LZ77 decoder for Linux x86_64. Embedded as a JAR resource at
`/native/linux-x86_64/libsirix_lz77.so` and extracted to a temp file at the
first decode call.

- **If present:** ~2× decompression throughput versus the pure-Java fallback.
- **If absent or platform mismatch:** falls back to `SirixLZ77Codec` pure-Java
  decoder, which is correct but slower.
- **Override:** `-Dsirix.lz77Codec.native.disable=true` forces pure-Java for A/B
  testing.

To rebuild from source: `./gradlew :sirix-core:buildNativeLz77` (requires `gcc`
on `PATH`). The build step is no-op when `gcc` is missing — the JAR ships only
the prebuilt `.so`.

### LZ4 (FFM)

The default `FFILz4Compressor` invokes the system `liblz4.so.1` via FFM. On
modern Linux distros this is in `apt install liblz4-1` / `dnf install lz4` and
present by default. macOS: `brew install lz4`. Windows: build / install
`liblz4.dll`.

If `liblz4` is unavailable the constructor throws at first compress/decompress.
Page writes succeed only when the compressor is functional; there is no
runtime fallback for LZ4 (unlike LZ77).

---

## 6. OS-level requirements

| Setting | Value | Why |
|---|---|---|
| `ulimit -n` | ≥ 65,536 | Each storage engine reader holds an open file handle to the resource; a busy server with hundreds of concurrent transactions will exceed the default 1024. |
| `vm.max_map_count` | ≥ 262144 | MemorySegment-backed allocations + memory-mapped file I/O can use many mappings. |
| Huge pages | enable `vm.nr_hugepages` (or `transparent_hugepage=always`) | `-XX:+UseLargePages` is the JVM default and falls back silently if huge pages aren't available, but you give up TLB efficiency on hot pages. |
| Disk | local NVMe SSD strongly preferred | Sirix's read path is page-random; spinning disks are roughly 100× slower per page read. |
| Filesystem | ext4 or xfs | btrfs and ZFS work but add their own copy-on-write layer that interacts oddly with Sirix's CoW page format. |
| Time source | NTP-synced | Sirix records commit timestamps; clock skew shows up as out-of-order revisions. |

---

## 7. Observability

The `sirix-rest-api` server exposes Prometheus-format metrics at `GET /metrics`
via [Micrometer](https://micrometer.io). Wired in
`bundles/sirix-rest-api/src/main/kotlin/io/sirix/rest/MetricsHandler.kt`.

Currently exported:

| Metric | Type | Labels | Notes |
|---|---|---|---|
| `http_request_duration_seconds` | Timer | method, path, status | per-request latency histogram |
| `http_requests_total` | Counter | method, path, status | request rate |
| `http_active_requests` | Gauge | — | in-flight requests |

**Sirix-internal metrics (active transaction count, page cache hit/miss/evict,
commit queue depth, GC pause attribution) are not yet exported through the
Prometheus registry.** A `ResourceSession.activeTrxCount()` accessor exists for
in-process diagnostics; bridging it through Micrometer is on the production-
readiness backlog. For now the recommended approach is JFR (`-XX:StartFlightRecording`)
plus the Sirix logback appender at `INFO` level.

For the embedded-library use case (no REST), Sirix logs cache initialization,
storage allocator decisions, and ClockSweeper progress at INFO. Logger names:

- `io.sirix.access.Databases` — startup, BufferManager init.
- `io.sirix.cache.BufferManagerImpl` / `io.sirix.cache.ShardedPageCache` — cache lifecycle.
- `io.sirix.cache.ClockSweeper` — eviction sweeps (PostgreSQL bgwriter pattern).
- `io.sirix.cache.LinuxMemorySegmentAllocator` — off-heap allocator events.
- `io.sirix.access.Databases$Databases` — close/cleanup warnings.

---

## 8. Backup and restore

Sirix has **no streaming or incremental backup tool**. Resource directories are
self-contained; the operational pattern is:

1. Stop the writer for the resource (close any active `NodeTrx`).
   Read-only transactions can continue.
2. `cp -a` or `rsync -a --inplace` the resource directory to the backup target.
   Sirix's append-only page format means this is consistent without additional
   coordination.
3. Verify the backup by opening it as a read-only resource:
   ```java
   try (var db = Databases.openJsonDatabase(backupPath);
        var session = db.beginResourceSession("...");
        var rtx = session.beginNodeReadOnlyTrx()) { /* ... */ }
   ```

Restoring is a directory move/copy back; no replay is required.

**Caveats:**

- Hot backup (writer running) is **not** safe — the in-flight Transaction Intent
  Log can leave the on-disk image inconsistent. Wait for `wtx.commit()` /
  `wtx.close()` first.
- Snapshot-based backups via filesystem snapshots (LVM, ZFS) are safe **iff** the
  snapshot is atomic across all files of the resource. ext4 + LVM is fine; per-
  file snapshots are not.

A point-in-time recovery is possible via Sirix's revision system: open the
resource at the desired revision number or timestamp via
`session.beginNodeReadOnlyTrx(revision)` /
`session.beginNodeReadOnlyTrx(Instant)`. No external tool needed.

---

## 9. Supported workloads

| Dimension | Supported | Notes |
|---|---|---|
| **Document model** | JSON, XML | one or the other per resource; no mixing |
| **Document size** | up to 64 KiB per LZ77 block, unlimited overall | LZ77's 16-bit offset caps the back-reference window; documents larger than 64 KiB fall back to a literal-only token stream (no compression) |
| **Page size** | 256 KiB ceiling | all in-memory page buffers use this as the practical max |
| **Concurrency** | many concurrent readers, exactly one writer per resource | the writer lock is a `Semaphore(1)` per resource |
| **Bitemporality** | system-time (revisions), valid-time (configurable paths via `validTimePaths`) | both queryable via `jn:all-times`, `jn:open-bitemporal`, `sdb:timestamp`, `sdb:valid-from` |
| **Versioning strategies** | FULL, INCREMENTAL, DIFFERENTIAL, SLIDING_SNAPSHOT | choose at resource creation; `SLIDING_SNAPSHOT` is the production default |
| **Indexes** | name index, path index, CAS index, HOT (height-optimized trie) | configured at resource creation |
| **Query language** | JSONiq via Brackit; XQuery via Brackit | the cost-based optimizer (M1–M5) is wired in for JSONiq |

---

## 10. Known limitations and operational caveats

1. **Single-writer-per-resource.** A second `beginNodeTrx()` on a resource with
   an active writer throws after a 5-second `tryAcquire` timeout. Plan for
   serialised writes; do batch ingestion in one writer.

2. **Brackit dependency.** Sirix depends on the released `io.sirix:brackit:1.0-alpha1`,
   so builds are reproducible from Maven Central with no local install or commit-hash
   pinning required. (Brackit is itself in its 1.0 alpha series alongside Sirix.)

3. **No on-disk format migration tool.** `BinaryEncodingVersion.V0` is the only
   shipping version. When V1 lands, an upgrader will ship; today, opening a
   resource written by an incompatible Sirix version raises
   `IllegalStateException: <n> not known.`

4. **HOT index does not isolate historical revisions on reads.** A read-only
   transaction at revision N opening a HOT index sub-tree may observe the
   latest committed state of the index rather than the state at revision N.
   Tracked as task #57 in project memory; not blocking the typical bench /
   analytical use case where the index reflects the most recent commit.

5. **Auto-commit features are in flight on multiple branches**
   (`feature/warm-auto-commit-v1`, `feature/async-auto-commit`,
   `feature/eager-serialize-gc-fix`). Production should currently use
   synchronous commits via `wtx.commit()` and avoid the
   `AfterCommitState.KEEP_OPEN_ASYNC` path until a single design lands on
   `main`.

6. **Chicago-scale ingestion tests are `@Disabled`.** The reference 3.6 GB
   Chicago dataset is not in CI; large-scale ingestion regressions are caught
   manually by removing the `@Disabled` annotation and running locally on a
   machine with ≥ 16 GB RAM.

7. **No automated crash-recovery test.** kill -9 mid-commit, partial fsync,
   torn writes — these scenarios are believed to be safe given the
   commit-file + UberPage swap protocol, but a fault-injection harness has not
   been built.

---

## 11. Quick-start: launch the REST API server

```bash
java \
  --enable-preview \
  --enable-native-access=ALL-UNNAMED \
  --add-modules=jdk.incubator.vector \
  --add-exports=java.base/jdk.internal.ref=ALL-UNNAMED \
  --add-exports=java.base/sun.nio.ch=ALL-UNNAMED \
  --add-exports=jdk.unsupported/sun.misc=ALL-UNNAMED \
  --add-exports=jdk.compiler/com.sun.tools.javac.file=ALL-UNNAMED \
  --add-opens=jdk.compiler/com.sun.tools.javac=ALL-UNNAMED \
  --add-opens=java.base/java.lang=ALL-UNNAMED \
  --add-opens=java.base/java.lang.reflect=ALL-UNNAMED \
  --add-opens=java.base/java.io=ALL-UNNAMED \
  --add-opens=java.base/java.util=ALL-UNNAMED \
  -Xms4g -Xmx8g \
  -XX:+UseZGC -XX:+AlwaysPreTouch -XX:MaxDirectMemorySize=1g \
  -Dsirix.cache.recordPage=4294967296 \
  -Dsirix.cache.recordPageFragment=1610612736 \
  -jar bundles/sirix-rest-api/build/libs/sirix-rest-api-1.0.0-alpha10-fat.jar \
  -conf bundles/sirix-rest-api/src/main/resources/sirix-conf.json
```

`/metrics` will be available on the configured port immediately; database
directories are created lazily under the path configured in `sirix-conf.json`.

---

## 12. Where to look when something is wrong

| Symptom | First place to check |
|---|---|
| `IllegalAccessError` on startup | mandatory JVM flags (§ 2). |
| `<n> not known.` on resource open | resource was written by an incompatible Sirix version (§ 1, § 10.3). |
| `OutOfMemoryError: Direct buffer memory` | raise `-XX:MaxDirectMemorySize` (§ 3). |
| `OutOfMemoryError: Java heap space` | raise `-Xmx`, OR shrink record-page cache (§ 4). |
| Page cache hit rate < 50 % | look at the working-set size in the startup log; raise `sirix.cache.recordPage`. |
| Long GC pauses | confirm ZGC is engaged (`-Xlog:gc*=info`); avoid G1 on heaps > 8 GB. |
| Slow LZ77 decompression | confirm `libsirix_lz77.so` extracted (look for `SirixLZ77NativeDecoder loaded` at INFO). |
| `No read-write transaction available` (5s timeout) | another writer is open on this resource session — close it first (§ 10.1). |
| Process-level slowdown after writer churn | check whether a writer was orphaned without `close()`; the deprecated `finalize`-based detector was replaced by Cleaner — leak warnings now appear at WARN with `NodeStorageEngineWriter FINALIZED WITHOUT CLOSE`. |
| Concurrent reader-open contention | Sirix 1.0.0-alpha5 onwards drops `synchronized` on `beginNodeReadOnlyTrx`; if you see throughput plateau, profile with `jfr`. |

---

## 13. Project memory

For deeper context, see:

- `docs/ARCHITECTURE.md` — page format, versioning, transaction model.
- `docs/cost-based-optimizer-design.md` — JQGM, histogram selectivity, DPhyp.
- `docs/NATIVE_IMAGE.md` — GraalVM native-image build/deploy.
- `CLAUDE.md` — internal developer expectations (HFT-grade hot path,
  no-Claude-in-commits, etc.).
- `ROADMAP.md` — open work items and target order.