# Rust core & language bindings (current architecture) **Status: implemented / current architecture.** This spec records the architectural *decisions* behind gitsheets' Rust-core engine with thin language bindings — and, as of the [#127](https://github.com/JarvusInnovations/gitsheets/issues/127) cutover, the shipped reality: `gitsheets-core` owns the engine, the Node package is a thin marshalling shell over `@gitsheets/core-napi`, and a Python binding (`rust/gitsheets-py`) runs over the same core. A spec-drift audit should find this implemented, not pending. The build-out landed via the (now-`done`) plan DAG in [`plans/`](../plans/) (see [Phasing](#phasing)). It completed the substrate evolution begun by the holo-tree migration ([#127](https://github.com/JarvusInnovations/gitsheets/issues/127), [`architecture.md`](architecture.md)). ## Goal gitsheets should eventually be consumable from **more than one language** (Node.js first, **Python** next, others later). The way to get there without re-implementing — or worse, *diverging* — the engine per language is a single **Rust core** that owns the engine, with **thin per-language bindings** that only adapt it to each host's idioms. ## The load-bearing principle: the bytes-authority lives in the core > **Anything that determines on-disk bytes, or must be identical across bindings, > lives in the Rust core. Anything that is purely a host language's API ergonomics > or the consumer's own types lives in the binding.** The on-disk form is a **contract every binding must agree on byte-for-byte**. If Node and Python each owned TOML serialization, normalization, path rendering, or validation, the *same logical write from two bindings would produce different commits* → divergence. That is unacceptable, and it is the principle that decides every "core vs binding" question below. This reverses the single-binding-era guidance (which kept serialization in JS): under multi-binding, centralizing the bytes-authority is mandatory, not optional. ## Division of labor ### Rust core (`gitsheets-core`) Owns everything bytes-determining or consistency-critical: - Tree / blob / commit operations (via `holo-tree`/gitoxide, linked directly into the core — the Node package no longer depends on `@hologit/holo-tree`). - **TOML parse + serialize** (`toml` / `toml_edit`). - **Normalization** — key sorting, byte-stable canonical form. - **Path-template rendering** (record → path), including partition derivations. - **Persisted-shape validation** (JSON Schema). - **Definition-embedded logic execution** — see [Embedded code](#embedded-code-execution). - Query traversal + filtering; secondary indexing. - Diff + patch semantics (RFC 7396 / RFC 6902). - The `Sheet` / `Transaction` / `Store` state machine. ### Language binding (thin: `node`, `python`, …) Owns only what is genuinely host-specific: - Idiomatic API surface (naming, async conventions). - **Marshalling** native objects ↔ the core value type, preserving type fidelity (TOML datetime → JS `Date` / Python `datetime`; integer vs float). - The **consumer-supplied runtime validator** hook (Standard Schema / Zod in JS, Pydantic in Python). This runs on the *native* object and is *allowed* to be language-specific — it is the consumer's app concern, not the store's contract. - Error-variant → idiomatic-exception mapping. The one real subtlety is the write order: native object → **binding** runs the consumer validator → marshal to core value → **core** does shape-validation + normalize + serialize + write. ## The FFI boundary - **Records cross as a typed core value, not JSON.** JSON flattens TOML datetimes to strings and blurs integer/float. The core value type preserves TOML's types; each binding maps it idiomatically (napi-rs / pyo3 both support custom conversions). - **Batch at the boundary.** Bulk APIs (`upsertMany`, `queryAll`) cross the FFI *once* with an array; the whole batch is parsed/serialized and the tree built natively in Rust. This is where the core most rewards bulk-write workloads — versus per-record marshalling + per-record tree writes. ### Type-fidelity rules The core value type is the lingua franca; these mappings are the round-trip contract every binding implements and tests against. The rule of thumb: **the core preserves whatever determines on-disk bytes** (bytes-authority), even when a binding's host language can't represent the distinction natively — the binding then picks the least-lossy idiomatic surface and the core retains the precise kind for faithful re-serialization. - **Integers — `i64` in the core; adaptive on the Node surface.** The core stores every integer as `i64` (TOML's full range). The Node binding marshals to a JS `number` when the value fits in ±(2^53−1) and to `BigInt` above that, so common small ids stay ergonomic numbers while large values never lose precision. Inbound it accepts both `number` and `bigint`. (A future Python binding maps to Python's arbitrary-precision `int` directly.) - **Floats — `f64`.** Kept distinct from integers; `1` and `1.0` are different core values and serialize differently. - **Datetimes — all four TOML kinds preserved distinctly.** Offset-datetime, local-datetime, local-date, and local-time serialize to *different bytes*, so the core keeps them as distinct value kinds. The Node binding surfaces them as JS `Date` to match the current `@iarna`-based v1.x behavior (no consumer-visible change), with the precise TOML kind retained core-side for byte-faithful re-serialization. - **Strings, booleans, arrays, tables** map to their obvious idiomatic counterparts (table ↔ plain object / dict). ## Embedded code execution Some behavior is **embedded in the sheet definition** itself — today, raw-JS sort comparators (`Sheet..sort` string rules, run via `node:vm`); plausibly later, partition-path derivations and custom validation expressions. Embedded code is the one thing that does **not** portability-flatten the way TOML bytes do: a definition-embedded snippet must produce identical results under every binding. The resolution is two-part: 1. **Declarative-first, native in the core.** The common cases stay *data, not code* and are evaluated natively in Rust: `${{ field }}` path substitution, `{field: ASC}` sort directives, JSON-Schema validation, built-in partition derivations (date-parts, hashing, bucketing). This already mirrors how `Sheet`'s sorter treats `true`/`false`/array/`{field:dir}` declaratively and reserves raw JS only as an escape hatch. **Declarative `sort = true` (locale-sensitive string-array sorting) is native — it does NOT go through the embedded engine.** It uses a Rust ICU collator that matches V8's `localeCompare`, so its order is byte-stable and identical across bindings. Because boa is built without `Intl`, routing locale sorting through the engine would diverge from `node:vm` for non-ASCII input; the engine is for *arbitrary* raw-JS comparators only. The ICU collator is part of the canonical-behavior contract — its **version is pinned** like the serializer and the engine. 2. **JS escape-hatch via an embedded engine *in the core*.** When a definition genuinely needs arbitrary logic, it runs in a JS engine **embedded in the Rust core** — *not* the host binding's JS runtime. Running it in the core is what keeps it portable: a Python consumer gets the *core's* engine, so Node and Python produce identical results. The engine + each definition's compiled snippets are held **persistently** on the `Sheet`/`Store` handle (compile once on open, reuse across every operation; no per-op re-parse). **Engine choice: `boa_engine`** (pure-Rust). The escape-hatch snippet set is simple and deterministic (sort comparators, partition derivations), so QuickJS's spec-completeness edge buys little, while its C toolchain is permanent overhead that multiplies per binding/target. Pure-Rust boa keeps the *whole* core C-free → trivial cross-platform prebuilds across all six targets (incl. musl + windows), clean Python wheels, and a plausible WASM binding later. Full V8 is overkill. The accepted trade-off is boa's divergence risk vs the current `node:vm` baseline (concentrated in `Intl`/locale + exotic built-ins our snippets don't use); the **`node:vm` parity gate is a hard validation criterion** that catches any real divergence on actual snippets before adoption. Flip to `rquickjs` only if parity fails on real snippets or a use case needs `Intl`/advanced built-ins. As boa evolves quickly, its **version is pinned** (per the contract constraint below) and upgraded deliberately. > **Shipped status (Node cutover).** The *declarative-first* half (point 1) is > live in the core: `${{ field }}` substitution, `{field: dir}` directives, > JSON-Schema validation, and locale-sensitive `sort = true` via the native ICU > collator all run natively in `gitsheets-core`. The *JS escape-hatch* half > (point 2) is **not yet routed through the core's boa engine on the Node > binding**: raw-JS sort comparators (`Sheet..sort` string rules) still run > host-side in `node:vm` via the retained `buildSorter`/exported `Template` > path. This is a **deliberate deferral** — it is single-binding today (Node only, > where `node:vm` is the parity baseline anyway) and moves to the in-core boa > engine when the Python binding needs raw-JS comparators to stay byte-identical. > Until then the core's boa engine is the specified target for this escape hatch, > not the Node runtime path. Tracked in [`node-binding-thin`](../plans/node-binding-thin.md). **Two constraints this creates:** - **Thread-confinement.** Embedded JS contexts are single-threaded (`!Send`). The persistent context is pinned to the `Store`'s owning thread — the same thread-model discipline holo-tree's cache already needs. - **The engine is part of the canonical-behavior contract.** Sort order and partition paths depend on its JS semantics, so the **engine version is pinned** and an upgrade is treated like a normalization change (same discipline as the TOML serializer and the validator). ## Canonical-form rebaseline Moving TOML serialization to the Rust `toml`/`toml_edit` stack changes the canonical bytes (at minimum integer-underscore: `31_618` → `31618`). gitsheets always serializes *fresh* from an object, so `toml_edit`'s format-preserving "0% churn" does not apply — expect the integer normalization. This is a **deliberate one-time re-baseline**: - **Decision: do it, and do it while gitsheets effectively has a single user.** The blast radius (re-normalizing existing repos) only grows with adoption. - It is a documented change to [`behaviors/normalization.md`](behaviors/normalization.md) plus a single re-normalize commit per repo. - Measurements behind this decision: [#196](https://github.com/JarvusInnovations/gitsheets/issues/196). ## Phasing The DAG in [`plans/`](../plans/) realized this in dependency order — **all three phases are now done**. The load-bearing sequencing rule held: **the bytes-authority landed before a second binding existed**, so Node and Python agree on bytes. 1. **Tree ops in Rust** *(done — [#203](https://github.com/JarvusInnovations/gitsheets/pull/203))* — `@hologit/holo-tree` binding; tree/blob/commit moved off hologit JS, TOML stays JS, behind the unchanged public API. (`holo-tree-napi-spike` → `holo-tree-migration`.) 2. **Bytes-authority core** *(done)* — established `gitsheets-core`; moved TOML + normalization, path templates, shape validation, and definition-embedded logic into it and executed the rebaseline. This was the gate for multi-binding. (`toml-core`, `canonical-rebaseline`, `path-template-core`, `validation-core`, the codec/collation/normalize plans, etc.) 3. **Engine + thin bindings** *(done — Node cutover [#216](https://github.com/JarvusInnovations/gitsheets/pull/216))* — record engine (CRUD/query/index/diff) and the `Sheet`/`Transaction`/`Store` state machine in the core; the Node binding re-thinned to a marshalling shell over `@gitsheets/core-napi` (`node-binding-thin`) with the direct `@hologit/holo-tree` dependency dropped; the Python binding (`rust/gitsheets-py`, pyo3) added over the same core, proven byte-identical. ## Non-negotiables (carried from the existing specs) - **No consumer-visible public-API change** for the Node surface across the migration (per [`architecture.md`](architecture.md) / #127). The bytes re-baseline is the one deliberate, documented exception, gated on the rebaseline decision above. - Parity passes are required wherever a Rust component replaces a JS one with observable output: the TOML serializer (#196), the JSON-Schema validator vs `ajv`, and the embedded engine vs `node:vm`. - **Markdown body normalization is native.** Content-typed (`markdown`/`mdx`) sheets normalize the body on write with the embedded `dprint-plugin-markdown` formatter rather than a host-side `markdownlint` pre-pass, so body bytes are identical across bindings. Its **version + config are pinned** as a canonical-behavior contract input (like the serializer, engine, validator, and collator); switching off the JS `markdownlint` pass is a one-time documented body re-baseline (see [`behaviors/content-types.md`](behaviors/content-types.md)).