# jz → native binary Compile a jz JavaScript source to a standalone native executable. The pipeline self-hosts watr's WAT compiler at the time of writing and serves as the reference target for "how fast can a jz-produced wasm get if you really care." ``` watr/src/compile.js │ jz (NaN-boxed f64 ABI, JZ-aware) │ ▼ jz-watr.wasm │ wasm-opt -O3 (Binaryen) │ ▼ jz-watr-opt.wasm │ wasm2c --enable-exceptions │ ▼ watr.c (post-processed: A2a nullify barriers + A2b hoist memory base) │ clang -O3 -flto -fprofile-instr-generate ──► profraw clang -O3 -flto -fprofile-instr-use=watr.profdata ──► watr-native ``` ## Quick start ```bash ./scripts/native/build.sh # full PGO pipeline → /tmp/jz-c/watr-native ./scripts/native/build.sh clean # wipe BUILD_DIR BIN=/tmp/jz-c/watr-native node scripts/bench-native.mjs # regression gate ``` Env overrides: | Variable | Default | Notes | |-------------|------------------------------|--------------------------------------------| | `BUILD_DIR` | `/tmp/jz-c` | All transient artefacts land here. | | `WABT_DIR` | `/Users/div/projects/wabt` | Provides `bin/wasm2c` and `wasm2c/*`. | | `WASM_OPT` | `$(which wasm-opt)` | Binaryen. | | `CC` | `clang` | Needs LTO + PGO. | ## Files | Path | Role | |-----------------------------------|-----------------------------------------------------------------------------| | `build.sh` | Three-stage PGO build orchestrator. | | `gen-watr-wasm.mjs` | jz-compiles `watr/src/compile.js`, runs `wasm-opt -O3` → `jz-watr-opt.wasm`. | | `postprocess-watr.awk` | A2b: hoist `instance->w2c_memory.data` per function + macro-shadow load/store. | | `harness.c` | Median-of-90 bench harness; re-instantiates every 5 iters to bound bump-heap.| | `env-stubs.c` | Empty `__ext_*` import stubs. | | `wasm-rt-exceptions-stub.c` | Trap-only EH (watr has 5 throws, 0 catches). | ## Why each stage matters **wasm-opt -O3** trims wasm2c's input by ~10% on parser-heavy paths. Raw jz output has redundant locals and unhoisted loads that wasm2c can't undo once it's serialized to C. **PGO** closes the last ~5% on the hottest inner loops (parser identifier walk, uleb encode, bump alloc) by giving clang accurate branch frequencies and inlining decisions. Profile is collected from a weighted sample of `watr/test/example/*.wat` — heavy iters on raycast/maze/containers/snake/etc., light pass over the rest. **A1** (`-fno-exceptions` + trap-only EH stub) removes `throw_with_stack` machinery. watr has 5 throws and 0 catches — we're never propagating, so the runtime only needs `wasm_rt_trap`. **A2a** (sed nullifies `FORCE_READ_INT`/`FORCE_READ_FLOAT`) is the biggest single win. wasm2c emits `__asm__("" ::"r"(var))` after every load to "force the value into a register," but clang's PGO+LTO treats those as side-effecting barriers that defeat CSE of `instance->w2c_memory.data`. Killing them unlocks the .data hoist on parser hot loops: ``` f5 inner loop, before: 12 insts/iter, .data reloaded 4× f5 inner loop, after A2a: 4 insts/iter, .data hoisted above the loop ``` 644M-call function on the PGO trace; ~8% on parser-heavy workloads. **A2b** (`postprocess-watr.awk`) goes further. Even with A2a, clang refuses to CSE `instance->w2c_memory.data` across CFG joins inside a single function — f6 still reloaded it 5 times. The awk pass injects, at the top of every function that takes `(w2c_jzwatr* instance, ...)`: ```c __attribute__((unused)) u8* const __restrict__ _md = instance->w2c_memory.data; ``` …and shadows the wasm2c load/store inlines with macros that reference `_md` directly. The `__restrict__` plus const-locality plus PGO is what finally lets clang keep the base in a register across the entire function. f6: 5 reloads → 1. **A3** removes C++ EH tables (`-fno-exceptions -fno-unwind-tables -fno-asynchronous-unwind-tables`), the stack protector (no untrusted input), and merges constants. Smaller `.text` and `.rodata` → better i-cache / constant-pool behaviour. **`WASM_RT_MEMCHECK_GUARD_PAGES`** moves bounds checks from inline branches to OS-level guard pages. **`WASM_RT_NONCONFORMING_UNCHECKED_STACK_EXHAUSTION`** turns `FUNC_PROLOGUE` into a no-op (no `++wasm_rt_call_stack_depth` per call). ## Regression gate `scripts/bench-native.mjs` walks `watr/test/example/*.wat`, runs each through both the native binary and a steady-state V8 baseline (200 iters or 200ms of warmup, whichever is longer; fresh `node` process per run to avoid in-process tier-up bias), and asserts that native is faster than V8 on every example. Each side is invoked `RUNS` times (default 3) and we take the min; this is robust against macOS scheduler jitter without burying real regressions. ``` ITERS=30 RUNS=3 MARGIN=1.0 # defaults ``` Current result on M4 Max (range across runs): ``` 19/21 wins (1.04× – 4.0×) 2/21 ties (raycast.wat, containers.wat — 0.97×–1.01×, within noise floor) ``` raycast and containers exercise the same identifier-resolution path that V8's TurboFan also optimises near-optimally; we've matched V8 to within measurement noise but not consistently beaten it. Tier B will close that gap through watr-source-level changes (the bottleneck is the structure of the JS, not the codegen).