# Polyglot Benchmark Results

Perry vs 9 other runtimes on 8 identical benchmarks. All implementations
use `f64`/`double` arithmetic to match TypeScript's `number` type. No SIMD
intrinsics, no unsafe code, no non-default optimization flags — each
language's idiomatic release-mode build. A companion `RESULTS_OPT.md`
(phase 2 of this investigation) shows what happens when each language is
given flags equivalent to Perry's defaults.

See [`METHODOLOGY.md`](./METHODOLOGY.md) for iteration counts, clocks,
compiler versions, and a full explanation of which optimizations create
each delta.

## Results

**Run date:** 2026-05-14 — Perry v0.5.908, full polyglot sweep across
all 10 runtimes. Both `default` and `--fast-math` Perry columns,
on an otherwise-idle machine.
**Hardware:** Apple M1 Max (10 cores, 64 GB RAM), macOS 26.4.
**Methodology:** RUNS=11 per cell. **Median wall-clock ms below**;
full per-cell stats (median + p95 + σ + min + max) in `RESULTS_AUTO.md`.
**Pinning:** macOS scheduler hint via `taskpolicy -t 0 -l 0`
(P-core preferred via throughput/latency tiers, NOT strict affinity —
Apple does not expose unprivileged hard core pinning). Lower is better.

**The two Perry columns (`default` / `--fast-math`):** since v0.5.585,
LLVM `reassoc + contract` per-instruction fast-math flags on f64 ops
are opt-in. Default mode produces bit-exact f64 output with Node;
`--fast-math` permits the optimizer to reassociate FP chains and fuse
multiply-adds. The flag's win is concentrated on trivially-foldable
accumulator patterns (`loop_overhead`, `math_intensive`, `accumulate`);
on data-dependent kernels (`loop_data_dependent`), the columns are
identical because the sequential carry can't be reordered regardless.
See `../docs/src/cli/fast-math.md` for the full behavior contract.

| Benchmark           | Perry default | Perry --fast |  Rust |   C++ |    Go | Swift |  Java |  Node |   Bun |  Python |
|---------------------|--------------:|-------------:|------:|------:|------:|------:|------:|------:|------:|--------:|
| fibonacci           |           309 |          306 |   316 |   309 |   446 |   401 |   278 |   987 |   518 |   12382 |
| loop_overhead       |            97 |           12 |    97 |    96 |    96 |    96 |    97 |    53 |    41 |    1967 |
| **loop_data_dependent** | **225** | **224** | **226** | **129** | **128** | **225** | **226** | **226** | **230** | **6068** |
| array_write         |             3 |            4 |     7 |     2 |     9 |     2 |     6 |     9 |     6 |     331 |
| array_read          |            11 |           11 |     9 |     9 |    10 |     9 |    11 |    14 |    16 |     236 |
| math_intensive      |            51 |           14 |    48 |    50 |    48 |    48 |    50 |    49 |    50 |    1579 |
| object_create       |             2 |            0 |     0 |     0 |     0 |     0 |     5 |     8 |     6 |     133 |
| nested_loops        |            18 |           17 |     8 |     8 |    10 |     8 |    10 |    17 |    20 |     353 |
| accumulate          |            97 |           34 |    97 |    96 |    96 |    96 |    98 |   597 |    98 |    4382 |

> Clean v0.5.908 sweep on an idle machine — both Perry columns and all
> peer languages re-measured together. σ on Perry cells is 0.3-2.2 ms
> across the board (vs the 25-57 ms spread in yesterday's contaminated
> v0.5.891 sweep). `fibonacci` and `loop_data_dependent` are within
> 3 ms between default and `--fast-math`, confirming the structural
> expectation (integer recursion / sequential FP carry are fast-math-
> immune). Trivially-foldable rows (`loop_overhead` / `math_intensive`
> / `accumulate`) reproduce the v0.5.585 fast-math win pattern to the
> millisecond.

**New benchmark in v0.5.249: `loop_data_dependent`.** Same shape as
`loop_overhead` but with a multiplicative carry through `sum` and
runtime-loaded array reads, so LLVM cannot apply reassoc, IV-simplify,
or autovectorization. The Rust version's loop body is verified at
the asm level (see `bench.rs` line 122) — emits a scalar fmul + fadd
chain with two array loads, no fold, no vectorize. **The kernel
splits the field into two FP-contract clusters:**

- **FMA-contract pack (~128 ms):** Go (default), C++ `g++ -O3` on
  Apple Clang. Both fuse `sum * a + b` into a single `FMADDD`
  instruction (one IEEE-754 rounding instead of two), shortening
  the per-iteration dependency chain from ~6-8 cycles
  (FMUL→FADD on Apple silicon) to ~4 cycles (single FMADDD).
- **No-contract pack (~229-235 ms):** Perry, Rust default `-O`,
  Swift `-O`, Java without `-XX:+UseFMA`, Bun. All emit separate
  `FMUL` + `FADD` because contracting changes observable IEEE-754
  result by up to 0.5 ULP per iteration and is not legal under
  the languages' default FP semantics.

LLVM matches the FMA pack with `-ffast-math` or
`-ffp-contract=fast`; Apple Clang's default at `-O2+` already enables
contract. Verified at the asm level on rustc 1.94.1 / Apple Clang on
2026-04-25 — see `bench.rs` line 115 for the full asm dump and
toolchain notes. The dependency chain on `sum` is preserved in both
clusters; the win is one ISA-level fusion, not a fold or a vectorize.
Node 322 ms this run was a JIT-warm-up tail (σ=63, p95=447); on
quieter runs it lands in the no-contract pack alongside Bun.

**Interesting tails surfaced by RUNS=11:**

- Python `accumulate` median 5052 ms but p95 9388 ms (σ=1454) — one
  run took 9.4 s, likely from GC pressure or thermal throttling
  during a 10 s+ tight loop. Best-of-5 hid this; the p95 is real.
- Python `math_intensive` median 2244 ms but p95 4091 ms (σ=531) —
  same pattern.
- Most other cells have σ < 5% of median — distributions are tight.

The `nested_loops` and `accumulate` regressions vs the v0.5.164
baseline (8 → 17 ms and 24 → 34 ms) are still present and still
caused by the v0.5.237 generational-GC default flip — `PERRY_GEN_GC=0`
recovers the 8 / 24 ms baseline. The slight jitter on `fibonacci`
(302 → 312 ms median across runs) is genuine variance from the
underlying recursive call chain; best-of-5 had been picking the
lower tail, median is more representative.

**Honest regressions vs v0.5.164** (when this table was last refreshed,
before generational GC became the default in v0.5.237):
`nested_loops` 8 → 17 ms, `accumulate` 24 → 33 ms. Both caused by
the v0.5.237 gen-GC default flip — the per-allocation gen-GC machinery
(write-barrier potential, age-bump pass) is overhead that allocation-heavy
compute benches don't recoup. `PERRY_GEN_GC=0` recovers the 8 / 24 ms
baseline. The trade-off was deliberate; gen-GC's wins on long-running and
RSS-sensitive workloads (`test_memory_json_churn` 115 → 91 MB) outweigh
the small compute-bench regression. All other cells unchanged or
slightly faster.

**Original v0.5.164 fix** ([#140](https://github.com/PerryTS/perry/issues/140))
that established the current `loop_overhead`/`math_intensive`/`accumulate`
baseline: an `asm sideeffect` loop-body barrier (from #74) and an
over-eager i32 shadow counter were blocking LLVM's vectorizer on pure-
accumulator loops; v0.5.164 scoped the i32 shadow to counters that
actually appear in an index subtree and refined the loop-body barrier
to fire only on truly-empty bodies — restoring the `<2 x double>`
parallel-accumulator reduction (vectorization width 2, interleave count
4). Perry beats Rust 3–8× on these cells.

## How to reproduce

```bash
cd benchmarks/polyglot
bash run_all.sh        # best of 3 runs (default)
bash run_all.sh 5      # best of 5 runs (what the above table used)
```

**Required:** Perry (`cargo build --release` from repo root).
**Optional** (any subset works; missing runtimes show as `-`): Node.js,
Bun, Static Hermes (`shermes`), Rust (`rustc`), C++ (`g++` or `clang++`),
Swift, Go, Java (`javac` + `java`), Python 3.

See [`METHODOLOGY.md`](./METHODOLOGY.md) for what each benchmark measures,
compiler versions, why certain cells look the way they do, and where Perry
wins (`loop_overhead`, `math_intensive`, `accumulate`, `array_read`) vs
where it matches the compiled pack (`nested_loops`, `fibonacci`) vs where
it loses (`object_create`).

## Benchmark-by-benchmark summary

### `loop_overhead` — `sum += 1.0` × 100M
Perry 12 ms vs Rust/C++/Go/Swift/Java ~94 ms — **Perry wins 7–8×**. Two
stacked optimizations produce this: (1) Perry emits `reassoc contract`
on f64 ops because JS/TS `number` semantics can't observe the difference
(no signalling NaNs, no fenv, no strict `-0` rules at the operator
level), so LLVM's IndVarSimplify can rewrite `sum + 1.0 × N` as an
integer induction variable. Rust/C++/Go/Swift default to strict-IEEE
`fadd`, which has a 3-cycle latency wall and is unreassociable. (2) On
top of the integer rewrite LLVM autovectorizes the body into a
`<2 x double>` parallel-accumulator reduction with interleave count 4
(four SIMD lanes worth of fadd happening per iteration). `g++ -O3
-ffast-math` on `bench.cpp` drops C++ from 96 ms to 11 ms, matching
Perry — same LLVM, same pipeline, one flag. See
[RESULTS_OPT.md](./RESULTS_OPT.md) for the per-language opt-sweep (C++
opt = 12 ms, Rust opt = 24 ms on stable, Go = 99 ms because Go has no
fast-math flag).

This cell regressed to 32 ms between v0.5.91 and v0.5.162 after an
`asm sideeffect` barrier (from #74) and an unconditional i32 shadow slot
started blocking the vectorizer. Restored in v0.5.164 via [#140](https://github.com/PerryTS/perry/issues/140).

### `math_intensive` — `result += 1.0/i` × 50M
Perry 14 ms, Rust/C++/Go/Swift/Java/Node/Bun all ~46–49 ms — **Perry
wins 3×**. Same fast-math default + vectorization story as
`loop_overhead`: the reciprocal divide has a 10+ cycle latency, but
vectorizing into a parallel 4-accumulator reduction keeps 4 independent
divides in flight so the scheduler hides the latency. C++ `-ffast-math`
matches Perry at 14 ms per [RESULTS_OPT.md](./RESULTS_OPT.md). Regressed
to 48 ms at v0.5.162, restored in v0.5.164 ([#140](https://github.com/PerryTS/perry/issues/140)).

### `accumulate` — `sum += i % 1000` × 100M
Perry 24 ms, Rust/C++/Go/Swift/Java/Bun all 93–98 ms — **Perry wins 4×**.
Node 583 ms is an outlier because V8 doesn't inline the libm `fmod` call
on this pattern. Perry's type analysis emits `srem` instead of `fmod`
for the mod op (same optimization Node misses), LLVM vectorizes the
resulting integer chain + fadd reduction, and the two together beat
Rust's strict-IEEE scalar `fmod`+`fadd` by 4×. Regressed to 97 ms at
v0.5.162, restored in v0.5.164 ([#140](https://github.com/PerryTS/perry/issues/140)).

### `array_read` — sum 10M-element `number[]`
Perry 3 ms, C++/Swift 9 ms, Rust 10 ms, Go 10 ms, Java 11 ms. Perry
detects `for (let i = 0; i < arr.length; i++)` as statically in-bounds,
skips the JS `undefined`-on-OOB check, caches the length at loop entry,
and maintains a parallel i32 counter so the index is never a float → int
conversion. LLVM then autovectorizes to NEON 2-wide f64. C++ `std::vector`
has no bounds check by default but pays the chunk-boundary check from
`-O3`'s vectorizer framing. Rust's iterator form (not used here) matches
Perry — see `bench_opt.rs` (phase 2).

### `array_write` — `arr[i] = i` × 10M
Perry 2 ms, C++/Swift 2 ms, Rust 7 ms, Go 9 ms. Perry matches C++ here.
The Rust result is `-O` with bounds-checked indexing; `.iter_mut()` would
match Perry.

### `nested_loops` — 3000×3000 flat-array sum
All compiled languages 8–10 ms. Perry 9 ms. This benchmark is
cache-bound, not compute-bound — there is no optimization lever to pull.
Perry matches the compiled pack.

### `fibonacci` — recursive `fib(40)`
Java 280 ms (JIT inlining), C++ 310 ms, Perry 311 ms, Rust 319 ms — the
top four languages all land within 10 ms of each other. Perry's type
inference refines the TS `number` parameter to `i64` (because the function
only ever performs integer operations), producing `add/sub/icmp` (1 cycle
each) instead of the `fadd/fsub/fcmp` (2–3 cycles) that the f64-typed Rust
and C++ benchmarks emit. The reason Perry isn't dramatically further
ahead is that LLVM's recursion-folding optimizations on fib-shaped code
recover most of the gap at -O3. The Rust `f64→i64` switch is a one-line
change (tested in `bench_opt.rs`) and drops Rust to ~280 ms.

### `object_create` — allocate 1M `{x, y}` pairs, sum fields
Rust/C++/Go/Swift 0 ms: the compiler proves the struct never escapes and
eliminates the whole loop. Java 5 ms, Bun 5 ms, Node 8 ms, Perry 2 ms,
Hermes 2 ms. Perry is competitive here only because of the v0.5.17
scalar-replacement pass; without it this benchmark was ~10 ms. The 0 ms
floor from statically-typed compiled languages is an inherent tradeoff of
compiling a dynamic language — see `METHODOLOGY.md`.

## Source files

- `bench.cpp` — C++17
- `bench.rs` — Rust (no dependencies)
- `bench.go` — Go
- `bench.swift` — Swift
- `bench.java` — Java
- `bench.py` — Python 3
- `bench.zig` — Zig (may need manual build; not in the current table)
- Perry / Node / Bun / Hermes run the TS files in `../suite/`

All implementations use the same algorithm, same data types (`f64` /
`double` throughout), same iteration counts, and the same output format
(`benchmark_name:elapsed_ms`) so the runner can grep a single key per row.