# Polyglot Benchmark Results — Default vs Optimized
Same benchmarks as [`RESULTS.md`](./RESULTS.md), but with a second column
per native language showing what happens when the language is given the
flags and idioms equivalent to **Perry `--fast-math`** (since v0.5.585,
fast-math is opt-in; the previous "Perry default" through v0.5.584 is
now the `--fast-math` mode).
**Run date (Perry):** 2026-05-06 — Perry commit `main` (v0.5.585).
The 2026-05-14 v0.5.908 default-mode polyglot sweep (see `RESULTS.md`)
did not rerun `bench_opt.*` — the table here is preserved as the
last side-by-side opt-tuning sweep. The 2026-05-14 default-column
numbers in `RESULTS.md` match v0.5.585 within 1-4 ms (e.g. fibonacci
304 → 309, loop_overhead 95 → 97, math_intensive 50 → 51), so the
opt-side cells here are within noise of the current binary; the
shape (which language's `opt` flags close the gap to Perry
`--fast-math`) is unchanged.
**Run date (other languages):** 2026-04-15 — refreshed when next opt
sweep runs.
**Hardware:** Apple M1 Max, macOS 26.4.
**Methodology:** Perry RUNS=11 median; other languages best of 5 per
cell (best of 20 for `fibonacci`) — methodology was modernized after
this table was first written; full RUNS=11 + p95 + σ are in
`RESULTS_AUTO.md`.
## Side by side
All times in milliseconds. `Δ` = (default − opt) / default. Positive =
opt is faster.
**The "Perry" column shows BOTH Perry default and Perry `--fast-math`**
since v0.5.585's flip means the apples-to-apples comparison against
each language's `opt` column is the `--fast` value. Perry default
sits roughly where each language's `dflt` column sits on the FP-
foldable benches.
| Benchmark | Perry
dflt | Perry
--fast | C++
dflt | C++
opt | ΔC++ | Rust
dflt | Rust
opt | ΔRust | Go
dflt | Go
opt | ΔGo | Swift
dflt | Swift
opt | ΔSwift |
|------------------|--------------:|----------------:|-------------:|------------:|------:|-------------:|------------:|------:|------------:|-----------:|-----:|--------------:|-------------:|-------:|
| loop_overhead | 95 | 12 | 98 | 12 | 88% | 99 | 24 | 76% | 97 | 99 | 0% | 97 | 24 | 75% |
| math_intensive | 50 | 14 | 50 | 14 | 72% | 49 | 14 | 71% | 49 | 49 | 0% | 49 | 14 | 71% |
| accumulate | 95 | 33 | 97 | 26 | 73% | 97 | 41 | 58% | 99 | 70 | 29% | 96 | 42 | 56% |
| array_write | 4 | 3 | 2 | 2 | 0% | 7 | 7 | 0% | 9 | 9 | 0% | 2 | 2 | 0% |
| array_read | 11 | 11 | 9 | 1 | 89% | 10 | 9 | 10% | 10 | 11 | -10% | 9 | 9 | 0% |
| nested_loops | 17 | 17 | 8 | 1 | 88% | 8 | 8 | 0% | 10 | 9 | 10% | 8 | 8 | 0% |
| fibonacci | 304 | 304 | 310 | 312 | -1% | 319 | 319 | 0% | 450 | 454 | -1% | 403 | 360 | 11% |
| object_create | 2 | 0 | 0 | 0 | -- | 0 | 0 | -- | 0 | 0 | -- | 0 | 0 | -- |
## The one-line story per language
**C++ (`bench_opt.cpp`, `-O3 -ffast-math -std=c++17`):** adding `-ffast-math`
and switching `accumulate` to `int64_t` closes every gap. C++ matches Perry
`--fast-math` to the millisecond on `loop_overhead` (12 = 12) and
`math_intensive` (14 = 14), and **beats Perry** on `array_read` (1 < 11)
and `nested_loops` (1 < 17) because clang's autovectorizer on ffast-math
flat-array sums is more aggressive than what Perry currently emits.
The thesis is confirmed: the entire Perry-vs-C++ advantage on numeric
f64 loops is one flag choice on each side. With v0.5.585's flip, that
flag choice is now visible in the table — Perry default sits with C++
default (95-98 ms, 50 ms); Perry `--fast` sits with C++ `-ffast-math`
(12 ms, 14 ms).
**Rust (`bench_opt.rs`, stable + `-C llvm-args=-fp-contract=fast`):** manual
4-way unrolling + iterator form + `i64` accumulate closes **most** of the
gap, but not all. `loop_overhead` goes from 99 → 24 ms (76% improvement)
but doesn't reach Perry `--fast`'s 12 ms — because stable Rust has no
way to expose LLVM's `reassoc` flag on individual fadd instructions.
Nightly Rust's `std::intrinsics::fadd_fast` (or the more recent
`#![feature(float_algebraic)]` API) would get there; we intentionally
stayed on stable. This is an interesting finding: Rust stable's
*type system* can express what Perry `--fast-math` does (via `i64`),
but Rust stable's *compile flags* cannot express what Perry
`--fast-math` does (via `reassoc`). Perry default sits at 95 ms,
right next to Rust default at 99 ms.
**Go (`bench_opt.go`, `go build`):** the only language that **cannot** close
the `loop_overhead` / `math_intensive` gap at all. Go has no `-ffast-math`,
no `reassoc` flag, and its compiler does not ship a floating-point
reassociation pass. `99 → 99` and `49 → 49` on the two fast-math-dependent
benchmarks, even with the full suite of type and loop-form changes that
helped the other languages. The only benchmark where Go opt improves on
Go default is `accumulate` (99 → 70), from the `int64` switch — and even
there, Go's 70 ms is well short of C++ opt's 26 ms, because Go's compiler
inserts a runtime integer-divide path that's slower than a bare ARM `sdiv`
+ `msub` for the modulo.
**Swift (`bench_opt.swift`, `-Ounchecked`):** manual unrolling and
`UnsafeBufferPointer` close the `loop_overhead` (97 → 24) and
`math_intensive` (49 → 14) gaps partially — same profile as Rust. Swift
also has no reachable `reassoc` flag on its public release toolchain as of
6.3, so the remaining 24 → 12 gap is the same story as Rust. `fibonacci`
improves noticeably (403 → 360) with `-Ounchecked`.
## Where the opt variants matter less than expected
**`array_write` / `array_read`:** the bounds-check elimination story is
less dramatic than predicted in the phase-2 plan. Rust's default indexed
`arr[i]` access with `-O` already gets within 10% of optimal because rustc
is good at proving `i < arr.len()` for classic for-loops. `.iter().sum()`
only shaves 10 → 9 on `array_read`. Swift `UnsafeBufferPointer` on
`array_write` shaved 2 → 1 ms but that's mostly in the noise floor.
The real `array_read` win is on **C++ opt (1 ms)** — and that's from
`-ffast-math` enabling LLVM to break the sum reduction into 4 parallel
lanes, not from bounds elimination. C++ had no bounds checks to remove.
**`fibonacci`:** type-switching from i32 → i64 (C++, Rust) or no-op (Go,
Swift — both already Int64-native on arm64) doesn't change the numbers
materially. The fib recursion is bottlenecked on call overhead, not
arithmetic width, and ARM64 handles i32 and i64 ops at the same rate. The
language-to-language fib gap (~315 ms for Rust/C++/Perry vs ~450 ms for
Go) is the compiler's recursion-folding quality, not expressible in
benchmark-source-level changes.
## Compile commands
| File | Command |
|------------------|--------------------------------------------------------------|
| `bench.cpp` | `g++ -O3 -std=c++17 bench.cpp -o bench_cpp` |
| `bench_opt.cpp` | `g++ -O3 -ffast-math -std=c++17 bench_opt.cpp -o bench_opt_cpp` |
| `bench.rs` | `rustc -O bench.rs -o bench_rs` |
| `bench_opt.rs` | `RUSTFLAGS="-C llvm-args=-fp-contract=fast" rustc -O bench_opt.rs -o bench_opt_rs` |
| `bench.go` | `go build -o bench_go bench.go` |
| `bench_opt.go` | `go build -o bench_opt_go bench_opt.go` (no opt flags exist) |
| `bench.swift` | `swiftc -O bench.swift -o bench_swift` |
| `bench_opt.swift`| `swiftc -Ounchecked bench_opt.swift -o bench_opt_swift` |
## Reproducing
```bash
cd benchmarks/polyglot
bash run_opt.sh # builds opt variants, runs best of 5, prints table
```
`run_opt.sh` reads default numbers from the last `run_all.sh` sweep
(stored in `/tmp/perry_polyglot_bench/results_*.txt`) so a full refresh
is `run_all.sh && run_opt.sh`.