# Fast-math (`--fast-math`)

Off by default. Opt in to permit LLVM optimizations on f64 arithmetic
that produce observably different results from Node's V8 in exchange for
faster code on a narrow class of numeric workloads.

## TL;DR

| Mode | Bit-exact with Node | Speed |
| --- | --- | --- |
| Default | Yes (~94% of random FP programs match Node bit-for-bit; the residual ~6% comes from the LLVM SLP vectorizer at `-O3`, not from fast-math) | Same as Node within noise on realistic FP code |
| `--fast-math` | No (~70%; ~30% of random FP programs diverge by 1 ULP) | ~7x faster on tight `sum += constant` loops; ~0% difference on dot products, array reductions, or any data-dependent FP-heavy code (M-series ARM64 numbers; x86_64 may differ) |

If your program does scientific computing, signal processing, or any
hand-tuned numeric kernel that benefits from autovectorization or FMA
fusion, `--fast-math` may help. For everything else (UI, business logic,
crypto, networking, framework code), it changes nothing observable
except correctness — leave it off.

## Three ways to enable it

CLI flag wins over env var, env var wins over package.json:

```bash
# 1. Per-build CLI flag
perry --fast-math myapp.ts

# 2. Per-shell environment
PERRY_FAST_MATH=1 perry myapp.ts

# 3. Per-project package.json (most common)
{
  "perry": {
    "fastMath": true
  }
}
```

## What it actually changes

Exactly two LLVM per-instruction fast-math flags become emitted on
every `fadd` / `fsub` / `fmul` / `fdiv` / `frem` / `fneg`:

- **`reassoc`** — permits the optimizer to reorder associative chains.
  `(a + b) + c` may become `a + (b + c)`. This is what the loop-vectorizer
  needs to break a serial accumulator dependency chain into 4 parallel
  accumulators. Worst-case observable behavior: tiny ULP-level
  differences in long sum chains over operands of widely-different
  magnitudes; rewrites like `(a / b) * b → (a * b) / b` (algebraically
  equal, IEEE-different).

- **`contract`** — permits fused multiply-add. `a * b + c` may become a
  single FMA instruction with one rounding step instead of two. ARM and
  modern x86 both have hardware FMA. Worst-case observable behavior:
  intermediate `a * b` no longer rounds independently, so code that
  depends on the rounding structure (Kahan summation, compensated
  arithmetic) sees different bits.

## What it deliberately does NOT enable

The full clang `-ffast-math` is **off** even with `--fast-math`. In
particular, these flags stay clear:

- `nnan` / `ninf` — these tell LLVM to assume no NaN/Inf inputs, which
  is catastrophic for Perry: NaN-boxing uses NaN bit patterns for every
  non-number value (strings, objects, null, undefined, booleans).
  Enabling them caused LLVM to replace `TAG_NULL` / `TAG_UNDEFINED`
  constants with `0.0` at codegen time. Tried at v0.2.x commit
  `083ce16`, reverted two days later in `b5a8c83f`. Will not return.
- `nsz` (no signed zeros) — would make `(a + 0) → a` a valid rewrite
  even when `a` is `-0`. `Object.is(-0, 0)` is observable in JS.
- `arcp` (allow reciprocal) — would rewrite `a / b → a * (1 / b)`,
  which loses precision when `b` is far from a power of two.
- `afn` (approximate functions) — would let LLVM substitute lower-
  precision math intrinsics.

For reference, Rust nightly's `#![feature(float_algebraic)]` enables
`reassoc + contract + nsz + arcp + afn`. Perry's `--fast-math` is
strictly more conservative than that.

## Performance numbers

Benchmarks on Apple Silicon (M-series, ARM64), `min` of 3 runs each,
LLVM 19, perry 0.5.569. Run `scripts/perf_bench.sh` to reproduce.

| Benchmark | Default | `--fast-math` | Ratio | Node |
| --- | ---: | ---: | ---: | ---: |
| `sum_loop` (100M `sum += 1`) | 96 ms | 13 ms | **7.4× faster** | 53 ms |
| `dot_product` (10M `sum += a[i]*b[i]`) | 13 ms | 13 ms | 1.00× | 12 ms |
| `array_sum` (10M `sum += xs[i]`) | 10 ms | 10 ms | 1.00× | 11 ms |

Read these together: `--fast-math` produces a large speedup ONLY on
loops where the accumulator step is constant or trivially-redundant
enough that LLVM can split it into parallel partial sums. Real FP
workloads rarely look like `sum += 1` and so rarely benefit. The default
mode beats Node on `array_sum` and matches it on `dot_product` without
giving up bit-exact parity.

## Correctness numbers

`scripts/fp_fuzz.mjs` — randomly generates TS programs exercising the
six patterns most likely to trip per-instruction FMFs (left-fold,
tree-fold, right-fold reductions; FMA-shaped chains; algebraic
identities like `(a/b)*b`; cancellation predicates). Each program is
compiled with both Node and Perry, and stdout is diffed byte-for-byte.

| Mode | Pass rate (100 random programs, seed=200) |
| --- | --- |
| Default | 94/100 |
| `--fast-math` | ~70/100 |

The 6/100 default-mode failures are residual divergences from sources
not gated by per-instruction FMFs — most originate in the LLVM SLP
vectorizer at `-O3`, which can apply pairwise reduction even without
the `reassoc` permission. Tracked separately; out of scope for this
flag.

## Object-cache interaction

Perry's per-module `.o` cache (in `.perry-cache/objects/`) keys on the
`fast_math` setting alongside source hash and other compile options.
Toggling the flag invalidates affected cache entries — `perry --fast-math`
right after `perry` does a clean recompile of every module that contains
f64 arithmetic. No `--no-cache` necessary.

(This is a deliberate fix. During the original investigation, an early
version of the flag forgot to enter the cache key, and the result was
that toggling the flag appeared to do nothing because all `.o` files
came from the cache. If you ever see fast-math defaults that *seem*
not to take effect, suspect the cache key first.)

## Migration notes

- **For library authors:** if your TS library publishes benchmark
  numbers, document which mode you measured under. The 7× sum-loop case
  is the only place the gap is large; if your benchmark doesn't look
  like that, the numbers are mode-independent and you can publish one
  set.
- **For app authors:** there is no migration. Default behavior is the
  pre-flag behavior with `--fast-math` removed; bit-exact results are
  *more* compatible with Node, not less.
- **For determinism-critical code** (lockstep simulations, financial
  reconciliation, hash function correctness): leave the default. Even
  with `--fast-math` off there's a residual ~6% divergence rate on
  random FP code, which is too high for true determinism work — but
  it's an order of magnitude better than the ~30% with the flag on.