# Sparkle RV32I SoC — Benchmark Results

Benchmark comparison of Verilator, CppSim, and JIT simulation backends.

## Quick Start

```bash
# Run all benchmarks (Verilator + JIT side-by-side)
cd verilator && ./bench.sh

# Custom cycle count
cd verilator && ./bench.sh 50000000

# Individual benchmarks
cd verilator && ./verilator_bench ../firmware/firmware.hex 10000000
cd verilator && ./jit_bench ../firmware/firmware.hex 10000000 generated_soc_jit.dylib

# Build benchmarks from scratch
cd verilator && make bench CYCLES=10000000
```

## Results (10M cycles, firmware.hex, Apple M4 Max)

| Backend | Speed (cyc/s) | vs Verilator |
|---------|--------------|-------------|
| **JIT evalTick (fused)** | **13.0M** | **1.22x** |
| JIT eval+tick (pure) | 13.0M | 1.22x |
| JIT eval+tick + 6 wire reads | 12.2M | 1.15x |
| JIT evalTick + 6 wire reads | 12.7M | 1.19x |
| Verilator 5.044 | 10.6M | 1.00x |

### JIT Wire Read Overhead

| Wires read/cycle | Speed (cyc/s) | Overhead |
|-----------------|--------------|----------|
| 0 (pure) | 13.0M | — |
| 1 (PC only) | 12.9M | 0.7% |
| 6 (SoCOutput) | 12.2M | 6.3% |

### JIT Fused evalTick Speedup

| Mode | eval+tick | evalTick | Speedup |
|------|-----------|----------|---------|
| Pure (no wires) | 768 ms | 771 ms | ~1.00x |
| With 6 wires | 816 ms | 788 ms | 1.04x |

Fused `evalTick()` keeps register `_next` values as stack-local variables,
eliminating ~260 intermediate memory operations per cycle. The speedup is
modest (1-4%) because Clang -O2 already promotes class members to registers
for simple workloads. Larger gains expected on Linux boot (higher register pressure).

## Profile Analysis (macOS `sample` profiler, 50M cycles)

### JIT Profile

| Component | Samples | % | Notes |
|-----------|---------|---|-------|
| `eval()` | 1906 | 74.7% | Combinational logic |
| `tick()` | 608 | 23.8% | Register updates |
| `jit_get_wire` | 3 | 0.1% | Wire reads (negligible) |
| `main` loop overhead | 33 | 1.3% | Loop, dlsym calls |

**Takeaway**: `eval()` dominates at 74.7%. This is the combinational logic
computation (ALU, decoder, hazard logic, TLB, page table walker, etc.).
Optimization should focus on reducing the instruction count of `eval()`.

### Verilator Profile

| Component | Samples | % | Notes |
|-----------|---------|---|-------|
| `nba_sequent__TOP__1` | 1033 | 41.1% | Sequential (register updates) |
| `nba_comb__TOP__0` | 530 | 21.1% | Combinational logic |
| `eval()` overhead | 151 | 6.0% | Eval dispatch |
| `nba_sequent__TOP__0` | 79 | 3.1% | Secondary sequential |
| `ico_sequent__TOP__0` | 40 | 1.6% | Initial-cycle only |
| `VlDeleter/mutex` | 187 | 7.4% | **Thread sync overhead** |
| `__psynch_cvwait` | — | — | Idle thread wait (excluded) |

**Takeaway**: Verilator wastes ~7.4% on mutex/thread synchronization
overhead (even in single-threaded mode). The JIT has zero thread overhead,
contributing to its 1.2x advantage.

## Why JIT is Faster Than Verilator

1. **No thread synchronization** — JIT is single-threaded with no mutex/lock overhead.
   Verilator 5.x uses a thread pool even for single-threaded workloads, wasting 7.4%
   on `VlDeleter::deleteAll()` → `std::mutex::try_lock()`.

2. **Observable wire optimization** — JIT has only 33 class member variables + 321
   `eval()`-local variables (L1-cache friendly). Verilator keeps all signals as
   class members (~1000+).

3. **Fewer CPU instructions per cycle** — The CppSim IR optimizer inlines single-use
   wires, folds constants, and eliminates dead code. Result: fewer memory operations
   per simulation cycle.

4. **Fused evalTick** — Register `_next` values stay on the stack instead of being
   written to class members then read back.

## Bottleneck Analysis

### Current Bottleneck: `eval()` (74.7%)

The `eval()` function computes all combinational logic per cycle. At 13M cyc/s
this means ~77ns per cycle, of which ~57ns is spent in `eval()`.

**Optimization opportunities**:

| Optimization | Expected Impact | Difficulty |
|-------------|----------------|------------|
| Expression inlining in `eval()` | 10-20% | Medium |
| Memory access pattern optimization | 5-10% | Low |
| SIMD for parallel ALU ops | 5-15% | High |
| Partial evaluation (skip unused paths) | 10-30% | High |

### tick() Overhead (23.8%)

`tick()` copies `_next` register values to current state. With 130 registers,
this is ~130 memory copies per cycle. The fused `evalTick()` partially
mitigates this by keeping `_next` values on the stack.

## Cycle-Skipping Oracle Performance

When idle-loop detection is enabled via `mkSelfLoopOracle`:

| Mode | Effective Speed | Real Cycles | Skipped |
|------|----------------|-------------|---------|
| No oracle | 13.0M cyc/s | 10M | 0 |
| Fixed skip (1000) | ~1.25B eff cyc/s | 10M | 9,998K |
| Timer-compare skip | ~5.0B eff cyc/s | 10M | 10M |

The timer-compare-aware oracle (`skipToTimerCompare := true`) computes
`min(mtimecmp - mtime, maxSkip)` to advance time precisely, enabling
Linux boot where the CPU wakes via timer interrupt.

## Reproducing

### Prerequisites

```bash
# macOS
brew install verilator

# Build all simulation backends
cd verilator
make build          # Verilator
make build-cppsim   # CppSim
make build-jit      # JIT shared library
```

### Running Benchmarks

```bash
# Unified benchmark (recommended)
cd verilator && ./bench.sh 10000000

# Rebuild and run
cd verilator && make bench CYCLES=10000000

# JIT bench with detailed profiling
cd verilator && make build-jit && \
  clang++ -O2 -std=c++17 -o jit_bench tb_jit_bench.cpp -ldl && \
  ./jit_bench ../firmware/firmware.hex 10000000 generated_soc_jit.dylib

# Verilator minimal bench
cd verilator && ./verilator_bench ../firmware/firmware.hex 10000000

# macOS profiling (run in separate terminal)
./jit_bench ../firmware/firmware.hex 50000000 generated_soc_jit.dylib &
sample $! 3 -file /tmp/jit_profile.txt
```

### Linux Boot Benchmark

Requires external builds of OpenSBI and Linux kernel:

```bash
# Verilator Linux boot
cd verilator && ./obj_dir/Vrv32i_soc ../firmware/opensbi/boot.hex 10000000 \
    --dram /tmp/opensbi/build/platform/generic/firmware/fw_jump.bin \
    --dtb ../firmware/opensbi/sparkle-soc.dtb \
    --payload /tmp/linux/arch/riscv/boot/Image

# JIT with boot oracle (timer-compare-aware idle-loop skipping)
lake exe rv32-jit-boot-oracle-test
```