# Sparkle RV32I SoC — Benchmark Results

Benchmark comparison of Verilator, CppSim, and JIT simulation backends.

## Quick Start

```bash
# Run all benchmarks (Verilator + JIT side-by-side)
cd verilator && ./bench.sh

# Custom cycle count
cd verilator && ./bench.sh 50000000

# Individual benchmarks
cd verilator && ./verilator_bench ../firmware/firmware.hex 10000000
cd verilator && ./jit_bench ../firmware/firmware.hex 10000000 generated_soc_jit.dylib

# Build benchmarks from scratch
cd verilator && make bench CYCLES=10000000
```

## Results — RV32I SoC (10M cycles, Sparkle-native design)

| Backend | Speed (cyc/s) | vs Verilator |
|---------|--------------|-------------|
| **Sparkle JIT evalTick** | **14.2M** | **1.63x** |
| Verilator 5.040 (no trace) | 8.73M | 1.00x |

## Results — LiteX PicoRV32 SoC (10M cycles, 1730-line real-world design)

| Backend | Speed (cyc/s) | vs Verilator |
|---------|--------------|-------------|
| **Sparkle JIT evalTick** | **17.9M** | **1.70x** |
| Verilator 5.040 (-O2) | 10.5M | 1.00x |
| **Sparkle + Timer Oracle** | **49 GHz** | **~9,900x** |

### Optimization Impact (LiteX SoC, cumulative)

| Phase | Optimization | cyc/s | vs Verilator |
|-------|-------------|-------|-------------|
| Baseline (correct SSA) | Full case SSA merge | 8.17M | 0.79x |
| +Reachability DCE | Generic BFS from output ports | 8.49M | 0.82x |
| +Generic guard detection | Auto-detect `_valid`/`_trigger`/`_enable` | 9.76M | 0.94x |
| +evalTick wire localization | ~270 wires → stack locals | 13.5M | 1.29x |
| +Self-ref _next elimination | Direct register update | 17.9M | 1.70x |
| **+Reverse synthesis** | **Remove pcpi_mul carry-save chain (38 assigns)** | **18.1M** | **1.72x** |

Note: All optimizations are fully generic — no hardcoded signal names.
Reverse synthesis uses `OracleReduction` type class with mandatory Lean proof
(carry-save shift-and-add = multiplication, zero sorry).

### Timer Oracle (Proof-Driven Temporal Skip)

| Mode | Effective Speed | Speedup |
|------|----------------|---------|
| Normal simulation | 5.04M cyc/s | 1x |
| Timer oracle (countdown skip) | **48.9 GHz** | **9,707x** |

Timer oracle detects countdown timer (timer_value) and skips ahead by
timer_value cycles when CPU is idle. Verified with LiteX firmware that
sets TIMER_LOAD=100000, TIMER_EN=1 via CSR bus.

### Multi-Core Scaling (LiteX N-core, hierarchical instantiation)

| Cores | Sparkle Hierarchical | Sparkle Flat | Verilator (wrapper) |
|-------|---------------------|-------------|---------------------|
| 1 | 11.6M | 10.8M | 32.9M |
| 2 | 11.9M | 10.7M | 35.3M |
| 4 | 12.0M | 10.7M | 35.2M |
| 8 | 11.8M | 10.8M | 35.3M |

With proper module hierarchy (10 C++ classes) and shared bus
(all cores active, no dead code elimination possible):

| Cores | Verilator | Sparkle | Ratio |
|-------|-----------|---------|-------|
| 1 | 10.5M | **17.9M** | **1.70x** |
| 8-seq | — | 7.14M per-core | — |
| 8-parallel | 1.06M | **12.7M per-core** | **11.9x** |

Both simulators degrade with core count (D-cache pressure from instance data).
Sparkle degrades more slowly due to instruction sharing via function calls.

### Why Sparkle Beats Verilator

1. **Verified reverse synthesis**: Remove multi-cycle FSM logic (e.g., carry-save multiplier) verified by Lean proof
2. **Wire localization**: All combinational wires as stack-local variables (L1 cache)
3. **Generic conditional guards**: Auto-detect `_valid`/`_trigger`/`_enable` signals, skip inactive logic
4. **Reachability DCE**: BFS from output ports eliminates all unreachable signals (no hardcoded names)
5. **Self-referencing register optimization**: 156/303 registers use if-else instead of ternary
6. **Aggressive constant propagation**: IR-level const/alias elimination before codegen
7. **Fused evalTick**: Single function with all wire+register locals on stack

## Profile Analysis (macOS `sample` profiler, 50M cycles)

### JIT Profile

| Component | Samples | % | Notes |
|-----------|---------|---|-------|
| `eval()` | 1906 | 74.7% | Combinational logic |
| `tick()` | 608 | 23.8% | Register updates |
| `jit_get_wire` | 3 | 0.1% | Wire reads (negligible) |
| `main` loop overhead | 33 | 1.3% | Loop, dlsym calls |

**Takeaway**: `eval()` dominates at 74.7%. This is the combinational logic
computation (ALU, decoder, hazard logic, TLB, page table walker, etc.).
Optimization should focus on reducing the instruction count of `eval()`.

### Verilator Profile

| Component | Samples | % | Notes |
|-----------|---------|---|-------|
| `nba_sequent__TOP__1` | 1033 | 41.1% | Sequential (register updates) |
| `nba_comb__TOP__0` | 530 | 21.1% | Combinational logic |
| `eval()` overhead | 151 | 6.0% | Eval dispatch |
| `nba_sequent__TOP__0` | 79 | 3.1% | Secondary sequential |
| `ico_sequent__TOP__0` | 40 | 1.6% | Initial-cycle only |
| `VlDeleter/mutex` | 187 | 7.4% | **Thread sync overhead** |
| `__psynch_cvwait` | — | — | Idle thread wait (excluded) |

**Takeaway**: Verilator wastes ~7.4% on mutex/thread synchronization
overhead (even in single-threaded mode). The JIT has zero thread overhead,
contributing to its 1.2x advantage.

## Why JIT is Faster Than Verilator

1. **No thread synchronization** — JIT is single-threaded with no mutex/lock overhead.
   Verilator 5.x uses a thread pool even for single-threaded workloads, wasting 7.4%
   on `VlDeleter::deleteAll()` → `std::mutex::try_lock()`.

2. **Observable wire optimization** — JIT has only 33 class member variables + 321
   `eval()`-local variables (L1-cache friendly). Verilator keeps all signals as
   class members (~1000+).

3. **Fewer CPU instructions per cycle** — The CppSim IR optimizer inlines single-use
   wires, folds constants, and eliminates dead code. Result: fewer memory operations
   per simulation cycle.

4. **Fused evalTick** — Register `_next` values stay on the stack instead of being
   written to class members then read back.

## Bottleneck Analysis

### Current Bottleneck: `eval()` (74.7%)

The `eval()` function computes all combinational logic per cycle. At 13M cyc/s
this means ~77ns per cycle, of which ~57ns is spent in `eval()`.

**Optimization opportunities**:

| Optimization | Expected Impact | Difficulty |
|-------------|----------------|------------|
| Expression inlining in `eval()` | 10-20% | Medium |
| Memory access pattern optimization | 5-10% | Low |
| SIMD for parallel ALU ops | 5-15% | High |
| Partial evaluation (skip unused paths) | 10-30% | High |

### tick() Overhead (23.8%)

`tick()` copies `_next` register values to current state. With 130 registers,
this is ~130 memory copies per cycle. The fused `evalTick()` partially
mitigates this by keeping `_next` values on the stack.

## Cycle-Skipping Oracle Performance

When idle-loop detection is enabled via `mkSelfLoopOracle`:

| Mode | Effective Speed | Real Cycles | Skipped |
|------|----------------|-------------|---------|
| No oracle | 13.0M cyc/s | 10M | 0 |
| Fixed skip (1000) | ~1.25B eff cyc/s | 10M | 9,998K |
| Timer-compare skip | ~5.0B eff cyc/s | 10M | 10M |

The timer-compare-aware oracle (`skipToTimerCompare := true`) computes
`min(mtimecmp - mtime, maxSkip)` to advance time precisely, enabling
Linux boot where the CPU wakes via timer interrupt.

## Reproducing

### Prerequisites

```bash
# macOS
brew install verilator

# Build all simulation backends
cd verilator
make build          # Verilator
make build-cppsim   # CppSim
make build-jit      # JIT shared library
```

### Running Benchmarks

```bash
# Unified benchmark (recommended)
cd verilator && ./bench.sh 10000000

# Rebuild and run
cd verilator && make bench CYCLES=10000000

# JIT bench with detailed profiling
cd verilator && make build-jit && \
  clang++ -O2 -std=c++17 -o jit_bench tb_jit_bench.cpp -ldl && \
  ./jit_bench ../firmware/firmware.hex 10000000 generated_soc_jit.dylib

# Verilator minimal bench
cd verilator && ./verilator_bench ../firmware/firmware.hex 10000000

# macOS profiling (run in separate terminal)
./jit_bench ../firmware/firmware.hex 50000000 generated_soc_jit.dylib &
sample $! 3 -file /tmp/jit_profile.txt
```

### Linux Boot Benchmark

Requires external builds of OpenSBI and Linux kernel:

```bash
# Verilator Linux boot
cd verilator && ./obj_dir/Vrv32i_soc ../firmware/opensbi/boot.hex 10000000 \
    --dram /tmp/opensbi/build/platform/generic/firmware/fw_jump.bin \
    --dtb ../firmware/opensbi/sparkle-soc.dtb \
    --payload /tmp/linux/arch/riscv/boot/Image

# JIT with boot oracle (timer-compare-aware idle-loop skipping)
lake exe rv32-jit-boot-oracle-test
```