# RV32IMA RISC-V SoC
## RV32IMA RISC-V SoC — Boots Linux

A **complete pipelined RV32IMA CPU** with S-mode, Sv32 MMU, and peripherals — written entirely in Signal DSL. **Boots Linux 6.6.0 via OpenSBI v0.9 on Verilator** — both hand-written and **auto-generated SystemVerilog** reach the same kernel init point.

```
OpenSBI v0.9 — Platform: Sparkle RV32IMA SoC — ISA: rv32imasu
Linux version 6.6.0 ... #6 Thu Feb 26 06:29:23 UTC 2026
Machine model: Sparkle RV32IMA SoC
Memory: 26208K/28672K available (1279K kernel code, 465K rwdata, ...)
```

| | Hand-written SV | Generated SV (from Lean) |
|---|:---:|:---:|
| **UART output** | 3944 bytes | **5250 bytes** |
| **Linux boot** | Kernel init | Kernel init |
| **Final PC** | 0xC013A9xx | 0xC013A9xx |
| **Firmware test** | ALL PASS | ALL PASS |

```lean
-- 122 registers in a single Signal.loop — full SoC in one function
-- State fields accessed by name via declare_signal_state macro
def rv32iSoCSimulate (firmware : BitVec 12 → BitVec 32) : Signal dom SoCState :=
  Signal.loopMemo fun state => rv32iSoCWithFirmwareBody firmware state
```

### CPU Core
- **4-stage pipeline**: IF/ID/EX/WB with hazard detection and data forwarding
- **RV32IMA ISA**: Full integer + M-extension (MUL/DIV/REM) + A-extension (LR.W/SC.W/AMO)
- **Multi-cycle divider**: Restoring division algorithm in Signal DSL

### Privilege & Virtual Memory
- **S-mode + M-mode**: Full privilege separation with trap delegation (medeleg/mideleg)
- **Sv32 MMU**: 4-entry TLB + hardware page table walker (PTW), megapage support
- **CSRs**: mstatus, mie, mtvec, mepc, mcause, mtval, satp, sstatus, stvec, sepc, scause, stval, medeleg, mideleg, mcounteren, scounteren, PMP stubs

### Peripherals
- **UART 8250**: TX/RX with LSR/IER/LCR/DLL/DLM registers (Linux-compatible)
- **CLINT**: Machine timer with mtime/mtimecmp
- **Memory**: 32 MB DRAM (byte-addressable, sub-word load/store: LB/LH/LBU/LHU/SB/SH)

### Signal DSL Ergonomics

Ergonomic operators and macros reduce boilerplate in hardware descriptions:

```lean
open Sparkle.Core.Signal

-- Hardware equality (replaces (· == ·) <$> a <*> Signal.pure val)
let isIdle := fsmReg === (0#4)

-- Bool operators (replaces (· && ·) <$> a <*> b etc.)
let startAndIdle := start &&& isIdle
let shouldFlush  := branchTaken ||| trap_taken

-- Implicit constant lifting via Coe (replaces Signal.pure)
let p3SavedNext := Signal.mux cond (true : Signal dom _) p3SavedReg

-- Hardware conditional macro (replaces deeply nested Signal.mux)
let fsmNext := hw_cond fsmReg            -- default value
  | startAndIdle  => (1#4 : Signal dom _)  -- first match wins
  | stemDone      => (2#4 : Signal dom _)
  | stageConvDone => (3#4 : Signal dom _)
  | isDone        => (0#4 : Signal dom _)
```

All features are synthesis-compatible — `===` expands to `(· == ·) <$> a <*> b`, `hw_cond` expands to nested `Signal.mux` calls, and `Coe` unfolds to `Signal.pure`.

### Named State Accessors (`declare_signal_state`)

Eliminates error-prone magic-number indices for hardware state tuples:

```lean
-- Declare state with named fields, types, and defaults
declare_signal_state BottleneckState
  | fsmReg      : BitVec 2   := 0#2
  | residualReg : BitVec 8   := 0#8
  | resultReg   : BitVec 8   := 0#8
  | doneReg     : Bool        := false

-- Access fields by name (no more projN! state 122 47)
let fsmReg := BottleneckState.fsmReg state
let residualReg := BottleneckState.residualReg state
```

Generates a tuple type alias, synthesis-compatible accessor `def`s, default value, and `Inhabited` instance. The RV32 SoC uses this for all 122 registers — adding/removing a field no longer requires updating every index.

### JIT Simulation — Transparent & Fast (~200x faster than Lean)

Two APIs: **Signal API** (drop-in replacement for `loopMemo`) and **Streaming API** (O(1) memory for long runs):

```lean
-- Signal API: same interface as loopMemo, JIT speed under the hood
let soc ← rv32iSoCJITSimulate (jitCppPath := "verilator/generated_soc_jit.cpp") (firmware := fw)
let out := soc.atTime 1000  -- SoCOutput with pc, uartValid, uartData, ...

-- Streaming API: 10M+ cycles with per-cycle callback, O(1) memory
rv32iSoCJITRun (jitCppPath := cppPath) (firmware := fw) (cycles := 10000000)
  (callback := fun cycle vals => do
    let out := SoCOutput.fromWireValues vals
    if out.uartValid then IO.println s!"UART: {out.uartData}"
    return true)  -- continue
```

- **Compile C++ to shared library at runtime**, load via `dlopen` from Lean
- Hash-based caching: recompilation skipped if source unchanged
- Uses stable **named output wires** (`_gen_pcReg`, `_gen_uartValidBV`, etc.) — immune to DCE
- 6 observable wires (via `SoCOutput.wireNames`), 11 memories, 6 input ports

### Simulation Performance (10M cycles, Apple Silicon)

| Backend | Speed | vs Verilator | vs Lean |
|---------|-------|-------------|---------|
| **JIT evalTick (fused)** | **13.0M cyc/s** | **1.17x faster** | ~2600x |
| **JIT evalTick + 6 wires** | **12.7M cyc/s** | **1.14x faster** | ~2540x |
| JIT eval+tick (separate) | 12.7M cyc/s | 1.14x faster | ~2540x |
| Verilator 5.044 | 11.1M cyc/s | 1.00x | ~2220x |
| CppSim (-O3 AOT) | 6.0M cyc/s | 0.54x | ~1200x |
| Lean loopMemo | ~5K cyc/s | — | 1x |

JIT **exceeds Verilator speed** (1.17x faster) thanks to: (1) no mutex/thread overhead (Verilator 5.x wastes 17.4% on locks even single-threaded), (2) observable wire optimization (33 class members + 321 locals, L1-cache friendly), (3) fewer CPU instructions per sim-cycle, and (4) fused `evalTick()` with stack-local `_next` variables (eliminates ~260 intermediate memory ops/cycle). See [docs/BENCHMARK.md](docs/BENCHMARK.md) for detailed benchmark results, profiling data, and bottleneck analysis.

### JIT Cycle-Skipping — Dynamic Oracle

Detects when the CPU is stuck in a tight halt loop and skips forward by advancing the cycle counter + CLINT timer, achieving **706x effective speedup**. The oracle receives the `JITHandle` directly for dynamic register reads, bulk memory ops (`memsetWord`), and direct state mutation:

```lean
import Sparkle.Core.Oracle
open Sparkle.Core.Oracle

-- Create oracle: detects PC stuck in ≤12-byte range for 50+ consecutive cycles
-- Oracle receives JITHandle per-call and handles all state mutations internally
let (oracle, statsRef) ← mkSelfLoopOracle {}

-- Run with oracle — 10M cycles complete in 9ms instead of 5.5 seconds
let cycles ← JIT.runOptimized handle 10_000_000 wireIndices oracle callback

-- Post-run: 9,998 triggers, 9,998,000 cycles skipped, UART output identical
let stats ← statsRef.get
IO.println s!"Skipped {stats.totalSkipped} cycles in {stats.triggerCount} triggers"

-- Bulk memory fill (e.g., BSS zeroing for Linux boot)
JIT.memsetWord handle memIdx addr 0 count
```

| Metric | Without Oracle | With Oracle |
|--------|---------------|-------------|
| Wall-clock time (10M cycles) | ~770 ms | **7 ms** |
| Effective cyc/s | 13.0M | **~1.4 billion** |

**BSS-Clear Speculative Warp**: A custom inline firmware (7-instruction BSS-clear loop) demonstrates the full pattern — the oracle detects the memory-clearing loop, bulk-zeros all 4 DMEM byte banks via `memsetWord`, and skips ~100K cycles in <1 ms (389 triggers, 99,584 cycles skipped). See `Tests/RV32/JITDynamicWarpTest.lean`.

**Speculative Simulation with Rollback**: Full-state snapshot/restore via C++ default copy constructor enables guard-and-rollback speculation. The oracle snapshots state, speculatively applies bulk updates, checks guard conditions (e.g., timer interrupt), and rolls back if the guard fails — providing bit-accurate cycle-skipping even with interrupts. See `Tests/RV32/JITSpeculativeWarpTest.lean`.

Also includes register snapshot/restore API (130 registers), bulk memory API, and full-state snapshot/restore for state introspection:

### Verilator Backend (~1000x faster)
- Auto-generated SystemVerilog via `#writeDesign` — boots Linux (5250 UART bytes at 10M cycles)
- Hand-written SystemVerilog reference at `verilator/rv32i_soc.sv` (3944 UART bytes at 10M cycles)
- Both reach same kernel init region (PC 0xC013A9xx) — generated SV matches reference behavior
- VCD waveform tracing for debugging
- Firmware test mode + OpenSBI/Linux boot mode

```bash
# Lean simulation
lake test  # Includes RV32 simulation tests

# JIT simulation from Lean (compile + load + run)
lake exe rv32-jit-test verilator/generated_soc_jit.cpp firmware/firmware.hex 5000

# JIT loop test (loopMemoJIT Signal API + Streaming API)
lake exe rv32-jit-loop-test verilator/generated_soc_jit.cpp firmware/firmware.hex 5000

# JIT cycle-skip test (register snapshot/restore roundtrip)
lake exe rv32-jit-cycle-skip-test

# JIT oracle test (self-loop detection, 10M cycles with cycle-skipping)
lake exe rv32-jit-oracle-test

# JIT dynamic warp test (memsetWord + dynamic oracle with JITHandle access)
lake exe rv32-jit-dynamic-warp-test

# JIT speculative warp test (snapshot/restore + guard-and-rollback)
lake exe rv32-jit-speculative-warp-test

# CppSim (standalone C++)
cd verilator && make build-cppsim && make run-cppsim

# Verilator simulation (much faster)
cd verilator && make build
./obj_dir/Vrv32i_soc ../firmware/firmware.hex 500000  # Firmware tests

# Linux kernel boot
./obj_dir/Vrv32i_soc ../firmware/opensbi/boot.hex 10000000 \
    --dram /tmp/opensbi/build/platform/generic/firmware/fw_jump.bin \
    --dtb ../firmware/opensbi/sparkle-soc.dtb \
    --payload /tmp/linux/arch/riscv/boot/Image
```

---

## SV→Sparkle Transpiler {#sv-transpiler}

Parse existing SystemVerilog RTL (e.g., PicoRV32) into Sparkle IR and execute via JIT — **no Verilator required**.

### Pipeline

```
picorv32.v + picorv32_soc_m.v
    → [SV Parser]    Parse to SV AST (250 items, 32 always blocks)
    → [Lower]         MemorySSA sequential emitter + MUX for always @*
    → [Flatten]       Inline sub-modules with parameter overrides
    → [CppSim]        Generate C++ simulation class
    → [GCC -O2]       Compile to shared library
    → [JIT dlopen]    Load and simulate from Lean
```

### C Firmware Execution

GCC-compiled bare-metal firmware runs on the transpiled PicoRV32 SoC with M-extension:

```bash
# Build RV32I firmware
cd firmware
riscv32-none-elf-gcc -march=rv32i -mabi=ilp32 -O2 -nostdlib \
    -T link.ld -o firmware_rv32i.elf boot_rv32i.S main_rv32i.c

# Build RV32IM firmware (MUL/DIV/REM)
riscv32-none-elf-gcc -march=rv32im -mabi=ilp32 -O2 -nostdlib \
    -T link.ld -o firmware_rv32im.elf boot_rv32i.S main_rv32im.c

# Run 34 tests via Sparkle JIT (no external dependencies)
lake exe svparser-test
```

| Test | Expected | Actual |
|------|----------|--------|
| Fibonacci | 0,1,1,2,3,5,8,13,21,34 | ✓ exact match |
| Array Sum | 360 | ✓ |
| Bubble Sort | 3,8,17,42,55,99 | ✓ exact match |
| GCD | 6,25,1 | ✓ exact match |
| MUL 7*6 | 42 | ✓ |
| MUL 12345*6789 | 83810205 | ✓ |
| DIV 42/6, 100/7 | 7, 14 | ✓ |
| REM 42%6, 100%7 | 0, 2 | ✓ |
| Factorial 10! | 3628800 | ✓ |
| Store/Load 0x12345678 | 0x12345678 | ✓ |
| **Final (RV32I)** | **0xCAFE0000** | **ALL PASSED** |
| **Final (RV32IM)** | **0xCAFE0000** | **ALL PASSED** |

### 34 CI-Safe JIT Tests

All tests embed Verilog as string literals — no external file dependencies:

| Tests | Description |
|-------|-------------|
| 1-6 | Parser, lowering, round-trip, JIT simulation |
| 7-11 | PicoRV32 parse, SoC "Hello", C firmware (requires `/tmp` files) |
| 12-16 | IR regression: nested if/else, case(1'b1) priority, concat-LHS, for-loop SSA, array read |
| 17-21 | pcpi_mul: IR structure, SSA chain, topo sort, JIT 7*6/100*100/12345*6789/consecutive/wrapper |
| 22-26 | JIT pairs: if/else+register, array read, case decoder, read-overwrite, for-loop accumulator |
| 27-29 | Bit replication `{4{en}}`, byte-lane write/read, multi-bit replication `{2{din}}` |

### Key Algorithms

- **MemorySSA Sequential Emitter**: Process `always @*` statements top-to-bottom, creating SSA wires for each variable write. Correctly handles read-then-overwrite patterns that cause cyclic dependencies under pure MUX.
- **If-Conversion (Guarded Assignments)**: Walk statement tree tracking guard conditions; emit flat priority mux chain with `!covered` guard for first-match-wins in `case(1'b1)`.
- **Generate Block Evaluation**: Resolve `generate if (PARAM)` using parameter defaults + overrides. 2-pass architecture for nested hierarchies.
- **Concat-LHS Bit Scatter**: Handle `{next_rdt[j+3], next_rd[j+:4]} = expr` via read-modify-write decomposition with `__RMW_BASE__` placeholder resolved by `stmtsToMuxExprBlocking`.
- **Byte-Strobe Memory**: Detect `if(wstrb[N]) memory[addr][hi:lo] <= wdata[hi:lo]` and generate read-modify-write with per-byte mask and proper shift.
- **Bit Replication**: `{N{expr}}` → `(0-val) & mask` for 1-bit, `concat` for multi-bit.
- **For-Loop Unrolling**: Static unroll with parameter substitution. MemorySSA handles sequential dependencies natively.
- **$signed Sign Extension**: SHL+ASR pattern with min-32-bit types and 64-bit promotion to prevent C++ UB.

### Inline Formal Verification (`verilog!` Macro)

Write Verilog directly in Lean, prove properties at compile time:

```lean
import Tools.SVParser.Macro

verilog! "
module counter8_en (input clk, input rst, input en, output [7:0] count);
    reg [7:0] count_reg;
    assign count = count_reg;
    always @(posedge clk) begin
        if (rst) count_reg <= 0;
        else if (en) count_reg <= count_reg + 1;
        assert(rst ? (count_reg == 0) : 1);  // auto-proved!
    end
endmodule
"

open counter8_en.Verify

-- These theorems prove against the auto-generated nextState
theorem counter_holds (s : State) :
    nextState s { rst := 0, en := 0 } = s := by simp [nextState]

-- auto_assert_0 was auto-generated and auto-proved by bv_decide
-- Change the assert condition → instant error in editor
```

Features:
- **`verilog!` macro**: Parse Verilog at elaboration time, inject State/Input/nextState
- **`assert(cond)`**: Auto-generate `theorem` and prove via `simp [nextState]; bv_decide`
- **Instant feedback**: Edit Verilog, save → proofs re-check in milliseconds
- **No Lean knowledge needed**: Write standard Verilog assertions, get mathematical proofs

---