# RV32IMA RISC-V SoC ## RV32IMA RISC-V SoC — Boots Linux A **complete pipelined RV32IMA CPU** with S-mode, Sv32 MMU, and peripherals — written entirely in Signal DSL. **Boots Linux 6.6.0 via OpenSBI v0.9 on Verilator** — both hand-written and **auto-generated SystemVerilog** reach the same kernel init point. ``` OpenSBI v0.9 — Platform: Sparkle RV32IMA SoC — ISA: rv32imasu Linux version 6.6.0 ... #6 Thu Feb 26 06:29:23 UTC 2026 Machine model: Sparkle RV32IMA SoC Memory: 26208K/28672K available (1279K kernel code, 465K rwdata, ...) ``` | | Hand-written SV | Generated SV (from Lean) | |---|:---:|:---:| | **UART output** | 3944 bytes | **5250 bytes** | | **Linux boot** | Kernel init | Kernel init | | **Final PC** | 0xC013A9xx | 0xC013A9xx | | **Firmware test** | ALL PASS | ALL PASS | ```lean -- 122 registers in a single Signal.loop — full SoC in one function -- State fields accessed by name via declare_signal_state macro def rv32iSoCSimulate (firmware : BitVec 12 → BitVec 32) : Signal dom SoCState := Signal.loopMemo fun state => rv32iSoCWithFirmwareBody firmware state ``` ### CPU Core - **4-stage pipeline**: IF/ID/EX/WB with hazard detection and data forwarding - **RV32IMA ISA**: Full integer + M-extension (MUL/DIV/REM) + A-extension (LR.W/SC.W/AMO) - **Multi-cycle divider**: Restoring division algorithm in Signal DSL ### Privilege & Virtual Memory - **S-mode + M-mode**: Full privilege separation with trap delegation (medeleg/mideleg) - **Sv32 MMU**: 4-entry TLB + hardware page table walker (PTW), megapage support - **CSRs**: mstatus, mie, mtvec, mepc, mcause, mtval, satp, sstatus, stvec, sepc, scause, stval, medeleg, mideleg, mcounteren, scounteren, PMP stubs ### Peripherals - **UART 8250**: TX/RX with LSR/IER/LCR/DLL/DLM registers (Linux-compatible) - **CLINT**: Machine timer with mtime/mtimecmp - **Memory**: 32 MB DRAM (byte-addressable, sub-word load/store: LB/LH/LBU/LHU/SB/SH) ### Signal DSL Ergonomics Ergonomic operators and macros reduce boilerplate in hardware descriptions: ```lean open Sparkle.Core.Signal -- Hardware equality (replaces (· == ·) <$> a <*> Signal.pure val) let isIdle := fsmReg === (0#4) -- Bool operators (replaces (· && ·) <$> a <*> b etc.) let startAndIdle := start &&& isIdle let shouldFlush := branchTaken ||| trap_taken -- Implicit constant lifting via Coe (replaces Signal.pure) let p3SavedNext := Signal.mux cond (true : Signal dom _) p3SavedReg -- Hardware conditional macro (replaces deeply nested Signal.mux) let fsmNext := hw_cond fsmReg -- default value | startAndIdle => (1#4 : Signal dom _) -- first match wins | stemDone => (2#4 : Signal dom _) | stageConvDone => (3#4 : Signal dom _) | isDone => (0#4 : Signal dom _) ``` All features are synthesis-compatible — `===` expands to `(· == ·) <$> a <*> b`, `hw_cond` expands to nested `Signal.mux` calls, and `Coe` unfolds to `Signal.pure`. ### Named State Accessors (`declare_signal_state`) Eliminates error-prone magic-number indices for hardware state tuples: ```lean -- Declare state with named fields, types, and defaults declare_signal_state BottleneckState | fsmReg : BitVec 2 := 0#2 | residualReg : BitVec 8 := 0#8 | resultReg : BitVec 8 := 0#8 | doneReg : Bool := false -- Access fields by name (no more projN! state 122 47) let fsmReg := BottleneckState.fsmReg state let residualReg := BottleneckState.residualReg state ``` Generates a tuple type alias, synthesis-compatible accessor `def`s, default value, and `Inhabited` instance. The RV32 SoC uses this for all 122 registers — adding/removing a field no longer requires updating every index. ### JIT Simulation — Transparent & Fast (~200x faster than Lean) Two APIs: **Signal API** (drop-in replacement for `loopMemo`) and **Streaming API** (O(1) memory for long runs): ```lean -- Signal API: same interface as loopMemo, JIT speed under the hood let soc ← rv32iSoCJITSimulate (jitCppPath := "verilator/generated_soc_jit.cpp") (firmware := fw) let out := soc.atTime 1000 -- SoCOutput with pc, uartValid, uartData, ... -- Streaming API: 10M+ cycles with per-cycle callback, O(1) memory rv32iSoCJITRun (jitCppPath := cppPath) (firmware := fw) (cycles := 10000000) (callback := fun cycle vals => do let out := SoCOutput.fromWireValues vals if out.uartValid then IO.println s!"UART: {out.uartData}" return true) -- continue ``` - **Compile C++ to shared library at runtime**, load via `dlopen` from Lean - Hash-based caching: recompilation skipped if source unchanged - Uses stable **named output wires** (`_gen_pcReg`, `_gen_uartValidBV`, etc.) — immune to DCE - 6 observable wires (via `SoCOutput.wireNames`), 11 memories, 6 input ports ### Simulation Performance (10M cycles, Apple Silicon) | Backend | Speed | vs Verilator | vs Lean | |---------|-------|-------------|---------| | **JIT evalTick (fused)** | **13.0M cyc/s** | **1.17x faster** | ~2600x | | **JIT evalTick + 6 wires** | **12.7M cyc/s** | **1.14x faster** | ~2540x | | JIT eval+tick (separate) | 12.7M cyc/s | 1.14x faster | ~2540x | | Verilator 5.044 | 11.1M cyc/s | 1.00x | ~2220x | | CppSim (-O3 AOT) | 6.0M cyc/s | 0.54x | ~1200x | | Lean loopMemo | ~5K cyc/s | — | 1x | JIT **exceeds Verilator speed** (1.17x faster) thanks to: (1) no mutex/thread overhead (Verilator 5.x wastes 17.4% on locks even single-threaded), (2) observable wire optimization (33 class members + 321 locals, L1-cache friendly), (3) fewer CPU instructions per sim-cycle, and (4) fused `evalTick()` with stack-local `_next` variables (eliminates ~260 intermediate memory ops/cycle). See [docs/BENCHMARK.md](docs/BENCHMARK.md) for detailed benchmark results, profiling data, and bottleneck analysis. ### JIT Cycle-Skipping — Dynamic Oracle Detects when the CPU is stuck in a tight halt loop and skips forward by advancing the cycle counter + CLINT timer, achieving **706x effective speedup**. The oracle receives the `JITHandle` directly for dynamic register reads, bulk memory ops (`memsetWord`), and direct state mutation: ```lean import Sparkle.Core.Oracle open Sparkle.Core.Oracle -- Create oracle: detects PC stuck in ≤12-byte range for 50+ consecutive cycles -- Oracle receives JITHandle per-call and handles all state mutations internally let (oracle, statsRef) ← mkSelfLoopOracle {} -- Run with oracle — 10M cycles complete in 9ms instead of 5.5 seconds let cycles ← JIT.runOptimized handle 10_000_000 wireIndices oracle callback -- Post-run: 9,998 triggers, 9,998,000 cycles skipped, UART output identical let stats ← statsRef.get IO.println s!"Skipped {stats.totalSkipped} cycles in {stats.triggerCount} triggers" -- Bulk memory fill (e.g., BSS zeroing for Linux boot) JIT.memsetWord handle memIdx addr 0 count ``` | Metric | Without Oracle | With Oracle | |--------|---------------|-------------| | Wall-clock time (10M cycles) | ~770 ms | **7 ms** | | Effective cyc/s | 13.0M | **~1.4 billion** | **BSS-Clear Speculative Warp**: A custom inline firmware (7-instruction BSS-clear loop) demonstrates the full pattern — the oracle detects the memory-clearing loop, bulk-zeros all 4 DMEM byte banks via `memsetWord`, and skips ~100K cycles in <1 ms (389 triggers, 99,584 cycles skipped). See `Tests/RV32/JITDynamicWarpTest.lean`. **Speculative Simulation with Rollback**: Full-state snapshot/restore via C++ default copy constructor enables guard-and-rollback speculation. The oracle snapshots state, speculatively applies bulk updates, checks guard conditions (e.g., timer interrupt), and rolls back if the guard fails — providing bit-accurate cycle-skipping even with interrupts. See `Tests/RV32/JITSpeculativeWarpTest.lean`. Also includes register snapshot/restore API (130 registers), bulk memory API, and full-state snapshot/restore for state introspection: ### Verilator Backend (~1000x faster) - Auto-generated SystemVerilog via `#writeDesign` — boots Linux (5250 UART bytes at 10M cycles) - Hand-written SystemVerilog reference at `verilator/rv32i_soc.sv` (3944 UART bytes at 10M cycles) - Both reach same kernel init region (PC 0xC013A9xx) — generated SV matches reference behavior - VCD waveform tracing for debugging - Firmware test mode + OpenSBI/Linux boot mode ```bash # Lean simulation lake test # Includes RV32 simulation tests # JIT simulation from Lean (compile + load + run) lake exe rv32-jit-test verilator/generated_soc_jit.cpp firmware/firmware.hex 5000 # JIT loop test (loopMemoJIT Signal API + Streaming API) lake exe rv32-jit-loop-test verilator/generated_soc_jit.cpp firmware/firmware.hex 5000 # JIT cycle-skip test (register snapshot/restore roundtrip) lake exe rv32-jit-cycle-skip-test # JIT oracle test (self-loop detection, 10M cycles with cycle-skipping) lake exe rv32-jit-oracle-test # JIT dynamic warp test (memsetWord + dynamic oracle with JITHandle access) lake exe rv32-jit-dynamic-warp-test # JIT speculative warp test (snapshot/restore + guard-and-rollback) lake exe rv32-jit-speculative-warp-test # CppSim (standalone C++) cd verilator && make build-cppsim && make run-cppsim # Verilator simulation (much faster) cd verilator && make build ./obj_dir/Vrv32i_soc ../firmware/firmware.hex 500000 # Firmware tests # Linux kernel boot ./obj_dir/Vrv32i_soc ../firmware/opensbi/boot.hex 10000000 \ --dram /tmp/opensbi/build/platform/generic/firmware/fw_jump.bin \ --dtb ../firmware/opensbi/sparkle-soc.dtb \ --payload /tmp/linux/arch/riscv/boot/Image ``` --- ## SV→Sparkle Transpiler {#sv-transpiler} Parse existing SystemVerilog RTL (e.g., PicoRV32) into Sparkle IR and execute via JIT — **no Verilator required**. ### Pipeline ``` picorv32.v + picorv32_soc_m.v → [SV Parser] Parse to SV AST (250 items, 32 always blocks) → [Lower] MemorySSA sequential emitter + MUX for always @* → [Flatten] Inline sub-modules with parameter overrides → [CppSim] Generate C++ simulation class → [GCC -O2] Compile to shared library → [JIT dlopen] Load and simulate from Lean ``` ### C Firmware Execution GCC-compiled bare-metal firmware runs on the transpiled PicoRV32 SoC with M-extension: ```bash # Build RV32I firmware cd firmware riscv32-none-elf-gcc -march=rv32i -mabi=ilp32 -O2 -nostdlib \ -T link.ld -o firmware_rv32i.elf boot_rv32i.S main_rv32i.c # Build RV32IM firmware (MUL/DIV/REM) riscv32-none-elf-gcc -march=rv32im -mabi=ilp32 -O2 -nostdlib \ -T link.ld -o firmware_rv32im.elf boot_rv32i.S main_rv32im.c # Run 34 tests via Sparkle JIT (no external dependencies) lake exe svparser-test ``` | Test | Expected | Actual | |------|----------|--------| | Fibonacci | 0,1,1,2,3,5,8,13,21,34 | ✓ exact match | | Array Sum | 360 | ✓ | | Bubble Sort | 3,8,17,42,55,99 | ✓ exact match | | GCD | 6,25,1 | ✓ exact match | | MUL 7*6 | 42 | ✓ | | MUL 12345*6789 | 83810205 | ✓ | | DIV 42/6, 100/7 | 7, 14 | ✓ | | REM 42%6, 100%7 | 0, 2 | ✓ | | Factorial 10! | 3628800 | ✓ | | Store/Load 0x12345678 | 0x12345678 | ✓ | | **Final (RV32I)** | **0xCAFE0000** | **ALL PASSED** | | **Final (RV32IM)** | **0xCAFE0000** | **ALL PASSED** | ### 34 CI-Safe JIT Tests All tests embed Verilog as string literals — no external file dependencies: | Tests | Description | |-------|-------------| | 1-6 | Parser, lowering, round-trip, JIT simulation | | 7-11 | PicoRV32 parse, SoC "Hello", C firmware (requires `/tmp` files) | | 12-16 | IR regression: nested if/else, case(1'b1) priority, concat-LHS, for-loop SSA, array read | | 17-21 | pcpi_mul: IR structure, SSA chain, topo sort, JIT 7*6/100*100/12345*6789/consecutive/wrapper | | 22-26 | JIT pairs: if/else+register, array read, case decoder, read-overwrite, for-loop accumulator | | 27-29 | Bit replication `{4{en}}`, byte-lane write/read, multi-bit replication `{2{din}}` | ### Key Algorithms - **MemorySSA Sequential Emitter**: Process `always @*` statements top-to-bottom, creating SSA wires for each variable write. Correctly handles read-then-overwrite patterns that cause cyclic dependencies under pure MUX. - **If-Conversion (Guarded Assignments)**: Walk statement tree tracking guard conditions; emit flat priority mux chain with `!covered` guard for first-match-wins in `case(1'b1)`. - **Generate Block Evaluation**: Resolve `generate if (PARAM)` using parameter defaults + overrides. 2-pass architecture for nested hierarchies. - **Concat-LHS Bit Scatter**: Handle `{next_rdt[j+3], next_rd[j+:4]} = expr` via read-modify-write decomposition with `__RMW_BASE__` placeholder resolved by `stmtsToMuxExprBlocking`. - **Byte-Strobe Memory**: Detect `if(wstrb[N]) memory[addr][hi:lo] <= wdata[hi:lo]` and generate read-modify-write with per-byte mask and proper shift. - **Bit Replication**: `{N{expr}}` → `(0-val) & mask` for 1-bit, `concat` for multi-bit. - **For-Loop Unrolling**: Static unroll with parameter substitution. MemorySSA handles sequential dependencies natively. - **$signed Sign Extension**: SHL+ASR pattern with min-32-bit types and 64-bit promotion to prevent C++ UB. ### Inline Formal Verification (`verilog!` Macro) Write Verilog directly in Lean, prove properties at compile time: ```lean import Tools.SVParser.Macro verilog! " module counter8_en (input clk, input rst, input en, output [7:0] count); reg [7:0] count_reg; assign count = count_reg; always @(posedge clk) begin if (rst) count_reg <= 0; else if (en) count_reg <= count_reg + 1; assert(rst ? (count_reg == 0) : 1); // auto-proved! end endmodule " open counter8_en.Verify -- These theorems prove against the auto-generated nextState theorem counter_holds (s : State) : nextState s { rst := 0, en := 0 } = s := by simp [nextState] -- auto_assert_0 was auto-generated and auto-proved by bv_decide -- Change the assert condition → instant error in editor ``` Features: - **`verilog!` macro**: Parse Verilog at elaboration time, inject State/Input/nextState - **`assert(cond)`**: Auto-generate `theorem` and prove via `simp [nextState]; bv_decide` - **Instant feedback**: Edit Verilog, save → proofs re-check in milliseconds - **No Lean knowledge needed**: Write standard Verilog assertions, get mathematical proofs ---