# Known Issues

## Issue 1: pcpi_mul Standalone JIT Tests (Test 21-21e) FAIL — **RESOLVED (2026-04-08)**

**Status**: Resolved. Tests 21, 21b, 21c, 21d now PASS. Test 21e (wrapper) tracked as Issue 7.

### Root cause

Two independent bugs in `Sparkle/Backend/CppSim.lean` were both corrupting Verilog
non-blocking (`<=`) semantics in the generated C++ `evalTick()`:

- **Bug A — `_waiting` false-positive in generic conditional-guard detection.**
  `wrapConditionalGuards` treated `_waiting` like enable signals (`_valid`/`_trigger`/
  `_enable`) and wrapped every line containing the prefix `mul_` in `if (mul_waiting) {...}`.
  That incorrectly gated the combinational wire `mul_start = pcpi_wait && !pcpi_wait_q;`,
  freezing the FSM once `mul_waiting` went to 0. Fixed by removing `_waiting` from
  `enablePatterns`: it is FSM state, not an enable gate.

- **Bug B — Self-ref register in-place optimization violated `<=` semantics.**
  The "self-ref register" optimization rewrote `reg_next` → `reg` via string replacement,
  turning non-blocking into blocking assignment. When one register's condition read a
  second self-ref register that had already been in-place written earlier in the same
  `evalTick()`, the condition saw the NEW value (e.g. `mul_waiting` blocking-assigned
  before `mul_finish`'s condition read it), so `mul_finish` never pulsed.

### Fix

1. Drop `_waiting` from `enablePatterns` in `wrapConditionalGuards`.
2. Remove the self-ref `_next` → `reg` string-replacement optimization entirely.
3. Initialize every evalTick-local `_next` to the current register value:
   `{cppType} reg_next = reg;`. This preserves `<=` semantics and lets Clang -O2
   elide redundant default stores, recovering the performance.

### Verification

Before: 28 passed / 6 failed (Tests 21, 21b, 21c, 21d, 21e, plus Test 11 stuck at
0 UART words because the CPU itself was frozen). After: 32 passed / 2 failed.
Test 11 now runs the C firmware for the full 200000 cycles (up from 0 words),
exposing a separate pre-existing UART bug tracked as Issue 6. Test 21e's first
multiply now succeeds (42 ✓); the wrapper-level 2nd-multiply failure is tracked
as Issue 7.

---

## Issue 1 (historical): pcpi_mul Standalone JIT Tests (Test 21-21e) FAIL

**Status**: Pre-existing bug, unrelated to recent reachabilityDCE / reverse synthesis changes.

**Affected tests** (`Tests/SVParser/ParserTest.lean`):
- Test 21: pcpi_mul JIT (7*6=42)
- Test 21b: pcpi_mul JIT (100*100=10000)
- Test 21c: pcpi_mul JIT (12345*6789)
- Test 21d: pcpi_mul consecutive MUL
- Test 21e: pcpi_mul SoC-like wrapper

All 28 other SVParser tests pass (Tests 1-20, 22-29). Test 11 (C firmware) SKIPs because hex files aren't present.

### Symptom

Direct JIT of PicoRV32 `picorv32_pcpi_mul` standalone module (without flattening):

```
Test 21: pcpi_mul JIT (7*6=42)... FAIL: ready=false result=0x0 expected=42 cycles=100
  cyc 0: wr=0 rd=0x0 wait=0 ready=0
  cyc 1: wr=0 rd=0x0 wait=0 ready=0
  cyc 2: wr=0 rd=0x0 wait=1 ready=0
  cyc 3: wr=0 rd=0x0 wait=1 ready=0
  ...
  cyc 99: wr=0 rd=0x0 wait=1 ready=0
```

The FSM enters `wait` state but never transitions to computing:
- `mul_waiting` stays 1 forever
- `mul_counter` doesn't decrement
- `rd` stays 0

### Why LiteX PicoRV32 works but standalone fails

- **LiteX SoC** (Test 10, LiteXTest): pcpi_mul is flattened into `picorv32` top module. Works correctly. Full LiteX firmware executes.
- **Standalone pcpi_mul** (Test 21): pcpi_mul is the top module itself, no flattening. FAILS.

This suggests the bug is in how SVParser lowers **unflattened** modules with complex always-block / for-loop structure.

### Suspicious code paths

1. **`Tools/SVParser/Lower.lean`** — `parseAndLower` (no flatten, no optimize)
   - Used directly by Test 21 via `let pcpiMulIR := parseAndLower pcpiMulVerilog`
   - `parseAndLowerFlat` (flatten + DCE + optimize) works correctly for LiteX

2. **`Tools/SVParser/Lower.lean:820-1200`** — `emitSequentialSSA` for `always @*` blocks
   - Handles nested for-loops (CARRY_CHAIN=4, 16 unrolled iterations)
   - Handles concat-LHS part-select writes (`{next_rdt[j+3], next_rd[j+:4]} = ...`)
   - Uses `__RMW_BASE__` placeholder for read-modify-write resolution

3. **`Sparkle/Backend/CppSim.lean:460-469`** — `collectTickRefWires`
   - Now includes register input refs (fixed today)
   - Previously only included memory refs → wires wrongly localized as 0-init locals

4. **`Sparkle/Backend/CppSim.lean:753-790`** — evalTick wire localization + self-ref optimization
   - Self-ref registers get `reg_next` replaced with `reg` directly
   - Default case generates `reg = reg;` (identity)

### Today's partial fixes (all merged)

1. **`inlineAssigns` infinite loop** (`Tools/SVParser/Verify.lean`): Register output names appeared as both registers AND assigns, causing infinite self-referencing expansion in `verilog!` macro. Fixed by filtering register assigns before inlining.
2. **reachabilityDCE register roots** (`Tools/SVParser/Lower.lean`): Registers are now unconditional roots in reachability DCE, preventing wire elimination when they feed FSM state.
3. **tickRefs from register inputs** (`Sparkle/Backend/CppSim.lean`): `collectTickRefWires` now includes wires referenced by register input expressions, preventing them from being localized as 0-init local variables in evalTick.
4. **Identity assign removal** (`Sparkle/IR/Optimize.lean` + `Sparkle/Backend/CppSim.lean`): Assigns of the form `x = ref x` (generated by SSA lowering of `output reg`) are now filtered before code emission.
5. **protectedWires** (`Sparkle/IR/Optimize.lean`): Wires directly referenced by register inputs are excluded from single-use inlining.

After all fixes: symptom changed from "garbage values (ready=127, wr=127)" to "FSM frozen (ready=0, never advances)". The last step (FSM frozen) is the remaining bug.

### Next steps for investigation

1. **Compare IR between standalone and LiteX flattened**
   - Does standalone pcpi_mul have all the same statements?
   - Is `mul_start` computed correctly? (`mul_start = pcpi_valid & instr_any_mul & ...`)
   - Are the wire ordering dependencies correct in the flat vs hierarchical lowering?

2. **Diff generated C++** between a working minimal counter and pcpi_mul
   - Look for wire localization that shouldn't happen
   - Check evalTick ordering (assigns before register updates?)

3. **Check `emitSequentialSSA` for edge cases**
   - Does it handle `mul_start = ...` at the top level correctly?
   - Are the SSA wire deps propagated through `always @*` properly?

---

## Issue 2: `verilog!` macro auto-assert proofs use `sorry`

**Status**: Intentional workaround for `bv_decide` compilation-mode hang.

**Location**: `Tools/SVParser/Macro.lean:92`

```lean
elabStr s!"theorem {assertName} ... := by sorry"
```

**Why**: `bv_decide` and `native_decide` work in interpreter mode (`lake env lean`) but hang indefinitely in compilation mode (`lake build`). This was observed with Lean 4.28.0-rc1.

**Impact**: Users must manually prove assertions in separate theorem files. The `verilog!` macro still provides the State/Input/nextState definitions for manual proofs.

**Workaround**: Use `lake env lean` for rapid iteration with auto-proved assertions, or write manual proofs.

### Cross-reference: `#verify_eq`, `#verify_eq_at`, and `#verify_eq_git`

`Sparkle/Verification/Equivalence.lean` provides three equivalence-check
commands:

- `#verify_eq f g` — pure `BitVec` function equivalence. Generates
  `theorem {f}_eq_{g} : f = g := by funext ...; unfold f g; bv_decide`.
- `#verify_eq_at (cycles := N) (latency := L) impl spec` — Signal DSL
  cycle-accurate equivalence. Generates a conjunction of `(impl
  inputs).val (t + L) = (spec inputs).val t` for `t ∈ [0, N)` and
  discharges each conjunct with `simp only [Signal.val_*]; bv_decide`.
  On failure, silently probes neighboring latencies and prints a 💡
  hint if one works.
- `#verify_eq_git <commit-ref> <ident>` — time-travel equivalence.
  Runs `git show <commit-ref>:<file-of-ident>`, wraps the old source
  in an isolated namespace, elaborates it, and runs `#verify_eq`
  between the current and old definitions. Intended for PR regression
  checks and bisecting "when did this break?" investigations.

All three command modules are pure elaborators (no `bv_decide` call at
build time) and are always safe to `lake build`. However, any file
that **invokes** any of them hits the compilation-mode hang. The
demo file `Tests/Verification/EquivDemo.lean` is therefore deliberately
excluded from every `lean_exe` root and from `Sparkle.lean` /
`Tests/AllTests.lean`. Run it interactively:

```bash
lake env lean Tests/Verification/EquivDemo.lean
```

Expected output: thirteen lines of `✅ verified: …` (eight pure-BitVec
demos, four Signal DSL demos including the 2-cycle-latency MAC
pipeline headline case, and one `#verify_eq_git` smoke comparing
`reluInt8` against `HEAD`). Uncomment the commented-out `macPipeBuggy`
variant in §11 or the `distrib_rhs_buggy` variant in §1 to see
`bv_decide` emit a concrete counterexample.

---

## Issue 3: High-level multi-domain simulation — **RESOLVED (2026-04-08)**

**Status**: Resolved. A typed, auto-dispatching `runSim` function now lives
in `Sparkle/Core/SimParallel.lean` and wraps both the single-threaded
`evalTick` loop and `JIT.runCDC`. The `sim!` macro and `generateSimWrappers`
emit `outputPortIndexByName` / `inputPortIndexByName` / `toEndpoint` so that
`runSim` can resolve connections by string name.

**Usage**:
```lean
sim! "module producer (...) ..."
sim! "module consumer (...) ..."

let p ← producer.Sim.load; p.reset
let c ← consumer.Sim.load; c.reset
let stats ← runSim
  [p.toEndpoint, c.toEndpoint]
  (connections := [("data_out", "data_in")])
  (cycles := 1000000)
```

See `docs/Tutorial.md` Step 6 for full walkthrough and
`Tests/Sim/SimRunnerTest.lean` for the 27-test regression suite.

### Residual limitations (separately tracked)

**Issue 3.1 — Multi-connection between the same pair of endpoints.**
`JIT.runCDC` currently transfers one output→input pair at a time. Passing
`connections := [("a", "b"), ("c", "d")]` to `runSim` is rejected at the
Lean layer. Resolving this requires extending the C++ runtime's CDC runner
to accept arrays of `(outPort, inPort)` pairs and N SPSC queues.

**Issue 3.2 — 3+ endpoints / arbitrary topologies.** Passing more than two
endpoints to `runSim` is rejected. The current runner assumes exactly two
domains (one producer, one consumer). A topology-aware scheduler is future
work.

---

## Issue 4: `CounterProps.lean` has `sorry` in auto-generated assert theorems

**Status**: Direct consequence of Issue 2.

**Location**: `Sparkle/Verification/CounterProps.lean`

The `verilog!` macro generates:
```lean
theorem auto_assert_0 (s : State) (i : Input) : ... := by sorry
```

Users should replace these with manual proofs using `simp [nextState]` + `decide` / `bv_decide` in a separate file (via `lake env lean`).

---

## Issue 5: M-ext SoC standalone test (`MExtRv32iTest.lean`) broken

**Status**: Pre-existing bug (confirmed via `git stash` test before recent changes).

**Affected**: `Tests/SVParser/MExtRv32iTest.lean` — runs M-extension firmware on standalone PicoRV32 SoC.

**Symptom**: Firmware runs 200000 cycles with 0 UART output.

**Likely related to**: Issue 1 (standalone pcpi_mul FSM frozen). MExtRv32iTest uses a standalone SoC wrapper that instantiates pcpi_mul, which is affected by the same FSM lowering bug.

**LiteX M-extension**: Works correctly via LiteX SoC wrapper (LiteXTest).

---

## Issue 6: UART output stuck / garbage — **RESOLVED (2026-04-08)**

**Status**: Resolved. Tests 10 and 11 now PASS (Test 10 prints "Hello", Test 11
runs the full C firmware test suite — fib, sum, sort, gcd — all OK).

### Root cause

`wrapConditionalGuards` in `Sparkle/Backend/CppSim.lean` was applying an unsound
"subsystem gating" heuristic to eval body lines:

1. Detect enable-like signals (`_valid`, `_trigger`, `_enable`).
2. Find the longest shared prefix of that enable signal that appears in ≥20 lines.
3. Wrap those lines in `if (enable) { ... }` to skip computation when the
   subsystem is idle.
4. A "lookahead" kept the block open across non-matching gap lines.

The heuristic was wrong on two axes:

- **Overbroad prefix match**: For picorv32, the detected enable was
  `cpu_decoder_trigger` with prefix `cpu_`. That matched not just the decoder
  body but also the memory interface state machine
  (`cpu__reg_mem_valid_next`, `cpu__reg_mem_wstrb_next`, etc.). Those lines
  MUST run every cycle or the memory interface freezes whenever the decoder is
  between instructions. Result: the CPU never completed memory transactions,
  so the UART handler never saw `mem_valid & !mem_ready`.
- **Lookahead traps unrelated lines**: The gap-filling kept the block open
  across lines like `uart_valid = uart_valid_reg;` and
  `uart_data = uart_data_reg;`, which have nothing to do with the decoder.
  Those output-wire assignments got gated by `cpu_decoder_trigger`, so the
  UART output wire stopped reflecting the register's value.

When combined, the firmware could neither execute instructions nor observe
UART output, and the test read stale/garbage values (`0x3A3A3A3A`, `0x20202020`)
from the output port every cycle.

### Fix

Disabled `wrapConditionalGuards` entirely in `emitModule`. The eval body is
now emitted without any gating. Performance is recovered by Clang -O2's
dead-store elimination: assignments whose results are not read before the next
overwrite are removed automatically.

### Verification

- Test 10: 0 UART events → **5 UART bytes: "Hello"** (firmware is the hex-encoded
  `sb` loop that writes H, e, l, l, o to 0x10000000).
- Test 11: 0 words → **26 words, ALL C TESTS OK** (Fibonacci + sum + sort + GCD).

Total: 33 passed / 1 failed (only Test 21e — wrapper consecutive MUL — remains,
tracked as Issue 7).

### Historical observation

This optimization was originally added for I-cache locality. The intent was
reasonable but the heuristic was never correctness-safe: any line that sits
near "enable-like" code got gated by an unrelated signal. If this optimization
is ever revived, it must work from the IR (analyzing which wires genuinely
feed an enable-gated cone of logic) rather than by string pattern matching on
generated C++.

---

## Issue 6 (historical): UART output is a 4-byte replication of a single ASCII byte

**Status**: Investigating. Pre-existing bug, newly exposed after Issue 1 fix.

**Affected tests**:
- Test 10 (PicoRV32 SoC with Verilog firmware): 2000 UART bytes, all `0x20` (space)
- Test 11 (C firmware RV32I): 200000 UART words, all `0x20202020`

**Symptom**: Every UART word read via `JIT.getOutput handle 0` is the same value — a
single ASCII byte replicated across all 4 bytes of the 32-bit word
(`0x20202020` after fix, `0x3A3A3A3A` before). The firmware completes its full run
(cycle count reaches the loop limit) and `uart_valid` does pulse, but every sample
is the same.

**Diagnosis**: The byte value looks like the first byte of the firmware's `.rodata`
initialization string. Combined with the 4-way byte-replication, this smells like
the same family of bug as the historical `{4{mem_la_write}}` → 1-bit lowering bug
documented in Test 27: byte-enable (`wstrb`) not being honored, so 32-bit word
writes clobber adjacent bytes. The UART memory-mapped write path is the likely
location.

**Why newly visible**: Before the Issue 1 fix, Test 11 produced 0 UART words because
the entire CPU was frozen by the `mul_waiting` guard (even without multiplies, the
false-positive `_waiting` guard poisoned evalTick's wire computation). Now the CPU
runs the full firmware, exposing this underlying UART bug.

**Next steps**: Inspect SoC Verilog UART mmio write and compare generated C++
`wstrb` handling against Test 27's known-good fix.

---

## Issue 7: pcpi_mul SoC-like wrapper consecutive MUL (Test 21e) — **RESOLVED (2026-04-08)**

**Status**: Resolved. Test 21e now PASSes; total ParserTest score is 34/34.

### Root cause

NOT a wrapper bug. The bug was in `Sparkle/IR/Optimize.lean`'s constant-folding
rule for AND with all-ones:

```lean
| .op .and [x, .const v w] =>
  if v != 0 && v == (2 ^ w - 1 : Int) then x
```

This rule rewrites `x & all-ones-of-width-w → x`. The rewrite is sound only when
`x` itself has width `w`, but the IR Expr type does not carry per-node widths,
so the optimizer fired the rule whenever the constant happened to look like
an all-ones mask of *its own* width — regardless of `x`'s width.

In the carry-save adder of `picorv32_pcpi_mul`, the SSA lowering produces
nibble-wise masks like `(slice(rd) >> 40) & 0xF` where the slice is 64-bit
and the mask `0xF` is encoded as a 4-bit constant. Because `0xF == 2^4 - 1`,
the rewrite kicked in and dropped the mask, leaving `slice(rd) >> 40` — a
full 64-bit value — as one operand of the carry-save 4-bit add. The add then
included high bits that should have been masked off, corrupting the carry
chain and producing wrong multiplication results for any non-trivial operand.

### Why standalone (Test 21d) passed but flat-wrapped (Test 21e) failed

`Test 21d` uses `parseAndLower pcpiMulVerilog` (no optimize), so the masks
were preserved in the generated C++ (visible as `& (uint8_t)15U`).

`Test 21e` uses `parseAndLowerFlat mulWrapperVerilog`, which runs the full
optimize pipeline including `foldConstants`. Inspecting the generated C++:

- Standalone: `((next_rd_seq28 >> 40) & 0xF) + ((next_rdx_seq1 >> 40) & 0xF) + ...`
- Flat-wrapped: `(next_rd_seq28 >> 40) + (mul0_rdx >> 40) + ...`  ← masks dropped

The standalone test only happened to pass because the optimizer wasn't run,
not because pcpi_mul was correct. Test 21e exposed the latent bug because the
wrapper module forces flat optimization. The "first 7*6 PASS, second 12345*6789
FAIL" pattern was a red herring: small operands (7, 6, 100) happen to be
correct anyway because their high bits are zero, so the missing mask doesn't
matter. Larger operands like 12345 (15 bits set) overflow the unmasked nibble
add and propagate spurious carries.

### Fix

Tightened the AND-identity rule in `Sparkle/IR/Optimize.lean` to only fire
when both operands are constants (where the rewrite is unconditionally sound
because we collapse to a single constant). For the `and(non-const, all-ones)`
case, we conservatively leave the AND in place rather than risk dropping a
necessary truncation. The cost is a few extra `& 0xF...F` ops in generated
C++ that Clang -O2 would constant-fold anyway.

Note: this also explains why Test 21e was misclassified as a "wrapper FSM"
bug at first — the `0x82008ad` value looked like consecutive-state corruption
but it was actually the deterministic miscalculation of `12345 * 6789` whenever
optimization was enabled. The fix is in `Optimize.lean`, not the wrapper.

---

## Issue 7 (historical): pcpi_mul SoC-like wrapper consecutive MUL (Test 21e) 2nd result wrong

**Status**: Investigating.

Initial (incorrect) hypothesis: wrapper FSM hazard. Actual root cause is the
`Optimize.lean` mask-removal bug above.

---

## Test Status Summary (2026-04-08, post Issue 1 fix)

| Test Suite | Pass | Fail | Notes |
|-----------|------|------|-------|
| OptimizeTest | 10/10 | 0 | All IR optimizer tests |
| ParserTest | **34/34** | **0** | All tests passing |
| LiteXTest (Phase 1-3) | 3/3 | 0 | Phase 4 requires JIT FFI |
| BitNet tests | ✓ build | - | `#eval`-based, verified at build time |
| AXI4-Lite tests | ✓ build | - | `#eval`-based |
| IP builds (RV32/BitNet/YOLOv8/Arbiter/Video/Bus) | ✓ | - | All build clean |
| Proofs (MulProps/MulOracleProof/ArbiterProps/Basic) | ✓ | - | Zero sorry, zero axiom |
| CounterProps | ✓ | - | 1.4s build (was 10+min) |
| MExtRv32iTest | 0/3 | 3 | Pre-existing, related to Issue 1 |

## Non-synthesizable Signal DSL patterns

Catalog of patterns that elaborate in Lean and simulate fine, but crash
`#writeDesign` / `#synthesizeVerilog` / `#sim`. Indexed by the error message
the user actually sees. See `docs/Tutorial.md` § "Reference: Writing
Synthesizable Signal DSL" for the rules, and `#check_synthesizable <name>`
for an early-warning linter.

### N1. `unknown free variable '_fvar.N'` from `Id.run do` / `let mut`

**Symptom**: `#writeDesign myIP` fails with `unknown free variable` pointing
at a `let mut` / `for` loop inside the body.

**Cause**: `Id.run do { let mut acc := ...; for w in weights do ...; acc }`
is meta-level mutation. It reduces during elaboration but leaves dangling
free variables that the Verilog backend cannot see through.

**Fix**: hoist the loop into a pure-Lean helper that runs *before* the
Signal expression, and fold the result into a single `Signal` combinator:

```lean
-- BAD
def body (xs : Array (Signal dom (BitVec 8))) : Signal dom (BitVec 8) :=
  Id.run do
    let mut acc : Signal dom (BitVec 8) := Signal.pure 0
    for x in xs do acc := acc + x
    return acc

-- GOOD
def body (xs : Array (Signal dom (BitVec 8))) : Signal dom (BitVec 8) :=
  xs.foldl (· + ·) (Signal.pure 0)
```

### N2. `ArchMode.rec is not supported` / `match on non-BitVec inductive`

**Symptom**: `match cfg.archMode with | .HardwiredUnrolled => ... | .Sequential => ...`
surfaces `ArchMode.rec` in the Verilog backend trace.

**Cause**: the backend can see through `match` on `BitVec`, `Bool`, and
`Signal`-level constructors, but not arbitrary enums unless the scrutinee
is a *literal* that reduces at elab time.

**Fix**: if the enum is known at the call site, inline the branch; if it
is truly runtime, encode it as a `BitVec` and use `Signal.mux`:

```lean
-- OK — cfg.archMode is a literal at the instantiation site
def myIP (cfg : Config) : Signal dom _ :=
  match cfg.archMode with
  | .Hardwired => hardwiredBody
  | .Sequential => sequentialBody

def top : Signal dom _ := myIP { archMode := .Hardwired, .. }  -- reduces
```

### N3. `if-then-else cannot be synthesized`

**Symptom**: `bitLinearSignal` or similar pure Lean helper has
`if macs.size == 0 then ... else ...` and the backend rejects it.

**Cause**: `if` on a runtime `Prop` / `Bool` that is not a `BitVec` guard
cannot lower to a Verilog mux. `Signal.mux` is the only runtime choice
operator.

**Fix**: either (a) make the condition a compile-time literal so it
reduces away, or (b) replace with `Signal.mux cond t f` where `cond : Signal dom (BitVec 1)`.

### N4. `Signal.val unsupported application`

**Symptom**: backend trace shows `Signal.val sig t` appearing in the
expression tree.

**Cause**: `Signal.val` is a meta-level observer that extracts a specific
tick's value. It only makes sense in simulation / proofs, never in
synthesis. Usually leaks in via a helper that was intended for proof use.

**Fix**: refactor the helper to return a `Signal dom _` instead of a
`BitVec _`, and compose with `Signal.map` / `Signal.ap` / arithmetic.

### N5. `Signal.loop` in a path fed to `#verify_eq_at`

**Symptom**: `#verify_eq_at` reports that `bv_decide` cannot unfold
`Signal.loop`.

**Cause**: `Signal.loop` is `@[implemented_by]` opaque; the equivalence
tactic cannot symbolically execute it. Synthesis works fine — this is a
*proof* limitation, not a synthesis one.

**Fix**: prove equivalence on the loop-free core first; use
`#verify_eq_at` only on components that don't feed back. See TODO V2 for
a planned `unfold_loop n` tactic.

### N6. Arrays of heterogeneous `Signal`s via `Array.map` returning `Signal`

**Symptom**: `Array.map (fun w => Signal.pure w * input) weights` fails
elaboration with a universe or reduction error.

**Cause**: meta-level `Array.map` is fine *as long as* the result array
is consumed by a pure-Lean fold before returning. Leaving an `Array (Signal ..)`
as the final return value sometimes confuses the backend because the array
becomes part of the expression tree.

**Fix**: always collapse the array with `foldl` or an explicit tree
reduction before returning:

```lean
let taps := weights.map (fun w => Signal.pure w * input)
taps.foldl (· + ·) (Signal.pure 0)  -- collapse to single Signal
```

### N7. Pure Lean recursion in the Signal body

**Symptom**: `def body : Signal dom _ := if n == 0 then Signal.pure 0 else body (n-1) + input`
either fails to elaborate (non-terminating) or produces an `unknown free variable`.

**Cause**: recursion on Lean-level naturals must fully unfold at elab
time. If the recursion depth is a `Nat` parameter not known literally,
the backend sees a stuck recursor.

**Fix**: parameterize by a concrete literal `(n : Nat := 8)` at the call
site, or replace recursion with `List.range n |>.foldl`.

### Quick triage

| Error fragment | Pattern | Rule violated |
|---|---|---|
| `unknown free variable` inside body | N1 | Rule 1 (no meta-mutation) |
| `X.rec is not supported` | N2 | Rule 3 (match on enum) |
| `if-then-else cannot be synthesized` | N3 | Rule 2 (use Signal.mux) |
| `Signal.val unsupported application` | N4 | Rule 4 (proof helper leak) |
| `bv_decide ... Signal.loop` | N5 | not synthesis, proof limit |
| universe / reduction on Array | N6 | Rule 1 |
| stuck recursor `Nat.rec` | N7 | Rule 3 |

Run `#check_synthesizable myIP` before `#writeDesign` to catch N1/N2/N3
proactively.

---

## Verified Benchmark Results (unchanged)

| Benchmark | Sparkle JIT | Verilator | Ratio |
|-----------|------------|-----------|-------|
| LiteX 1-core | **18.1M cyc/s** | 10.5M | **1.72x** |
| LiteX + reverse synthesis | **18.1M cyc/s** | 8.4M baseline | **2.14x** |
| 8-core parallel | **12.7M per-core** | 1.1M | **11.9x** |
| Timer oracle (proof skip) | **49 GHz effective** | — | **9,900x** |