# Building `libiree_ffi.so` (and running mnist-mlp out of the box)

The Lake build links every trainer against `ffi/libiree_ffi.so`, but that
file is **not** checked in — you have to build it once. This page is the
step-by-step. After it, `lake build mnist-mlp-train` should just work.

If you only want to skim: you need (1) `iree-compile` from pip, (2) the
IREE runtime built from source as static archives, (3) one `gcc` invocation
that wraps `ffi/iree_ffi.c` + the runtime archives into `ffi/libiree_ffi.so`.

The narrative version of how this came together (with the gotchas as they
were hit) lives in [`IREE.md`](IREE.md). The ROCm-specific variant is in
[`ROCM.md`](ROCM.md). This file is the consolidated recipe.

## What you need

| Thing | Why | How |
|---|---|---|
| Lean 4.29.0 | builds the trainer | `elan` (see main README §1) |
| `iree-compile` | Lean shells out to it to lower StableHLO → `.vmfb` | `pip install iree-base-compiler` |
| IREE runtime (static `.a`) | linked into `libiree_ffi.so` | build from source, runtime-only |
| GPU toolchain | runtime needs a backend | CUDA toolkit *or* ROCm 6.x |
| `ffi/libiree_ffi.so` | every Lean trainer links `-liree_ffi` | the link command in §4 |
| MNIST data | input | `./download_mnist.sh` |

## 1. Install the IREE compiler (plus CMake / Ninja)

```bash
python3 -m venv .venv
source .venv/bin/activate
pip install iree-base-compiler cmake ninja
iree-compile --version
cmake --version
ninja --version
```

`cmake` and `ninja` are only needed for §2 (building the IREE runtime). If
your distro already ships recent versions on `PATH` you can skip them, but
pip-installing inside the venv avoids version-skew surprises — the IREE
runtime-build CMakeLists wants a relatively new CMake.

The Lean trainers shell out to `iree-compile` from `$PATH` (actually from
`./.venv/bin/iree-compile` — see `LeanMlir/Types.lean`), so make sure the
venv is active when you run them.

## 2. Build the IREE runtime from source

A naive `git clone --recursive` of `iree-org/iree` pulls in LLVM via the
torch-mlir / stablehlo submodule chains and balloons past 9 GB. We only
need the **runtime** submodules (~470 MB).

```bash
# Pick a sibling directory — these paths are referenced below.
cd ~/src   # or wherever
git clone https://github.com/iree-org/iree.git
cd iree

# Init only the submodules listed in runtime_submodules.txt
xargs -a build_tools/scripts/git/runtime_submodules.txt \
  git submodule update --init --depth 1
```

Then a runtime-only CMake build:

```bash
mkdir -p ../iree-build && cd ../iree-build

cmake ../iree -G Ninja \
  -DCMAKE_BUILD_TYPE=Release \
  -DIREE_BUILD_COMPILER=OFF        `# we use pip's iree-compile` \
  -DIREE_BUILD_TESTS=OFF \
  -DIREE_BUILD_SAMPLES=OFF \
  -DIREE_HAL_DRIVER_DEFAULTS=OFF \
  -DIREE_HAL_DRIVER_LOCAL_SYNC=ON \
  -DIREE_HAL_DRIVER_LOCAL_TASK=ON \
  -DIREE_HAL_DRIVER_CUDA=ON        `# pick ONE of CUDA / HIP, or both` \
  -DBUILD_SHARED_LIBS=OFF          `# we want static archives`

ninja
```

For AMD/ROCm, swap `-DIREE_HAL_DRIVER_CUDA=ON` for
`-DIREE_HAL_DRIVER_HIP=ON`. You can enable both if you want one
`libiree_ffi.so` that supports either.

Build is ~30 seconds on a modern box. Output sits under
`iree-build/runtime/src/iree/...` as a tree of `.a` files, with
`libiree_runtime_unified.a` containing most of the runtime.

After the build, export the paths so the next step can find them:

```bash
export IREE_SRC=$HOME/src/iree            # adjust to your clone
export IREE_BUILD=$HOME/src/iree-build    # adjust to your build dir
```

## 3. (CUDA only) Pin the compile target if you're on Ada or newer

IREE 3.11's compiler has no GPU target metadata for `sm_89` and later
(see [iree-org/iree#21122](https://github.com/iree-org/iree/issues/21122),
[#22147](https://github.com/iree-org/iree/issues/22147)). The Lean
trainers already pass `--iree-cuda-target=sm_86` for this reason; PTX
JITs forward to Ada at load time. If you change targets, do it in the
`Main*Train.lean` file, not in this doc.

## 4. Build `libiree_ffi.so`

This is the one missing piece. The C source `ffi/iree_ffi.c` already
supports both CUDA and HIP via `#ifdef USE_HIP`, so the only thing you
choose is which driver(s) to compile in.

From the repo root:

```bash
cd ffi

# 4a. Compile the wrapper. Add -DUSE_HIP for AMD/ROCm.
gcc -fPIC -O2 -c iree_ffi.c \
  -I"$IREE_SRC/runtime/src" \
  -I"$IREE_BUILD/runtime/src" \
  -DIREE_ALLOCATOR_SYSTEM_CTL=iree_allocator_libc_ctl \
  # -DUSE_HIP

# 4b. Link against the static runtime. Note the --start-group / --end-group:
#     flatcc_verify_* lives in libflatcc_parsing.a, the rest in
#     libflatcc_runtime.a, and they reference each other.
gcc -shared -o libiree_ffi.so iree_ffi.o \
  -Wl,--whole-archive \
    "$IREE_BUILD/runtime/src/iree/runtime/libiree_runtime_unified.a" \
  -Wl,--no-whole-archive \
  -Wl,--start-group \
    "$IREE_BUILD"/build_tools/third_party/flatcc/libflatcc_runtime.a \
    "$IREE_BUILD"/build_tools/third_party/flatcc/libflatcc_parsing.a \
  -Wl,--end-group \
  -lm -lpthread -ldl

cd ..
```

Notes:
- `--whole-archive` is required around `libiree_runtime_unified.a` so the
  HAL driver registration symbols (which are pulled in by static
  constructors) actually make it into the `.so`.
- `IREE_ALLOCATOR_SYSTEM_CTL` is gated behind a compile-time macro; the
  define above wires it to `iree_allocator_libc_ctl` (gotcha #3 in
  `IREE.md`).
- Flatcc paths can drift between IREE versions — if the `.a` files aren't
  where this command expects, `find "$IREE_BUILD" -name 'libflatcc*.a'`
  and substitute.

### 4c. Fallback: rebuild flatcc verifier with default visibility (ROCm/HIP)

On at least one ROCm 7.2 / IREE setup, the simple link in §4b produced
a `.so` whose `flatcc_verify_*` symbols ended up as `GLOBAL HIDDEN` in
the dynamic symbol table — meaning they're present in the `.text`
section but absent from `.dynsym`. When the IREE runtime later tries
to verify a `.vmfb` file at session-create time, it errors out with:

```
error while loading: undefined symbol: flatcc_verify_table_as_root
```

If you hit that, the fix is to recompile **just the flatcc verifier
source** with `-fvisibility=default` and link the resulting object
ahead of the archive (so the linker picks the visible version):

```bash
cd ffi

# 4c-i. Compile the verifier source (NOT the prebuilt .a) with default visibility.
gcc -fPIC -O2 -fvisibility=default \
  -c "$IREE_SRC/third_party/flatcc/src/runtime/verifier.c" \
  -I"$IREE_SRC/third_party/flatcc/include" \
  -o flatcc_verifier_visible.o

# 4c-ii. Link as before, but include flatcc_verifier_visible.o BEFORE
#        the archives so its definitions win over the hidden ones.
gcc -shared -o libiree_ffi.so iree_ffi.o flatcc_verifier_visible.o \
  -Wl,--whole-archive \
    "$IREE_BUILD/runtime/src/iree/runtime/libiree_runtime_unified.a" \
  -Wl,--no-whole-archive \
  -Wl,--start-group \
    "$IREE_BUILD"/build_tools/third_party/flatcc/libflatcc_runtime.a \
    "$IREE_BUILD"/build_tools/third_party/flatcc/libflatcc_parsing.a \
  -Wl,--end-group \
  -lm -lpthread -ldl

rm flatcc_verifier_visible.o
```

After this, `readelf -Ws ffi/libiree_ffi.so | grep flatcc_verify_table_as_root`
should show `GLOBAL DEFAULT` (not `GLOBAL HIDDEN`).

Why this works: the IREE third-party flatcc build compiles its sources
with `-fvisibility=hidden`. The resulting `.o` files in the archives
have `GLOBAL HIDDEN` symbols which are stripped from `.dynsym` when
linked into a shared library. We bypass that by recompiling just
`verifier.c` with default visibility and putting it on the link line
before the archive — the linker resolves to our visible copy.

## 5. Verify the result

```bash
ls -lh ffi/libiree_ffi.so                  # ~1.4 MB
nm ffi/libiree_ffi.so | grep driver_module_register
# Should print iree_hal_cuda_driver_module_register  (or _hip_, or both)
```

If you skipped `--whole-archive`, the `driver_module_register` symbol
won't be present and session creation will fail at runtime with "no
HAL driver matching 'cuda'/'hip'".

## 6. Build and run

Every trainer is self-bootstrapping: it generates its own MLIR,
calls `iree-compile`, and starts training. Just build + run:

```bash
./download_mnist.sh                        # → data/*-ubyte
lake build mnist-mlp-train             # links -liree_ffi from ./ffi
.lake/build/bin/mnist-mlp-train data   # generates vmfbs + trains
```

Expected output: 12 epochs, ~14-16 s/epoch on a modest GPU (or ~90
s/epoch on CPU), final accuracy ≈ 97.9%.

For a bigger smoke test with Imagenette:

```bash
./download_imagenette.sh                   # → data/imagenette/
lake build resnet34-train
./run.sh resnet34                          # sets IREE_BACKEND + GPU
```

The first run spends ~10-15 min in `iree-compile` generating the
train-step vmfb; subsequent runs hit the cache and start training
immediately.

If you see `error while loading shared libraries: libiree_ffi.so`, it's
an rpath issue — the lakefile sets `-Wl,-rpath,./ffi`, so run the binary
from the repo root (not from `.lake/build/bin/`).

**Note:** the historical `mnist-mlp-train` (without the `-f32` suffix)
in `historical/MainMlpTrain.lean` is a pre-codegen artifact that still
uses hand-authored MLIR and a custom FFI. It's kept for reference but
is not the recommended path. Use `mnist-mlp-train` instead — it
uses the same unified `spec.train` loop as every other architecture.

## Common failure modes

| Symptom | Cause | Fix |
|---|---|---|
| `cannot find -liree_ffi` at link time | `ffi/libiree_ffi.so` missing | redo §4 |
| `undefined reference to iree_allocator_system` | forgot `-DIREE_ALLOCATOR_SYSTEM_CTL=...` | add the define in §4a |
| `undefined reference to flatcc_verify_*` | flatcc archives missing or wrong order | add both `libflatcc_*.a` inside `--start-group` |
| `no HAL driver matching 'cuda'` at runtime | `--whole-archive` was dropped | redo §4b with the wrap intact |
| `--no-allow-shlib-undefined` errors when linking the Lean trainer | Lean's bundled lld is strict about transitive glibc symbols | already handled — every trainer in `lakefile.lean` passes `-Wl,--allow-shlib-undefined` |
| `iree-compile` not found | venv not active in the shell that runs `lake build` | `source .venv/bin/activate` first |
| `undefined symbol: flatcc_verify_table_as_root` *at runtime, not link time* | flatcc symbols built with `-fvisibility=hidden`, stripped from `.dynsym` | use the §4c fallback (rebuild `verifier.c` with `-fvisibility=default`) |
| Trainer crashes with `device target "cuda"` error on a HIP machine (or vice versa) | `IREE_BACKEND` env var not propagated through tmux into the binary | set it via a shell wrapper script (see `run_*.sh` in the repo root for examples) — `IREE_BACKEND=cuda` is the default |
| Eval call fails with `module 'foo_eval' not registered` after epoch 10 | the trainer's eval string is misspelled vs the sanitized spec name | sanitize lowercases and replaces non-alphanum with `_` — "EfficientNet V2-S" → `efficientnet_v2_s_eval`, NOT `efficient_net_v2_s_eval` |

## What this gets you

Once `libiree_ffi.so` exists in `ffi/`, every other target in
`lakefile.lean` that links `-liree_ffi` (mnist, cifar, resnet, mobilenet,
efficientnet, vit, vgg, …) builds without further setup. The runtime
library is shared across all of them; only the `.vmfb` files differ.

## Running a real trainer (operational notes)

Building `libiree_ffi.so` is the hard part; running the bigger trainers
afterwards has a few non-obvious gotchas worth knowing up front.

**1. Use a shell wrapper to set env vars.** The trainers respect
`IREE_BACKEND` (default: `cuda`; set to `rocm` for AMD) and
`HIP_VISIBLE_DEVICES` / `CUDA_VISIBLE_DEVICES`. Setting these inline in
`tmux send-keys "FOO=bar binary"` works *most* of the time but we hit
cases where the env didn't propagate cleanly. The repo includes
`run_effnet.sh`, `run_vit.sh`, `run_mnv4.sh` etc as examples — short
shell scripts that `export` then `exec` the binary. Use one for any
trainer you care about.

```bash
# run_resnet34.sh
#!/bin/bash
export HIP_VISIBLE_DEVICES=0      # or CUDA_VISIBLE_DEVICES
export IREE_BACKEND=rocm          # or omit / set to cuda
exec .lake/build/bin/resnet34-train 2>&1 | tee resnet34.log
```

**2. The first run compiles vmfbs.** Each trainer generates its
StableHLO MLIR on launch and shells out to `iree-compile`. For
ResNet-sized models the train step is 500 KB-2 MB of MLIR and IREE
takes ~5-15 minutes to lower it to a `.vmfb`. You'll see:
```
Generating train step MLIR...
  517912 chars
Compiling vmfbs...
  forward compiled
  eval forward compiled
  compiled
```
…and then training starts. The vmfbs are written to `.lake/build/`.
Subsequent runs of the same binary regenerate the MLIR fresh each
time (because the trainer always calls `generateTrainStep`), so
expect the compile delay every launch unless you rip out that step.

**3. First val eval is at epoch 10.** If the eval forward MLIR is
wrong (or the eval call uses a misspelled module name) the trainer
runs for ~10 epochs of training and then crashes. Watch the first
val eval output before walking away from a long run.

**4. Running BN stats are critical for eval accuracy.** Don't be
surprised if epoch-10 val accuracy looks bad initially — running BN
EMA needs ~1 epoch to stabilize. The numbers in `RESULTS.md` are
real.

**5. Big models eat a long time.** EfficientNetV2-S is 38M params
and takes ~9 min/epoch on a single 7900 XTX, so 80 epochs = ~12 hours.
Plan accordingly.