# Verified VJP Proofs

Machine-checked proofs that the backward pass (VJP) of every layer
matches its forward-pass Jacobian. Zero `sorry`s. If `lake build`
succeeds, every theorem is correct.

## Foundation: Mathlib's `fderiv`

Earlier drafts of this suite axiomatized the entire calculus
foundation — chain rule, sum rule, product rule, identity, reindex —
as eight opaque facts. The current foundation **flips that**:
`pdiv` is *defined* in terms of Mathlib's Fréchet derivative
`fderiv`, the structural rules are *theorems* proved from Mathlib's
API, and every downstream chapter threads a `Differentiable`
hypothesis through its compositions.

Many later-chapter axioms have been pruned the same way. Where
earlier drafts axiomatized `conv2d`, `maxPool2`, `depthwiseConv2d`,
or `geluScalar` as opaque functions with stated Jacobians, the
current version defines them concretely and proves their
gradient-related lemmas from the foundation.

The diff-threading branch closed out every remaining "provable but
deferred" Jacobian: `pdivMat_rowIndep`, `pdiv_softmax`,
`softmaxCE_grad`, `pdiv_gelu`, `pdiv_bnIstdBroadcast`, the BN
inverse-stddev smoothness, the row-wise softmax smoothness, and all
seven transformer-level composition chains.

The progression: **30 → 0 project axioms.** See `planning/VJP.md` (foundation
flip and per-chapter migration) and `planning/pdiv.md` (final 4-axiom retirement)
for the full elimination history.

## Dependency graph

```
Tensor.lean                    ← pdiv (def via fderiv) + VJP framework
  │
  │  pdiv_comp (chain rule)         ← theorem
  │  pdiv_add  (sum rule)           ← theorem
  │  pdiv_mul  (product rule)       ← theorem
  │  pdiv_id   (identity)           ← theorem
  │  vjp_comp  (VJP composition)    ← theorem
  │  biPath    (additive fan-in)    ← theorem
  │  elemwiseProduct                ← theorem
  │  pdivMat_rowIndep               ← theorem (was the last surviving Mat-axiom)
  │
  ├── MLP.lean                 dense (proved both sides) + ReLU (pdiv_relu proved,
  │                            relu/mlp _has_vjp = canonical-witness defs)
  │                            + softmax CE (proved, lives in Attention.lean)
  │
  ├── CNN.lean                 conv2d (def) + maxPool (def) + weight/bias grads (theorems)
  │                            conv2d_has_vjp3 (theorem); maxPool2_has_vjp3
  │                            (canonical-witness def — codegen substitutes argmax)
  │
  ├── BatchNorm.lean           BN (every axiom proved from foundation)
  │
  ├── Residual.lean            skip connections (biPath; zero new axioms)
  │
  ├── Depthwise.lean           depthwise conv (def) + weight/bias grads (theorems)
  │                            depthwise_has_vjp3 (theorem)
  │
  ├── SE.lean                  squeeze-and-excitation (elemwiseProduct; zero new axioms)
  │
  ├── LayerNorm.lean           LayerNorm (proved) + GELU (gelu Jacobian proved)
  │
  └── Attention.lean           softmax (proved) + SDPA (proved) + MHSA (proved)
                               + ViT body chains (proved) + patchEmbed (proved)
```

## Whole-network VJPs

Two forms, set by the architecture's activations:

- **Unconditional** (ViT, ConvNeXt, EfficientNet) — only smooth ops
  (GELU/Swish/sigmoid, softmax, LayerNorm, convolution; no ReLU, no
  max-pool), so `vit_full_has_vjp` / `convnext_has_vjp` /
  `efficientnet_has_vjp` are global `HasVJP`: correct at *every* input, with
  the `0 < ε` LayerNorm/BatchNorm positivity as the only side condition.

- **Conditional + concretely instantiated** (MLP, MNIST-CNN, ResNet,
  MobileNetV2) — ReLU/ReLU6/max-pool have genuine kinks, so the generic
  whole-network VJP is pointwise (`*_has_vjp_at`, under per-site
  off-the-kink hypotheses). Each is instantiated on a concrete small net
  with every hypothesis discharged (`MlpConcrete`, `Spatial`/`Mini`,
  `CnnConcrete`, `MobileNetV2Concrete`), proving the bundle is jointly
  satisfiable — not vacuous.

Conditionality is intrinsic to the math, not a formalization gap: it enters
exactly at the non-smooth operators and is *recovered* by the
smooth-activation nets. Two honesty notes on the concrete witnesses:
`MobileNetV2Concrete` is degenerate (constant output — ReLU6's two-sided
kink admits no cheap live witness, since pinning every input into `(0,6)`
forces constant activations), whereas `CnnConcrete` has a genuinely
injective stem; and all concrete nets are deliberately tiny.

## Axioms (0 project)

Pure-Mathlib closure on every theorem. `#print axioms vit_full_has_vjp`
shows only `propext`, `Classical.choice`, `Quot.sound` (Lean core).

The earlier 4-axiom floor was retired in Phase 7 (Apr 2026):

- `relu_has_vjp`, `mlp_has_vjp`, `maxPool2_has_vjp3` — converted from
  `axiom` to `noncomputable def` with the canonical pdiv-derived
  witness. `HasVJP.correct` holds by `rfl` since `pdiv` is a `def`
  over `fderiv` (post-foundation-flip). At non-smooth points the
  canonical backward is `fderiv`'s junk default of `0`; the codegen
  substitutes the standard subgradient/argmax convention — see
  "Codegen trust boundary" below.
- `pdiv_relu` — proved via local-diagonal-CLM transport
  (~80 LOC). At a smooth point (`∀ k, x k ≠ 0`), ReLU agrees with the
  diagonal indicator CLM `Π k, (if x k > 0 then proj k else 0)` on
  `Metric.ball x (min |x k|)` (every coordinate keeps its sign).
  `HasFDerivAt.congr_of_eventuallyEq` transports the CLM's self-fderiv
  to ReLU; direct evaluation at `basisVec i` reads off the entry.

**Tensor.lean** — calculus foundation: **0 axioms.** `pdiv` is a
`noncomputable def` over `fderiv`; every structural rule is a
theorem. `pdivMat_rowIndep` (the last surviving Mat-level axiom in
prior drafts) is now a theorem proved via the row-projection
`ContinuousLinearMap` and the chain rule, given a `Differentiable`
hypothesis on the per-row function.

**MLP.lean** — dense layers: **0 axioms.**

> `pdiv_dense`, `pdiv_dense_W`, `dense_weight_grad_correct`,
> `dense_bias_grad_correct`, and `pdiv_relu` are theorems.
> `relu_has_vjp` and `mlp_has_vjp` are `def`s over the canonical
> pdiv-derived witness. `softmaxCE_grad` is a theorem (relocated to
> `Attention.lean` next to `pdiv_softmax`).

**CNN.lean** — convolution and pooling: **0 axioms.**

> `conv2d` and `maxPool2` are concrete `def`s. The weight-grad and
> bias-grad VJPs (`conv2d_weight_grad_has_vjp`,
> `conv2d_bias_grad_has_vjp`) are theorems proved from foundation
> via `unfold + fun_prop`. `conv2d_has_vjp3` is a theorem (Phase 1,
> Apr 2026) — proved via `pdiv_finset_sum` × 3 +
> `pdiv_const_mul_pi_pad_eval` per-summand + Σ_(c, kh, kw) collapse.
> `maxPool2_has_vjp3` is a `def` over the canonical pdiv-derived
> witness.

**BatchNorm.lean** — the hard one: **0 axioms.**

> Every BN Jacobian is now a theorem. `pdiv_bnAffine` and
> `pdiv_bnCentered` were proved in Stage 1; `pdiv_bnIstdBroadcast`
> and the smoothness lemma `bnIstdBroadcast_diff` were the last to
> fall in the diff-threading branch — the centering CLM,
> `HasFDerivAt.sqrt` (under `bnVar + ε > 0`), and
> `(hasDerivAt_inv).comp_hasFDerivAt` close the chain. Every BN
> proof now carries a `(hε : 0 < ε)` hypothesis.

**Residual.lean** — skip connections: **0 axioms.** Pure composition
over `biPath_has_vjp` + `identity_has_vjp` from `Tensor.lean`.

**Depthwise.lean** — depthwise conv: **0 axioms.**

> `depthwiseConv2d` is now a concrete `def`, weight and bias
> gradients are theorems via `unfold + fun_prop`. `depthwise_has_vjp3`
> is now a theorem (Phase 2, Apr 2026) — same recipe as conv2d with
> one fewer Σ level (no cross-channel mixing in depthwise).

**SE.lean** — squeeze-and-excitation: **0 axioms.** Pure composition
over `elemwiseProduct_has_vjp` + `dense_has_vjp` + `identity_has_vjp`.

**LayerNorm.lean** — layer norm and GELU: **0 axioms.**

> `geluScalar` and `geluScalarDeriv` are now concrete `def`s using
> the standard `tanh`-approximation formula. `pdiv_gelu` is a
> theorem proved via `fderiv_apply` + chain rule with
> `geluScalar ∘ ContinuousLinearMap.proj j`, then
> `fderiv_eq_smul_deriv` to convert scalar `fderiv` ↔ `deriv`. A
> new `Real.differentiable_tanh` `@[fun_prop]` lemma (derived from
> `Real.tanh_eq_sinh_div_cosh` + `Real.cosh_pos`) carries the
> smoothness through. `layerNorm_has_vjp` reuses the BN proof
> template on a different axis.

**Attention.lean** — softmax, attention, ViT: **0 axioms.**

> `pdiv_softmax`, `softmaxCE_grad`, the three `sdpa_back_*_correct`
> theorems, `rowSoftmax_flat_diff`, and **every** transformer-level
> chain (`transformerMlp_has_vjp_mat`,
> `transformerAttnSublayer_has_vjp_mat`,
> `transformerMlpSublayer_has_vjp_mat`,
> `transformerBlock_has_vjp_mat`,
> `transformerTower_has_vjp_mat`, `vit_body_has_vjp_mat`,
> `mhsa_has_vjp_mat`, `mhsa_layer_flat_diff`,
> `classifier_flat_has_vjp`, `vit_full_has_vjp`) are theorems.
> `patchEmbed_flat`, `patchEmbed_flat_diff`, and
> `patchEmbed_flat_has_vjp` were the last to fall: Phase 6a (Apr 2026)
> de-opaqued the forward and proved Diff via `differentiableAt_pad_eval`;
> Phase 6b (Apr 2026) proved the closed-form input-VJP via the same
> recipe used for `conv2d_has_vjp3`/`depthwise_has_vjp3`, with one new
> wrinkle: split `Σ n : Fin (N+1)` into the n=0 (CLS row, zero img-grad
> contribution) and `Σ p : Fin N` (n = p.succ, conv projection).

Plus three Lean core axioms (`propext`, `Classical.choice`,
`Quot.sound`) present in every nontrivial Lean program.

**Total: 0 project axioms across all nine content modules.**

## Codegen trust boundary

`HasVJP.correct` certifies the *canonical* backward
`backward x dy i = ∑ j, pdiv f x i j * dy j`. Where `f` is everywhere
differentiable, this is the true Jacobian-vector product (and the
`_diff` theorems on each layer carry that hypothesis through).

For the two non-smooth ops — ReLU at `x i = 0`, MaxPool at argmax
ties — the canonical backward is `fderiv`'s junk default of `0`,
because Mathlib's `fderiv` returns `0` at non-differentiable points
by convention. The codegen (`MlirCodegen.lean`) does **not** emit the
canonical backward at the kinks. Instead it emits the standard ML-
framework subgradient conventions:

- ReLU: `if x > 0 then dy else 0` (the `relu'(0) := 0` convention).
- MaxPool: **tile-compare-select** — the gradient is tiled to the
  input shape, compared against the pooled output with `stablehlo.compare EQ`,
  and `stablehlo.select`-ed through the resulting mask. The codegen
  avoids `stablehlo.select_and_scatter` because IREE does not support
  it (see `MlirCodegen.lean` near the maxPool backward case). At
  argmax ties, the EQ-mask routes the gradient to *every* tied input
  cell, matching the PyTorch/JAX semantics.

These match the canonical Lean witness at smooth points and differ
only at the kinks. The verification gap is intrinsic to backward
passes through non-smooth ops — every ML framework lives with the
same gap. The numerical FD checks in `check_jacobians.py` and the
end-to-end oracles in `tests/vjp_oracle/` cover the codegen-emitted
formula at the kinks.

## The three rules

All of backpropagation:

```
vjp_comp              f ∘ g  →  back_f(x, back_g(f(x), dy))
biPath_has_vjp        f + g  →  back_f(x, dy) + back_g(x, dy)
elemwiseProduct_has_vjp  f * g  →  back_f(x, g·dy) + back_g(x, f·dy)
```

## The five Jacobian tricks

Every layer's backward pass is one of:

1. **Diagonal** — activations (ReLU, GELU): one multiply
2. **Sparse Toeplitz** — conv: reversed kernel convolution
3. **Binary selection** — max pool: route to argmax
4. **Rank-1 correction** — batch/layer norm, softmax: closed-form 3-term formula
5. **Outer product** — dense/matmul: input ⊗ grad

## Numerical gradient checks

`check_jacobians.py` runs 25 finite-difference checks. They cover the
codegen-emitted backward formulas (where the trust gap actually
lives — see "Codegen trust boundary" above), particularly at the
ReLU and MaxPool kinks where the codegen substitutes a subgradient
convention for the canonical Lean witness. Typical max-error is
~1e-11 in float64.

## Independent kernel re-check (comparator)

`tests/comparator/` runs
[leanprover/comparator](https://github.com/leanprover/comparator) on
51 theorems spanning the foundation rules, every chapter's headline
Jacobian, the public `*_has_vjp_correct` wrappers, and the five
whole-network VJPs (ViT, ResNet, MobileNetV2, ConvNeXt, EfficientNet).
comparator
re-runs Lean's kernel typechecker independently
of the elaborator and verifies the transitive axiom closure of each
proof. The configured allowlist is exactly
`{propext, Quot.sound, Classical.choice}` (Lean core); any project
axiom in the closure would fail the run. See
`tests/comparator/README.md` for the prereq + run instructions.

## Verify

```bash
lake build LeanMlir          # the library root imports every proof module
```

The `LeanMlir` root imports the whole proof suite (transitively), so this
one build type-checks all of it — no module list to keep in sync.

If it builds, it's correct. That's the point.