# LoRA Finetuning for BitNet — Development Guide

## Architecture Overview

```
┌──────────────────────────────────────────────────────────────────┐
│                    Training Pipeline                              │
│                                                                  │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐  │
│  │ Alpaca   │───▶│ Forward  │───▶│  Loss    │───▶│ Backward │  │
│  │ Dataset  │    │ + LoRA   │    │ (CE)     │    │ (AD)     │  │
│  └──────────┘    └──────────┘    └──────────┘    └──────────┘  │
│                       │                               │          │
│                       ▼                               ▼          │
│              ┌──────────────┐                ┌──────────────┐   │
│              │ GPU Kernels  │                │ Verified AD  │   │
│              │ (ShaderM)    │                │ (DiffOp)     │   │
│              └──────────────┘                └──────────────┘   │
│                                                      │          │
│                                              ┌──────────────┐   │
│                                              │ Numerical    │   │
│                                              │ Gradient     │   │
│                                              │ Check        │   │
│                                              └──────────────┘   │
└──────────────────────────────────────────────────────────────────┘
```

## File Structure

```
Hesper/
├── AD/
│   ├── Reverse.lean          # CPU scalar AD (tape-based, existing)
│   └── Verified.lean         # Verified AD framework (DiffOp, numerical VJP)
├── LoRA/
│   ├── Types.lean            # Config, Weight, Adapter, SavedActivations
│   ├── Init.lean             # Kaiming/zero initialization, RNG
│   ├── Forward.lean          # GPU kernels: A@x, B@h, fused add
│   ├── Backward.lean         # GPU kernels: dA, dB, dInput
│   ├── IO.lean               # Binary save/load of LoRA weights
│   └── Inference.lean        # LoRA-aware generate, batched forward+backward
├── Training/
│   ├── Loss.lean             # Cross-entropy forward/backward + GPU accumulation
│   ├── AlpacaDataset.lean    # JSON parser, prompt templating
│   ├── TrainLoop.lean        # Training utilities, gradient management
│   ├── VerifiedBackward.lean # CPU backward specs with numerical checks
│   └── AttentionBackward.lean# GPU attention backward kernels
├── Optimizer/
│   └── AdamGPU.lean          # GPU-accelerated Adam (has NaN issues, use SGD)
├── Layers/
│   ├── Attention.lean        # Modified: optional LoRA injection via loraOpt
│   └── TransformerBlock.lean # Modified: pass-through LoRA to attention
Examples/
├── Training/
│   └── AlpacaFinetune.lean   # End-to-end finetuning CLI
Tests/
├── BackwardVerification.lean # Run backward spec checks
├── VerifiedAD.lean           # Run verified AD checks
└── WrongBackwardTest.lean    # Prove checker detects wrong backwards
docs/
├── VERIFIED_AD.md            # How to add verified operations
└── LORA_FINETUNING.md        # This file
```

## Development Workflow

### 1. Define Spec → 2. Verify → 3. Implement GPU → 4. Test

This workflow ensures correctness at each step.

#### Step 1: CPU Spec (Pure Function)

Define forward and backward as pure Lean functions in `Hesper/AD/Verified.lean`:

```lean
def myOpFwd (x : Array Float) : Array Float := ...
def myOpBwd (x dy : Array Float) : Array Float := ...
```

**Key rule**: These are the **source of truth**. GPU kernels must match them.

#### Step 2: Numerical Verification

Register as `DiffOp` and verify:

```lean
def myOp : DiffOp := {
  name := "MyOp", forward := myOpFwd, backward := myOpBwd,
  testInput := #[...], testGradOutput := #[...]
}
-- In runVerification: verifyOp myOp → should PASS
```

Also test that wrong implementations are detected:

```lean
-- Intentionally wrong backward
def wrongOp : DiffOp := { myOp with backward := fun _ _ => #[0.0, ...] }
-- verifyOp wrongOp → should FAIL
```

#### Step 3: GPU Kernel (ShaderM)

Implement the kernel using `ShaderM` DSL:

```lean
def myOpBackwardKernel (n : Nat) : ShaderM Unit := do
  let gid ← ShaderM.globalId
  let i := Exp.vec3X gid
  ShaderM.if_ (Exp.lt i (Exp.litU32 n)) (do
    -- ... compute gradient matching CPU spec ...
  ) (pure ())
```

**Critical patterns for GPU kernels**:
- Always use `ShaderM.if_` guard (not `Exp.select` + write) to prevent OOB writes
- Use `ShaderM.var` + `ShaderM.assign` for mutable accumulators in loops
- Use `Exp.var varName` to read a mutable variable (not the initial binding)
- Match buffer names exactly between `declareInputBuffer`/`declareOutputBuffer` and `readBuffer`/`writeBuffer`

#### Step 4: Integration Test

Run GPU kernel and compare output to CPU spec:

```lean
-- Upload test data to GPU
-- Run GPU kernel
-- Download result
-- Compare to CPU spec output
-- Assert match within tolerance
```

## Lessons Learned

### Float64 → Float32 Conversion

Lean's `Float` is Float64. GPU buffers are Float32. Converting via `f.toBits` gives
Float64 bits — you MUST convert to Float32 format manually:

```lean
private def float64ToFloat32Bits (f : Float) : UInt32 := ...
```

The lower 4 bytes of Float64 bits are NOT Float32 bits. This caused astronomically
large initialization values that corrupted training.

### WGSL Reserved Keywords

`target` is a reserved keyword in WGSL. Buffer names must avoid reserved words.
Our cross-entropy kernel originally used `"target"` → renamed to `"target_id"`.

### WebGPU Buffer Aliasing

WebGPU does not allow the same buffer to be bound as both input and output in a
single dispatch. For in-place operations (like gradient clipping), use a single
`declareOutputBuffer` and read/write through it:

```lean
let _data ← ShaderM.declareOutputBuffer "data" (.array (.scalar .f32) n)
let val ← ShaderM.readBuffer (ty := .scalar .f32) (n := n) "data" i
ShaderM.writeBuffer (ty := .scalar .f32) "data" i (clamp val)
```

### Out-of-Bounds GPU Writes

Never use `Exp.select inBounds result (Exp.litF32 0.0)` followed by an unconditional
`writeBuffer`. Out-of-bounds threads will write 0.0 to memory beyond the buffer,
causing NaN corruption. Always use `ShaderM.if_` to guard:

```lean
-- WRONG: writes 0.0 to OOB indices
let result := Exp.select inBounds val (Exp.litF32 0.0)
ShaderM.writeBuffer "buf" i result

-- CORRECT: skips OOB threads entirely
ShaderM.if_ (Exp.lt i (Exp.litU32 n)) (do
  ShaderM.writeBuffer "buf" i val
) (pure ())
```

### Gradient Signal Strength

Without full attention backward, the gradient signal from `dLogits → dHidden` through
the LM head is too weak for effective LoRA training. The attention backward chain
(apply → softmax → scores → RoPE → dQ) amplifies the gradient to the correct
magnitude. Without it, `lr` must be impractically large.

### SavedActivations vs Gradient Checkpointing

The forward pass uses shared buffers (`layerBufs.normedBuf`) that get overwritten
each layer. For backward, you must either:
1. **Save activations** per layer during forward (memory-expensive)
2. **Recompute** activations in backward (compute-expensive but memory-efficient)

Current approach: recompute `h = A @ normedBuf` in backward using the last layer's
normedBuf. This is approximate — only the last layer's gradient is accurate.

### GPU Batching

The biggest performance win is batching GPU dispatches:

```lean
-- SLOW: 20 GPU syncs per token
forwardSingleToken ...    -- 1 sync
crossEntropyForward ...   -- 1 sync
crossEntropyBackward ...  -- 1 sync
...18 more dispatches...  -- 18 syncs

-- FAST: 1 GPU sync per token
beginBatch device
  forwardSingleToken ...       -- recorded
  crossEntropyForward ...      -- recorded
  crossEntropyBackward ...     -- recorded
  ...all dispatches...         -- recorded
endBatch device                -- 1 sync
```

Loss accumulation on GPU avoids per-token `mapBufferRead` (CPU←GPU sync).

## Known Issues

### Adam Optimizer NaN

The GPU Adam kernel (`AdamGPU.lean`) produces NaN when:
- `v_hat` becomes negative due to floating point (impossible mathematically but happens)
- Large gradient × large lr causes overflow

**Workaround**: Use SGD (`param += -lr * grad`) which is stable.

**Fix needed**: Clamp `v_hat` to non-negative before `sqrt`, clip update magnitude.

### Loss Plateau

With only last-layer attention backward, loss decreases very slowly because:
- Only layer 29's LoRA Q gets correct gradients
- Other layers' LoRA weights remain at initialization
- The gradient signal diminishes through many layers

**Fix needed**: Full multi-layer backward with per-layer saved activations.

### Speed

Current: ~4 seconds per example (1 token ≈ 30ms forward + backward).
Bottleneck: `writeBuffer` for token upload is per-token and causes GPU queue flush.

**Fix needed**: Pre-upload all tokens, use token index buffer.

## Running Tests

```bash
# Verified AD (backward correctness)
lake build verified-ad && ./.lake/build/bin/verified-ad

# Backward spec verification
lake build backward-verify && ./.lake/build/bin/backward-verify

# Wrong backward detection
lake build wrong-backward-test && ./.lake/build/bin/wrong-backward-test

# Training (small test)
lake build alpaca-finetune
./.lake/build/bin/alpaca-finetune \
  --model data/gguf/ggml-model-i2_s.gguf \
  --data data/alpaca_test.json \
  --epochs 5 --rank 8 --lr 1e-3

# Inference with LoRA
lake build bitnet-complete
./.lake/build/bin/bitnet-complete \
  data/gguf/ggml-model-i2_s.gguf \
  "Your prompt" 50 --lora lora_weights.bin
```

## Next Steps

### Priority 1: Full Multi-Layer Backward

Currently only the last layer gets correct attention backward gradients.
To fix:
1. Save `normedBuf` per layer during forward (30 × 2560 × 4 = 300KB)
2. In backward, iterate layers in reverse, using saved normedBuf
3. This gives all 30 layers correct gradients

### Priority 2: Differentiable Typeclass Integration

Use `Hesper.Core.Differentiable` to formally link forward/backward:

```lean
instance : Differentiable SoftmaxOp (Array Float) (Array Float) where
  forward _ := softmaxFwd
  backward _ := softmaxBwd
```

Then compose with verified chain rule:

```lean
instance [Differentiable f I M] [Differentiable g M O] :
    Differentiable (g ∘ f) I O where
  forward op x := g.forward op.2 (f.forward op.1 x)
  backward op x dy := f.backward op.1 x (g.backward op.2 (f.forward op.1 x) dy)
```

### Priority 3: GPU ↔ CPU Spec Consistency Test

For each GPU backward kernel, download GPU output and compare to CPU spec:

```lean
def testGPUKernel (gpuResult cpuResult : Array Float) (tol : Float := 1e-4) : Bool :=
  maxRelativeError gpuResult cpuResult < tol
```

### Priority 4: Formal Lean Proofs

Graduate from numerical checks to symbolic proofs:

```lean
theorem softmax_backward_correct (x dy : Vector ℝ n) :
    softmaxBwd x dy = jacobianTranspose (softmaxFwd ·) x dy := by
  ...
```

This requires Mathlib's analysis library but provides absolute correctness.