# Intuition: what each theorem actually says The Lean files in this repo prove formal statements. This file translates each main theorem into one or two paragraphs of plain English: what the bound controls, why the constant is what it is, and what would have to be true for the theorem to be useful in practice. For exact statements see [`theorem-map.md`](./theorem-map.md). For visual dependency structure see [`diagrams.md`](./diagrams.md). For the precise shape of every assumption see [`assumptions-and-nonclaims.md`](./assumptions-and-nonclaims.md). --- ## Generalization gap ``` genGap(H, S) = sup_{h ∈ H} ( populationRisk(h) − empiricalRisk(h, S) ) ``` How wrong can a worst-case hypothesis look on the training set compared to its true risk? `genGap` is the *one-sided* worst-case discrepancy: a learner that minimizes empirical risk can select a bad hypothesis only when some `h ∈ H` looks better on `S` than it really is. Bounding `genGap` from above bounds how badly empirical risk minimization can do. ## Rademacher symmetrization — `E[genGap] ≤ 2·E[Rad]` **What it says.** If you replace the unknown population with an iid copy of the training set (a "ghost sample"), the expected generalization gap is at most twice the expected *Rademacher complexity* of the loss class on the training set. Rademacher complexity is the expected sup over `H` of the correlation between the loss and a ±1 random sign vector — a measure of how well the class can fit pure noise. **Why this halves the problem.** Generalization is a question about unobserved data. The ghost-sample trick converts it into a question about *two* observed samples, which is tractable because both are iid from the same distribution. The factor of 2 is the cost of replacing one sample with two; it is not slack and cannot be removed without different techniques. **Constant in this repo:** the bound is `2 · E[empiricalRademacherComplexity]` exactly. The factor of 2 follows from the standard symmetrization argument in Boucheron-Lugosi-Massart (2013), Lemma 4.4. ## Massart's finite-class bound — `Rad(H, S) ≤ B·√(2·log|H| / n)` **What it says.** When `H` has finitely many hypotheses, the Rademacher complexity grows only logarithmically in `|H|`. The dependence on the sample size is `1/√n`, the bounded-loss range is `B`, and the constant is exactly `√2`. **Why log of `|H|`.** The bound is a tight maximal inequality on sub-Gaussian random variables. Each hypothesis contributes a sub-Gaussian random variable with parameter `B/√n`; the maximum of `K` such variables grows like `B·√(2·log K / n)` by a union argument that is sharp up to constants. There is no `√|H|` in finite-class learning theory because the log is what governs the scaling. **Use in the proof.** Plugging Massart into symmetrization gives `E[genGap] ≤ 2B·√(2·log|H| / n)` with no further work. The price of finiteness is logarithmic; the price of `n` is square-root. ## Azuma generalization-gap tail — `P(genGap − E[genGap] ≥ ε) ≤ exp(−ε²n / 8B²)` **What it says.** The generalization gap concentrates around its expectation: deviations of size `ε` are exponentially unlikely in the sample size. The `8B²` in the denominator is the Azuma constant for a sample-exposure martingale where each step changes the gap by at most `2B/n` and there are `n` steps. **The Azuma constant is loose by 4× compared to the sharp McDiarmid constant** (which would give `2ε²n / B²`). This repo currently uses the Azuma route because it composes directly with the high-probability layer above. The [`assumptions-and-nonclaims.md`](./assumptions-and-nonclaims.md) doc records this constant choice explicitly. **What you would change to make this useful.** Bigger `n` shrinks the tail exponentially. Bigger `B` (looser loss range) loosens the tail quadratically. To get a 95% confidence interval on the gap, set the right-hand side equal to 0.05 and solve for `ε`. ## High-probability Rademacher — `P(genGap ≥ 2·E[Rad] + ε) ≤ exp(−ε²n / 8B²)` **What it says.** Compose Rademacher symmetrization (controls the expectation) with the Azuma tail (controls the deviation around the expectation): a learner's worst-case generalization gap stays within `2·E[Rad] + ε` of its expectation with probability at least `1 − exp(−ε²n / 8B²)`. This is the theorem in the chain that turns the earlier expectation work into a finite-sample, high-probability statement. Everything before this is either an expectation bound or a single concentration step. ## Finite-class high-probability — `P(genGap ≥ 2B√(2log|H|/n) + ε) ≤ exp(...)` **What it says.** Specializing to a finite hypothesis class: the gap is at most `2B√(2·log|H|/n) + ε` with high probability. This is the canonical finite-class generalization bound used in introductory learning theory texts. **How tight is it?** The leading `2B√(2·log|H|/n)` is sharp up to constant factors for the worst case. The `+ ε` slack is the price of moving from expectation to high probability via Azuma; that price is `O(B/√n)` to match the leading term. ## Sauer-Shelah polynomial bound — `Σ_{k≤d} C(n,k) ≤ (en/d)^d` **What it says.** A class with VC dimension `d` realizes at most `(en/d)^d` distinct labelings on any sample of size `n`. Combinatorially: even though there can be infinitely many hypotheses, only finitely many *behaviors* are realized on any finite sample, and this number grows polynomially in `n`, not exponentially. **Why this is the bridge.** All the finite-class machinery above goes through unchanged once you replace `|H|` with the *number of effective behaviors on the sample*. Sauer-Shelah is what gives you a polynomial ceiling on that count, which makes `log(effective class)` grow like `d · log(en/d)` — the famous `O(d·log n)` rate for VC classes. ## VC pointwise Rademacher — `Rad ≤ B·√(2d·log(en/d) / n)` **What it says.** The same Massart bound, but with `log|H|` replaced by `d·log(en/d)` via Sauer-Shelah. This is the canonical VC-dimension generalization bound: `O(√(d/n))` rate up to log factors. **The shape.** This is *pointwise* (depends on the realized sample `S`) but the bound has no `S`-dependent constant — it holds for every sample in a class with VC dimension `d`. That uniformity is what distinguishes VC theory from finite-class theory. ## VC ERM excess risk tail — `P(excess risk ≥ 2B√(2d·log(en/d)/n) + ε) ≤ 2·exp(−ε²n / 8B²)` **The capstone.** Under bounded loss, iid samples, finite VC dimension `d`, the ERM learner's excess risk over the best-in-class hypothesis is at most `2B√(2d·log(en/d)/n) + ε` with probability at least `1 − 2·exp(−ε²n / 8B²)`. **Reading off sample complexity.** Set the failure probability to `δ` and the excess risk slack to `ε_0`; solve to get `n ≳ (B²/ε_0²) · (d·log(1/ε_0) + log(1/δ))`. This is the canonical VC sample-complexity rate. **Current boundaries.** - It is *not* a low-probability bound on a specific learner's true risk; it is a bound on excess risk over the in-class optimum. - The `2·exp` (not `exp`) reflects that the bound is two-sided in the uniform-deviation sense; the original Azuma gives a one-sided tail. - Bounded loss is essential: unbounded loss requires moment or sub-Gaussian assumptions that this repo does not formalize. - iid is essential: the Azuma exposure martingale assumes independent samples. Generalizing to `β`-mixing or martingale data needs different machinery. ## Binary VC bridge — `effectiveClass(0-1 loss).card = binaryClassTrace.card` **What it says.** When the loss is the 0-1 loss, the *effective class* on a sample (the set of distinct loss-vectors realized by hypotheses) has exactly the same cardinality as the *binary class trace* (the set of labelings hypotheses produce). This is a small bridge result that lets the VC machinery for 0-1 loss reuse the effective-class machinery built for general bounded loss. **Use in the proof.** Without this bridge, the VC sample-complexity results would have to be re-proved for the 0-1 case from scratch. With it, a one-line corollary gives the classical PAC bound for binary classification. ## Finite scalar contraction — `Rad_S(φ ∘ F) ≤ L·Rad_S(F)` **What it says.** If `φ` is scalar-valued, Lipschitz, and vanishes at zero, then composing a finite real-valued function class with `φ` does not increase empirical Rademacher complexity by more than the Lipschitz constant. This is the finite-sample, finite-class scalar contraction theorem. **Use in the proof.** Contraction is the tool that lets Rademacher bounds move from raw scores to losses or margins. The formalized theorem is not the full general Talagrand contraction theorem; it is the finite scalar result that fits the current library spine. ## Linear predictors — `Rad ≤ R·n⁻¹·√(Σ‖xᵢ‖²)` **What it says.** For finite-dimensional Euclidean linear predictors with `‖w‖ ≤ R`, empirical Rademacher complexity is bounded by the norm of the signed sample sum. The public corollary gives `Rad ≤ R·B / √n` when every input satisfies `‖xᵢ‖ ≤ B`. **Use in the proof.** This theorem connects the repo to a standard ML model class rather than only an abstract finite hypothesis class. It connects the Rademacher machinery to linear prediction in finite-dimensional Euclidean spaces. ## Finite sub-Gaussian chaining **What it says.** The finite chaining layer starts with MGF control for a finite sub-Gaussian process, turns that into expected finite-supremum bounds, and composes those bounds across finite net projections. The endpoint is a finite Dudley-style entropy-budget wrapper under a dyadic radius schedule. **Use in the proof.** This is the finite infrastructure needed before a continuous Dudley integral theorem can be stated responsibly. It proves the bookkeeping for finite nets, projection pairs, finite entropy sums, and per-scale entropy budgets. --- ## Reading order If you have read a learning theory textbook chapter on VC theory, the fastest path through this repo is: 1. **`Rademacher.Symmetrization`** — see how the ghost-sample trick is formalized as a measurable kernel argument. 2. **`Rademacher.Massart`** — see the maximal-inequality argument that turns sub-Gaussian tails into the `√(2·log|H|/n)` bound. 3. **`Azuma.GenGapTail`** — see the bounded-difference martingale that gives the Azuma exponential tail. 4. **`VC.SauerShelah`** — see the `(en/d)^d` polynomial bound proved by the standard double-counting argument. 5. **`VC.SampleComplexity`** — see how all of the above compose into the capstone ERM excess-risk bound. If you have not read the corresponding textbook chapter, the bounds will be hard to motivate from the Lean files alone. Recommended companions: - Boucheron, Lugosi, Massart (2013), *Concentration Inequalities*, Ch. 4 - Mohri, Rostamizadeh, Talwalkar (2018), *Foundations of Machine Learning*, Ch. 3 - Shalev-Shwartz and Ben-David (2014), *Understanding Machine Learning*, Ch. 6, 26