# Constant-Weight Bitmap Overlap Brief This repository formalizes a finite decision-theoretic claim about constant-weight bitmap overlap admission. It proves what follows when the stated finite assumptions hold. For a developer-facing version of the same result in vector DB and RAG terms, see [`rag-pipeline-guide.md`](rag-pipeline-guide.md). ## Full Quotient Stack The bitmap theorem is the clearest end-to-end specialization, but the checked Lean stack is broader. It also formalizes finite quotient search constraints, reachable-image/kernel containment, fiber topology, product quotients for query/document targets, score-induced ranking comparisons, componentwise multi-target signatures, finite observation windows for bundles of lossy probes, and margin/profile certificates for threshold decisions. These layers do not assert that a real encoder satisfies the contracts; they state what follows if finite observations, labels, scores, and margins support them. ## The Claim The checked theorem is: ```lean OrdvecFormalization.exists_uniformBitmapOverlapTail_finiteBayesRisk_le_and_hypergeomTail ``` It says that, for `K`-active bitmaps in dimension `D`, if the null is uniform over `K`-active bitmaps and the signal law is modeled as a canonical finite exponential tilt of that law by literal bitmap overlap, then: 1. some overlap-tail rule is Bayes-optimal among all deterministic rules on the full constant-weight bitmap observation space; 2. under the uniform constant-weight bitmap null, that same threshold event has exactly the hypergeometric upper-tail probability. The cost-sensitive version is: ```lean OrdvecFormalization.exists_uniformBitmapOverlapTail_finiteCostedBayesRisk_le_and_hypergeomTail ``` The more general positive-base theorem is: ```lean OrdvecFormalization.exists_constantWeightBitmapOverlapTail_finiteBayesRisk_le_and_hypergeomTail ``` There the Bayes-optimality result holds for any positive base law, while the hypergeometric equation is a separate uniform-null tail calibration at the same cutoff. ## Glossary - `K`-active bitmap: a bitmap with exactly `K` active coordinates. - Overlap-tail rule: accept a candidate iff bitmap overlap is at least a cutoff. - Canonical overlap tilt: a signal model that exponentially reweights a base law by overlap evidence. - Uniform constant-weight null: the uniform law over all `K`-active bitmaps. - Hypergeometric tail: the exact probability, under that null, of clearing an overlap cutoff. ## Why This Is The Right Shape The formalization treats admission as a finite statistical experiment: ```text full observation Z quotient Q(Z) overlap evidence T(Q(Z)) binary admission decision ``` The proof separates four claims that are easy to blur: - symmetry: for constant-weight bitmaps, overlap classifies the orbits of the query-stabilizer action; - decision theory: if the class likelihood ratio factors through that evidence and is monotone in it, the optimal deterministic rule is a threshold; - calibrated evidence: a supplied ordered-tail equality is packaged at the same Bayes-optimal cutoff without asserting deployment-null adequacy; - null calibration: under the uniform constant-weight bitmap null, the threshold event has the hypergeometric upper-tail probability. This is a quotient-sufficiency theorem, not a representation-learning theorem. ## Assumption Checklist - finite observation space, specialized in the strongest theorem to the constant-weight bitmap subtype; - fixed `D`, `K`, and query bitmap `q`; - `q.card = K`; - uniform `K`-active bitmap null for the strongest checked theorem; - positive finite base law for the more general theorem; - signal parameter strictly larger than null parameter; - deterministic admission rules; - idealized null calibration, not a deployment-corpus guarantee. ## What Builders Get For an implementation that uses constant-weight overlap admission, the useful engineering consequence is: ```text under an empirically validated monotone-overlap decision contract, candidate admission can be a calibrated popcount threshold instead of an arbitrary accept/reject rule. ``` The Lean development supports the rule shape and the idealized finite null: - threshold form: optimal under the finite monotone evidence model; - bitmap event: the theorem is stated over literal `K`-active bitmaps; - null calibration: the threshold event has a checked hypergeometric tail under the uniform constant-weight bitmap law; - costs: asymmetric false-accept / false-reject tradeoffs change the threshold, not the rule family. ## What Reviewers Should Check The proof is intentionally decomposed: - `FiniteExperiment.lean`: quotient-form optimality theorem for finite experiments. - `FiniteQuotientSearch.lean`: finite quotient version-space constraints, sample consistency, and same-bucket falsifiers. - `FiniteQuotientImage.lean` and `QuotientKernel.lean`: reachable image quotients and the kernel-containment characterization of target factorization. - `FiniteProductQuotient.lean`: generic product quotients for pairwise targets, product-sample falsifiers, fixed-left/fixed-right specializations, componentwise multi-target signatures, score-induced ranking comparisons, and reachable image/cardinality equalities for product and comparison quotient rule spaces. - `FinitePairQuotient.lean`: retrieval-facing query/document wrapper around the generic product quotient API. - `FiniteObservationWindow.lean`: finite families/windows of lossy probes, target-invariance on joint-code fibers, longer-window refinement, and same-window-code falsifiers. - `FiniteDecision.lean`: monotone predicates on finite supports are thresholds. - `MLR.lean`: monotone likelihood ratio makes Bayes admit predicates monotone. - `BayesThreshold.lean`: Bayes thresholds minimize finite pointwise risk. - `OrdinalSufficiency.lean`: quotient evidence plus monotonicity gives a threshold. - `CalibratedEvidence.lean`: supplied ordered-tail calibration travels with the selected Bayes-optimal cutoff. - `OverlapSufficiency.lean`: specializes the quotient bridge to actual overlap coordinates. - `CanonicalTilt.lean`: finite exponential tilts make the likelihood ratio a monotone function of the evidence. - `BitmapIncidence.lean`: literal bitmap overlap evidence and bitmap-tail event bridges. - `ExponentialTilt.lean`: ordered-support exponential tilts have MLR. - `FNCH.lean`: actual-overlap FNCH weights match the shifted tilt. - `OverlapNull.lean`: overlap-null theorem wrappers and compatibility aliases. - `OverlapBayesOptimal.lean`: finite Bayes-risk and cost-sensitive wrappers for the canonical overlap-tilt theorem. - `BitmapNull.lean`: exact hypergeometric bitmap overlap null. - `BitmapSymmetry.lean`: query-stabilizer coordinate permutations preserve overlap, same-overlap equal-cardinality bitmaps lie in the same query-stabilizer orbit, and invariant constant-weight bitmap statistics factor through overlap. - `BitmapCalibration.lean`: combines the canonical signal model with the exact bitmap null in the constant-weight bitmap theorem. Run: ```sh make build make verify make check-doc-names make audit make lint ``` The expected axiom baseline is: ```text [propext, Classical.choice, Quot.sound] ``` ## What Is Not Claimed - Real encoders are not proved to satisfy quotient/factorization assumptions. - Concrete production probe families are not proved globally target-sufficient without finite target-invariance evidence. - Pairwise retrieval, ranking, and multi-target pipeline decisions are not proved preserved unless the corresponding product quotient contract is empirically supported. - Ranking-comparison preservation alone does not guarantee top-k equality, tie-policy preservation, or nDCG preservation. - Product-image equalities are theorems about the full formal product domain, not arbitrary prefiltered traces or logged candidate graphs. - Near-threshold score decisions require explicit margin evidence. - Real encoders are not proved to satisfy the query-stabilizer symmetry assumption; the symmetry theorem identifies the canonical invariant when that assumption is appropriate. - The hypergeometric null is not claimed to be the deployment null for every corpus or embedding model. - Quotient sufficiency for one decision is not representation completeness. - Full observations are not dismissed; they may be necessary for training, transformation, calibration, margins, near-ties, and targets outside the modeled decision. ## Empirical Contract The formalization gives the conditions. A benchmark harness should audit them: - Does `P(relevant | overlap = t)` increase with `t`? - Is the estimated likelihood ratio `p(t | relevant) / p(t | nonrelevant)` monotone? - Do presumed negatives follow the hypergeometric tail, or a fitted effective null? - Do same-joint-code pairs ever disagree on relevance/admission labels when multiple compressed probes are combined? - Where does the contract fail: broad queries, cross-domain admission, paraphrase, sparse signatures, or near-tie top-K instability? The theory is strongest when paired with these diagnostics: Lean proves the decision consequences, and experiments test whether modern encoders operate in the regime where the contract is a good approximation.