ADR-005: Scenario-Fitted Recall Boundary and Dual-Metric Reporting
==================================================================

Status
------
Accepted (2026-06-17)

Context
-------
The rules-only recall path under src/recall/selectors/ contains a large set of
"narrow gates" and source-order coverage rules (see src/recall/narrowGates.ts,
which describes them as "narrow, scenario-fitted query classifiers"). A subset
of these — most visibly the eventOrder.* family in
src/recall/selectors/sourceOrderRules/ — are fitted to specific benchmark
conversations: their query patterns and facet regexes match literal proper
nouns and phrasings drawn from individual benchmark cases (e.g. a rule whose
QUERY_PATTERN requires the exact phrase "personal and work-related challenges"
and whose facets match strings like "Blue Bay Resort" / "collaborating with
Greg").

Two problems follow:

1. Misleading quality signal. A headline rules-only recall number (~0.96 on the
   internal benchmark) is produced partly by these case-fitted rules, which can
   only ever fire on the exact benchmark inputs they were written for. A real
   user's data never contains "Blue Bay Resort", so the case-fitted rules never
   fire for them. The reported number therefore overstates generalization.

2. Shipped overfitting. package.json "files" publishes both src/ and dist/, so
   the case-fitted rules are shipped to npm consumers as part of the library.

The repository already has partial, reactive guardrails in
tests/unit/architecture.boundaries.test.ts:
  - DISALLOWED_SELECTOR_FILENAME_PATTERN bans some case-name filenames
    (Alexis|Greg|Kimberly|Stephen|...).
  - DISALLOWED_SELECTOR_RUNTIME_FIXTURE_PATTERN bans some fixture-literal
    contents (ashlee|laura|mason|...).
These blocklists are incomplete (they enumerate names cleaned up in the past
rather than the general pattern), so new case-fitted rules can still land.

The authoritative way to quantify the gap already exists:
src/recall/narrowGates.ts exposes a kill switch
(GOODMEMORY_DISABLED_NARROW_GATES) and listRegisteredNarrowGateIds();
scripts/audit-narrow-gates.ts uses them to classify each gate as dead /
case_fitted / load_bearing against the BEAM benchmark.

Decision
--------
1. Two recall numbers, always reported together. Whenever a rules-only recall
   figure is published (README, boards, eval artifacts), it MUST be accompanied
   by a "generalization recall" figure measured with every narrow gate disabled
   (GOODMEMORY_DISABLED_NARROW_GATES set to all registered gate ids). The gap
   between the two is the scenario-fitting contribution and must be visible, not
   hidden. scripts/list-scenario-gates.ts emits the id list for the kill switch;
   see Measurement below.

2. Admission criteria for new scenario rules. A new narrow gate / source-order
   coverage rule may be added only if ALL of the following hold, recorded in the
   commit message or linked board item:
   a. A general formulation was attempted first and demonstrably could not cover
      the case without unacceptable collateral noise.
   b. The rule keys on structural/semantic signals, NOT on literal proper nouns
      or verbatim phrases copied from a benchmark transcript.
   c. The rule is justified by >= 2 distinct cases (a single-case rule is
      "case_fitted" by the audit definition and is not admissible as new work).
   d. The author ran the audit (or recorded why it could not run) and the gate
      is not "dead".

3. Pruning is evidence-gated. Gates are removed based on audit verdicts (dead,
   or case_fitted that the team decides to drop), never on guesswork. Pruning
   requires the BEAM benchmark dataset to be present so the audit can run; see
   the Local SQLite / benchmark notes for dataset provenance (BEAM is CC BY-NC
   4.0 and is NOT vendored into the repo).

4. No silent growth. The scenario-rule set should not grow without going through
   (2). A follow-up guard (a file-count freeze on
   src/recall/selectors/sourceOrderRules/) is recommended once the set has been
   pruned to its load-bearing core; it is intentionally NOT added now because an
   in-flight benchmark-closure effort is still editing that directory and a hard
   freeze would break its build.

Measurement
-----------
Generalization recall (all scenario fitting disabled), once the BEAM dataset is
available at the diagnostic's benchmark root:

  GOODMEMORY_DISABLED_NARROW_GATES="$(bun run scripts/list-scenario-gates.ts)" \
    bun run scripts/run-phase-63-beam-recall-diagnostic.ts --run-id generalization

Per-gate dead/case_fitted/load_bearing classification:

  bun run scripts/audit-narrow-gates.ts

Measured (2026-06-18, BEAM 100K, goodmemory-rules-only, 355 evidence cases)
---------------------------------------------------------------------------
  fitted (all narrow gates on):       evidence-chat recall 0.9621, 20 missed
  generalization (all 151 gates off): evidence-chat recall 0.6822, 147 missed
  gap: 27.99 points; 127 of 355 evidence cases (36%) depend on narrow gates.

The headline 0.96 is therefore substantially produced by scenario-fitted
classifiers; on the non-gated path recall is 0.68. This is the floor (it
disables ALL narrow gates, including any legitimately-general ones); the
per-gate audit (scripts/audit-narrow-gates.ts) splits the gates into
dead / case_fitted / load_bearing so the genuinely-overfit subset can be pruned
without losing the general ones. Reproduce with the Measurement commands above.

Gate pruning — what a single-split "dead" verdict does and does NOT license
(final finding, 2026-06-18):
Disabling 14 gates flagged dead on the 100K split produced ZERO case deltas
there. But "dead on one split" is NOT a deletion license. Cross-checking those
14 against ALL splits (100K/500K/1M, 1,800 questions) AND the test suite:
  - 4 fire on 500K/1M (load-bearing on larger splits): aggregate.accommodationCost,
    aggregate.furnitureActivity, aggregate.medicalProvider,
    updateSeries.mortgagePreapproval.
  - 7 fire on no split but ARE pinned by unit/LongMemEval tests (real tested
    behavior the 100K split just lacks): aggregate.aquariumTank,
    aggregate.magazineSubscription, aggregate.foodDeliveryService,
    aggregate.formalEducationDuration, and the 3 updateSeries gates
    (relationshipLatestLocation / recentFamilyTrip / sharedGroceryListMethod —
    these serve LongMemEval's "personal evidence families").
  - 3 fire on no split AND have zero test coverage — genuinely dead:
    aggregate.feedWeight, aggregate.bikeService, aggregate.healthIssueOrder.

Only the last 3 were removed (verified: typecheck clean, full suite green,
BEAM recall identical at 0.9621 / 20 missed). The first attempt — deleting on the
100K-only verdict — broke 14 tests and was reverted; that mistake is the reason
this ADR exists.

Rule: a gate is removable only if it fires on NO split AND breaks NO test. A
single-split "dead" verdict is an ANALYSIS signal (it explains the dual-metric
gap), never a prune list. For everything else, surface the over-fitting via the
dual metric and gate new rules via the admission criteria above.

Consequences
------------
+ The published quality signal becomes honest: readers see both the fitted and
  the generalizing recall, and the delta is explicit.
+ New overfitting is gated by explicit admission criteria rather than landing
  silently.
- Until the BEAM dataset is restored locally, the generalization number cannot
  be recomputed here; the dual-metric policy is in force but the live figure is
  filled by whoever holds the dataset (CI or the maintainer).
- The full prune of the eventOrder.* family is deferred to a dataset-backed
  session; this ADR records the boundary so the set stops growing in the
  meantime.

Related
-------
- src/recall/narrowGates.ts (kill switch, listRegisteredNarrowGateIds)
- scripts/audit-narrow-gates.ts, scripts/list-scenario-gates.ts
- tests/unit/architecture.boundaries.test.ts (selector overfitting guards)
- ADR-006 (module layering and shared contracts)