# Entailer — Design Document (v0.1) *A logician's-pass linter for software artifacts. Open-source. Seeds `projects/entailer/`.* --- ## 1. One-paragraph pitch **Entailer** is a TypeScript-native engine that checks the *mathematical-logical soundness* of text artifacts at four escalating tiers — a single claim, a prompt, a Markdown document, a whole repo — by translating their load-bearing prose into propositional/first-order logic, then deterministically verifying **validity** and **consistency** and reporting them separately from **truth**. It is for engineers, prompt authors, spec writers, and AI agents who want "does this actually *follow* / are these requirements *consistent*?" answered with an auditable symbol dictionary and a reproducible certificate — not a vibe. **The one-sentence differentiation: markdownlint/Vale, but for logical validity and consistency over claims, prompts, docs, and repos — with `validity ≠ truth` as a first-class, machine-readable output contract that fallacy-classifiers and raw Z3 bridges both throw away.** --- ## 2. Name recommendation **`entailer`, npm scope `@entailer`** (both the unscoped name and the scope are confirmed free; 404 on registry probe). It names the exact relation the tool checks — entailment, `Γ ⊨ φ` — reads as an agent-noun ("the thing that entails"), and gives us a matching CLI bin and GitHub org. **Runner-up: `valens` / `@valens`** (Latin "being valid/strong", also free) — evocative but less self-explanatory, kept as the fallback if `entailer` collides on a GitHub org or `.dev` domain. --- ## 3. The core thesis The tool exists to protect one discipline the references hammer (ref 05, 99): **`validity ≠ truth ≠ faithful-formalization`**, reported as three independent fields, never collapsed. - **Validity / consistency** is what a deterministic verifier licenses (`Γ ∧ ¬φ` unsat ⇒ valid; a witnessed model ⇒ invalid/consistent). It is sound-by-construction and reproducible. - **Faithfulness** (does the formula capture the English?) is attestable only by an LLM-in-the-loop or a human; it is the *weak link* (ref 99) and the source of "rigorous, confident, wrong." - **Premise truth** is a question about the world and is explicitly out of scope. This forces a **strict two-layer architecture**: a **pluggable translator** (LLM-in-the-loop *or* deterministic rule-based) emits a typed **IR** carrying an auditable symbol dictionary; a **fully deterministic verifier core** consumes *only the IR, never raw text*, so the same checker runs identically whether the formalization came from Claude or a human. The verifier is the trusted kernel; the translator is an untrusted proposer whose product must always be shown. **A verdict without a visible formalization must not ship** (ref 99) — we make that a schema invariant, not a guideline. --- ## 4. Architecture ### 4.1 Monorepo layout pnpm workspaces + Turborepo + Changesets + tsdown (dual ESM+CJS). ~6 packages, the sweet spot for that toolchain. ``` entailer/ # projects/entailer/, later a public sibling clone ├── turbo.json pnpm-workspace.yaml tsconfig.base.json LICENSE (MIT) ├── packages/ │ ├── core/ @entailer/core # PURE. zero LLM/solver deps. The trusted kernel. │ │ ├── ir/ # FormalizedArgument + ClaimSet types + Zod + JSON Schema │ │ ├── parser/ # ASCII/Unicode logic DSL ↔ Formula AST │ │ ├── verify/ # native tableau + DPLL (propositional), refutation driver │ │ ├── lint/ # IR lint pass (coarse atoms, ∀∧ / ∃→ errors, free vars) │ │ ├── report/ # LogicReport type + toMarkdown + toSarif │ │ └── adapters/ # tier source-adapters: sentence | prompt | mdast | repo │ ├── solver/ @entailer/solver # OPTIONAL Z3/SMT bridge. z3-solver as optionalDependency. │ ├── translate/ @entailer/translate# LLM-in-the-loop translator (the ONLY LLM-touching code) │ ├── cli/ @entailer/cli # bin "entailer". depends on core (+ optional solver/translate) │ ├── mcp/ @entailer/mcp # stdio MCP server. zod-validate → call core → structuredContent │ └── viz/ @entailer/viz # view-model renderers (static SVG/HTML + React). depends on core types only ├── plugin/ # in-repo Claude Code plugin (.claude-plugin/plugin.json) │ ├── skills/formalize/ # VENDORED copy of the formalize skill (one-way sync) │ └── .mcp.json # declares @entailer/mcp via npx ├── apps/playground/ # Vite demo of viz (not published) ├── benchmarks/ # FOLIO/ProofWriter eval harness (dev-only; may host Python baselines) └── fixtures/ # golden LogicReports (ref-09 A/B/C + tier fixtures) ``` ### 4.2 How the surfaces compose without duplicating logic There is exactly **one place each concern lives**: - **Logic** (parse, verify, classify, lint, report-schema, tier-adapters) lives only in `@entailer/core`. - **The LLM translation step** lives only in `@entailer/translate` (and is re-invoked by the *plugin/skill in-context* — see §6). - **Solver escalation** lives only in `@entailer/solver`, behind a capability probe. - **Rendering pixels** live only in `@entailer/viz`; core emits typed **view-models** (JSON) and imports zero render libs. - **CLI / MCP** are thin: they validate input, call `core` (and optionally `solver`/`translate`), and shape output. They contain no logic. - **Prose/judgment knowledge** (fallacy catalog, NL→logic heuristics, caveats) lives only in the vendored `formalize` skill's `references/`. The machine-consumable parts of refs 07/09/16–22 (fallacy ids + signatures, verdict enum, severity ladder, honesty-contract string, the logic-fragment / escalation-target / flattened-kind taxonomy) are **generated** from fenced data blocks in those references into `core/src/generated/taxonomy.ts`, with a CI `gen && git diff --exit-code` gate so the TS enums can never silently drift from the prose. (Per the adversarial finding: "only types + algorithms" does *not* dodge duplication when the types *are* the taxonomy — so we generate, not hand-copy.) **Seeded now in `core/src/taxonomy.ts`** (hand-authored, ref-cited) ahead of the codegen (BUILD_PLAN T14): the deferred tail — ref 21 (`GENERALIZED_QUANTIFIER` fragment; `generalized-quantifier`/`ambiguous-reading` kinds) and ref 22 (`HOL` + `dependent-type-theory` fragments; the `PROOF_ASSISTANT` escalation target; `higher-order` kind) — is wired in, alongside the PL/FOL/modal/deontic/temporal/probabilistic/fuzzy base. ### 4.3 The internal IR Two record types in **one** schema (per the adversarial verdict on a single IR: the layer split is real but does **not** need a second schema — SMT-LIB itself unifies signature + per-argument assertions via push/pop). Propositional = the arity-0 fragment of FOL, so one Formula AST covers both. ```ts type Logic = 'PROP' | 'FOL'; type Source = 'stated' | 'supplied' | `reading:${string}`; // [supplied] enthymemes & per-reading interface SymbolEntry { kind: 'atom' | 'pred' | 'const' | 'func'; name: string; arity: number; gloss: string; // exact English — the trust anchor (ref 06 prime directive) source: Source; spans: SourceSpan[]; // provenance is a RELATION (0..n spans), not a 1:1 function } interface FormalizedArgument { logic: Logic; symbols: SymbolEntry[]; // the symbol dictionary, first-class premises: { ast: Formula; role: 'premise'|'premise-supplied'; source: Source; spans: SourceSpan[] }[]; conclusion?: { ast: Formula; source: Source; spans: SourceSpan[]; inferred?: boolean }; outOfScope: { text: string; kind: 'modal'|'deontic'|'temporal'|'probabilistic'|'causal'|'vague'|'higher-order'|'generalized-quantifier'|'ambiguous-reading'; spans: SourceSpan[] }[]; ambiguity?: { readings: FormalizedArgument[]; divergesOnVerdict: boolean }; translationConfidence: { band: 'low'|'med'|'high'; score: number; signals: string[] }; } // Corpus / cross-file layer = the SAME schema + alias edges as DATA (Tier 4 only; absent below) interface ClaimSet { args: FormalizedArgument[]; scopes: { id: string; argIds: string[] }[]; // one prompt / one doc / a cross-file cluster = a scope unification?: { a: string; b: string; basis: string; source: 'user'|'llm'|'lexical'; confidence: number }[]; } ``` The verifier reads only `args[].premises[].ast` / `conclusion.ast` and the `scopes`. Everything else (glosses, spans, confidence, ambiguity, out-of-scope) is for the human/LLM audit loop and the viz layer. --- ## 5. The five tiers One canonical pipeline; a **tier is a `SourceAdapter: (raw) => ClaimSet`** over one tier-agnostic engine. Tiers are an **additive capability ladder**, not five products. | Tier | Input | What the adapter adds | Consistency scope | Logic core change | |---|---|---|---|---| | **1 Sentence** | one claim | classification mode (tautology / contingent / contradiction via `φ` valid? `φ` sat? `¬φ` valid?) — no conclusion to recover | n/a (set of 1) | none | | **2 Prompt** | prose w/ implicit conclusion | conclusion recovery (indicator words → else LLM nominates + flags `inferred`) + enthymeme `[supplied]` recovery + argument validity | within-prompt | none | | **3 Markdown** | one `.md` | mdast extraction (remark-parse + unist-util-visit), drop `code`/`inlineCode`, heading-stack sectionPath, `position.start.line` → spans | within-doc | none | | **4 Repo** | a path | **file selection** (load-bearing prose) + per-file Tier-3 + cross-file scope | within + across clusters | none | | **5 PR** | base + head file sets (+ PR body) | **base→head delta**: cross-file consistency of both states, attributes each contradiction as introduced / fixed / pre-existing, folds the PR-description claims into the head; a configurable `gate` (head \| introduced) drives the exit code | head, vs a base baseline | none | **Tier 5 (PR) — the regression-gate view.** A PR is not "another repo": the differentiated check is a **delta**, established by prior art (SonarQube "Clean as You Code" applies only new-code conditions to PRs; ESLint bulk-suppressions / eslint-baseline / reviewdog all fail only on *newly-introduced* issues). Entailer computes the base claim-set vs the head claim-set and reports what the PR introduced / fixed on top of the head's absolute consistency, plus folds the PR description's fenced `entailer` claims into the head ("the PR says X but the code contradicts X"). Two knobs: `gate` (`head` = fail on any head inconsistency, the default; `introduced` = fail only when a new/changed claim participates in the minimal conflicting subset) and `descriptionSeverity` (`fail` | `warn` | `off`). **Honesty:** a head that is still inconsistent but not *regressed* is a legitimate PASS under `gate=introduced` — the verdict is `NO_ISSUE_FOUND` while `consistency.status` stays `UNSAT`, and a mandatory note + the `delta` field state "the head carries N pre-existing contradiction(s); this PR introduced none." Unlike the count-based CI baselines, entailer's claims have stable identity (`uri:dsl`), so attribution is **exact per-claim**, not an approximation. **Shared source-adapter model.** The engine never knows which tier it serves; only chunking, conclusion-recovery, and consistency *scope* differ. Logic strength is **independent of tier** — right-size per claim (a Tier-1 sentence may need FOL; a Tier-4 requirement may be pure propositional). **Repo-scale tractability bound.** Never the O(n²) all-pairs claim cross-product. Instead: **(1) select** load-bearing docs (deterministic glob allowlist — README, SPEC*, RFC*, ADR/decisions/*, *requirements*, *acceptance*, package manifests — *always* included without LLM gating; LLM classifier only for the ambiguous tail), defaulting to **high-recall** (a false-exclude inverts the verdict; a false-include is mere noise), and **report the selection manifest** (chosen vs skipped + why). **(2) Canonicalize predicates** into a repo-wide symbol dictionary *before* clustering (so "request" and "inbound call" collapse to one symbol). **(3) Cluster** claims by canonical-symbol overlap, run SMT consistency only within and across overlapping clusters, **(4) residual synonymy pass**: a small batched LLM judge over high-embedding-similarity cross-cluster predicate pairs, merging on a yes. Cap claims-per-solver-call and time-box each Z3 invocation. **Selection and synonym-bridging are fallible LLM steps; both are surfaced as explicit coverage caveats, never as a clean "consistent".** --- ## 5b. The concept-faithfulness lens (an orthogonal axis, v1.2.0) The five tiers ask *does the logic hold?* The **domain lens** asks a different question on an independent axis: *does the code stay faithful to its own concepts?* It reuses the infrastructure (Tier-4 selection, `path:line` provenance, the report honesty discipline, viz, MCP/CLI) but **not** the SAT/DPLL/tableau verifier, and introduces no new trusted kernel (framing ADR 0012; design ADR 0013). - **Input.** A human declares a small concept cluster on its four sides — relationships / defining rule / examples / vocabulary (`DomainSpec`). Only two relation verbs are machine-checkable: `is-a` (subsumption `A ⊑ B`) and `is-not` (disjointness `A ⊓ B = ⊥`). Subsumption is satisfiable by construction, so a `member` that *is* a `user` co-occurring is silence; only disjointness is violable, which is what makes a leak a *contradiction*. - **Detect (detect-then-adjudicate).** An honest-recursion pre-check first asks whether the declaration is even self-consistent (a pair declared both `is-a` and `is-not`; an identifier both a positive and negative example) — the cheapest, weak-link-free move, before any file is read. Then it classifies identifier sites lexically and fires **rank-1** when one identifier fuses two disjoint concepts. - **Honesty.** Only **rank-1** is a verdict, carrying a ≥2-receipt minimal conflicting subset. **Ranks 2–3 are reader hints** (`readerAssisted`): they assert nothing, can never be a blocker, and never render in the certificate register — enforced by `DomainReport`'s `.superRefine`, mirroring `LogicReport`. `noLeakFound` is the analogue of `noContradictionFound`. - **The weak link, named.** Classification ("is this token a `member`-use?") is lexical, so it is the irreducible weak link — high recall, precision contingent on a human confirming each site as the adjudication step. The tool never emits an absolute same/different judgment; only a declared-disjointness collision fires. - **Surface.** `entailer domain --lens --repo ` (`--gate introduced` reports only leaks new vs a base ref), the MCP tool `evaluate_domain`, and `domainMapView`/`domainBadge` in viz. `target.tier` stays `1..5`; the lens is a distinct `DomainReport`, not a sixth tier. Deferred past v1.2.0: the LLM-grounded rank-3 prose-vs-practice detector, execution-grounded detection, and the standing-rule adjudication store. --- ## 6. The engine pipeline ``` ┌─────────────── LLM-IN-THE-LOOP (untrusted proposer) ───────────────┐ raw ──▶ extract claims ──▶ recover premises/conclusion ──▶ formalize (NL→logic) ──▶ FormalizedArgument/ClaimSet └────────────────────────────────────────────────────────────────────┘ │ (IR is the gate) ┌───────────── DETERMINISTIC CORE (trusted kernel, no text) ──────────┐ IR ──▶ Zod/JSON-Schema validate ──▶ IR lint ──▶ verify (validity) ──▶ consistency ──▶ report └──────────────────────────────────────────────────────────────────────┘ │ (optional, opt-in) EXTERNAL SOLVER: Z3 (z3-solver WASM) — amplifier, never a gate ``` ### The deterministic / LLM boundary, drawn explicitly - **LLM-in-the-loop (required, irreducible):** claim extraction, conclusion identification, enthymeme/`[supplied]` recovery, NL→logic formalization, ambiguity detection + competing-reading generation, predicate canonicalization (Tier 4), residual synonymy judgment. This is the highest-error step (FOLIO: GPT-3.5 ~49–54%; NL2FOL ~78–80% F1 — *the solver is correct; translation is the bottleneck*). - **Deterministic core (default path, zero deps):** IR schema validation, the **IR lint pass** (coarse-atom detection, `∀`-with-`∧` / `∃`-with-`→` errors, dropped domain restrictions, free variables in sentences), propositional **semantic tableau** (primary — yields the counter-valuation for free) backed by a native **DPLL** for SAT/consistency, refutation driver (`Γ ∪ {¬φ}` unsat), counter-model extraction, classification (ref 10), report assembly, SMT-LIB emission, view-model emission. **Propositional validity/SAT is decidable (co-NP / NP) and fully owned here.** - **External solver (opt-in amplifier):** **Z3 via `z3-solver`** (official Microsoft WASM build, MIT, ~v4.16, TS bindings, `z3-solver/node`). Used for FOL / arithmetic / quantifier-free SMT validity (`unsat`⇒VALID), consistency (`get-unsat-core`⇒minimal conflict, then deletion-/QuickXplain-minimized since Z3 cores aren't guaranteed minimal), and finite counter-models. **Always also runs/shows the hand formalization; the base verdict never depends on Z3.** ### Chosen solver packages (decided) - **`z3-solver`** — the FOL/SMT escalator. **Node-only by default** (browser needs SharedArrayBuffer + COOP/COEP). In the MCP server it runs **out-of-process** in a short-lived spawned worker that does `init → assert → check → get-model/unsat-core → exit`, with a hard wall-clock SIGKILL timeout — this sidesteps the documented Z3-WASM worker-thread leak (issues #7070/#6512) and gives a clean timeout story. Lazy `dynamic import()` + full async-init try/catch capability probe; on any failure, degrade silently to the native path. `engines.node >=20.19` (avoids `ERR_REQUIRE_ASYNC_MODULE`); ship unbundled. - **Native TS tableau + DPLL** — the always-available propositional base. **No dependency on `logic-solver`** (Meteor-era, unmaintained) and **no `tau-prolog`** in core (maintenance-stale, CWA semantics silently change the logic). `logic-solver` may return later as an *optional* pluggable SAT backend if a hard CNF gets large. - **Decidable-FOL native target (deferred, see §11):** monadic predicate logic and Bernays-Schönfinkel/EPR are decidable and pure-TS-implementable; if a corpus probe shows syllogistic arguments dominate Tier 3/4, this becomes the WASM-free workhorse. Until measured, FOL escalates to Z3. ### Self-refinement (design pattern, cited not novel) Steal **Logic-LM's** loop (MIT): feed solver error / `unknown` back to the translator to repair the formalization. Cited as prior art in the README. --- ## 7. MCP tools `@modelcontextprotocol/sdk ^1.20.0`, `registerTool` with `outputSchema` + `structuredContent`, stdio, all tools **stateless**. Mirrors the house `palette-oklch` convention. **Default formalization path is one-call** (Claude formalizes *in-context* via the skill, then calls the deterministic `check_*` tools) — the two-call "instruction object" protocol is **rejected** (MCP `sampling` is unsupported in Claude Code/Desktop; an unenforced re-call convention is fragile and a careless agent treats the instruction as the answer). **Realized both sides in 1.3.0:** each plugin's `SKILL.md` now formalizes in-context and then runs a real deterministic check (MCP tools or the CLI), so the editor path is key-less end to end — the key-carrying `@entailer/translate` is only for autonomous use outside an editor. (Re-confirmed 2026-07: Claude Code still exposes no key-less LLM to a Node subprocess — no MCP `sampling`, no injected credentials, no key-less endpoint outside the browser-artifact sandbox — so the in-context skill remains the only key-less route.) | Tool | One-line schema | |---|---| | `check_validity` | `(premises: Formula[], conclusion: Formula, logic, escalate?) → { verdict: Verdict, proof? \| counterModel?, method, solverArtifact? }` | | `check_consistency` | `(formulas: Formula[], escalate?) → { status: SAT\|UNSAT\|UNKNOWN, model? \| minimalConflictingSubset?, method }` | | `classify_formula` | `(formula: Formula) → { kind: tautology\|contradiction\|contingent\|vacuous, witness? }` | | `verify_smtlib` | `(script: string) → { result: sat\|unsat\|unknown, model?, unsatCore? }` — thin `z3-solver` passthrough: the audit/escape hatch (~30 LoC) | | `render_proof` | `(checkResult, format: text\|svg\|mermaid) → { artifact: string }` (data, not pixels) | | `enumerate_repo_docs` | `(repoPath: string) → { manifest: { uri, included, role, reason }[] }` (Tier-4 selection step; server reads FS, local-only) | | `formalize_claim` | `(text, context?, formalization?) → LogicReport-needs OR validated/normalized IR` — **validation/normalization helper only**, never a blocking precondition | | `evaluate_sentence` / `evaluate_text` / `evaluate_repo` | `(target, formalization) → LogicReport` — named for clear affordances + cost expectations; thin scope-wrappers over one core `evaluate(claimSet)`; if `formalization` absent, return a "formalization needed" schema rather than guessing | Every `evaluate_*` and `check_validity` result is **refused by schema** unless it carries a non-empty symbol dictionary. `verify_smtlib` is exposed *alongside* the high-level tools (per the adversarial verdict: don't "hide" Z3 — adding the low-level escape is nearly free and the high-level tools already surface their emitted SMT-LIB, so there is no leak-with-no-exit). The differentiator is the **formalize methodology**, not the abstraction level. --- ## 8. Visualizations `@entailer/viz` consumes typed **view-models** (engine emits JSON, zero render deps). Dual-target from one model: `toSVG/toHTML` (Node, no browser — for CLI/Markdown reports) and React components (web). Static path: hand-emitted SVG with **fixed monospace glyph metrics** (DejaVu Sans Mono, embedded data-URI — exact widths for linear logic notation, **2D typeset math is out of MVP scope**), `elkjs`/`dagre` for layout, `@viz-js/viz` (Graphviz WASM, headless) for the repo graph, KaTeX `renderToString` for HTML reports. Interactive web path: `@xyflow/react`. **Mermaid is rejected as a runtime dep** (Puppeteer/Chrome); offered only as an optional `toMermaid()` string. | # | View | Data contract (sketch) | Certificate it shows | |---|---|---|---| | 1 | **TruthTableView** | `{ vars, rows: {assignment, columns, isCounterexample}[], formulaTeX, verdict }` | falsifying row ⇒ invalid | | 2 | **TableauTreeView** | `{ nodes: {id, labelUnicode, rule?, closedBy?}[], edges, openBranches: id[][], positions, verdict }` | closed tableau ⇒ valid; open branch ⇒ counter-model | | 3 | **CounterModelView** | `{ domain?, assignment, satisfies: id[], falsifies: id, provenance }` | the model that breaks the argument ⇒ invalid | | 4 | **ClaimGraphView** | `{ nodes: {id, claimText, source:'path:line', formalTeX, supplied:bool, confidence}[], edges:{from,to,kind:'entails'\|'depends'\|'conflicts'}[], minimalConflictingSubset?: id[] }` | unsat-core spanning files ⇒ inconsistent (the repo "contradiction heatmap" is just a styling mode = node fill by conflict-participation) | **Hard renderer invariant (the validity-certificate discipline, ref 04/13):** a view colors something VALID only if a proof/UNSAT certificate object is present, INVALID only if a witnessed (finite) counter-model is present; **"bounded search, inconclusive" / "no finite witness" renders as an explicit inconclusive state, never as VALID.** Nodes carry structured provenance (`literal | supplied | llm-recovered | ambiguous`) + a legend, so the picture cannot launder an LLM guess into a logical guarantee. View-models are stamped with the **formalization content-hash + model id + temperature** so the render is deterministic given a frozen formalization (layout uses a pinned seed / deterministic algorithm). *The moat is the verdict-gating contract in the core, not the pixels — proof-viz toys exist (tpg/umsu, Mace4, Nitpick), so we compete on the five-tier escalation + packaging, not on the render alone.* --- ## 9. Output / result schema Canonical store = an ergonomic, hand-authored **`LogicReport`** in `@entailer/core`. **SARIF 2.1.0 is a lossless serializer target, not the canonical form** (`toSarif(report)` via `@types/sarif` + `node-sarif-builder` for VS Code / GitHub code-scanning). The verdict-trichotomy and 4-tier severity *do* fit SARIF faithfully — `level` (error/warning/note/none) + the orthogonal `kind` axis (fail/pass/open/review, where UNKNOWN→`open`, INCONSISTENT→`fail`+ruleId) + numeric `rank` for severity — but Tier-1/2 (no file) reports are **validator-grade, not GitHub-grade** (logical-only locations are dropped by code-scanning), so SARIF emission is gated per-tier and the README documents the limit. ```ts type Verdict = | 'VALID' | 'INVALID' | 'INCONSISTENT' | 'GAP' | 'UNKNOWN' | 'NO_ISSUE_FOUND'; type Severity = 'blocker' | 'major' | 'minor' | 'note'; interface LogicReport { target: { tier: 1|2|3|4; uri?: string }; verdict: Verdict; // headline (worst-case rollup for tiers 3-4) // validity ≠ truth ≠ faithfulness — THREE independent fields, never merged: validity: { status: Verdict; method: string; proof?: Derivation; counterModel?: CounterModel }; consistency: { status: 'SAT'|'UNSAT'|'UNKNOWN'; model?: Model; minimalConflictingSubset?: ClaimRef[] }; // serialized as "noContradictionFound" when clean soundnessRisk: { note: string }[]; // per-premise: vouchable | empirical-out-of-scope | supplied-and-contested. NEVER a boolean "sound". symbolDictionary: { symbol: string; gloss: string }[]; // REQUIRED non-empty or the report fails schema validation formalization: { // VALID/INVALID ⇒ must be present id: string; source: SourceSpan[] | null; // null/[] ⇒ rendered "[supplied/derived]" english: string; formal: string; role: 'premise'|'premise-supplied'|'conclusion'; logicSystem: 'PL'|'FOL'|'HOL'|'dependent-type-theory'|'generalized-quantifier'|`flagged-${'modal'|'deontic'|'temporal'|'probabilistic'}`; // HOL/DTT escalate to a proof assistant; generalized-quantifier is flag-only (refs 21/22) translationConfidence: { band: 'low'|'med'|'high'; score: number }; }[]; flattened: { text: string; kind: 'modal'|'deontic'|'temporal'|'probabilistic'|'causal'|'vague'|'higher-order'|'generalized-quantifier'|'ambiguous-reading' }[]; verdictConfidence: { band: 'high'|'low'; reason: string }; // HIGH iff a proof/counter-model exists; drops on solver unknown/timeout findings: { severity: Severity; location: SourceSpan|null; name: string; signature: string; why: string }[]; selectionManifest?: { uri: string; included: boolean; role: string; reason: string }[]; // Tier 4 coverageCaveat?: string; // Tier 4: "verdict holds over N selected docs; M excluded; selection is heuristic" solverArtifact?: { smtlib: string; result: 'sat'|'unsat'|'unknown'; unsatCore?: string }; honestyContract: string; // REQUIRED — ref-99 disclosure verbatim } ``` Dishonesty is made **unrepresentable**: no `formalization` ⇒ no VALID/INVALID; INVALID ⇒ `counterModel` required; INCONSISTENT ⇒ `minimalConflictingSubset` required; clean consistency serializes as `noContradictionFound` (never "proven consistent"); low `translationConfidence` forces `verdict: UNKNOWN`; flagged-modal/deontic content sets `logicSystem` to the flagged variant and is excluded from binary verdicts. CI schema-validates the corpus so a dishonest report literally fails tests. --- ## 10. Eval & regression strategy Split the suite at the **formalize boundary**. - **Deterministic verifier — HARD golden + property tests** (fast, free, fully deterministic): soundness property (a VALID verdict *must* carry a derivation or unsat-core); PL agreement with an exhaustive truth-table oracle; **explosion guard** (an INCONSISTENT premise set never yields a VALID-informative verdict — consistency is checked first); **vacuity flag** (unsatisfiable antecedent ⇒ flagged vacuous, not silently VALID); the invariant **no path emits INVALID without an attached counter-model**. - **LLM formalization step — structural-equivalence scoring, never exact-match** (faithful encodings vary): run candidate *and* golden formalization through the **same deterministic checker** and assert agreement on validity + consistency — but **over a battery of probe-claims per symbol set**, not one fixture (a coarse atom can match on one probe and fail another); bucket Z3 `unknown` / quantified-timeout as a third **INDETERMINATE — human review** state (never pass/fail); score **symbol-dictionary / `[supplied]` / ambiguity recall**; assert the **low-confidence ⇒ UNKNOWN** behavior. Logical-equivalence is reported as "equivalent *under its symbol dictionary*," never as "faithful" — it is one signal, necessary-not-sufficient. Run this layer as an **opt-in, model-pinned (temperature 0)** eval, gated behind the cheap deterministic checks (promptfoo-style preflight). - **Translation-confidence** is a **multi-signal ensemble feeding an abstention gate**, never LLM self-report (documented as badly calibrated): k-sample forward-formalization *disagreement* + back-translation with a *different* model (defeats symmetric round-trip masking) + a structural-checklist round-trip (quantifier scope, necessary/sufficient direction, dropped domain restriction) + deterministic IR-lint red-flags. The UNKNOWN threshold is **tuned on a labeled dev set via the risk-coverage curve** (conformal abstention for a coverage guarantee), exposed as per-tier config, and the flood-/miss-rates are published. - **Corpus.** Anchor = ref-09's **A/B/C** as fully hand-certified expected `LogicReport` objects (A valid modus-ponens, B affirming-the-consequent + counter-model `W=F,L=T`, C inconsistent health-check spec + minimal subset `{R1,R2,R3}`). Extend tier-by-tier: **T1** sentences incl. a vacuous conditional + a flagged-deontic "shall"; **T2** a prompt with an embedded fallacy + a benign prompt (**false-positive guard** — must NOT manufacture a fallacy); **T3** a multi-claim spec `.md` asserting `path:line`; **T4** a tiny multi-doc repo where README contradicts SPEC (cross-file inconsistency + per-file locations + rollup). Plus explicit **negative/clean fixtures** to pin the false-positive rate. **FOLIO/ProofWriter** run in `benchmarks/` (dev-only CI lane, may host the Python Logic-LM/LINC baselines) and the translation-fidelity score is **published** — credibility currency in this space. --- ## 11. MVP scope (v0.1) vs later **v0.1 — Tier-1 end-to-end across core + CLI, with the IR/adapter seam built for the ladder.** 1. `@entailer/core`: Formula AST + ASCII/Unicode DSL parser; the `FormalizedArgument`/`ClaimSet` IR (Zod + JSON Schema); native propositional **tableau + DPLL**; refutation driver; **classification** (tautology/contingent/contradiction); IR lint pass; `LogicReport` type + `toMarkdown`. 2. Tier-1 **sentence** source-adapter (degenerate ClaimSet of 1) + the three-check classification mode. 3. `@entailer/cli` (`entailer`): formalization supplied as input (rule-based/JSON), `--json`/`--sarif`, exit codes incl. a distinct UNKNOWN-blocked code. 4. The ref-09 **A/B/C golden fixtures** + deterministic property tests green; schema-validation CI gate. 5. Taxonomy codegen + drift gate; vendored `formalize` skill + one-way sync script. **v0.2 — Tier 2 + the LLM seam + MCP.** `@entailer/translate` (LLM-in-the-loop, isolated); conclusion recovery + `[supplied]` enthymemes + argument validity; `@entailer/mcp` (the `check_*` + `evaluate_sentence/text` tools, one-call default); structural-equivalence eval lane. **v0.3 — Tier 3 + viz MVP.** mdast adapter (remark-parse, code-fence strip, `path:line`); `@entailer/solver` (Z3 out-of-process, opt-in); within-doc consistency; `@entailer/viz` views 1–4 static SVG; `evaluate_text`. **v0.4 — Tier 4.** Selection manifest (glob allowlist + LLM tail), predicate canonicalization, clustered cross-file consistency + residual synonymy pass, coverage caveats, repo ClaimGraph; `evaluate_repo`; public sibling split + ADRs. **Explicit NON-goals / scope cuts (from the verification):** - **No standalone Tier-1/2 single-sentence demo as the headline** — that surface *is* a TS NL2FOL re-implementation and the "distinct niche" charge is unanswerable there. **Lead every README/demo with a Tier-3/4 inconsistent-spec example** (NL2FOL structurally cannot produce a cross-file minimal conflicting subset). - **No "soundness checking"** in any copy — core does validity + consistency + satisfiability; premise truth is out of scope for *every* component. - **No full-FOL completeness promise** — guarantee only the decidable fragments (propositional now; monadic/EPR later). Quantified/full-FOL input is best-effort via Z3 quantifier instantiation and may return `unknown`; **`unknown`/timeout is a first-class UNKNOWN, never INVALID.** - **No 2D typeset math** in the viz MVP (linear Unicode notation only). - **No `tau-prolog` / `logic-solver` in core**; no in-browser Z3 by default (Node/MCP only; viz consumes solver *output*). - **No two-call MCP "formalize-then-verify" protocol** (sampling unsupported; fragile). Formalization happens in-skill, in-context. - **No Mermaid runtime dep; no Turborepo remote cache on day one** (local + `actions/cache`); LLM-eval tasks are explicitly uncached. - **No native FOL theorem prover ported to JS** (none exists, maintained); full-FOL resolution, if ever needed, is an optional out-of-process binary, clearly marked unavailable when absent. - **Public-split disclosure is a business-scope CEO call** logged as an ADR; vendoring is one-way (workshop → public). --- ## 12. Open risks 1. **PROP-vs-FOL corpus ratio is unmeasured.** If real Tier-3/4 arguments are overwhelmingly quantified, the native path returns UNKNOWN constantly and value collapses onto Z3 (re-introducing the browser-header tax for any client-side solving). *Track via a one-week corpus probe before committing the native-only browser story; ship the decidable-FOL native target only if PROP+monadic+EPR ≥ ~70% of formalizable units.* 2. **Translation faithfulness is the irreducible weak link.** ~1-in-5 formalizations carry a wrong predicate/quantifier even on clean inputs (NL2FOL 78–80%), and the deterministic core cannot detect a confident-but-wrong IR (hallucinated/unsourced premises especially). The verdict is *always* conditional on the displayed formalization; never a standalone badge, never an auto-CI gate at Tier 3/4 (those ship as "assisted review", not pass/fail), no numeric repo "soundness score." 3. **Tier-4 selection error is invisible to the engine** — a dropped load-bearing doc yields a confident verdict over an incomplete ClaimSet and can *invert* an INCONSISTENT result. Mitigated by high-recall selection + the mandatory coverage caveat + deterministic always-include globs, but recall is empirically ~67–84% (requirement-extraction proxy), so it stays a tracked residual. 4. **Cross-file synonymy can hide a real contradiction** if canonicalization mis-merges or misses an alias; the residual-synonymy pass raises the floor but offers no completeness guarantee — reported as "checked within canonical clusters", never a clean "consistent". 5. **Z3-WASM operational surface** — worker-thread leak, single-threaded serialization, large payload, `unknown` on quantifiers. Contained by out-of-process short-lived workers + hard timeouts + lazy guarded init, but it is a heavy dep the team owns. 6. **The viz moat is thin and time-limited** — proof/counter-model rendering is commodity; durability comes from the five-tier escalation + packaging + the core's verdict-gating contract, not the pixels. 7. **Differentiation is back-loaded** — it is invisible unless Tier-3/4 and cross-claim consistency are *actually built and demoed*; if they slip, the "reinventing NL2FOL" criticism lands. Front-load the Tier-3/4 inconsistent-spec demo to neutralize this. ---