--- # agentskills.io compliant frontmatter name: clarity-gate risk: unknown source: community version: 2.1.3 description: > Pre-ingestion verification for epistemic quality in RAG systems. Ensures documents are properly qualified before entering knowledge bases. Produces CGD (Clarity-Gated Documents) and validates SOT (Source of Truth) files. author: Francesco Marinoni Moretto license: CC-BY-4.0 repository: https://github.com/frmoretto/clarity-gate triggers: - clarity gate - check for hallucination risks - can an LLM read this safely - review for equivocation - verify document clarity - pre-ingestion check - cgd verify - sot verify capabilities: - document-verification - epistemic-quality - rag-preparation - cgd-generation - sot-validation outputs: - type: cgd extension: .cgd.md spec: docs/CLARITY_GATE_FORMAT_SPEC.md spec_version: "2.1" --- # Clarity Gate v2.1 **Purpose:** Pre-ingestion verification system that enforces epistemic quality before documents enter RAG knowledge bases. Produces Clarity-Gated Documents (CGD) compliant with the Clarity Gate Format Specification v2.1. **Core Question:** "If another LLM reads this document, will it mistake assumptions for facts?" **Core Principle:** *"Detection finds what is; enforcement ensures what should be. In practice: find the missing uncertainty markers before they become confident hallucinations."* --- ## What's New in v2.1 | Feature | Description | |---------|-------------| | **Claim Completion Status** | PENDING/VERIFIED determined by field presence (no explicit status field) | | **Source Field Semantics** | Actionable source (PENDING) vs. what-was-found (VERIFIED) | | **Claim ID Format Guidance** | Hash-based IDs preferred, collision analysis for scale | | **Body Structure Requirements** | HITL Verification Record section mandatory when claims exist | | **New Validation Codes** | E-ST10, W-ST11, W-HC01, W-HC02, E-SC06 (FORMAT_SPEC); E-TB01-07 (SOT validation) | | **Bundled Scripts** | `claim_id.py` and `document_hash.py` for deterministic computations | --- ## Specifications This skill implements and references: | Specification | Version | Location | |---------------|---------|----------| | Clarity Gate Format (Unified) | v2.1 | docs/CLARITY_GATE_FORMAT_SPEC.md | **Note:** v2.0 unifies CGD and SOT into a single `.cgd.md` format. SOT is now a CGD with an optional `tier:` block. --- ## Validation Codes Clarity Gate defines validation codes for structural and semantic checks per FORMAT_SPEC v2.1: ### HITL Claim Validation (§1.3.2-1.3.3) | Code | Check | Severity | |------|-------|----------| | **W-HC01** | Partial `confirmed-by`/`confirmed-date` fields | WARNING | | **W-HC02** | Vague source (e.g., "industry reports", "TBD") | WARNING | | **E-SC06** | Schema error in `hitl-claims` structure | ERROR | ### Body Structure (§1.2.1) | Code | Check | Severity | |------|-------|----------| | **E-ST10** | Missing `## HITL Verification Record` when claims exist | ERROR | | **W-ST11** | Table rows don't match `hitl-claims` count | WARNING | ### SOT Table Validation (§3.1) | Code | Check | Severity | |------|-------|----------| | **E-TB01** | No `## Verified Claims` section | ERROR | | **E-TB02** | Table has no data rows | ERROR | | **E-TB03** | Required columns missing | ERROR | | **E-TB04** | Column order wrong | ERROR | | **E-TB05** | Empty cell in required column | ERROR | | **E-TB06** | Invalid date format in Verified column | ERROR | | **E-TB07** | Verified date in future (beyond 24h grace) | ERROR | **Note:** Additional validation codes may be defined in RFC-001 (clarification document) but are not part of the normative FORMAT_SPEC. --- ## Bundled Scripts This skill includes Python scripts for deterministic computations per FORMAT_SPEC. ### scripts/claim_id.py Computes stable, hash-based claim IDs for HITL tracking (per §1.3.4). ```bash # Generate claim ID python scripts/claim_id.py "Base price is $99/mo" "api-pricing/1" # Output: claim-75fb137a # Run test vectors python scripts/claim_id.py --test ``` **Algorithm:** 1. Normalize text (strip + collapse whitespace) 2. Concatenate with location using pipe delimiter 3. SHA-256 hash, take first 8 hex chars 4. Prefix with "claim-" **Test vectors:** - `claim_id("Base price is $99/mo", "api-pricing/1")` → `claim-75fb137a` - `claim_id("The API supports GraphQL", "features/1")` → `claim-eb357742` ### scripts/document_hash.py Computes document SHA-256 hash per FORMAT_SPEC §2.2-2.4 with full canonicalization. ```bash # Compute hash python scripts/document_hash.py my-doc.cgd.md # Output: 7d865e959b2466918c9863afca942d0fb89d7c9ac0c99bafc3749504ded97730 # Verify existing hash python scripts/document_hash.py --verify my-doc.cgd.md # Output: PASS: Hash verified: 7d865e... # Run normalization tests python scripts/document_hash.py --test ``` **Algorithm (per §2.2-2.4):** 1. Extract content between opening `---\n` and `` 2. Remove `document-sha256` line from YAML frontmatter ONLY (with multiline continuation support) 3. Canonicalize: - Strip trailing whitespace per line - Collapse 3+ consecutive newlines to 2 - Normalize final newline (exactly 1 LF) - UTF-8 NFC normalization 4. Compute SHA-256 **Cross-platform normalization:** - BOM removed if present - CRLF to LF (Windows) - CR to LF (old Mac) - Boundary detection (prevents hash computation on content outside CGD structure) - Whitespace variations produce identical hashes (deterministic across platforms) --- ## The Key Distinction Existing tools like UnScientify and HedgeHunter (CoNLL-2010) **detect** uncertainty markers already present in text ("Is uncertainty expressed?"). Clarity Gate **enforces** their presence where epistemically required ("Should uncertainty be expressed but isn't?"). | Tool Type | Question | Example | |-----------|----------|---------| | **Detection** | "Does this text contain hedges?" | UnScientify/HedgeHunter find "may", "possibly" | | **Enforcement** | "Should this claim be hedged but isn't?" | Clarity Gate flags "Revenue will be $50M" | --- ## Critical Limitation > **Clarity Gate verifies FORM, not TRUTH.** > > This skill checks whether claims are properly marked as uncertain—it cannot verify if claims are actually true. > > **Risk:** An LLM can hallucinate facts INTO a document, then "pass" Clarity Gate by adding source markers to false claims. > > **Solution:** HITL (Human-In-The-Loop) verification is **MANDATORY** before declaring PASS. --- ## When to Use - Before ingesting documents into RAG systems - Before sharing documents with other AI systems - After writing specifications, state docs, or methodology descriptions - When a document contains projections, estimates, or hypotheses - Before publishing claims that haven't been validated - When handing off documentation between LLM sessions --- ## The 9 Verification Points ### Relationship to Spec Suite The 9 Verification Points guide **semantic review** — content quality checks that require judgment (human or AI). They answer questions like "Should this claim be hedged?" and "Are these numbers consistent?" When review completes, output a CGD file conforming to CLARITY_GATE_FORMAT_SPEC.md. The C/S rules in CLARITY_GATE_FORMAT_SPEC.md validate **file structure**, not semantic content. **The connection:** 1. Semantic findings (9 points) determine what issues exist 2. Issues are recorded in CGD state fields (`clarity-status`, `hitl-status`, `hitl-pending-count`) 3. State consistency is enforced by structural rules (C7-C10) *Example: If Point 5 (Data Consistency) finds conflicting numbers, you'd mark `clarity-status: UNCLEAR` until resolved. Rule C7 then ensures you can't claim `REVIEWED` while still `UNCLEAR`.* --- ### Epistemic Checks (Core Focus: Points 1-4) **1. HYPOTHESIS vs FACT LABELING** Every claim must be clearly marked as validated or hypothetical. | Fails | Passes | |-------|--------| | "Our architecture outperforms competitors" | "Our architecture outperforms competitors [benchmark data in Table 3]" | | "The model achieves 40% improvement" | "The model achieves 40% improvement [measured on dataset X]" | **Fix:** Add markers: "PROJECTED:", "HYPOTHESIS:", "UNTESTED:", "(estimated)", "~", "?" --- **2. UNCERTAINTY MARKER ENFORCEMENT** Forward-looking statements require qualifiers. | Fails | Passes | |-------|--------| | "Revenue will be $50M by Q4" | "Revenue is **projected** to be $50M by Q4" | | "The feature will reduce churn" | "The feature is **expected** to reduce churn" | **Fix:** Add "projected", "estimated", "expected", "designed to", "intended to" --- **3. ASSUMPTION VISIBILITY** Implicit assumptions that affect interpretation must be explicit. | Fails | Passes | |-------|--------| | "The system scales linearly" | "The system scales linearly [assuming <1000 concurrent users]" | | "Response time is 50ms" | "Response time is 50ms [under standard load conditions]" | **Fix:** Add bracketed conditions: "[assuming X]", "[under conditions Y]", "[when Z]" --- **4. AUTHORITATIVE-LOOKING UNVALIDATED DATA** Tables with specific percentages and checkmarks look like measured data. **Red flag:** Tables with specific numbers (89%, 95%, 100%) without sources **Fix:** Add "(guess)", "(est.)", "?" to numbers. Add explicit warning: "PROJECTED VALUES - NOT MEASURED" --- ### Data Quality Checks (Complementary: Points 5-7) **5. DATA CONSISTENCY** Scan for conflicting numbers, dates, or facts within the document. **Red flag:** "500 users" in one section, "750 users" in another **Fix:** Reconcile conflicts or explicitly note the discrepancy with explanation. --- **6. IMPLICIT CAUSATION** Claims that imply causation without evidence. **Red flag:** "Shorter prompts improve response quality" (plausible but unproven) **Fix:** Reframe as hypothesis: "Shorter prompts MAY improve response quality (hypothesis, not validated)" --- **7. FUTURE STATE AS PRESENT** Describing planned/hoped outcomes as if already achieved. **Red flag:** "The system processes 10,000 requests per second" (when it hasn't been built) **Fix:** Use future/conditional: "The system is DESIGNED TO process..." or "TARGET: 10,000 rps" --- ### Verification Routing (Points 8-9) **8. TEMPORAL COHERENCE** Document dates and timestamps must be internally consistent and plausible. | Fails | Passes | |-------|--------| | "Last Updated: December 2024" (when current is 2026) | "Last Updated: January 2026" | | v1.0.0 dated 2024-12-23, v1.1.0 dated 2024-12-20 | Versions in chronological order | **Sub-checks:** 1. Document date vs current date 2. Internal chronology (versions, events in order) 3. Reference freshness ("current", "now", "today" claims) **Fix:** Update dates, add "as of [date]" qualifiers, flag stale claims --- **9. EXTERNALLY VERIFIABLE CLAIMS** Specific numbers that could be fact-checked should be flagged for verification. | Type | Example | Risk | |------|---------|------| | Pricing | "Costs ~$0.005 per call" | API pricing changes | | Statistics | "Papers average 15-30 equations" | May be wildly off | | Rates/ratios | "40% of researchers use X" | Needs citation | | Competitor claims | "No competitor offers Y" | May be outdated | **Fix options:** 1. Add source with date 2. Add uncertainty marker 3. Route to HITL or external search 4. Generalize ("low cost" instead of "$0.005") --- ## The Verification Hierarchy ``` Claim Extracted --> Does Source of Truth Exist? | +---------------+---------------+ YES NO | | Tier 1: Automated Tier 2: HITL Consistency & Verification Two-Round Verification | | PASS / BLOCK Round A → Round B → APPROVE / REJECT ``` ### Tier 1: Automated Verification **A. Internal Consistency** - Figure vs. Text contradictions - Abstract vs. Body mismatches - Table vs. Prose conflicts - Numerical consistency **B. External Verification (Extension Interface)** - User-provided connectors to structured sources - Financial systems, Git commits, CRM, etc. ### Tier 2: Two-Round HITL Verification — MANDATORY **Round A: Derived Data Confirmation** - Claims from sources found in session - Human confirms interpretation, not truth **Round B: True HITL Verification** - Claims needing actual verification - No source found, human's own data, extrapolations --- ## CGD Output Format When producing a Clarity-Gated Document, use this format per CLARITY_GATE_FORMAT_SPEC.md v2.1: ```yaml --- clarity-gate-version: 2.1 processed-date: 2026-01-12 processed-by: Claude + Human Review clarity-status: CLEAR hitl-status: REVIEWED hitl-pending-count: 0 points-passed: 1-9 rag-ingestable: true # computed by validator - do not set manually document-sha256: 7d865e959b2466918c9863afca942d0fb89d7c9ac0c99bafc3749504ded97730 hitl-claims: - id: claim-75fb137a text: "Revenue projection is $50M" value: "$50M" source: "Q3 planning doc" location: "revenue-projections/1" round: B confirmed-by: Francesco confirmed-date: 2026-01-12 --- # Document Title [Document body with epistemic markers applied] Claims like "Revenue will be $50M" become "Revenue is **projected** to be $50M *(unverified projection)*" --- ## HITL Verification Record ### Round A: Derived Data Confirmation - Claim 1 (source) ✓ - Claim 2 (source) ✓ ### Round B: True HITL Verification | # | Claim | Status | Verified By | Date | |---|-------|--------|-------------|------| | 1 | [claim] | ✓ Confirmed | [name] | [date] | Clarity Gate: CLEAR | REVIEWED ``` **Required CGD Elements (per spec):** - YAML frontmatter with all required fields: - `clarity-gate-version` — Tool version (no "v" prefix) - `processed-date` — YYYY-MM-DD format - `processed-by` — Processor name - `clarity-status` — CLEAR or UNCLEAR - `hitl-status` — PENDING, REVIEWED, or REVIEWED_WITH_EXCEPTIONS - `hitl-pending-count` — Integer ≥ 0 - `points-passed` — e.g., `1-9` or `1-4,7,9` - `hitl-claims` — List of verified claims (may be empty `[]`) - End marker (HTML comment + status line): ``` Clarity Gate: | ``` - HITL verification record (if status is REVIEWED) **Optional/Computed Fields:** - `rag-ingestable` — **Computed by validators**, not manually set. Shows `true` only when `CLEAR | REVIEWED` with no exclusion blocks. - `document-sha256` — Required. 64-char lowercase hex hash for integrity verification. See spec §2 for computation rules. - `exclusions-coverage` — Optional. Fraction of body inside exclusion blocks (0.0–1.0). **Escape Mechanism:** To write about markers like `*(estimated)*` without triggering parsing, wrap in backticks: `` `*(estimated)*` `` ### Claim Completion Status (v2.1) Claim verification status is determined by field **presence**, not an explicit status field: | State | `confirmed-by` | `confirmed-date` | Meaning | |-------|----------------|------------------|----------| | **PENDING** | absent | absent | Awaiting human verification | | **VERIFIED** | present | present | Human has confirmed | | *(invalid)* | present | absent | W-HC01: partial fields | | *(invalid)* | absent | present | W-HC01: partial fields | **Why no explicit status field?** Field presence is self-enforcing—you can't accidentally set status without providing who/when. ### Source Field Semantics (v2.1) The `source` field meaning changes based on claim state: | State | `source` Contains | Example | |-------|-------------------|----------| | **PENDING** | Where to verify (actionable) | `"Check Q3 planning doc"` | | **VERIFIED** | What was found (evidence) | `"Q3 planning doc, page 12"` | **Vague source detection (W-HC02):** Sources like `"industry reports"`, `"research"`, `"TBD"` trigger warnings. ### Claim ID Format (v2.1) **General pattern:** `claim-[a-z0-9._-]{1,64}` (alphanumeric, dots, underscores, hyphens) | Approach | Pattern | Example | Use Case | |----------|---------|---------|----------| | **Hash-based** (preferred) | `claim-[a-f0-9]{8,}` | `claim-75fb137a` | Deterministic, collision-resistant | | **Sequential** | `claim-[0-9]+` | `claim-1`, `claim-2` | Simple documents | | **Semantic** | `claim-[a-z0-9-]+` | `claim-revenue-q3` | Human-friendly | **Collision probability:** At 1,000 claims with 8-char hex IDs: ~0.012%. For >1,000 claims, use 12+ hex characters. **Recommendation:** Use hash-based IDs generated by `scripts/claim_id.py` for consistency and collision resistance. --- ## Exclusion Blocks When content cannot be resolved (no SME available, legacy prose, etc.), mark it as excluded rather than leaving it ambiguous: ```markdown Legacy authentication details that require SME review... ``` **Rules:** - IDs must match: `[A-Za-z0-9][A-Za-z0-9._-]{0,63}` - No nesting or overlapping blocks - Each ID used only once - Requires `hitl-status: REVIEWED_WITH_EXCEPTIONS` - Must document `exceptions-reason` and `exceptions-ids` in frontmatter **Important:** Documents with exclusion blocks are **not RAG-ingestable**. They're rejected entirely (no partial ingestion). See CLARITY_GATE_FORMAT_SPEC.md §4 for complete rules. --- ## SOT Validation When validating a Source of Truth file, the skill checks both **format compliance** (per CLARITY_GATE_FORMAT_SPEC.md) and **content quality** (the 9 points). ### Format Compliance (Structural Rules) SOT documents are CGDs with a `tier:` block. They require a `## Verified Claims` section with a valid table. | Code | Check | Severity | |------|-------|----------| | E-TB01 | No `## Verified Claims` section | ERROR | | E-TB02 | Table has no data rows | ERROR | | E-TB03 | Required columns missing (Claim, Value, Source, Verified) | ERROR | | E-TB04 | Column order wrong (Claim not first or Verified not last) | ERROR | | E-TB05 | Empty cell in required column | ERROR | | E-TB06 | Invalid date format in Verified column | ERROR | | E-TB07 | Verified date in future (beyond 24h grace) | ERROR | ### Content Quality (9 Points) The 9 Verification Points apply to SOT content: | Point | SOT Application | |-------|-----------------| | 1-4 | Check claims in `## Verified Claims` are actually verified | | 5 | Check for conflicting values across tables | | 6 | Check claims don't imply unsupported causation | | 7 | Check table doesn't state futures as present | | 8 | Check dates are chronologically consistent | | 9 | Flag specific numbers for external check | ### SOT-Specific Requirements - **Tier block required:** SOT is a CGD with `tier:` block containing `level`, `owner`, `version`, `promoted-date`, `promoted-by` - **Structured claims table:** `## Verified Claims` section with columns: Claim, Value, Source, Verified - **Table outside exclusions:** The verified claims table must NOT be inside an exclusion block - **Staleness markers:** Use `[STABLE]`, `[CHECK]`, `[VOLATILE]`, `[SNAPSHOT]` in content - `[STABLE]` — Safe to cite without rechecking - `[CHECK]` — Verify before citing - `[VOLATILE]` — Changes frequently; always verify - `[SNAPSHOT]` — Point-in-time data; include date when citing --- ## Output Format After running Clarity Gate, report: ``` ## Clarity Gate Results **Document:** [filename] **Issues Found:** [number] ### Critical (will cause hallucination) - [issue + location + fix] ### Warning (could cause equivocation) - [issue + location + fix] ### Temporal (date/time issues) - [issue + location + fix] ### Externally Verifiable Claims | # | Claim | Type | Suggested Verification | |---|-------|------|------------------------| | 1 | [claim] | Pricing | [where to verify] | --- ## Round A: Derived Data Confirmation - [claim] ([source]) Reply "confirmed" or flag any I misread. --- ## Round B: HITL Verification Required | # | Claim | Why HITL Needed | Human Confirms | |---|-------|-----------------|----------------| | 1 | [claim] | [reason] | [ ] True / [ ] False | --- **Would you like me to produce an annotated CGD version?** --- **Verdict:** PENDING CONFIRMATION ``` --- ## Severity Levels | Level | Definition | Action | |-------|------------|--------| | **CRITICAL** | LLM will likely treat hypothesis as fact | Must fix before use | | **WARNING** | LLM might misinterpret | Should fix | | **TEMPORAL** | Date/time inconsistency detected | Verify and update | | **VERIFIABLE** | Specific claim that could be fact-checked | Route to HITL or external search | | **ROUND A** | Derived from witnessed source | Quick confirmation | | **ROUND B** | Requires true verification | Cannot pass without confirmation | | **PASS** | Clearly marked, no ambiguity, verified | No action needed | --- ## Quick Scan Checklist | Pattern | Action | |---------|--------| | Specific percentages (89%, 73%) | Add source or mark as estimate | | Comparison tables | Add "PROJECTED" header | | "Achieves", "delivers", "provides" | Use "designed to", "intended to" if not validated | | Checkmarks | Verify these are confirmed | | "100%" anything | Almost always needs qualification | | "Last Updated: [date]" | Check against current date | | Version numbers with dates | Verify chronological order | | "$X.XX" or "~$X" (pricing) | Flag for external verification | | "averages", "typically" | Flag for source/citation | | Competitor capability claims | Flag for external verification | --- ## What This Skill Does NOT Do - Does not classify document types (use Stream Coding for that) - Does not restructure documents - Does not add deep links or references - Does not evaluate writing quality - **Does not check factual accuracy autonomously** (requires HITL) --- ## Related Projects | Project | Purpose | URL | |---------|---------|-----| | Source of Truth Creator | Create epistemically calibrated docs | github.com/frmoretto/source-of-truth-creator | | Stream Coding | Documentation-first methodology | github.com/frmoretto/stream-coding | | ArXiParse | Scientific paper verification | arxiparse.org | --- ## Changelog ### v2.1.3 (2026-03-02) - **FIXED:** `document_hash.py` now implements full FORMAT_SPEC §2.1-2.4 compliance - **FIXED:** Fence-aware end marker detection (Quine Protection per §2.3/§8.5) - **FIXED:** All 4 deployment copies converged to single canonical implementation - **ADDED:** `canonicalize()` function: trailing whitespace stripping, newline collapsing, NFC normalization - **ADDED:** YAML-aware `document-sha256` removal with multiline continuation support (§2.2) - **ADDED:** Fence-tracking test vectors (7 new tests, 15 total) ### v2.1.0 (2026-01-27) - **ADDED:** Claim Completion Status semantics (PENDING/VERIFIED by field presence) - **ADDED:** Source Field Semantics (actionable vs. what-was-found) - **ADDED:** Claim ID Format guidance with collision analysis - **ADDED:** Body Structure Requirements (HITL Verification Record mandatory when claims exist) - **ADDED:** New validation codes: E-ST10, W-ST11, W-HC01, W-HC02, E-SC06 (FORMAT_SPEC §1.2-1.3) - **ADDED:** Bundled scripts: `claim_id.py`, `document_hash.py` - **UPDATED:** References to FORMAT_SPEC v2.1 - **UPDATED:** CGD output example to version 2.1 ### v2.0.0 (2026-01-13) - **ADDED:** agentskills.io compliant YAML frontmatter - **ADDED:** Clarity Gate Format Specification v2.0 compliance (unified CGD/SOT) - **ADDED:** SOT validation support with E-TB* error codes - **ADDED:** Validation rules mapping (9 points → rule codes) - **ADDED:** CGD output format template with `` markers - **ADDED:** Quine Protection note (§2.3 fence-aware marker detection) - **ADDED:** Redacted Export feature (§8.11) - **UPDATED:** `hitl-claims` format to v2.0 schema (id, text, value, source, location, round) - **UPDATED:** End marker format to HTML comment style - **UPDATED:** Unified format spec v2.0 (single `.cgd.md` extension) - **RESTRUCTURED:** For multi-platform skill discovery ### v1.6 (2025-12-31) - Added Two-Round HITL verification system - Round A: Derived Data Confirmation - Round B: True HITL Verification ### v1.5 (2025-12-28) - Added Point 8: Temporal Coherence - Added Point 9: Externally Verifiable Claims ### v1.4 (2025-12-23) - Added CGD annotation output mode ### v1.3 (2025-12-21) - Restructured points into Epistemic (1-4) and Data Quality (5-7) ### v1.2 (2025-12-21) - Added Source of Truth request step ### v1.1 (2025-12-21) - Added HITL Fact Verification (mandatory) ### v1.0 (2025-11) - Initial release with 6-point verification --- **Version:** 2.1.3 **Spec Version:** 2.1 **Author:** Francesco Marinoni Moretto **License:** CC-BY-4.0