--- name: qa-eval description: Run OCR and translation quality evaluations across scripts and languages. Produces research-grade reports with MCR, cross-model agreement, embedding-space hallucination detection, and corpus readiness scores. --- # QA-Eval: Quality Evaluation Framework Run systematic quality evaluations on OCR and translation output across all scripts and languages in Source Library. Produces structured JSON results and markdown blog posts suitable for academic publication. Issue: #1329 ## Quick Start ```bash # Load env set -a; source .env.production.local; set +a # OCR consistency (run each model N times, compute Modal Consistency Rate) node scripts/eval/qa-eval.mjs consistency --corpus=bhutan --sample=10 --models=flash,opus --runs=3 # Embedding-space evaluation (hallucination detection without ground truth) node scripts/eval/qa-eval.mjs embedding --corpus=bhutan --sample=10 # Compare against ground truth (CER for OCR, BLEU/ROUGE for translation) node scripts/eval/qa-eval.mjs compare --corpus=bhutan --against=ocr # Readiness score for a corpus node scripts/eval/qa-eval.mjs readiness bhutan # Show all results node scripts/eval/qa-eval.mjs report --latest # Generate blog post from results node scripts/eval/qa-eval.mjs report --corpus=bhutan --format=blog --save ``` ## Invocation Modes ### Interactive ``` /qa-eval # Show help and available corpora /qa-eval --corpus=bhutan --sample=5 # Quick consistency check /qa-eval --corpus=bhutan --blog # Full eval + blog post ``` ### Specific Commands ``` /qa-eval consistency --corpus=bhutan --models=flash,opus --runs=3 /qa-eval embedding --corpus=bhutan --sample=20 /qa-eval compare --corpus=bhutan --against=translation /qa-eval matrix # All corpora comparison table /qa-eval readiness bhutan # Quick readiness score ``` ### Cost Estimation ``` /qa-eval consistency --corpus=bhutan --sample=10 --models=flash,opus --runs=3 --dry-run ``` ## Available Corpora Defined in `scripts/eval/corpus-registry.json`: | Corpus | Script | Description | |--------|--------|-------------| | bhutan | Tibetan | 1,325 EAP manuscripts (dbu can + dbu med) | | latin-alchemy | Latin | Printed alchemical texts (baseline) | | fraktur | German | Pre-1800 Fraktur/blackletter | | arabic | Arabic | Printed Naskh | | hebrew | Hebrew | Hebrew + Rashi script | | chinese-classical | CJK | Woodblock-printed classical Chinese | | sanskrit | Devanagari | Printed Sanskrit editions | | greek-ancient | Greek | Aldine and early printed Greek | | bph-manuscripts | Mixed | BPH high-quality manuscript scans | ## Model Aliases | Alias | Full Model ID | |-------|---------------| | flash | gemini-3-flash-preview | | lite | gemini-3.1-flash-lite-preview | | opus | claude-opus-4-6 | | sonnet | claude-sonnet-4-6 | | haiku | claude-haiku-4-5-20251001 | ## Metrics ### OCR Quality - **MCR (Modal Consistency Rate)**: % of N runs producing the majority output at temp=0 - **Pairwise character similarity**: Levenshtein-based, 0-100% - **Syllable similarity**: Script-aware tokenization (tsheg for Tibetan, char for CJK) - **CER (Character Error Rate)**: Edit distance / reference length (requires ground truth) ### Translation Quality - **BLEU-4**: N-gram overlap with brevity penalty (requires ground truth) - **ROUGE-L**: Longest common subsequence F1 (requires ground truth) - **Embedding distance**: Cosine distance between OCR and translation embeddings (no ground truth needed) ### Hallucination Detection - Pages where OCR→Translation embedding distance exceeds 2σ from corpus mean are flagged - Example: Flash Lite "translating" an astrological text as a ritual manual ### Readiness Score - **High**: MCR ≥ 90% AND cross-model agreement ≥ 85% - **Medium**: MCR ≥ 70% AND cross-model agreement ≥ 70% - **Low**: Below medium thresholds ## Output Results are saved to `scripts/eval/results/` as JSON and optionally as markdown blog posts in `docs/`. ``` scripts/eval/results/ bhutan-consistency-2026-04-23.json bhutan-embedding-2026-04-23.json matrix-2026-04-23.json docs/ qa-eval-bhutan-2026-04-23.md ``` ## Ground Truth Place reference transcriptions and translations in `scripts/eval/ground-truth/` as JSON: ```json { "book_id": "abc123", "page_number": 5, "script": "tibetan", "source": "BDRC etext", "source_url": "https://library.bdrc.io/...", "ocr_ground_truth": "...", "translation_ground_truth": "...", "translation_source": "Thurman 1994" } ``` Sources: BDRC etexts, OpenPecha, Esukhia Derge Kangyur, Lotsawa House, scholarly editions. ### OCR vs. ctext (Chinese) For the Chinese corpus, OCR ground truth is auto-built from [ctext.org](https://ctext.org) canonical transcriptions: ```bash node scripts/eval/build-ctext-groundtruth.mjs # dry run — shows alignment + which works pass the guard node scripts/eval/build-ctext-groundtruth.mjs --write # write pinned ground-truth files node scripts/eval/qa-eval.mjs compare --corpus=chinese --against=ocr ``` Two things make this work where a naive CER fails: - **Subsequence alignment** (`subsequenceCER` in `lib/metrics.mjs`). Most of our Chinese editions are *commentary* editions — the canonical main text is interleaved with small-character annotation that ctext's main-text-only transcription lacks. The reference is matched as an in-order subsequence of the OCR with extra (commentary) characters skipped free, so the metric scores OCR error on the canonical text only. A plain edit distance scored the Book of Odes at 6% when it was really 99%. - **Pinned book + page = identity guard.** `compare` fetches the exact `book_id`/`page_number` from each ground-truth file (no fuzzy title matching that could grab a same-phrase decoy). The generator only writes a file when the passage aligns below `--threshold` (default 0.30); anything above is skipped as a wrong book or a divergent recension (e.g. Zhu Xi's reordered *Great Learning*, the *Shiji* with 三家注) — reported, not counted as OCR error. Coverage caveat: ctext holds canonical **printed** texts only — manuscripts, tables, and rare/regional works (the actual OCR frontier) are out of scope and need MCR / cross-model / embedding checks instead. Baseline run (2026-06-25): **98.5% character accuracy** across 7 canonical works. ## Architecture ``` scripts/eval/ qa-eval.mjs # CLI entrypoint lib/ metrics.mjs # All metric functions runners.mjs # Gemini + Claude model execution sampling.mjs # MongoDB page sampling report.mjs # JSON + Markdown output embedding-eval.mjs # Embedding-space evaluation corpus-registry.json # Known corpora ground-truth/ # Reference data results/ # Output ``` ## Key References - Blog post: `docs/blog-tibetan-ocr-benchmark.md` - Prototype: `_tmp-ocr-consistency.mjs` - Embedding model: `gemini-embedding-2-preview` (768d, matches production search) - Related papers: GlotOCR Bench, Wang & Wang 2025, Conformal Risk Control for OCR