--- id: ins_llm-as-judge-binary-not-likert operator: Hamel Husain operator_role: Independent ML consultant and Berkeley PhD researcher source_url: https://www.lennysnewsletter.com/p/why-ai-evals-are-the-hottest-new-skill source_type: podcast source_title: Evals as error analysis, the benevolent dictator, LLM judges source_date: 2026-04-28 captured_date: 2026-05-01 domain: [ai-native, engineering] lifecycle: [ai-workflow, attribution-measurement] maturity: applied artifact_class: playbook score: { originality: 4, specificity: 5, evidence: 4, transferability: 5, source: 5 } tier: A related: [ins_open-coding-then-axial-coding, ins_evals-are-data-analysis-on-llm-apps] raw_ref: raw/podcasts/hamel-husain-shreya-shankar--evals-error-analysis--2026-04-28.md --- # Build LLM-as-judge as binary true/false, one judge per pesky failure mode, and validate against human labels ## Claim LLM-as-judge evals should output a binary pass/fail per pesky failure mode, not a Likert scale. Build 4–7 narrow judges total, not dozens, because most failures are fixed by prompt edits and never need a permanent eval. Always validate the judge against human-labelled data using a confusion matrix, not a single agreement number. ## Mechanism A 1-to-5 Likert is "a weasel way of not making a decision", it produces averages that look reasonable while masking the cases where the judge is wrong. A binary judge forces a callable decision and a falsifiable accuracy number. A confusion matrix surfaces the off-diagonal cells where a judge that says "pass" 90% of the time can hide near-total failure on the long tail. Without the matrix, teams trust scaffolding that has no real signal. ## Conditions Holds when: - Each failure mode is genuinely binary (the output either has the kill-list word or it doesn't). - A reviewer can produce labelled data for the same traces the judge sees. - The team will retire judges as failure modes get fixed by prompts. Fails when: - The failure mode is genuinely graded (severity, urgency, fluency) and binary collapses real distinctions. - Reviewers cannot produce ground-truth labels at sufficient volume. - The team treats judges as permanent fixtures and never prunes. ## Evidence > "1-2-3-4-5 is a weasel way of not making a decision." > "When people lose trust in your evals, they lose trust in you." Operating rule Hamel and Shreya teach: most products end with 4–7 LLM judges total. Always look at the off-diagonal cells in the confusion matrix; agreement % is misleading on long-tail errors. · Hamel Husain & Shreya Shankar on Lenny's Podcast, 2026-04-28 ## Signals - Judge prompts are short and binary; their accuracy is reported with confusion-matrix detail, not single-number summaries. - Number of permanent judges stays small even as the product grows. - Trust in the eval dashboard rises because failures map to recognisable categories. ## Counter-evidence For ranking problems (which of these outputs is best?), pairwise comparison or graded scoring may be needed. The binary rule is conditional on detection-style judges, not preference-style ones. ## Cross-references - `ins_open-coding-then-axial-coding`, the categories that become judges - `ins_evals-are-data-analysis-on-llm-apps`, the broader frame