--- date: "2025-05-10T09:35:14Z" categories: - linkedin - llms description: "Hallucinations become much less dangerous when multiple cheap models check each other, because their mistakes overlap far less than their individual error rates suggest." keywords: [hallucinations, ensemble methods, double checking, LLM reliability, classification, automation quality] --- How can we rely on unreliable LLMs?" people ask me. Double-checking with another LLM," is my top response. That's what we do with unreliable humans, anyway. LLMs feel magical until they start confidently hallucinating. When I asked 11 cheap LLMs to classify customer service messages into billing, refunds, order changes, etc. they got it wrong ~14%. Not worse than a human, but in scale-sensitive settings, that's not good enough. But different LLMs make **DIFFERENT** mistakes. When double-checking with two LLMs, they were **both** wrong only 4% of the time. With 4 LLMs, it was only 1%. Double-checking costs almost nothing. When LLMs disagree, a human can check it. Also, multiple LLMs rarely agree on the **same** wrong answer. So, instead of 100% automation at 85% quality, double-check with multiple LLMs. You can get 80% automation with 99% quality. - Full analysis: https://sanand0.github.io/llmevals/double-checking/ - Code and data: https://github.com/sanand0/llmevals/tree/main/double-checking ![](https://github.com/sanand0/llmevals/raw/main/double-checking/improvement.webp) [LinkedIn](https://www.linkedin.com/feed/update/urn%3Ali%3Ashare%3A7326902628490059776)