--- title: Gemini 3 Flash OCRs Dilbert accurately date: '2026-02-02T18:53:15+08:00' categories: - llms description: Gemini 3 Flash is accurate and cheap enough to make large-scale comic OCR practical, with a credible local-model fallback for offline use. keywords: [OCR, Gemini 3 Flash, Dilbert, comics, transcription, benchmark] --- [Scott Adams](https://en.wikipedia.org/wiki/Scott_Adams), the author of [Dilbert](https://en.wikipedia.org/wiki/Dilbert), passed away last month. While his work will live on, I was curious about the best way to build a Dilbert search engine. The first step is to extract the text. [Pavan](https://github.com/pavankumart18) tested over half a dozen LLMs on ~30 Dilbert strips to see which one transcribed them best. [Here are the results](https://pavankumart18.github.io/comic-transcriptions/). **Summary**: [Gemini 3 Flash](https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/3-flash) does the best, and would cost ~$20 to process the entire Dilbert archive. But if you want a local solution, [Qwen 3 VL 32b](https://ollama.com/library/qwen3-vl:32b) is the best.
Model Score (%) Text (40) Spkr (25) Caps (15) Panel (10) Halluc (10)
gemini-3-flash-preview 99.3% 39.9 24.4 15.0 10.0 10.0
qwen3-vl-32b-instruct 96.0% 39.8 21.6 15.0 9.9 9.7
llama-4-maverick 85.1% 38.5 16.3 13.2 9.1 8.1
llama-4-scout 84.1% 39.0 16.4 12.5 8.7 7.5
gemma-3-27b-it 81.3% 37.8 13.1 14.4 8.4 7.6
nemotron-nano-12b-v2-vl-free 81.3% 38.6 13.1 14.4 8.5 6.6
molmo-2-8b-free 70.4% 36.2 16.4 0.5 8.8 8.4
That accuracy of 99.3% is impressive. Here's the biggest error it made: ![](https://web.archive.org/web/20230301061931im_/https://assets.amuniversal.com/d999ece0979e012f2fe400163e41dd5b) 1. Dogbert: CHAPTER IV. "TIME MANAGEMENT" 2. Dogbert: "ALWAYS POSTPONE MEETINGS WITH TIME-WASTING MORONS."\ Dilbert: "HOW DO YOU DO THAT?" 3. Dogbert: CAN I GET BACK TO YOU ON THAT? Can you spot the error? The model attributed the text to Dogbert instead of the computer. (But you _could_ argue that Dogbert is the one typing it...) --- Here's another error: ![](https://web.archive.org/web/20230228074232im_/https://assets.amuniversal.com/7cf00b10979d012f2fe400163e41dd5b) 1. Dilbert: I'VE DECIDED WE SHOULD OPERATE ALONG MORE CLASSIC LINES, LIKE DR. FRANKENSTEIN'S LAB. 2. Dogbert: YOU KNOW WHAT THAT MAKES YOU? 3. Dogbert: I'VE GOT A HUNCH... 4. Dilbert: LET'S PRACTICE... 5. Dilbert: DOGBERT, FETCH ME A BRAIN!\ Dogbert: LIKE YOUR PRESENT MODEL, OR ONE THAT WORKS? Can you spot the error? In Panel 2, it's Dilbert speaking, not Dogbert. --- In fact, the only transcription errors Gemini 3 Flash made was writing "McDONALD'S" instead of "MCDONALD'S" ([see panel 2](https://web.archive.org/web/20230228083128im_/https://assets.amuniversal.com/3eb64cb0979e012f2fe400163e41dd5b)), and not hyphenating a line-break in "PRESEN-TATION" ([see panel 4](https://web.archive.org/web/20230228231330im_/https://assets.amuniversal.com/03d47960979f012f2fe400163e41dd5b)). Qwen 3 VL 32b made almost as few errors. The bigger gap is in speaker detection, where the models fall off steeply. --- This incredibly low cost + high accuracy enables a _number_ of new things. For example: - **Infrastructure Serial Tracking:** Extract serial numbers and maintenance dates from photos of utility meters, fire hydrants, streetlights, etc. to build a live digital twin of city assets. - **Small-Business Permit Audits:** Process photos of street-facing shop permits to flag expired licenses. - **Evidence Label Transcription:** Annotate small-text labels on physical exhibits in legal archives, e.g. "Exhibit A" becomes "Exhibit A: Photo of the crime scene taken on 03/15/2020 at 14:32 by Officer J. Smith." --- I spent [7 years typing out every one of the ~3,000 Calvin & Hobbes strips by hand](https://www.s-anand.net/blog/the-calvin-and-hobbes-search-takedown/). For these ~12,000 Dilbert strips, it might take a few hours and a few dollars for the same.