--- title: Gemini 3 Flash OCRs Dilbert accurately date: '2026-02-02T18:53:15+08:00' categories: - llms description: Gemini 3 Flash is accurate and cheap enough to make large-scale comic OCR practical, with a credible local-model fallback for offline use. keywords: [OCR, Gemini 3 Flash, Dilbert, comics, transcription, benchmark] --- [Scott Adams](https://en.wikipedia.org/wiki/Scott_Adams), the author of [Dilbert](https://en.wikipedia.org/wiki/Dilbert), passed away last month. While his work will live on, I was curious about the best way to build a Dilbert search engine. The first step is to extract the text. [Pavan](https://github.com/pavankumart18) tested over half a dozen LLMs on ~30 Dilbert strips to see which one transcribed them best. [Here are the results](https://pavankumart18.github.io/comic-transcriptions/). **Summary**: [Gemini 3 Flash](https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/3-flash) does the best, and would cost ~$20 to process the entire Dilbert archive. But if you want a local solution, [Qwen 3 VL 32b](https://ollama.com/library/qwen3-vl:32b) is the best.

Model	Score (%)	Text (40)	Spkr (25)	Caps (15)	Panel (10)	Halluc (10)
gemini-3-flash-preview	99.3%	39.9	24.4	15.0	10.0	10.0
qwen3-vl-32b-instruct	96.0%	39.8	21.6	15.0	9.9	9.7
llama-4-maverick	85.1%	38.5	16.3	13.2	9.1	8.1
llama-4-scout	84.1%	39.0	16.4	12.5	8.7	7.5
gemma-3-27b-it	81.3%	37.8	13.1	14.4	8.4	7.6
nemotron-nano-12b-v2-vl-free	81.3%	38.6	13.1	14.4	8.5	6.6
molmo-2-8b-free	70.4%	36.2	16.4	0.5	8.8	8.4

That accuracy of 99.3% is impressive. Here's the biggest error it made: ![](https://web.archive.org/web/20230301061931im_/https://assets.amuniversal.com/d999ece0979e012f2fe400163e41dd5b) 1. Dogbert: CHAPTER IV. "TIME MANAGEMENT" 2. Dogbert: "ALWAYS POSTPONE MEETINGS WITH TIME-WASTING MORONS."\ Dilbert: "HOW DO YOU DO THAT?" 3. Dogbert: CAN I GET BACK TO YOU ON THAT? Can you spot the error? The model attributed the text to Dogbert instead of the computer. (But you _could_ argue that Dogbert is the one typing it...) --- Here's another error: ![](https://web.archive.org/web/20230228074232im_/https://assets.amuniversal.com/7cf00b10979d012f2fe400163e41dd5b) 1. Dilbert: I'VE DECIDED WE SHOULD OPERATE ALONG MORE CLASSIC LINES, LIKE DR. FRANKENSTEIN'S LAB. 2. Dogbert: YOU KNOW WHAT THAT MAKES YOU? 3. Dogbert: I'VE GOT A HUNCH... 4. Dilbert: LET'S PRACTICE... 5. Dilbert: DOGBERT, FETCH ME A BRAIN!\ Dogbert: LIKE YOUR PRESENT MODEL, OR ONE THAT WORKS? Can you spot the error? In Panel 2, it's Dilbert speaking, not Dogbert. --- In fact, the only transcription errors Gemini 3 Flash made was writing "McDONALD'S" instead of "MCDONALD'S" ([see panel 2](https://web.archive.org/web/20230228083128im_/https://assets.amuniversal.com/3eb64cb0979e012f2fe400163e41dd5b)), and not hyphenating a line-break in "PRESEN-TATION" ([see panel 4](https://web.archive.org/web/20230228231330im_/https://assets.amuniversal.com/03d47960979f012f2fe400163e41dd5b)). Qwen 3 VL 32b made almost as few errors. The bigger gap is in speaker detection, where the models fall off steeply. --- This incredibly low cost + high accuracy enables a _number_ of new things. For example: - **Infrastructure Serial Tracking:** Extract serial numbers and maintenance dates from photos of utility meters, fire hydrants, streetlights, etc. to build a live digital twin of city assets. - **Small-Business Permit Audits:** Process photos of street-facing shop permits to flag expired licenses. - **Evidence Label Transcription:** Annotate small-text labels on physical exhibits in legal archives, e.g. "Exhibit A" becomes "Exhibit A: Photo of the crime scene taken on 03/15/2020 at 14:32 by Officer J. Smith." --- I spent [7 years typing out every one of the ~3,000 Calvin & Hobbes strips by hand](https://www.s-anand.net/blog/the-calvin-and-hobbes-search-takedown/). For these ~12,000 Dilbert strips, it might take a few hours and a few dollars for the same.