---
title: Gemini 3 Flash OCRs Dilbert accurately
date: '2026-02-02T18:53:15+08:00'
categories:
- llms
description: Gemini 3 Flash is accurate and cheap enough to make large-scale comic OCR practical, with a credible local-model fallback for offline use.
keywords: [OCR, Gemini 3 Flash, Dilbert, comics, transcription, benchmark]
---
[Scott Adams](https://en.wikipedia.org/wiki/Scott_Adams), the author of [Dilbert](https://en.wikipedia.org/wiki/Dilbert), passed away last month. While his work will live on, I was curious about the best way to build a Dilbert search engine.
The first step is to extract the text. [Pavan](https://github.com/pavankumart18) tested over half a dozen LLMs on ~30 Dilbert strips to see which one transcribed them best.
[Here are the results](https://pavankumart18.github.io/comic-transcriptions/).
**Summary**: [Gemini 3 Flash](https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/3-flash) does the best, and would cost ~$20 to process the entire Dilbert archive. But if you want a local solution, [Qwen 3 VL 32b](https://ollama.com/library/qwen3-vl:32b) is the best.
| Model |
Score (%) |
Text (40) |
Spkr (25) |
Caps (15) |
Panel (10) |
Halluc (10) |
| gemini-3-flash-preview |
99.3% |
39.9 |
24.4 |
15.0 |
10.0 |
10.0 |
| qwen3-vl-32b-instruct |
96.0% |
39.8 |
21.6 |
15.0 |
9.9 |
9.7 |
| llama-4-maverick |
85.1% |
38.5 |
16.3 |
13.2 |
9.1 |
8.1 |
| llama-4-scout |
84.1% |
39.0 |
16.4 |
12.5 |
8.7 |
7.5 |
| gemma-3-27b-it |
81.3% |
37.8 |
13.1 |
14.4 |
8.4 |
7.6 |
| nemotron-nano-12b-v2-vl-free |
81.3% |
38.6 |
13.1 |
14.4 |
8.5 |
6.6 |
| molmo-2-8b-free |
70.4% |
36.2 |
16.4 |
0.5 |
8.8 |
8.4 |
That accuracy of 99.3% is impressive. Here's the biggest error it made:

1. Dogbert: CHAPTER IV. "TIME MANAGEMENT"
2. Dogbert: "ALWAYS POSTPONE MEETINGS WITH TIME-WASTING MORONS."\
Dilbert: "HOW DO YOU DO THAT?"
3. Dogbert: CAN I GET BACK TO YOU ON THAT?
Can you spot the error? The model attributed the text to Dogbert instead of the computer. (But you _could_ argue that Dogbert is the one typing it...)
---
Here's another error:

1. Dilbert: I'VE DECIDED WE SHOULD OPERATE ALONG MORE CLASSIC LINES, LIKE DR. FRANKENSTEIN'S LAB.
2. Dogbert: YOU KNOW WHAT THAT MAKES YOU?
3. Dogbert: I'VE GOT A HUNCH...
4. Dilbert: LET'S PRACTICE...
5. Dilbert: DOGBERT, FETCH ME A BRAIN!\
Dogbert: LIKE YOUR PRESENT MODEL, OR ONE THAT WORKS?
Can you spot the error? In Panel 2, it's Dilbert speaking, not Dogbert.
---
In fact, the only transcription errors Gemini 3 Flash made was writing "McDONALD'S" instead of "MCDONALD'S" ([see panel 2](https://web.archive.org/web/20230228083128im_/https://assets.amuniversal.com/3eb64cb0979e012f2fe400163e41dd5b)), and not hyphenating a line-break in "PRESEN-TATION" ([see panel 4](https://web.archive.org/web/20230228231330im_/https://assets.amuniversal.com/03d47960979f012f2fe400163e41dd5b)).
Qwen 3 VL 32b made almost as few errors. The bigger gap is in speaker detection, where the models fall off steeply.
---
This incredibly low cost + high accuracy enables a _number_ of new things. For example:
- **Infrastructure Serial Tracking:** Extract serial numbers and maintenance dates from photos of utility meters, fire hydrants, streetlights, etc. to build a live digital twin of city assets.
- **Small-Business Permit Audits:** Process photos of street-facing shop permits to flag expired licenses.
- **Evidence Label Transcription:** Annotate small-text labels on physical exhibits in legal archives, e.g. "Exhibit A" becomes "Exhibit A: Photo of the crime scene taken on 03/15/2020 at 14:32 by Officer J. Smith."
---
I spent [7 years typing out every one of the ~3,000 Calvin & Hobbes strips by hand](https://www.s-anand.net/blog/the-calvin-and-hobbes-search-takedown/). For these ~12,000 Dilbert strips, it might take a few hours and a few dollars for the same.