--- title: LLMs still do not locate bounding boxes well date: "2024-11-01T16:20:03Z" lastmod: "2024-11-02T00:32:28Z" categories: - llms wp_id: 3682 description: "Vision-capable LLMs can describe scenes but still struggle to place accurate pixel-level bounding boxes, though simple aids like labeled gridlines improve them somewhat." keywords: [bounding boxes, vision models, computer vision, LLMs, object detection, evaluation] ---

I sent an image to over a dozen LLMs that support vision, asking them:

Detect objects in this 1280x720 px image and return their color and bounding boxes in pixels. Respond as a JSON object: {[label]: [color, x1, y1, x2, y2], …}

None of the models did a good-enough job. It looks like we have some time to go before LLMs become good at bounding boxes.

I've given them a subjective rating on a 1-5 scale below.

ModelPositionsSizes
gemini-1.5-flash-001🟢🟢🟢🔴🔴🟢🟢🟢🟢🔴
gemini-1.5-flash-8b🟢🟢🟢🔴🔴🟢🟢🟢🔴🔴
gemini-1.5-flash-002🟢🟢🔴🔴🔴🟢🟢🟢🔴🔴
gemini-1.5-pro-002🟢🟢🟢🔴🔴🟢🟢🟢🟢🔴
gpt-4o-mini🟢🔴🔴🔴🔴🟢🟢🔴🔴🔴
gpt-4o🟢🟢🟢🟢🔴🟢🟢🟢🟢🔴
chatgpt-4o-latest🟢🟢🟢🟢🔴🟢🟢🟢🟢🔴
claude-3-haiku-20240307🟢🔴🔴🔴🔴🟢🟢🔴🔴🔴
claude-3=5-sonnet-20241022🟢🟢🟢🔴🔴🟢🟢🟢🔴🔴
llama-3.2-11b-vision-preview🔴🔴🔴🔴🔴🔴🔴🔴🔴🔴
llama-3.2-90b-vision-preview🟢🟢🟢🔴🔴🟢🟢🟢🔴🔴
qwen-2-vl-72b-instruct🟢🟢🟢🔴🔴🟢🟢🔴🔴🔴
pixtral-12b🟢🟢🔴🔴🔴🟢🟢🟢🔴🔴

I used an app I built for this.

Here is the original image along with the individual results.

![](/blog/assets/shapes-1024x576.webp) ![](/blog/assets/gemini-1.5-flash-8b-1024x576.webp) ![](/blog/assets/gemini-1.5-flash-001-1024x576.webp) ![](/blog/assets/gemini-1.5-flash-002-1024x576.webp) ![](/blog/assets/gemini-1.5-pro-002-1024x576.webp) ![](/blog/assets/gpt-4o-mini-1024x576.webp) ![](/blog/assets/gpt-4o-1024x576.webp) ![](/blog/assets/chatgpt-4o-latest-1024x576.webp) ![](/blog/assets/claude-3-haiku-20240307-1024x576.webp) ![](/blog/assets/claude-3-5-sonnet-20241022-1024x576.webp) ![](/blog/assets/llama-3.2-11b-vision-preview-1024x576.webp) ![](/blog/assets/llama-3.2-90b-vision-preview-1024x576.webp) ![](/blog/assets/pixtral-12b-1024x576.webp) ![](/blog/assets/qwen-2-vl-72b-instruct-1024x576.webp)

Update

Adding gridlines with labeled axes helps the LLMs. (Thanks @Bijan Mishra.) Here are a few examples:

![](/blog/assets/chatgpt-4o-latest-1-1024x576.webp) ![](/blog/assets/gemini-1.5-flash-002-1-1024x576.webp) ![](/blog/assets/claude-3-5-sonnet-20241022-1-1024x576.webp)