--- name: vision-language-models description: GPT-5/4o, Claude 4.5, Gemini 2.5/3, Grok 4 vision patterns for image analysis, document understanding, and visual QA. Use when implementing image captioning, document/chart analysis, or multi-image comparison. context: fork agent: multimodal-specialist version: 1.0.0 author: OrchestKit user-invocable: false tags: [vision, multimodal, image, gpt-5, claude-4, gemini, grok, vlm, 2026] --- # Vision Language Models (2026) Integrate vision capabilities from leading multimodal models for image understanding, document analysis, and visual reasoning. ## Overview - Image captioning and description generation - Visual question answering (VQA) - Document/chart/diagram analysis with OCR - Multi-image comparison and reasoning - Bounding box detection and region analysis - Video frame analysis ## Model Comparison (January 2026) | Model | Context | Strengths | Vision Input | |-------|---------|-----------|--------------| | **GPT-5.2** | 128K | Best general reasoning, multimodal | Up to 10 images | | **Claude Opus 4.5** | 200K | Best coding, sustained agent tasks | Up to 100 images | | **Gemini 2.5 Pro** | 1M+ | Longest context, video analysis | 3,600 images max | | **Gemini 3 Pro** | 1M | Deep Think, 100% AIME 2025 | Enhanced segmentation | | **Grok 4** | 2M | Real-time X integration, DeepSearch | Images + upcoming video | ## Image Input Methods ### Base64 Encoding (All Providers) ```python import base64 import mimetypes def encode_image_base64(image_path: str) -> tuple[str, str]: """Encode local image to base64 with MIME type.""" mime_type, _ = mimetypes.guess_type(image_path) mime_type = mime_type or "image/png" with open(image_path, "rb") as f: base64_data = base64.standard_b64encode(f.read()).decode("utf-8") return base64_data, mime_type ``` ### OpenAI GPT-5/4o Vision ```python from openai import OpenAI client = OpenAI() def analyze_image_openai(image_path: str, prompt: str) -> str: """Analyze image using GPT-5 or GPT-4o.""" base64_data, mime_type = encode_image_base64(image_path) response = client.chat.completions.create( model="gpt-5", # or "gpt-4o", "gpt-4.1" messages=[{ "role": "user", "content": [ {"type": "text", "text": prompt}, {"type": "image_url", "image_url": { "url": f"data:{mime_type};base64,{base64_data}", "detail": "high" # low, high, or auto }} ] }], max_tokens=4096 # Required for vision ) return response.choices[0].message.content ``` ### Claude 4.5 Vision (Anthropic) ```python import anthropic client = anthropic.Anthropic() def analyze_image_claude(image_path: str, prompt: str) -> str: """Analyze image using Claude Opus 4.5 or Sonnet 4.5.""" base64_data, media_type = encode_image_base64(image_path) response = client.messages.create( model="claude-opus-4-5-20251124", # or claude-sonnet-4-5 max_tokens=4096, messages=[{ "role": "user", "content": [ { "type": "image", "source": { "type": "base64", "media_type": media_type, "data": base64_data } }, {"type": "text", "text": prompt} ] }] ) return response.content[0].text ``` ### Gemini 2.5/3 Vision (Google) ```python import google.generativeai as genai from PIL import Image genai.configure(api_key="YOUR_API_KEY") def analyze_image_gemini(image_path: str, prompt: str) -> str: """Analyze image using Gemini 2.5 Pro or Gemini 3.""" model = genai.GenerativeModel("gemini-2.5-pro") # or gemini-3-pro image = Image.open(image_path) response = model.generate_content([prompt, image]) return response.text # For video analysis (Gemini excels here) def analyze_video_gemini(video_path: str, prompt: str) -> str: """Analyze video using Gemini's native video support.""" model = genai.GenerativeModel("gemini-2.5-pro") video_file = genai.upload_file(video_path) response = model.generate_content([prompt, video_file]) return response.text ``` ### Grok 4 Vision (xAI) ```python from openai import OpenAI # Grok uses OpenAI-compatible API client = OpenAI( api_key="YOUR_XAI_API_KEY", base_url="https://api.x.ai/v1" ) def analyze_image_grok(image_path: str, prompt: str) -> str: """Analyze image using Grok 4 with real-time capabilities.""" base64_data, mime_type = encode_image_base64(image_path) response = client.chat.completions.create( model="grok-4", # or grok-2-vision-1212 messages=[{ "role": "user", "content": [ {"type": "text", "text": prompt}, {"type": "image_url", "image_url": { "url": f"data:{mime_type};base64,{base64_data}" }} ] }] ) return response.choices[0].message.content ``` ## Multi-Image Analysis ```python async def compare_images(images: list[str], prompt: str) -> str: """Compare multiple images (Claude supports up to 100).""" content = [] for img_path in images: base64_data, media_type = encode_image_base64(img_path) content.append({ "type": "image", "source": { "type": "base64", "media_type": media_type, "data": base64_data } }) content.append({"type": "text", "text": prompt}) response = client.messages.create( model="claude-opus-4-5-20251124", max_tokens=8192, messages=[{"role": "user", "content": content}] ) return response.content[0].text ``` ## Object Detection (Gemini 2.5+) ```python def detect_objects_gemini(image_path: str) -> list[dict]: """Detect objects with bounding boxes using Gemini 2.5+.""" model = genai.GenerativeModel("gemini-2.5-pro") image = Image.open(image_path) response = model.generate_content([ "Detect all objects in this image. Return bounding boxes " "as JSON with format: {objects: [{label, box: [x1,y1,x2,y2]}]}", image ]) import json return json.loads(response.text) ``` ## Token Cost Optimization | Provider | Detail Level | Cost Impact | |----------|-------------|-------------| | OpenAI | `low` (65 tokens) | Use for classification | | OpenAI | `high` (129+ tokens/tile) | Use for OCR/charts | | Gemini | 258 tokens base | Scales with resolution | | Claude | Per-image pricing | Batch for efficiency | ```python # Cost-optimized simple classification response = client.chat.completions.create( model="gpt-4o-mini", # Cheaper for simple tasks messages=[{ "role": "user", "content": [ {"type": "text", "text": "Is there a person? Reply: yes/no"}, {"type": "image_url", "image_url": { "url": image_url, "detail": "low" # Minimal tokens }} ] }] ) ``` ## Image Size Limits (2026) | Provider | Max Size | Max Images | Notes | |----------|----------|------------|-------| | OpenAI | 20MB | 10/request | GPT-5 series | | Claude | 8000x8000 px | 100/request | 2000px if >20 images | | Gemini | 20MB | 3,600/request | Best for batch | | Grok | 20MB | Limited | Grok 5 expands this | ## Key Decisions | Decision | Recommendation | |----------|----------------| | High accuracy | Claude Opus 4.5 or GPT-5 | | Long documents | Gemini 2.5 Pro (1M context) | | Cost efficiency | Gemini 2.5 Flash ($0.15/M tokens) | | Real-time/X data | Grok 4 with DeepSearch | | Video analysis | Gemini 2.5/3 Pro (native) | ## Common Mistakes - Not setting `max_tokens` (responses truncated) - Sending oversized images (resize to 2048px max) - Using `high` detail for yes/no questions - Not validating image format before encoding - Ignoring rate limits on vision endpoints - Using deprecated models (GPT-4V retired) ## Limitations - Cannot identify specific people (privacy restriction) - May hallucinate on low-quality/rotated images (<200px) - GPT-4o: struggles with non-Latin text, precise spatial reasoning - No real-time video (use frame extraction except Gemini) ## Related Skills - `audio-language-models` - Audio/speech processing - `multimodal-rag` - Image + text retrieval - `llm-streaming` - Streaming vision responses ## Capability Details ### image-captioning **Keywords:** caption, describe, image description, alt text, accessibility **Solves:** - Generate descriptive captions for images - Create accessibility alt text - Extract visual content summary ### visual-qa **Keywords:** VQA, visual question, image question, analyze image **Solves:** - Answer questions about image content - Extract specific information from visuals - Reason about image elements ### document-vision **Keywords:** document, PDF, chart, diagram, OCR, extract, table **Solves:** - Extract text from documents and charts - Analyze diagrams and flowcharts - Process forms and tables with structure ### multi-image-analysis **Keywords:** compare images, multiple images, image comparison, batch **Solves:** - Compare visual elements across images - Track changes between versions - Analyze image sequences ### object-detection **Keywords:** bounding box, detect objects, locate, segmentation **Solves:** - Detect and locate objects in images - Generate bounding box coordinates - Segment image regions (Gemini 2.5+)