--- name: clip-aware-embeddings description: Semantic image-text matching with CLIP and alternatives. Use for image search, zero-shot classification, similarity matching. NOT for counting objects, fine-grained classification (celebrities, car models), spatial reasoning, or compositional queries. Activate on "CLIP", "embeddings", "image similarity", "semantic search", "zero-shot classification", "image-text matching". allowed-tools: Read,Write,Edit,Bash metadata: category: AI & Machine Learning pairs-with: - skill: photo-content-recognition-curation-expert reason: Content-aware photo processing - skill: collage-layout-expert reason: Semantic image matching for layouts tags: - clip - embeddings - vision - similarity - zero-shot --- # CLIP-Aware Image Embeddings Smart image-text matching that knows when CLIP works and when to use alternatives. ## MCP Integrations | MCP | Purpose | |-----|---------| | **Firecrawl** | Research latest CLIP alternatives and benchmarks | | **Hugging Face** (if configured) | Access model cards and documentation | ## Quick Decision Tree ``` Your task: ├─ Semantic search ("find beach images") → CLIP ✓ ├─ Zero-shot classification (broad categories) → CLIP ✓ ├─ Counting objects → DETR, Faster R-CNN ✗ ├─ Fine-grained ID (celebrities, car models) → Specialized model ✗ ├─ Spatial relations ("cat left of dog") → GQA, SWIG ✗ └─ Compositional ("red car AND blue truck") → DCSMs, PC-CLIP ✗ ``` ## When to Use This Skill ✅ **Use for**: - Semantic image search - Broad category classification - Image similarity matching - Zero-shot tasks on new categories ❌ **Do NOT use for**: - Counting objects in images - Fine-grained classification - Spatial understanding - Attribute binding - Negation handling ## Installation ```bash pip install transformers pillow torch sentence-transformers --break-system-packages ``` **Validation**: Run `python scripts/validate_setup.py` ## Basic Usage ### Image Search ```python from transformers import CLIPProcessor, CLIPModel from PIL import Image model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14") processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14") # Embed images images = [Image.open(f"img{i}.jpg") for i in range(10)] inputs = processor(images=images, return_tensors="pt") image_features = model.get_image_features(**inputs) # Search with text text_inputs = processor(text=["a beach at sunset"], return_tensors="pt") text_features = model.get_text_features(**text_inputs) # Compute similarity similarity = (image_features @ text_features.T).softmax(dim=0) ``` ## Common Anti-Patterns ### Anti-Pattern 1: "CLIP for Everything" **❌ Wrong**: ```python # Using CLIP to count cars in an image prompt = "How many cars are in this image?" # CLIP cannot count - it will give nonsense results ``` **Why wrong**: CLIP's architecture collapses spatial information into a single vector. It literally cannot count. **✓ Right**: ```python from transformers import DetrImageProcessor, DetrForObjectDetection processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50") model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50") # Detect objects results = model(**processor(images=image, return_tensors="pt")) # Filter for cars and count car_detections = [d for d in results if d['label'] == 'car'] count = len(car_detections) ``` **How to detect**: If query contains "how many", "count", or numeric questions → Use object detection --- ### Anti-Pattern 2: Fine-Grained Classification **❌ Wrong**: ```python # Trying to identify specific celebrities with CLIP prompts = ["Tom Hanks", "Brad Pitt", "Morgan Freeman"] # CLIP will perform poorly - not trained for fine-grained face ID ``` **Why wrong**: CLIP trained on coarse categories. Fine-grained faces, car models, flower species require specialized models. **✓ Right**: ```python # Use a fine-tuned face recognition model from transformers import AutoFeatureExtractor, AutoModelForImageClassification model = AutoModelForImageClassification.from_pretrained( "microsoft/resnet-50" # Then fine-tune on celebrity dataset ) # Or use dedicated face recognition: ArcFace, CosFace ``` **How to detect**: If query asks to distinguish between similar items in same category → Use specialized model --- ### Anti-Pattern 3: Spatial Understanding **❌ Wrong**: ```python # CLIP cannot understand spatial relationships prompts = [ "cat to the left of dog", "cat to the right of dog" ] # Will give nearly identical scores ``` **Why wrong**: CLIP embeddings lose spatial topology. "Left" and "right" are treated as bag-of-words. **✓ Right**: ```python # Use a spatial reasoning model # Examples: GQA models, Visual Genome models, SWIG from swig_model import SpatialRelationModel model = SpatialRelationModel() result = model.predict_relation(image, "cat", "dog") # Returns: "left", "right", "above", "below", etc. ``` **How to detect**: If query contains directional words (left, right, above, under, next to) → Use spatial model --- ### Anti-Pattern 4: Attribute Binding **❌ Wrong**: ```python prompts = [ "red car and blue truck", "blue car and red truck" ] # CLIP often gives similar scores for both ``` **Why wrong**: CLIP cannot bind attributes to objects. It sees "red, blue, car, truck" as a bag of concepts. **✓ Right - Use PC-CLIP or DCSMs**: ```python # PC-CLIP: Fine-tuned for pairwise comparisons from pc_clip import PCCLIPModel model = PCCLIPModel.from_pretrained("pc-clip-vit-l") # Or use DCSMs (Dense Cosine Similarity Maps) ``` **How to detect**: If query has multiple objects with different attributes → Use compositional model --- ## Evolution Timeline ### 2021: CLIP Released - Revolutionary: zero-shot, 400M image-text pairs - Widely adopted for everything - Limitations not yet understood ### 2022-2023: Limitations Discovered - Cannot count objects - Poor at fine-grained classification - Fails spatial reasoning - Can't bind attributes ### 2024: Alternatives Emerge - **DCSMs**: Preserve patch/token topology - **PC-CLIP**: Trained on pairwise comparisons - **SpLiCE**: Sparse interpretable embeddings ### 2025: Current Best Practices - Use CLIP for what it's good at - Task-specific models for limitations - Compositional models for complex queries **LLM Mistake**: LLMs trained on 2021-2023 data will suggest CLIP for everything because limitations weren't widely known. This skill corrects that. --- ## Validation Script Before using CLIP, check if it's appropriate: ```bash python scripts/validate_clip_usage.py \ --query "your query here" \ --check-all ``` Returns: - ✅ CLIP is appropriate - ❌ Use alternative (with suggestion) ## Task-Specific Guidance ### Image Search (CLIP ✓) ```python # Good use of CLIP queries = ["beach", "mountain", "city skyline"] # Works well for broad semantic concepts ``` ### Zero-Shot Classification (CLIP ✓) ```python # Good: Broad categories categories = ["indoor", "outdoor", "nature", "urban"] # CLIP excels at this ``` ### Object Counting (CLIP ✗) ```python # Use object detection instead from transformers import DetrImageProcessor, DetrForObjectDetection # See /references/object_detection.md ``` ### Fine-Grained Classification (CLIP ✗) ```python # Use specialized models # See /references/fine_grained_models.md ``` ### Spatial Reasoning (CLIP ✗) ```python # Use spatial relation models # See /references/spatial_models.md ``` --- ## Troubleshooting ### Issue: CLIP gives unexpected results **Check**: 1. Is this a counting task? → Use object detection 2. Fine-grained classification? → Use specialized model 3. Spatial query? → Use spatial model 4. Multiple objects with attributes? → Use compositional model **Validation**: ```bash python scripts/diagnose_clip_issue.py --image path/to/image --query "your query" ``` ### Issue: Low similarity scores **Possible causes**: 1. Query too specific (CLIP works better with broad concepts) 2. Fine-grained task (not CLIP's strength) 3. Need to adjust threshold **Solution**: Try broader query or use alternative model --- ## Model Selection Guide | Model | Best For | Avoid For | |-------|----------|-----------| | CLIP ViT-L/14 | Semantic search, broad categories | Counting, fine-grained, spatial | | DETR | Object detection, counting | Semantic similarity | | DINOv2 | Fine-grained features | Text-image matching | | PC-CLIP | Attribute binding, comparisons | General embedding | | DCSMs | Compositional reasoning | Simple similarity | ## Performance Notes **CLIP models**: - ViT-B/32: Fast, lower quality - ViT-L/14: Balanced (recommended) - ViT-g-14: Highest quality, slower **Inference time** (single image, CPU): - ViT-B/32: ~100ms - ViT-L/14: ~300ms - ViT-g-14: ~1000ms ## Further Reading - `/references/clip_limitations.md` - Detailed analysis of CLIP's failures - `/references/alternatives.md` - When to use what model - `/references/compositional_reasoning.md` - DCSMs and PC-CLIP deep dive - `/scripts/validate_clip_usage.py` - Pre-flight validation tool - `/scripts/diagnose_clip_issue.py` - Debug unexpected results --- *See CHANGELOG.md for version history.*