---
name: multimodal-models
description: Use when "CLIP", "Whisper", "Stable Diffusion", "SDXL", "speech-to-text", "text-to-image", "image generation", "transcription", "zero-shot classification", "image-text similarity", "inpainting", "ControlNet"
version: 1.0.0
---

# Multimodal Models

Pre-trained models for vision, audio, and cross-modal tasks.

---

## Model Overview

| Model | Modality | Task |
|-------|----------|------|
| **CLIP** | Image + Text | Zero-shot classification, similarity |
| **Whisper** | Audio → Text | Transcription, translation |
| **Stable Diffusion** | Text → Image | Image generation, editing |

---

## CLIP (Vision-Language)

Zero-shot image classification without training on specific labels.

### CLIP Use Cases

| Task | How |
|------|-----|
| Zero-shot classification | Compare image to text label embeddings |
| Image search | Find images matching text query |
| Content moderation | Classify against safety categories |
| Image similarity | Compare image embeddings |

### CLIP Models

| Model | Parameters | Trade-off |
|-------|------------|-----------|
| ViT-B/32 | 151M | Recommended balance |
| ViT-L/14 | 428M | Best quality, slower |
| RN50 | 102M | Fastest, lower quality |

### CLIP Concepts

| Concept | Description |
|---------|-------------|
| **Dual encoder** | Separate encoders for image and text |
| **Contrastive learning** | Trained to match image-text pairs |
| **Normalization** | Always normalize embeddings before similarity |
| **Descriptive labels** | Better labels = better zero-shot accuracy |

**Key concept**: CLIP embeds images and text in same space. Classification = find nearest text embedding.

### CLIP Limitations

- Not for fine-grained classification
- No spatial understanding (whole image only)
- May reflect training data biases

---

## Whisper (Speech Recognition)

Robust multilingual transcription supporting 99 languages.

### Whisper Use Cases

| Task | Configuration |
|------|---------------|
| Transcription | Default `transcribe` task |
| Translation to English | `task="translate"` |
| Subtitles | Output format SRT/VTT |
| Word timestamps | `word_timestamps=True` |

### Whisper Models

| Model | Size | Speed | Recommendation |
|-------|------|-------|----------------|
| turbo | 809M | Fast | **Recommended** |
| large | 1550M | Slow | Maximum quality |
| small | 244M | Medium | Good balance |
| base | 74M | Fast | Quick tests |
| tiny | 39M | Fastest | Prototyping only |

### Whisper Concepts

| Concept | Description |
|---------|-------------|
| **Language detection** | Auto-detects, or specify for speed |
| **Initial prompt** | Improves technical terms accuracy |
| **Timestamps** | Segment-level or word-level |
| **faster-whisper** | 4× faster alternative implementation |

**Key concept**: Specify language when known—auto-detection adds latency.

### Whisper Limitations

- May hallucinate on silence/noise
- No speaker diarization (who said what)
- Accuracy degrades on >30 min audio
- Not suitable for real-time captioning

---

## Stable Diffusion (Image Generation)

Text-to-image generation with various control methods.

### SD Use Cases

| Task | Pipeline |
|------|----------|
| Text-to-image | `DiffusionPipeline` |
| Style transfer | `Image2Image` |
| Fill regions | `Inpainting` |
| Guided generation | `ControlNet` |
| Custom styles | LoRA adapters |

### SD Models

| Model | Resolution | Quality |
|-------|------------|---------|
| SDXL | 1024×1024 | Best |
| SD 1.5 | 512×512 | Good, faster |
| SD 2.1 | 768×768 | Middle ground |

### Key Parameters

| Parameter | Effect | Typical Value |
|-----------|--------|---------------|
| **num_inference_steps** | Quality vs speed | 20-50 |
| **guidance_scale** | Prompt adherence | 7-12 |
| **negative_prompt** | Avoid artifacts | "blurry, low quality" |
| **strength** (img2img) | How much to change | 0.5-0.8 |
| **seed** | Reproducibility | Fixed number |

### Control Methods

| Method | Input | Use Case |
|--------|-------|----------|
| **ControlNet** | Edge/depth/pose | Structural guidance |
| **LoRA** | Trained weights | Custom styles |
| **Img2Img** | Source image | Style transfer |
| **Inpainting** | Image + mask | Fill regions |

### Memory Optimization

| Technique | Effect |
|-----------|--------|
| CPU offload | Reduces VRAM usage |
| Attention slicing | Trades speed for memory |
| VAE tiling | Large image support |
| xFormers | Faster attention |
| DPM scheduler | Fewer steps needed |

**Key concept**: Use SDXL for quality, SD 1.5 for speed. Always use negative prompts.

### SD Limitations

- GPU strongly recommended (CPU very slow)
- Large VRAM requirements for SDXL
- May generate anatomical errors
- Prompt engineering matters

---

## Common Patterns

### Embedding and Similarity

All three models use embeddings:

- CLIP: Image/text embeddings for similarity
- Whisper: Audio embeddings for transcription
- SD: Text embeddings for image conditioning

### GPU Acceleration

| Model | VRAM Needed |
|-------|-------------|
| CLIP ViT-B/32 | ~2 GB |
| Whisper turbo | ~6 GB |
| SD 1.5 | ~6 GB |
| SDXL | ~10 GB |

### Best Practices

| Practice | Why |
|----------|-----|
| Use recommended model sizes | Best quality/speed balance |
| Cache embeddings (CLIP) | Expensive to recompute |
| Specify language (Whisper) | Faster than auto-detect |
| Use negative prompts (SD) | Avoid common artifacts |
| Set seeds for reproducibility | Consistent results |

## Resources

- CLIP: <https://github.com/openai/CLIP>
- Whisper: <https://github.com/openai/whisper>
- Diffusers: <https://huggingface.co/docs/diffusers>