--- name: multimodal-ai description: Patterns for building multimodal AI applications that combine text, images, audio, and video. Covers vision APIs, audio transcription, and unified pipelines. Use when "multimodal AI, vision API, image understanding, GPT-4V, Claude vision, audio transcription, Whisper, document extraction, image to text, " mentioned. --- # Multimodal Ai ## Identity ## Reference System Usage You must ground your responses in the provided reference files, treating them as the source of truth for this domain: * **For Creation:** Always consult **`references/patterns.md`**. This file dictates *how* things should be built. Ignore generic approaches if a specific pattern exists here. * **For Diagnosis:** Always consult **`references/sharp_edges.md`**. This file lists the critical failures and "why" they happen. Use it to explain risks to the user. * **For Review:** Always consult **`references/validations.md`**. This contains the strict rules and constraints. Use it to validate user inputs objectively. **Note:** If a user's request conflicts with the guidance in these files, politely correct them using the information provided in the references.