--- name: faion-multimodal-ai description: "Multimodal AI: vision, image/video generation, speech-to-text, text-to-speech, voice synthesis." user-invocable: false allowed-tools: Read, Write, Edit, Glob, Grep, Bash, Task, AskUserQuestion, TodoWrite --- > **Entry point:** `/faion-net` — invoke this skill for automatic routing to the appropriate domain. # Multimodal AI Skill **Communication: User's language. Code: English.** ## Purpose Handles multimodal AI applications. Covers vision, image generation, video generation, speech, and voice synthesis. ## Scope | Area | Coverage | |------|----------| | **Vision** | GPT-4o Vision, Gemini Vision, image understanding | | **Image Generation** | DALL-E 3, Midjourney, Stable Diffusion | | **Video Generation** | Sora, Runway, Pika | | **Speech-to-Text** | Whisper, Deepgram, AssemblyAI | | **Text-to-Speech** | OpenAI TTS, ElevenLabs, Google TTS | | **Voice** | Real-time voice, voice cloning | ## Quick Start | Task | Files | |------|-------| | Vision API | vision-basics.md → vision-applications.md | | Image generation | img-gen-basics.md → img-gen-tools.md | | Video generation | video-gen-basics.md → video-gen-tools.md | | Speech-to-text | speech-to-text-basics.md → speech-to-text-advanced.md | | Text-to-speech | tts-basics.md → tts-implementation.md | | Voice synthesis | voice-basics.md → voice-implementation.md | ## Methodologies (12) **Vision (2):** - vision-basics: Image understanding, OCR, scene analysis - vision-applications: Use cases, production patterns **Image Generation (2):** - img-gen-basics: Prompt engineering, models - img-gen-tools: DALL-E 3, Midjourney, Stable Diffusion **Video Generation (2):** - video-gen-basics: Fundamentals, prompting - video-gen-tools: Sora, Runway, Pika, Luma **Speech-to-Text (2):** - speech-to-text-basics: Whisper API, real-time - speech-to-text-advanced: Diarization, timestamps **Text-to-Speech (2):** - tts-basics: Voice selection, SSML - tts-implementation: Production patterns, streaming **Voice (2):** - voice-basics: Real-time voice, cloning - voice-implementation: Integration patterns ## Code Examples ### GPT-4o Vision ```python from openai import OpenAI client = OpenAI() response = client.chat.completions.create( model="gpt-4o", messages=[{ "role": "user", "content": [ {"type": "text", "text": "What's in this image?"}, {"type": "image_url", "image_url": {"url": "https://..."}} ] }] ) ``` ### DALL-E 3 Image Generation ```python from openai import OpenAI client = OpenAI() response = client.images.generate( model="dall-e-3", prompt="A futuristic city with flying cars", size="1024x1024", quality="hd", n=1 ) image_url = response.data[0].url ``` ### Whisper Speech-to-Text ```python from openai import OpenAI client = OpenAI() audio_file = open("speech.mp3", "rb") transcription = client.audio.transcriptions.create( model="whisper-1", file=audio_file, response_format="verbose_json", timestamp_granularities=["word"] ) print(transcription.text) ``` ### OpenAI TTS ```python from openai import OpenAI from pathlib import Path client = OpenAI() response = client.audio.speech.create( model="tts-1-hd", voice="alloy", input="Hello, this is a test of text to speech." ) response.stream_to_file("speech.mp3") ``` ### Gemini Vision ```python import google.generativeai as genai genai.configure(api_key="...") model = genai.GenerativeModel('gemini-pro-vision') image = PIL.Image.open("image.jpg") response = model.generate_content([ "Describe this image in detail", image ]) print(response.text) ``` ## Model Comparison ### Vision Models | Model | Best For | Max Image Size | |-------|----------|----------------| | GPT-4o | General vision, OCR | 20MB | | Gemini Pro Vision | High-res images | 20MB | | Claude Sonnet 4 | Document analysis | 5MB | ### Image Generation | Model | Best For | Cost | |-------|----------|------| | DALL-E 3 | Photorealistic, text | $$$ | | Midjourney | Artistic, creative | $$ | | Stable Diffusion | Custom, open-source | Free/$ | ### Speech-to-Text | Service | Best For | Languages | |---------|----------|-----------| | Whisper | General, multilingual | 99 | | Deepgram | Real-time, low latency | 30+ | | AssemblyAI | Features, diarization | 10+ | ### Text-to-Speech | Service | Best For | Voices | |---------|----------|--------| | OpenAI TTS | Quality, variety | 6 | | ElevenLabs | Cloning, realism | Custom | | Google TTS | Languages, SSML | 400+ | ## Use Cases | Use Case | Modalities | |----------|------------| | Document analysis | Vision → Text | | Video narration | Video → Speech → TTS | | Voice assistant | Speech → LLM → TTS | | Content generation | Text → Images/Video | | Accessibility | Vision → TTS, Speech → Text | ## Related Skills | Skill | Relationship | |-------|-------------| | faion-llm-integration | Provides vision APIs | | faion-ai-agents | Multimodal agents | --- *Multimodal AI v1.0 | 12 methodologies*