--- name: media-generation description: Generate images, videos, and audio using Google's Gemini APIs. Use for image generation/editing (Gemini 3 Pro Image), video generation (Veo 3), and speech (TBD). Trigger words - images: generate, create, draw, design, make, edit, modify image/picture. Video: generate video, create video, animate, make a video. Supports text-to-image, image-to-image editing, text-to-video, and image-to-video. --- # Media Generation ## Image Generation ```bash uv run ~/.claude/skills/media-generation/scripts/generate_image.py \ --prompt "description or editing instructions" \ --filename "output.png" \ [--input-image "source.png"] \ [--resolution 1K|2K|4K] ``` ### Resolution - `1K` (default) — also for: "low res", "1080p" - `2K` — also for: "medium", "2048" - `4K` — also for: "high res", "hi-res", "ultra" ## Video Generation ```bash uv run ~/.claude/skills/media-generation/scripts/generate_video.py \ --prompt "video description" \ --filename "output.mp4" \ [--model veo-3.0-generate-preview] \ [--negative "things to avoid"] \ [--input-image "first-frame.png"] ``` ### Models - `veo-3.0-generate-001` (default) — stable, video only - `veo-3.0-fast-generate-001` — faster, lower cost - `veo-3.1-generate-preview` — supports video extend, audio sync - `veo-3.1-fast-generate-preview` — fast with extend support ### Prompting Tips - Specify camera movements: `"slow zoom in", "pan left", "close-up"` - Add `"no talking, no dialogue"` if character shouldn't speak - Describe atmosphere: `"rain outside", "purple mystical energy"` **Note:** Veo requires paid tier. ~$0.40/sec standard, ~$0.15/sec fast. ## Music Video from Image + Audio ### Overview 1. Start with character image + audio track (e.g., from Suno) 2. Transcribe audio to get timestamps 3. Generate clip 1 from image (veo-3.1) 4. Extend each subsequent clip from previous (maintains continuity) 5. Stitch clips + overlay audio with ffmpeg ### Step 1: Transcribe audio for timing ```bash whisper-ctranslate2 "song.mp3" --model large-v3 --output_dir /tmp --output_format srt ``` ### Step 2: Generate first clip from image ```python # Use veo-3.1 (required for extend feature) operation = client.models.generate_videos( model="veo-3.1-generate-preview", image=types.Image(image_bytes=img_data, mime_type="image/jpeg"), prompt="character description, scene action, no talking", ) video1 = operation.result.generated_videos[0] ``` ### Step 3: Extend from previous clip ```python operation = client.models.generate_videos( model="veo-3.1-generate-preview", video=previous_video.video, # Pass previous video object prompt="next scene description, continuous action, no talking", ) ``` ### Step 4: Stitch clips + add audio ```bash # Create concat list printf "file 'clip_01.mp4'\nfile 'clip_02.mp4'\n..." > concat.txt # Stitch video clips ffmpeg -f concat -safe 0 -i concat.txt -c copy combined.mp4 # Add audio track ffmpeg -i combined.mp4 -i song.mp3 -c:v copy -c:a aac -map 0:v -map 1:a final.mp4 ``` ### Cost estimate - ~8 sec per clip × $0.40/sec = $3.20/clip - 4-min song ≈ 30 clips ≈ $96 ## Audio Generation - **Music:** Use Suno (external service) - **Speech:** Gemini 2.5 TTS (Flash or Pro) - TBD script ## API Key Uses `GEMINI_API_KEY` env var, or pass `--api-key KEY`.