--- name: audio-language-models description: Gemini Live API, Grok Voice Agent, GPT-4o-Transcribe, AssemblyAI patterns for real-time voice, speech-to-text, and TTS. Use when implementing voice agents, audio transcription, or conversational AI. context: fork agent: multimodal-specialist version: 1.1.0 author: OrchestKit user-invocable: false tags: [audio, multimodal, gemini-live, grok-voice, whisper, tts, speech, voice-agent, 2026] --- # Audio Language Models (2026) Build real-time voice agents and audio processing using the latest native speech-to-speech models. ## Overview - Real-time voice assistants and agents - Live conversational AI (phone agents, support bots) - Audio transcription with speaker diarization - Multilingual voice interactions - Text-to-speech generation - Voice-to-voice translation ## Model Comparison (January 2026) ### Real-Time Voice (Speech-to-Speech) | Model | Latency | Languages | Price | Best For | |-------|---------|-----------|-------|----------| | **Grok Voice Agent** | <1s TTFA | 100+ | $0.05/min | Fastest, #1 Big Bench | | **Gemini Live API** | Low | 24 (30 voices) | Usage-based | Emotional awareness | | **OpenAI Realtime** | ~1s | 50+ | $0.10/min | Ecosystem integration | ### Speech-to-Text Only | Model | WER | Latency | Best For | |-------|-----|---------|----------| | **Gemini 2.5 Pro** | ~5% | Medium | 9.5hr audio, diarization | | **GPT-4o-Transcribe** | ~7% | Medium | Accuracy + accents | | **AssemblyAI Universal-2** | 8.4% | 200ms | Best features | | **Deepgram Nova-3** | ~18% | <300ms | Lowest latency | | **Whisper Large V3** | 7.4% | Slow | Self-host, 99+ langs | ## Grok Voice Agent API (xAI) - Fastest ```python import asyncio import websockets import json async def grok_voice_agent(): """Real-time voice agent with Grok - #1 on Big Bench Audio. Features: - <1 second time-to-first-audio (5x faster than competitors) - Native speech-to-speech (no transcription intermediary) - 100+ languages, $0.05/min - OpenAI Realtime API compatible """ uri = "wss://api.x.ai/v1/realtime" headers = {"Authorization": f"Bearer {XAI_API_KEY}"} async with websockets.connect(uri, extra_headers=headers) as ws: # Configure session await ws.send(json.dumps({ "type": "session.update", "session": { "model": "grok-4-voice", "voice": "Aria", # or "Eve", "Leo" "instructions": "You are a helpful voice assistant.", "input_audio_format": "pcm16", "output_audio_format": "pcm16", "turn_detection": {"type": "server_vad"} } })) # Stream audio in/out async def send_audio(audio_stream): async for chunk in audio_stream: await ws.send(json.dumps({ "type": "input_audio_buffer.append", "audio": base64.b64encode(chunk).decode() })) async def receive_audio(): async for message in ws: data = json.loads(message) if data["type"] == "response.audio.delta": yield base64.b64decode(data["delta"]) return send_audio, receive_audio # Expressive voice with auditory cues async def expressive_response(ws, text: str): """Use auditory cues for natural speech.""" # Supports: [whisper], [sigh], [laugh], [pause] await ws.send(json.dumps({ "type": "response.create", "response": { "instructions": "[sigh] Let me think about that... [pause] Here's what I found." } })) ``` ## Gemini Live API (Google) - Emotional Awareness ```python import google.generativeai as genai from google.generativeai import live genai.configure(api_key="YOUR_API_KEY") async def gemini_live_voice(): """Real-time voice with emotional understanding. Features: - 30 HD voices in 24 languages - Affective dialog (understands emotions) - Barge-in support (interrupt anytime) - Proactive audio (responds only when relevant) """ model = genai.GenerativeModel("gemini-2.5-flash-live") config = live.LiveConnectConfig( response_modalities=["AUDIO"], speech_config=live.SpeechConfig( voice_config=live.VoiceConfig( prebuilt_voice_config=live.PrebuiltVoiceConfig( voice_name="Puck" # or Charon, Kore, Fenrir, Aoede ) ) ), system_instruction="You are a friendly voice assistant." ) async with model.connect(config=config) as session: # Send audio async def send_audio(audio_chunk: bytes): await session.send( input=live.LiveClientContent( realtime_input=live.RealtimeInput( media_chunks=[live.MediaChunk( data=audio_chunk, mime_type="audio/pcm" )] ) ) ) # Receive audio responses async for response in session.receive(): if response.data: yield response.data # Audio bytes # With transcription async def gemini_live_with_transcript(): """Get both audio and text transcripts.""" async with model.connect(config=config) as session: async for response in session.receive(): if response.server_content: # Text transcript if response.server_content.model_turn: for part in response.server_content.model_turn.parts: if part.text: print(f"Transcript: {part.text}") if response.data: yield response.data # Audio ``` ## Gemini Audio Transcription (Long-Form) ```python import google.generativeai as genai def transcribe_with_gemini(audio_path: str) -> dict: """Transcribe up to 9.5 hours of audio with speaker diarization. Gemini 2.5 Pro handles long-form audio natively. """ model = genai.GenerativeModel("gemini-2.5-pro") # Upload audio file audio_file = genai.upload_file(audio_path) response = model.generate_content([ audio_file, """Transcribe this audio with: 1. Speaker labels (Speaker 1, Speaker 2, etc.) 2. Timestamps for each segment 3. Punctuation and formatting Format: [00:00:00] Speaker 1: First statement... [00:00:15] Speaker 2: Response...""" ]) return { "transcript": response.text, "audio_duration": audio_file.duration } ``` ## Gemini TTS (Text-to-Speech) ```python def gemini_text_to_speech(text: str, voice: str = "Kore") -> bytes: """Generate speech with Gemini 2.5 TTS. Features: - Enhanced expressivity with style prompts - Precision pacing (context-aware speed) - Multi-speaker dialogue consistency """ model = genai.GenerativeModel("gemini-2.5-flash-tts") response = model.generate_content( contents=text, generation_config=genai.GenerationConfig( response_mime_type="audio/mp3", speech_config=genai.SpeechConfig( voice_config=genai.VoiceConfig( prebuilt_voice_config=genai.PrebuiltVoiceConfig( voice_name=voice # Puck, Charon, Kore, Fenrir, Aoede ) ) ) ) ) return response.audio ``` ## OpenAI GPT-4o-Transcribe ```python from openai import OpenAI client = OpenAI() def transcribe_openai(audio_path: str, language: str = None) -> dict: """Transcribe with GPT-4o-Transcribe (enhanced accuracy).""" with open(audio_path, "rb") as audio_file: response = client.audio.transcriptions.create( model="gpt-4o-transcribe", file=audio_file, language=language, response_format="verbose_json", timestamp_granularities=["word", "segment"] ) return { "text": response.text, "words": response.words, "segments": response.segments, "duration": response.duration } ``` ## AssemblyAI (Best Features) ```python import assemblyai as aai aai.settings.api_key = "YOUR_API_KEY" def transcribe_assemblyai(audio_url: str) -> dict: """Transcribe with speaker diarization, sentiment, entities.""" config = aai.TranscriptionConfig( speaker_labels=True, sentiment_analysis=True, entity_detection=True, auto_highlights=True, language_detection=True ) transcriber = aai.Transcriber() transcript = transcriber.transcribe(audio_url, config=config) return { "text": transcript.text, "speakers": transcript.utterances, "sentiment": transcript.sentiment_analysis, "entities": transcript.entities } ``` ## Real-Time Streaming Comparison ```python async def choose_realtime_provider( requirements: dict ) -> str: """Select best real-time voice provider.""" if requirements.get("fastest_latency"): return "grok" # <1s TTFA, 5x faster if requirements.get("emotional_understanding"): return "gemini" # Affective dialog if requirements.get("openai_ecosystem"): return "openai" # Compatible tools if requirements.get("lowest_cost"): return "grok" # $0.05/min (half of OpenAI) return "gemini" # Best overall for 2026 ``` ## API Pricing (January 2026) | Provider | Type | Price | Notes | |----------|------|-------|-------| | Grok Voice Agent | Real-time | $0.05/min | Cheapest real-time | | Gemini Live | Real-time | Usage-based | 30 HD voices | | OpenAI Realtime | Real-time | $0.10/min | | | Gemini 2.5 Pro | Transcription | $1.25/M tokens | 9.5hr audio | | GPT-4o-Transcribe | Transcription | $0.01/min | | | AssemblyAI | Transcription | ~$0.15/hr | Best features | | Deepgram | Transcription | ~$0.0043/min | | ## Key Decisions | Scenario | Recommendation | |----------|----------------| | Voice assistant | Grok Voice Agent (fastest) | | Emotional AI | Gemini Live API | | Long audio (hours) | Gemini 2.5 Pro (9.5hr) | | Speaker diarization | AssemblyAI or Gemini | | Lowest latency STT | Deepgram Nova-3 | | Self-hosted | Whisper Large V3 | ## Common Mistakes - Using STT+LLM+TTS pipeline instead of native speech-to-speech - Not leveraging emotional understanding (Gemini) - Ignoring barge-in support for natural conversations - Using deprecated Whisper-1 instead of GPT-4o-Transcribe - Not testing latency with real users ## Related Skills - `vision-language-models` - Image/video processing - `multimodal-rag` - Audio + text retrieval - `streaming-api-patterns` - WebSocket patterns ## Capability Details ### real-time-voice **Keywords:** voice agent, real-time, conversational, live audio **Solves:** - Build voice assistants - Phone agents and support bots - Interactive voice response (IVR) ### speech-to-speech **Keywords:** native audio, speech-to-speech, no transcription **Solves:** - Low-latency voice responses - Natural conversation flow - Emotional voice interactions ### transcription **Keywords:** transcribe, speech-to-text, STT, convert audio **Solves:** - Convert audio files to text - Generate meeting transcripts - Process long-form audio ### voice-tts **Keywords:** TTS, text-to-speech, voice synthesis **Solves:** - Generate natural speech - Multi-voice dialogue - Expressive audio output