--- name: podcast-generation description: Generate AI-powered podcast-style audio narratives using Azure OpenAI's GPT Realtime Mini model via WebSocket. Use when building text-to-speech features, audio narrative generation, podcast creation from content, or integrating with Azure OpenAI Realtime API for real audio output. Covers full-stack implementation from React frontend to Python FastAPI backend with WebSocket streaming. --- # Podcast Generation with GPT Realtime Mini Generate real audio narratives from text content using Azure OpenAI's Realtime API. ## Quick Start 1. Configure environment variables for Realtime API 2. Connect via WebSocket to Azure OpenAI Realtime endpoint 3. Send text prompt, collect PCM audio chunks + transcript 4. Convert PCM to WAV format 5. Return base64-encoded audio to frontend for playback ## Environment Configuration ```env AZURE_OPENAI_AUDIO_API_KEY=your_realtime_api_key AZURE_OPENAI_AUDIO_ENDPOINT=https://your-resource.cognitiveservices.azure.com AZURE_OPENAI_AUDIO_DEPLOYMENT=gpt-realtime-mini ``` **Note**: Endpoint should NOT include `/openai/v1/` - just the base URL. ## Core Workflow ### Backend Audio Generation ```python from openai import AsyncOpenAI import base64 # Convert HTTPS endpoint to WebSocket URL ws_url = endpoint.replace("https://", "wss://") + "/openai/v1" client = AsyncOpenAI( websocket_base_url=ws_url, api_key=api_key ) audio_chunks = [] transcript_parts = [] async with client.realtime.connect(model="gpt-realtime-mini") as conn: # Configure for audio-only output await conn.session.update(session={ "output_modalities": ["audio"], "instructions": "You are a narrator. Speak naturally." }) # Send text to narrate await conn.conversation.item.create(item={ "type": "message", "role": "user", "content": [{"type": "input_text", "text": prompt}] }) await conn.response.create() # Collect streaming events async for event in conn: if event.type == "response.output_audio.delta": audio_chunks.append(base64.b64decode(event.delta)) elif event.type == "response.output_audio_transcript.delta": transcript_parts.append(event.delta) elif event.type == "response.done": break # Convert PCM to WAV (see scripts/pcm_to_wav.py) pcm_audio = b''.join(audio_chunks) wav_audio = pcm_to_wav(pcm_audio, sample_rate=24000) ``` ### Frontend Audio Playback ```javascript // Convert base64 WAV to playable blob const base64ToBlob = (base64, mimeType) => { const bytes = atob(base64); const arr = new Uint8Array(bytes.length); for (let i = 0; i < bytes.length; i++) arr[i] = bytes.charCodeAt(i); return new Blob([arr], { type: mimeType }); }; const audioBlob = base64ToBlob(response.audio_data, 'audio/wav'); const audioUrl = URL.createObjectURL(audioBlob); new Audio(audioUrl).play(); ``` ## Voice Options | Voice | Character | |-------|-----------| | alloy | Neutral | | echo | Warm | | fable | Expressive | | onyx | Deep | | nova | Friendly | | shimmer | Clear | ## Realtime API Events - `response.output_audio.delta` - Base64 audio chunk - `response.output_audio_transcript.delta` - Transcript text - `response.done` - Generation complete - `error` - Handle with `event.error.message` ## Audio Format - **Input**: Text prompt - **Output**: PCM audio (24kHz, 16-bit, mono) - **Storage**: Base64-encoded WAV ## References - **Full architecture**: See [references/architecture.md](references/architecture.md) for complete stack design - **Code examples**: See [references/code-examples.md](references/code-examples.md) for production patterns - **PCM conversion**: Use [scripts/pcm_to_wav.py](scripts/pcm_to_wav.py) for audio format conversion