--- title: Python audio passthrough with custom TTS description: "Stream audio from your own TTS provider through an Anam avatar using Python and the audio passthrough API." tags: [python, audio-passthrough, tts, avatar, webRTC, beginner] date: 2026-02-19 authors: [sebvanleuven] --- Anam's audio passthrough mode lets you drive the avatar with your own TTS audio. Instead of using Anam's orchestration layer (STT/LLM/TTS pipeline), you directly send audio to generate the avatar. This is useful when you want to use a specific TTS provider, or want full control over the pipeline. This recipe shows a script-style example: pass text on the command line, and the script connects to Anam, converts text to speech via ElevenLabs, and streams the audio to the avatar. It receives synchronised audio and video frames from the backend and displays the avatar video in an OpenCV window. The complete code is at [examples/python-audio-passthrough-tts](https://github.com/anam-org/anam-cookbook/tree/main/examples/python-audio-passthrough-tts). ## What you'll build A Python script that: - Accepts `--text` as input and converts it to speech via ElevenLabs - Connects to Anam with audio passthrough - Sends the audio to the avatar, waits in listening mode for 5 seconds and exits when done - Receives synchronised audio and video frames from the backend - Displays the avatar video in an OpenCV window and plays the audio through sounddevice ## Prerequisites - Python 3.10+ - [uv](https://docs.astral.sh/uv/) for project management (or pip) - An Anam API key from [lab.anam.ai](https://lab.anam.ai) - An ElevenLabs API key from [elevenlabs.io](https://elevenlabs.io) - An avatar ID from [lab.anam.ai/avatars](https://lab.anam.ai/avatars) (or use the default Liv avatar: `071b0286-4cce-4808-bee2-e642f1062de3`) ## Project setup Clone the cookbook and set up the example: ```bash git clone https://github.com/anam-org/anam-cookbook.git cd anam-cookbook/examples/python-audio-passthrough-tts uv sync cp .env.example .env ``` Edit `.env`: ```bash ANAM_API_KEY=your_anam_api_key ANAM_AVATAR_ID=your_avatar_id ELEVENLABS_API_KEY=your_elevenlabs_api_key ``` Never expose your API key in client-side code. The Python SDK is designed for server-side use. To run client-side code, we suggest to use the JavaScript SDK instead. ## Persona configuration With audio passthrough, configure the persona with only `avatar_id` and `enable_audio_passthrough=True`. Do not use `persona_id`—that would enable Anam's built-in LLM and interfere with your custom audio pipeline. ```python from anam.types import PersonaConfig persona_config = PersonaConfig( avatar_id=avatar_id, enable_audio_passthrough=True, ) ``` ## Connecting and displaying the avatar Create the client, connect, and consume video and audio frames. The avatar appears as soon as the session is ready—you'll see it idling in a neutral pose. You can run this minimal setup by itself (e.g. for 10 seconds) to verify the connection before adding any audio: ```python from anam import AnamClient, ClientOptions client = AnamClient( api_key=api_key, persona_config=persona_config, options=ClientOptions(), ) async with client.connect() as session: async def consume_video(): async for frame in session.video_frames(): display.update(frame) async def consume_audio(): async for frame in session.audio_frames(): audio_player.add_frame(frame) asyncio.create_task(consume_video()) asyncio.create_task(consume_audio()) # Avatar idles—no audio sent yet. Run for 10s to verify, or press 'q' to quit await asyncio.sleep(10) ``` The script runs the async session in a background thread and the OpenCV display on the main thread (required for window handling on macOS). Audio frames are played through sounddevice. `video_frames()` and `audio_frames()` are async iterators—use `async for` to consume them. The frames are PyAV `VideoFrame` and `AudioFrame` objects: decoded WebRTC frames from aiortc (i.e. containing raw PCM audio and video pixels). ## Waiting for the session to be ready To limit latency, the backend drops incoming audio until the pipeline is ready and the session is established. To avoid audio loss, we register a handler with `@client.on(AnamEvent.SESSION_READY)` to be notified when the `SESSION_READY` event is emitted. ```python session_ready = asyncio.Event() @client.on(AnamEvent.SESSION_READY) async def on_session_ready() -> None: session_ready.set() ``` After the webRTC session is connected, we'll wait until session_ready is set before sending TTS audio to the avatar. ```python try: await asyncio.wait_for(session_ready.wait(), timeout=30.0) except asyncio.TimeoutError: print("Session timeout: session did not become ready in time") display.stop() return ``` ## Getting PCM audio Use the ElevenLabs SDK with `output_format="pcm_24000"` to get PCM 24kHz mono directly—no conversion needed. ```python from elevenlabs.client import ElevenLabs client = ElevenLabs(api_key=api_key) response = client.text_to_speech.convert( text="Hello, this is a test", voice_id="EXAVITQu4vr4xnSDxMaL", model_id="eleven_turbo_v2_5", output_format="pcm_24000", ) pcm_bytes = b"".join(response) ``` Anam supports any sample rate between 16000 and 48000 Hz. However, for best performance, we suggest using 24000 Hz as this balances latency and audio quality. Anam does not resample the TTS audio internally, so the quality you provide is the quality that will be delivered. Note that for webRTC delivery, the audio is converted to 48000 Hz (stereo) and compressed with OPUS. ## Sending audio to the avatar Once you have PCM bytes, create an agent audio stream and send the chunks. Call `end_sequence()` when done so the avatar finishes rendering and returns to a neutral state. The script waits for playback duration plus 5 seconds before stopping the session: ```python from anam.types import AgentAudioInputConfig agent = session.create_agent_audio_input_stream( AgentAudioInputConfig(encoding="pcm_s16le", sample_rate=24000, channels=1) ) chunk_size = 24000 # 500ms for i in range(0, len(pcm_bytes), chunk_size): chunk = pcm_bytes[i : i + chunk_size] if chunk: await agent.send_audio_chunk(chunk) await asyncio.sleep(0.01) await agent.end_sequence() # Wait for playback + extra time before stopping duration_sec = len(pcm_bytes) / (24000 * 2) await asyncio.sleep(duration_sec + 5.0) ``` `end_sequence()` is essential for the avatar to behave naturally at the end of its turn and return to a neutral listening mode. When end_sequence() is not called, the pipeline assumes TTS audio is still incoming and it will stall the video generation. Comment out end_sequence() to see this effect in action. ## Running the script ```bash # Text via ElevenLabs uv run python main.py --text "Hello, this is a test" # Custom voice uv run python main.py --text "Hi" --voice YOUR_VOICE_ID ``` ## Terminology - **Persona** – The full AI character (avatar + voice + LLM + system prompt) - **Avatar** – Just the visual character In audio passthrough mode, you provide the voice (TTS) yourself. The TTS audio is not parsed, nor added to LLM context or message history.