--- name: Agent description: Use when building real-time voice and video AI agents, integrating with 25+ AI providers, deploying to production with HTTP servers, or adding computer vision and RAG capabilities to conversational applications. metadata: mintlify-proj: agent version: "1.0" --- # Vision Agents Skill ## Product Summary Vision Agents is an open-source Python framework for building real-time voice and video AI applications. It provides a unified `Agent` class that orchestrates LLMs, speech services (STT/TTS), video processors, and external tools via MCP. The framework ships with Stream's global edge network for low-latency transport but is edge-agnostic. Install with `uv add vision-agents` and add provider plugins as needed (e.g., `uv add "vision-agents[gemini,deepgram,elevenlabs]"`). Key files: `main.py` (agent definition), `.env` (API keys), `pyproject.toml` (dependencies). CLI commands: `uv run agent.py run` (console mode), `uv run agent.py serve` (HTTP server). See https://visionagents.ai for full documentation. ## When to Use Reach for Vision Agents when: - Building voice assistants, customer support bots, or phone agents (Twilio integration) - Creating video AI applications with real-time frame processing or VLM analysis - Needing to swap LLM/STT/TTS providers without rewriting agent logic - Deploying agents to production with HTTP session management and metrics - Adding computer vision (YOLO, Roboflow) or RAG (Gemini FileSearch, TurboPuffer) to agents - Handling real-time events (participant joins, transcriptions, tool calls) - Building agents that need function calling or MCP server integration Do not use for: static chatbots, batch processing, or applications that don't require real-time audio/video. ## Quick Reference ### Agent Constructor ```python from vision_agents.core import Agent, User from vision_agents.plugins import gemini, deepgram, elevenlabs, getstream # Realtime mode (speech-to-speech, no separate STT/TTS) agent = Agent( edge=getstream.Edge(), agent_user=User(name="Assistant", id="agent"), instructions="You're a helpful voice assistant.", llm=gemini.Realtime(), # or openai.Realtime() ) # Custom pipeline (STT → LLM → TTS) agent = Agent( edge=getstream.Edge(), agent_user=User(name="Assistant", id="agent"), instructions="You're a helpful voice assistant.", llm=gemini.LLM("gemini-2.5-flash"), stt=deepgram.STT(), tts=elevenlabs.TTS(), processors=[yolo_processor], # Optional video processors mcp_servers=[github_server], # Optional MCP servers ) ``` ### Core Methods | Method | Purpose | |--------|---------| | `await agent.join(call)` | Join a call (async context manager) | | `await agent.simple_response(text)` | Send text to LLM for processing | | `await agent.finish()` | Wait for call to end | | `await agent.close()` | Clean up resources | | `@agent.events.subscribe` | Subscribe to events (async handler) | | `@llm.register_function()` | Register Python function for LLM calling | ### Running Agents | Mode | Command | Use Case | |------|---------|----------| | Console | `uv run agent.py run` | Development, single agent | | Server | `uv run agent.py serve --host 0.0.0.0 --port 8000` | Production, multi-session | ### API Endpoints (Server Mode) | Method | Endpoint | Purpose | |--------|----------|---------| | POST | `/sessions` | Create agent session | | DELETE | `/sessions/{id}` | Stop session | | GET | `/sessions/{id}/metrics` | Performance metrics | | GET | `/health` | Liveness check | | GET | `/ready` | Readiness check | ### Environment Variables Store API keys in `.env`: ```bash STREAM_API_KEY=your_key STREAM_API_SECRET=your_secret GOOGLE_API_KEY=your_key OPENAI_API_KEY=your_key DEEPGRAM_API_KEY=your_key ELEVENLABS_API_KEY=your_key ``` Load with: `from dotenv import load_dotenv; load_dotenv()` ### Plugin Installation ```bash # LLMs & Realtime uv add "vision-agents[gemini,openai,openrouter,anthropic,xai,huggingface]" # Speech uv add "vision-agents[deepgram,elevenlabs,cartesia,fish,fast_whisper,kokoro]" # Vision uv add "vision-agents[nvidia,ultralytics,roboflow,moondream,heygen]" # Infrastructure uv add "vision-agents[getstream,twilio,turbopuffer]" ``` ## Decision Guidance ### When to Use Realtime vs Custom Pipeline | Aspect | Realtime Mode | Custom Pipeline | |--------|---------------|-----------------| | **Latency** | Lowest (native speech-to-speech) | Higher (STT → LLM → TTS) | | **Setup** | Simplest (one LLM) | More config (3+ components) | | **Control** | Less (model handles STT/TTS) | Full (choose each service) | | **Providers** | OpenAI, Gemini, Qwen, AWS Nova | Any LLM + STT + TTS combo | | **Video** | Supported (fps parameter) | Supported (frame buffering) | | **Best for** | Fast prototypes, demos | Production, cost optimization | ### When to Use Video Approach | Approach | Best For | Example | |----------|----------|---------| | **Realtime LLM** | Direct video streaming | `gemini.Realtime(fps=3)` | | **VLM** | Frame analysis + understanding | `nvidia.VLM(fps=1, frame_buffer_seconds=10)` | | **Processor** | Detection before LLM | `ultralytics.YOLOPoseProcessor()` | ### When to Use RAG | Option | Setup | Search | Best For | |--------|-------|--------|----------| | **Gemini FileSearch** | Simple | Managed | Quick prototypes | | **TurboPuffer** | More setup | Hybrid (vector + BM25) | Production, custom needs | ## Workflow ### 1. Set Up Project ```bash mkdir my-agent && cd my-agent uv init --python 3.12 uv add vision-agents python-dotenv uv add "vision-agents[getstream,gemini,deepgram,elevenlabs]" echo "STREAM_API_KEY=..." > .env echo "GOOGLE_API_KEY=..." >> .env ``` ### 2. Define Agent Factory Create `main.py`: ```python from dotenv import load_dotenv from vision_agents.core import Agent, AgentLauncher, Runner, User from vision_agents.plugins import gemini, deepgram, elevenlabs, getstream load_dotenv() async def create_agent(**kwargs) -> Agent: return Agent( edge=getstream.Edge(), agent_user=User(name="Assistant", id="agent"), instructions="You're a helpful voice assistant.", llm=gemini.LLM("gemini-2.5-flash"), stt=deepgram.STT(), tts=elevenlabs.TTS(), ) async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None: call = await agent.create_call(call_type, call_id) async with agent.join(call): await agent.simple_response("Hello! How can I help?") await agent.finish() if __name__ == "__main__": Runner(AgentLauncher(create_agent=create_agent, join_call=join_call)).cli() ``` ### 3. Test Locally ```bash uv run main.py run ``` ### 4. Add Features - **Function calling**: Use `@llm.register_function(description="...")` decorator - **Events**: Use `@agent.events.subscribe` to listen to transcripts, responses, etc. - **Video**: Add `processors=[yolo_processor]` to Agent constructor - **RAG**: Create `gemini.GeminiFilesearchRAG()` and pass to LLM tools ### 5. Deploy ```bash # Build Docker image docker buildx build --platform linux/amd64 -t my-agent . # Deploy to Kubernetes with health checks kubectl apply -f deployment.yaml # Start server uv run main.py serve --host 0.0.0.0 --port 8080 ``` ## Common Gotchas - **Missing API keys**: Vision Agents scans `.env` automatically. If a key is missing, the plugin will fail silently at runtime. Always test with `uv run main.py run` first. - **Event handler must be async**: Non-async handlers raise `RuntimeError`. Use `async def` for all `@agent.events.subscribe` handlers. - **Agent event loops**: Avoid infinite loops by filtering agent's own events: `if event.participant.user.id == "agent": return` - **Realtime mode doesn't need STT/TTS**: Don't pass `stt=` or `tts=` when using `gemini.Realtime()` or `openai.Realtime()`. - **Video processors need cleanup**: Always implement `stop_processing()` and `close()` to remove frame handlers and prevent memory leaks. - **Session affinity for stateful agents**: If using multiple server replicas, ensure sticky sessions or session affinity is enabled (agents are stateful per session). - **GPU not needed for most agents**: Only use GPU instances if running local models (Roboflow, local VLMs). Most voice agents use cloud APIs. - **Frame buffer size matters**: For VLMs, `frame_buffer_seconds=10` means 10 seconds of frames are held. Adjust based on memory and latency needs. - **MCP servers timeout**: Remote MCP servers have default 10s timeout. Increase with `timeout=30.0` if needed. - **Turn detection with realtime models**: Realtime models handle turn detection natively. Don't pass `turn_detection=` parameter. ## Verification Checklist Before submitting agent code: - [ ] All required API keys are in `.env` and loaded with `load_dotenv()` - [ ] Agent runs locally with `uv run main.py run` without errors - [ ] `create_agent()` and `join_call()` are async functions - [ ] Event handlers use `async def` (not sync functions) - [ ] Agent filters its own events: `if event.participant.user.id == "agent": return` - [ ] Video processors implement `stop_processing()` and `close()` - [ ] Realtime mode agents don't include `stt=` or `tts=` parameters - [ ] Custom pipeline agents include both `stt=` and `tts=` - [ ] Function calls have `description=` parameter for LLM context - [ ] Docker image builds for Linux: `docker buildx build --platform linux/amd64` - [ ] Health checks configured: `/health` and `/ready` endpoints - [ ] Session limits set if needed: `max_concurrent_sessions`, `max_session_duration_seconds` - [ ] Metrics endpoint tested: `GET /sessions/{id}/metrics` ## Resources - **Full documentation**: https://visionagents.ai/llms.txt (comprehensive page-by-page navigation) - **Core concepts**: [Agent Class](https://visionagents.ai/core/agent-core) — orchestration, lifecycle, event system - **Running agents**: [Running Agents as a Server](https://visionagents.ai/guides/running) — HTTP endpoints, session management, CORS, authentication - **Deployment**: [Production Deployment](https://visionagents.ai/guides/deployment) — Docker, Kubernetes, health checks, scaling --- > For additional documentation and navigation, see: https://visionagents.ai/llms.txt