---
title: Using Gemini Vision with Anam in LiveKit
description: "Add Gemini Vision to a LiveKit voice agent with an Anam avatar so your AI can see and respond to visual input."
tags: [livekit, python, agents]
date: 2026-01-19
authors: [stukennedy, ao-anam]
---

LiveKit agents can do more than just talk—they can see. By combining Gemini's vision capabilities with Anam avatars, you can build assistants that watch the user's screen and take action on what they see.

In this cookbook, we'll build an HR onboarding assistant. The user shares their screen showing an employee form, and the assistant guides them through filling it out. When the user provides information verbally, the assistant uses function tools to fill in the form fields automatically.

The complete code is at [anam-org/anam-livekit-demo](https://github.com/anam-org/anam-livekit-demo).

## What you'll build

An onboarding assistant that:
- Displays an Anam avatar as the visual interface
- Uses Gemini Live for voice conversation and screen understanding
- Watches the user's screen share to see form fields
- Fills out forms automatically using function tools
- Runs as a LiveKit agent that joins rooms on demand

## How the pieces fit together

This demo combines three services:

- **Gemini Live** handles the conversation by listening to the user's voice, processing their screen share, and deciding what to say or do
- **Anam** generates the avatar video, synchronized to the agent's speech
- **LiveKit** ties it all together, routing audio and video between the user, the agent, and the avatar

When the user speaks, their audio goes to Gemini. Gemini can also see frames from the user's screen share. Based on what it hears and sees, Gemini responds with text, which Anam turns into avatar video. Gemini can also call function tools to interact with the page.

## Prerequisites

- Python 3.9+
- A [LiveKit Cloud](https://cloud.livekit.io) account
- A [Gemini API key](https://aistudio.google.com/apikey)
- An Anam API key from [lab.anam.ai](https://lab.anam.ai)

## Project setup

Clone the demo repository:

```bash
git clone https://github.com/anam-org/anam-livekit-demo.git
cd anam-livekit-demo/agent
```

Create a virtual environment and install dependencies:

```bash
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt
```

The key dependencies are:
- `livekit-agents` - The LiveKit agent framework
- `livekit-plugins-google` - Gemini Live integration
- `livekit-plugins-anam` - Anam avatar integration

Create a `.env` file with your credentials:

```bash
# LiveKit Cloud credentials
LIVEKIT_URL=wss://your-project.livekit.cloud
LIVEKIT_API_KEY=your_api_key
LIVEKIT_API_SECRET=your_api_secret

# Anam
ANAM_API_KEY=your_anam_key
ANAM_AVATAR_ID=your_avatar_id

# Google Gemini
GEMINI_API_KEY=your_gemini_key
```

You can find avatar IDs at [lab.anam.ai/avatars](https://lab.anam.ai/avatars).

## Understanding the agent code

Let's walk through `agent.py`. We'll start with the imports and setup:

```python
import asyncio
import json
import logging
import os
from pathlib import Path
from typing import Optional

from dotenv import load_dotenv

load_dotenv(Path(__file__).parent / ".env")

from livekit import rtc
from livekit.agents import (
    Agent,
    AgentSession,
    AutoSubscribe,
    JobContext,
    WorkerOptions,
    cli,
    function_tool,
)
from livekit.agents.voice import VoiceActivityVideoSampler, room_io
from livekit.plugins import anam, google
```

We import the LiveKit agent framework, the Anam and Google plugins, and the `function_tool` decorator for creating tools the agent can call.

### Function tools for browser control

The agent needs a way to interact with the frontend. We use LiveKit's data channel to send commands:

```python
_current_room: Optional[rtc.Room] = None

async def send_control_command(command: str, data: dict) -> None:
    """Send a control command to the frontend via data channel."""
    if _current_room is None:
        return

    message = json.dumps({"type": command, **data})
    await _current_room.local_participant.publish_data(
        message.encode("utf-8"),
        reliable=True,
        topic="browser-control",
    )
```

This sends JSON messages to the frontend, which listens on the `browser-control` topic and executes the commands.

Now we define the tools themselves. The `@function_tool` decorator exposes these to Gemini:

```python
@function_tool
async def fill_form_field(field_identifier: str, value: str) -> str:
    """Fill in a form field on the current page.

    Args:
        field_identifier: The field to fill (e.g. "Full Name", "Email Address")
        value: The value to enter into the field

    Returns:
        A confirmation message
    """
    await send_control_command(
        "fill_field", {"field": field_identifier, "value": value}
    )
    return "ok"


@function_tool
async def click_element(element_description: str) -> str:
    """Click a button or link on the page.

    Args:
        element_description: Button/element text (e.g. "Submit", "Next")

    Returns:
        A confirmation message
    """
    await send_control_command("click", {"element": element_description})
    return "ok"
```

The docstrings are important. Gemini uses them to understand when and how to call each tool. When the user says "My name is John Smith", Gemini sees the form on screen, understands it needs to fill the name field, and calls `fill_form_field("Full Name", "John Smith")`.

### The agent entry point

The `entrypoint` function runs when the agent joins a room:

```python
async def entrypoint(ctx: JobContext):
    await ctx.connect(auto_subscribe=AutoSubscribe.SUBSCRIBE_ALL)

    global _current_room
    _current_room = ctx.room
```

We connect to the room and subscribe to all tracks. The `SUBSCRIBE_ALL` option means we'll receive the user's screen share video, which is essential for vision.

### Agent instructions

The instructions tell Gemini how to behave and what tools are available:

```python
    instructions = (
        "You are Maya, a friendly HR onboarding assistant. "
        "You can see the user's screen share.\n\n"
        "THE FORM HAS THESE 6 FIELDS (fill ALL before submitting):\n"
        "1. Full Name\n"
        "2. Email Address\n"
        "3. Phone Number\n"
        "4. Department\n"
        "5. Job Title\n"
        "6. Start Date\n\n"
        "Tools:\n"
        "- fill_form_field(field_name, value) - use EXACT field names above\n"
        "- click_element('Submit') - ONLY after ALL 6 fields are filled\n\n"
        "IMPORTANT: You MUST fill ALL 6 fields before clicking Submit."
    )
```

Being explicit about field names helps Gemini use the tools correctly. The instructions also prevent premature form submission.

### Creating the models

Now we set up Gemini and Anam:

```python
    # Create Gemini Live realtime model
    llm = google.realtime.RealtimeModel(
        api_key=os.environ.get("GEMINI_API_KEY"),
        voice="Aoede",
        instructions=instructions,
    )

    # Create Anam Avatar session
    avatar = anam.AvatarSession(
        persona_config=anam.PersonaConfig(
            name="Maya",
            avatarId=os.environ.get("ANAM_AVATAR_ID")
        ),
        api_key=os.environ.get("ANAM_API_KEY"),
    )
```

Gemini handles the conversation logic and voice output. The `voice` parameter sets Gemini's TTS voice. Anam takes that audio and generates synchronized avatar video.

### Video sampling for vision

For screen share analysis, we configure how often to send frames to Gemini:

```python
    session = AgentSession(
        llm=llm,
        video_sampler=VoiceActivityVideoSampler(
            speaking_fps=0.2,  # 1 frame every 5 seconds while speaking
            silent_fps=0.1,    # 1 frame every 10 seconds while silent
        ),
        tools=[fill_form_field, click_element],
    )
```

The `VoiceActivityVideoSampler` is efficient, it samples more frequently during active conversation and less during silence. This keeps Gemini aware of screen changes without overwhelming it with frames.

### Starting everything

Finally, we start the avatar and agent:

```python
    await avatar.start(session, room=ctx.room)
    await session.start(
        agent=Agent(instructions=instructions),
        room=ctx.room,
        room_input_options=room_io.RoomInputOptions(video_enabled=True),
    )
```

The `video_enabled=True` option tells the agent to accept video input (the screen share). The avatar starts first so it's ready to display when the agent begins speaking.

### Running the agent

The main block starts the agent worker:

```python
if __name__ == "__main__":
    cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint))
```

## Running the demo

Start the agent in development mode:

```bash
python agent.py dev
```

The agent connects to LiveKit Cloud and waits for rooms to be created.

For the frontend, go back to the repository root and start the Next.js app:

```bash
cd ..
pnpm install
pnpm dev
```

Open [http://localhost:3000](http://localhost:3000). You'll see a demo onboarding form. Click to connect, then share your screen. The avatar will greet you and guide you through filling out the form.

Try saying things like:
- "My name is John Smith"
- "My email is john@example.com"
- "I'm starting in the Engineering department as a Senior Developer"

The assistant will fill in the fields as you provide information.

## Adapting for your use case

The onboarding form is just one example. The same pattern works for:

- **Technical support** - Watch the user's screen and guide them through troubleshooting
- **Education** - See what the student is working on and provide contextual help
- **Data entry** - Fill out complex forms based on verbal input
- **Accessibility** - Help users who have difficulty using a keyboard

To adapt the demo:

1. Update the instructions to describe your use case and available fields
2. Modify the function tools to match your frontend's expectations
3. Update the frontend to handle the control commands appropriately

## Deploying to production

For production, run without the `dev` flag:

```bash
python agent.py
```

The repository includes a Dockerfile for containerized deployments:

```bash
docker build -t onboarding-agent .
docker run --env-file .env onboarding-agent
```

See the [LiveKit deployment docs](https://docs.livekit.io/agents/deployment/) for Kubernetes and cloud platform guides.