# Agents LLM agents run as native DimOS modules. They subscribe to camera, LiDAR, odometry, and spatial memory streams and they control the robot through skills. ## Architecture ``` Human Input ──→ Agent ──→ Skill Calls ──→ Robot (text/voice) │ (RPC) │ subscribes to streams: color_image, odom, spatial_memory ``` **Agent** (`dimos/agents/agent.py`) is a `Module` with: - `human_input: In[str]`: receives text from `humancli`, `WebInput`, or `agent-send` - `agent: Out[BaseMessage]`: publishes agent responses (text, tool calls, images) - `agent_idle: Out[bool]`: signals when the agent is waiting for input The agent uses LangGraph with a configurable LLM. The default is `gpt-4o` and you need to provide an `OPENAI_API_KEY` environment variable. On startup, it discovers all `@skill`-annotated methods across deployed modules via RPC and exposes them as LangChain tools. ## Skills Skills are methods decorated with `@skill` on any `Module`. The agent discovers them automatically at startup. ```python from dimos.agents.annotation import skill from dimos.core.module import Module class MySkillContainer(Module): @skill def wave_hello(self) -> str: """Wave at the nearest person.""" # ... robot control logic ... return "Waving!" ``` **Rules:** - Parameters must be JSON-serializable primitives (`str`, `int`, `float`, `bool`, `list`, `dict`). - Docstrings become the tool description the LLM sees. Write them clearly so the agent has sufficent context. - The function must return a string or image which with be used by the agent to decide what to do next. ### Built-in Skills | Skill | Module | Description | |-------|--------|-------------| | `relative_move(forward, left, degrees)` | `UnitreeSkillContainer` | Move robot relative to current position | | `execute_sport_command(command_name)` | `UnitreeSkillContainer` | Unitree sport commands (sit, stand, flip, etc.) | | `wait(seconds)` | `UnitreeSkillContainer` | Pause execution | | `observe()` | `GO2Connection` | Capture and return current camera frame | | `navigate_with_text(query)` | `NavigationSkillContainer` | Navigate to a location by description | | `tag_location(name)` | `NavigationSkillContainer` | Tag current position for later recall | | `stop_navigation()` | `NavigationSkillContainer` | Cancel current navigation goal | | `follow_person(query)` | `PersonFollowSkill` | Visual servoing to follow a described person | | `stop_following()` | `PersonFollowSkill` | Stop person following | | `speak(text)` | `SpeakSkill` | Text-to-speech through robot speakers | | `where_am_i()` | `GoogleMapsSkillContainer` | Current street/area from GPS | | `get_gps_position_for_queries(queries)` | `GoogleMapsSkillContainer` | Look up GPS coordinates | | `set_gps_travel_points(points)` | `GPSNavSkill` | Navigate via GPS waypoints | | `map_query(query)` | `OsmSkill` | Search OpenStreetMap with VLM | ## MCP There is also an MCP implementation. It replaces the `Agent` with two modules: `McpServer` and `McpClient`. * `McpServer` exposes the methods annotated with `@skill` as MCP tools. Any external client can connect to the server to use the MCP tools. * `McpClient` has a LangGraph LLM which calls MCP tools from `McpServer`. CLI access: ```bash dimos mcp list-tools # List available skills dimos mcp call relative_move --arg forward=0.5 # Call a skill dimos mcp status # Server status ``` ## Input Methods | Method | How it works | |--------|-------------| | `humancli` | Standalone terminal — type messages, see responses | | `dimos agent-send "text"` | One-shot CLI command via LCM | | `WebInput` | Web interface at localhost:7779 with optional Whisper STT | ## Models | Config | Model | Notes | |--------|-------|-------| | Default | `gpt-4o` | Best quality, requires `OPENAI_API_KEY` | | `ollama:llama3.1` | Local Ollama | Requires `ollama serve` running | | Custom | Any LangChain-compatible | Set via `AgentConfig(model="...")` |