--- name: cua-cloud description: Comprehensive guide for building Computer Use Agents with the CUA framework. This skill should be used when automating desktop applications, building vision-based agents, controlling virtual machines (Linux/Windows/macOS), or integrating computer-use models from Anthropic, OpenAI, or other providers. Covers Computer SDK (click, type, scroll, screenshot), Agent SDK (model configuration, composition), supported models, provider setup, and MCP integration. --- # CUA Framework ## Overview CUA ("koo-ah") is an open-source framework for building Computer Use Agents—AI systems that see, understand, and interact with desktop applications through vision and action. It supports Windows, Linux, and macOS automation. **Key capabilities:** - Vision-based UI automation via screenshot analysis - Multi-platform desktop control (click, type, scroll, drag) - 100+ LLM providers via LiteLLM integration - Composed agents (grounding + planning models) - Local and cloud execution options ## Installation ```bash # Computer SDK - desktop control pip install cua-computer # Agent SDK - autonomous agents pip install cua-agent[all] # MCP Server (optional) pip install cua-mcp-server ``` **CLI Installation:** ```bash # macOS/Linux curl -LsSf https://cua.ai/cli/install.sh | sh # Windows powershell -ExecutionPolicy ByPass -c "irm https://cua.ai/cli/install.ps1 | iex" ``` ## Computer SDK ### Computer Class ```python from computer import Computer import os os.environ["CUA_API_KEY"] = "sk_cua-api01_..." computer = Computer( os_type="linux", # "linux" | "macos" | "windows" provider_type="cloud", # "cloud" | "docker" | "lume" | "windows_sandbox" name="sandbox-name" ) try: await computer.run() # Use computer.interface methods here finally: await computer.close() ``` ### Interface Methods **Screenshot:** ```python screenshot = await computer.interface.screenshot() ``` **Mouse Actions:** ```python await computer.interface.left_click(x, y) # Left click at coordinates await computer.interface.right_click(x, y) # Right click await computer.interface.double_click(x, y) # Double click await computer.interface.move_cursor(x, y) # Move cursor without clicking await computer.interface.drag(x1, y1, x2, y2) # Click and drag ``` **Keyboard Actions:** ```python await computer.interface.type_text("Hello!") # Type text await computer.interface.key_press("enter") # Press single key await computer.interface.hotkey("ctrl", "c") # Key combination ``` **Scrolling:** ```python await computer.interface.scroll(direction, amount) # Scroll up/down/left/right ``` **File Operations:** ```python content = await computer.interface.read_file("/path/to/file") await computer.interface.write_file("/path/to/file", "content") ``` **Clipboard:** ```python text = await computer.interface.get_clipboard() await computer.interface.set_clipboard("text to copy") ``` ### Supported Actions (Message Format) **OpenAI-style:** - `ClickAction` - button: left/right/wheel/back/forward, x, y coordinates - `DoubleClickAction` - same parameters as click - `DragAction` - start and end coordinates - `KeyPressAction` - key name - `MoveAction` - x, y coordinates - `ScreenshotAction` - no parameters - `ScrollAction` - direction and amount - `TypeAction` - text string - `WaitAction` - duration **Anthropic-style:** - `LeftMouseDownAction` - x, y coordinates - `LeftMouseUpAction` - x, y coordinates ## Agent SDK ### ComputerAgent Class ```python from agent import ComputerAgent agent = ComputerAgent( model="anthropic/claude-sonnet-4-5-20250929", tools=[computer], max_trajectory_budget=5.0 # Cost limit in USD ) messages = [{"role": "user", "content": "Open Firefox and go to google.com"}] async for result in agent.run(messages): for item in result["output"]: if item["type"] == "message": print(item["content"][0]["text"]) ``` ### Response Structure ```python { "output": [AgentMessage, ...], # List of messages "usage": { "prompt_tokens": int, "completion_tokens": int, "total_tokens": int, "response_cost": float } } ``` **Message Types:** - `UserMessage` - Input from user/system - `AssistantMessage` - Text output from agent - `ReasoningMessage` - Agent thinking/summary - `ComputerCallMessage` - Intent to perform action - `ComputerCallOutputMessage` - Screenshot result - `FunctionCallMessage` - Python tool invocation - `FunctionCallOutputMessage` - Function result ## Supported Models ### CUA VLM Router (Recommended) ```python model="cua/anthropic/claude-sonnet-4.5" # Recommended model="cua/anthropic/claude-haiku-4.5" # Faster, cheaper ``` Single API key, cost tracking, managed infrastructure. ### Anthropic (BYOK) ```python os.environ["ANTHROPIC_API_KEY"] = "sk-ant-..." model="anthropic/claude-sonnet-4-5-20250929" model="anthropic/claude-haiku-4-5-20251001" model="anthropic/claude-opus-4-20250514" model="anthropic/claude-3-7-sonnet-20250219" ``` ### OpenAI (BYOK) ```python os.environ["OPENAI_API_KEY"] = "sk-..." model="openai/computer-use-preview" ``` ### Google Gemini ```python model="gemini-2.5-computer-use-preview-10-2025" ``` ### Local Models ```python model="huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B" model="ollama_chat/0000/ui-tars-1.5-7b" ``` ### Composed Agents Combine grounding models with planning models: ```python model="huggingface-local/GTA1-7B+openai/gpt-4o" model="moondream3+openai/gpt-4o" model="omniparser+anthropic/claude-sonnet-4-5-20250929" model="omniparser+ollama_chat/mistral-small3.2" ``` **Grounding Models:** UI-TARS, GTA, Holo, Moondream, OmniParser, OpenCUA ### Human-in-the-Loop ```python model="human/human" # Pause for user approval ``` ## Provider Types ### Cloud (Recommended) ```python computer = Computer( os_type="linux", # linux, windows, macos provider_type="cloud", name="sandbox-name", api_key="sk_cua-api01_..." ) ``` Get API key from [cloud.trycua.com](https://cloud.trycua.com). ### Docker (Local) ```python computer = Computer( os_type="linux", provider_type="docker" ) ``` Images: `trycua/cua-xfce:latest`, `trycua/cua-ubuntu:latest` ### Lume (macOS Local) ```python computer = Computer( os_type="linux", provider_type="lume" ) ``` Requires Lume CLI installation. ### Windows Sandbox ```python computer = Computer( os_type="windows", provider_type="windows_sandbox" ) ``` Requires `pywinsandbox` and Windows Sandbox feature enabled. ## MCP Integration This project uses the CUA MCP Server for Claude Code integration: ```json { "mcpServers": { "cua": { "type": "http", "url": "https://cua-mcp-server.vercel.app/mcp" } } } ``` ### MCP Tools Available **Sandbox Management:** - `mcp__cua__list_sandboxes` - List all sandboxes - `mcp__cua__create_sandbox` - Create VM (os, size, region) - `mcp__cua__start/stop/restart/delete_sandbox` **Task Execution:** - `mcp__cua__run_task` - Autonomous task execution - `mcp__cua__describe_screen` - Vision analysis without action - `mcp__cua__get_task_history` - Retrieve task results ## Best Practices ### Task Design ```python # Good - specific and sequential "Open Chrome, navigate to github.com, click the Sign In button" # Avoid - vague "Log into GitHub" ``` ### Error Recovery ```python async for result in agent.run(messages): if result.get("error"): # Take screenshot to understand state screenshot = await computer.interface.screenshot() # Retry with more specific instructions ``` ### Resource Management ```python try: await computer.run() # ... perform tasks finally: await computer.close() # Always cleanup ``` ### Cost Control ```python agent = ComputerAgent( model="cua/anthropic/claude-sonnet-4.5", max_trajectory_budget=5.0 # Stop at $5 spent ) ``` ## Resources - [Documentation](https://cua.ai/docs) - [GitHub](https://github.com/trycua/cua) - [Cloud Dashboard](https://cloud.trycua.com) - [Discord](https://discord.com/invite/mVnXXpdE85)