{ "cells": [ { "cell_type": "markdown", "id": "cdbd8028", "metadata": {}, "source": [ "# Computer Use Agents in Daytona Sandboxes\n", "\n", "Plenty of useful work still lives behind browser UIs with no public API: third-party dashboards, admin panels, form-heavy workflows. The [Agents SDK](https://openai.github.io/openai-agents-python/)'s Computer Use tool lets an agent see and control a desktop. In this cookbook, we use a [Daytona](https://www.daytona.io/) sandbox as the source of that desktop.\n", "\n", "The Computer Use tool needs just a handful of primitives to drive a desktop: screenshot, click, type, scroll, press keys. A Daytona sandbox wraps a Linux desktop (browser included) in a Python SDK that exposes exactly those primitives. A thin adapter implementing the Agents SDK's `AsyncComputer` interface plugs the sandbox into the tool.\n", "\n", "The agent loop runs in this notebook while the sandbox does the actual clicking and typing. As a demo, in this cookbook we have an agent fill out a web form. The form itself is served inside the sandbox on `localhost:8080`, and the whole session is recorded to an `.mp4` embedded below.\n", "\n", "The same pattern works for any task you'd describe as \"open an app, navigate somewhere, interact with the screen\": testing UI flows end-to-end, driving legacy desktop software, or any workflow that only exists as a human-facing interface.\n", "\n", "Below you can watch an agent drive the sandbox to fill out a complex multi-page form. The rest of this cookbook walks through the machinery that makes it run." ] }, { "cell_type": "markdown", "id": "af3955f1", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "id": "5890e753", "metadata": {}, "source": [ "## Requirements\n", "\n", "- Python 3.10+\n", "- A [Daytona](https://www.daytona.io/) account and an API key, exported as `DAYTONA_API_KEY`\n", "- An OpenAI API key, exported as `OPENAI_API_KEY`\n", "- The OpenAI Agents SDK and the Daytona Python SDK (see the install cell below)\n", "\n", "Keep both API keys in your shell environment. This notebook reads them with `os.environ[...]` and never writes them to the sandbox.\n" ] }, { "cell_type": "markdown", "id": "94dc4267", "metadata": {}, "source": [ "## Install dependencies\n", "\n", "Clone the cookbook and move into this example directory:\n", "\n", "```bash\n", "git clone https://github.com/openai/openai-cookbook.git\n", "cd openai-cookbook/examples/agents_sdk/computer_use_with_daytona\n", "```\n", "\n", "Open `computer_use_with_daytona.ipynb` from that directory and install the dependencies below.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "aba0e5c8", "metadata": {}, "outputs": [], "source": [ "%pip install -r requirements.txt --quiet" ] }, { "cell_type": "markdown", "id": "a0d52afb", "metadata": {}, "source": [ "## Imports and environment\n", "\n", "We import from three places: the Agents SDK (`Agent`, `Runner`, `ComputerTool`, and the `AsyncComputer` / `Button` / `Environment` types we'll implement against), the Daytona SDK (`AsyncDaytona` plus `CreateSandboxFromSnapshotParams`), and the usual standard-library async/path helpers. `IPython.display.Video` is only needed at the very end, to play the recording inline.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "c32eb22f", "metadata": {}, "outputs": [], "source": [ "from __future__ import annotations\n", "\n", "import asyncio\n", "import logging\n", "import os\n", "from pathlib import Path\n", "from typing import Any\n", "\n", "from daytona import AsyncDaytona, CreateSandboxFromSnapshotParams\n", "\n", "from agents import Agent, AsyncComputer, Button, ComputerTool, Environment, Runner, trace\n", "\n", "from IPython.display import Video\n", "\n", "\n", "# The Daytona and OpenAI keys live in the shell environment.\n", "assert os.environ.get(\"DAYTONA_API_KEY\"), \"DAYTONA_API_KEY is not set.\"\n", "assert os.environ.get(\"OPENAI_API_KEY\"), \"OPENAI_API_KEY is not set.\"\n", "\n", "logger = logging.getLogger(\"computer_use_with_daytona\")" ] }, { "cell_type": "markdown", "id": "c216c45e", "metadata": {}, "source": [ "## The computer-use adapter\n", "\n", "The Agents SDK's [Computer Use tool](https://platform.openai.com/docs/guides/tools-computer-use) works against any object that implements the `AsyncComputer` interface: a screenshot method that returns a base64 PNG, plus `click`, `double_click`, `scroll`, `type`, `keypress`, `move`, `drag`, and `wait`. The harness drives this interface; the model never talks to Daytona directly.\n", "\n", "Daytona's desktop sandbox exposes a matching API under `sandbox.computer_use.*`: `screenshot.take_full_screen()`, `mouse.click/move/scroll/drag`, `keyboard.type/press`, plus `start()` / `stop()` for the underlying Xvfb and VNC processes. The class below is the adapter between the two." ] }, { "cell_type": "code", "execution_count": null, "id": "076cd284", "metadata": {}, "outputs": [], "source": [ "_DEFAULT_WIDTH, _DEFAULT_HEIGHT = 1024, 768\n", "\n", "# CUA emits DOM KeyboardEvent.key-style names (for example \"ArrowDown\"); Daytona\n", "# uses robotgo key names internally. Lowercase, then translate the few that\n", "# differ. Keys not in the table pass through unchanged.\n", "_CUA_KEY_TO_DAYTONA: dict[str, str] = {\n", " \"arrowdown\": \"down\",\n", " \"arrowleft\": \"left\",\n", " \"arrowright\": \"right\",\n", " \"arrowup\": \"up\",\n", " \"option\": \"alt\",\n", " \"super\": \"cmd\",\n", " \"win\": \"cmd\",\n", "}\n", "\n", "\n", "def _normalize_key(key: str) -> str:\n", " if len(key) > 1:\n", " key = _CUA_KEY_TO_DAYTONA.get(key.lower(), key.lower())\n", " return key\n", "\n", "\n", "class DaytonaAsyncComputer(AsyncComputer):\n", " \"\"\"AsyncComputer implementation backed by a Daytona sandbox desktop.\"\"\"\n", "\n", " def __init__(\n", " self,\n", " sandbox: Any,\n", " *,\n", " width: int = _DEFAULT_WIDTH,\n", " height: int = _DEFAULT_HEIGHT,\n", " ) -> None:\n", " self._sandbox = sandbox\n", " self._width = width\n", " self._height = height\n", "\n", " async def __aenter__(self) -> DaytonaAsyncComputer:\n", " await self._sandbox.computer_use.start()\n", " # Give Xvfb, the window manager, and the VNC server a moment to come up.\n", " await asyncio.sleep(2)\n", " return self\n", "\n", " async def __aexit__(self, exc_type: Any, exc_val: Any, exc_tb: Any) -> None:\n", " try:\n", " await self._sandbox.computer_use.stop()\n", " except asyncio.CancelledError:\n", " raise\n", " except Exception:\n", " logger.warning(\"Failed to stop computer-use processes\", exc_info=True)\n", "\n", " @property\n", " def environment(self) -> Environment:\n", " # CUA's Environment enum is {\"windows\", \"mac\", \"ubuntu\", \"browser\"} — there is\n", " # no generic \"linux\", so \"ubuntu\" is the right value for any Linux desktop\n", " # (the snapshot here is Debian) since it selects Linux-style UI conventions.\n", " return \"ubuntu\"\n", "\n", " @property\n", " def dimensions(self) -> tuple[int, int]:\n", " return (self._width, self._height)\n", "\n", " async def screenshot(self) -> str:\n", " response = await self._sandbox.computer_use.screenshot.take_full_screen()\n", " return response.screenshot or \"\"\n", "\n", " async def click(self, x: int, y: int, button: Button) -> None:\n", " if button not in (\"left\", \"right\"):\n", " logger.warning(\"Daytona does not support %s clicks; ignoring.\", button)\n", " return\n", " await self._sandbox.computer_use.mouse.click(x, y, button)\n", "\n", " async def double_click(self, x: int, y: int) -> None:\n", " await self._sandbox.computer_use.mouse.click(x, y, \"left\", True)\n", "\n", " async def scroll(self, x: int, y: int, scroll_x: int, scroll_y: int) -> None:\n", " if scroll_y != 0:\n", " direction = \"down\" if scroll_y > 0 else \"up\"\n", " amount = max(1, abs(scroll_y) // 100)\n", " await self._sandbox.computer_use.mouse.scroll(x, y, direction, amount)\n", " if scroll_x != 0:\n", " logger.warning(\n", " \"Daytona does not support horizontal scrolling; ignoring scroll_x=%d.\",\n", " scroll_x,\n", " )\n", "\n", " async def type(self, text: str) -> None:\n", " await self._sandbox.computer_use.keyboard.type(text)\n", "\n", " async def wait(self) -> None:\n", " await asyncio.sleep(1)\n", "\n", " async def move(self, x: int, y: int) -> None:\n", " await self._sandbox.computer_use.mouse.move(x, y)\n", "\n", " async def keypress(self, keys: list[str]) -> None:\n", " if not keys:\n", " return\n", " if len(keys) == 1:\n", " await self._sandbox.computer_use.keyboard.press(_normalize_key(keys[0]))\n", " else:\n", " # Multiple keys: treat the last as the primary key, the rest as modifiers.\n", " *modifiers, key = keys\n", " await self._sandbox.computer_use.keyboard.press(\n", " _normalize_key(key), [_normalize_key(m) for m in modifiers]\n", " )\n", "\n", " async def drag(self, path: list[tuple[int, int]]) -> None:\n", " if len(path) < 2:\n", " return\n", " # Daytona drag takes start -> end; chain segments for multi-point paths.\n", " for i in range(len(path) - 1):\n", " sx, sy = path[i]\n", " ex, ey = path[i + 1]\n", " await self._sandbox.computer_use.mouse.drag(sx, sy, ex, ey)" ] }, { "cell_type": "markdown", "id": "6092b07c", "metadata": {}, "source": [ "## The form, the data, and the prompt\n", "\n", "The form we'll fill lives in `form.html` in this folder. It is a single-page HTML registration form with five fieldsets: personal info, professional details, conference preferences, travel/accommodation, and additional info. The fields cover text inputs, emails, phone, dates, `