# MCP server `craftdriver` ships a [Model Context Protocol](https://modelcontextprotocol.io) server so hosted / sandboxed AI agents (Claude Desktop, Claude Code, Cursor, Windsurf, Zed, Goose, Gemini CLI, …) can drive a real browser without managing a daemon, a socket, or filesystem access. It is a **peer to the CLI**, not a wrapper. Both share the same dispatcher and error codes, but the MCP server returns a richer post-action payload (compact a11y snapshot, diffed from the previous turn) that text models can act on directly. ```bash # Start once via your MCP client — examples below npx -y craftdriver mcp ``` The server speaks JSON-RPC 2.0 on stdio. The browser launches lazily on the first tool call and shuts down when the client disconnects. ## Install snippets ### Claude Code / Claude Desktop ```bash claude mcp add craftdriver -- npx -y craftdriver mcp ``` ### Cursor / Windsurf / Zed (`.cursor/mcp.json` and similar) ```json { "mcpServers": { "craftdriver": { "command": "npx", "args": ["-y", "craftdriver", "mcp"] } } } ``` ### Gemini CLI ```bash gemini mcp add craftdriver npx -y craftdriver mcp ``` ### Goose ```bash goose configure # add craftdriver as a stdio server ``` ## Tools Compact set — 14 tools, one line each. Long help lives in the schema description; clients render it in the model's context once per session. | Tool | Purpose | | ------------------------ | ---------------------------------------------------------------------- | | `browser_navigate` | Go to a URL (waits for load). | | `browser_click` | Click an element. Auto-waits visible+enabled. | | `browser_fill` | Fill an input/textarea/select. | | `browser_press` | Press a keyboard key (`Enter`, `Tab`, `Control+A`). | | `browser_hover` | Hover over an element. | | `browser_find` | Locate elements without acting (returns tag/text/visibility). | | `browser_exists` | **0-wait probe.** Returns `{exists, count}` in one BiDi roundtrip. | | `browser_wait` | Wait for selector state or load state. | | `browser_read` | Read `text` / `attr` / `value` / `is(visible|enabled|checked)`. | | `browser_pages` | List open pages (id, url, title). | | `browser_snapshot` | **Sanitized DOM summary with refs.** Use `ref=eN` as the selector for subsequent calls. | | `browser_screenshot` | Capture PNG to a file (auto-allocated under the per-session artifact dir; never inlined). | | `browser_status` | Browser up? Which URL is active? | | `browser_advanced_eval` | Evaluate JS in the page. Last resort. | `browser_trace` (start/stop/explain) and trace resources are slated for a future release alongside richer trace introspection. ## Selector syntax Identical to the CLI. CSS by default; switch with a `prefix=value` form: ``` role=button[name=Submit] text=Sign In text*=Sign label=Email placeholder=Search… testid=login-btn alt=Logo title=Help xpath=//div[1] id=submit name=email tag=h1 ref=e5 (← from browser_snapshot, see below) ``` ## Refs — the token-efficient locator Call `browser_snapshot` (or just navigate — the post-action diff carries refs too) and you get a sanitized accessibility-tree summary where each visible interactive element is numbered: ``` page: Login — http://…/login.html e1: heading "Login" e2: form "Username Password Sign in" #login-form e3: label "Username" e4: textbox "Username" #username e5: label "Password" e6: textbox "Password" #password e7: button "Sign in" #submit ``` Use `ref=eN` as the selector for the next call: ```jsonc { "name": "browser_fill", "arguments": { "selector": "ref=e4", "value": "alice" } } { "name": "browser_fill", "arguments": { "selector": "ref=e6", "value": "hunter2" } } { "name": "browser_click", "arguments": { "selector": "ref=e7" } } ``` **Why this is a big deal for AI test generation** - **No selector hallucination.** The agent picks a number, not a CSS/XPath/role expression. The element is already on the page — there is nothing to guess wrong. - **Token efficient.** `ref=e7` is 5 characters; `role=button[name=Sign in]` is 26. Over a 50-step flow that adds up. - **Auto-waiting still works.** Internally `ref=eN` resolves to a CSS attribute selector (`[data-craftdriver-ref="eN"]`); every action takes the normal visible+enabled wait path. **Invalidation rules** - Refs are re-allocated on **every** `browser_snapshot` call. - The post-action a11y diff after a mutating tool also re-runs the snapshot, so refs renumber on every turn. - Navigating to a new URL invalidates all refs. - A stale ref just fails with `NO_MATCH` — take a fresh snapshot. ## Post-action payload Every tool returns a content array. Mutating tools (`navigate`, `click`, `fill`, `press`, `hover`, `advanced_eval`) additionally include a **compact a11y snapshot, diffed from the previous turn**: ```jsonc { "content": [ { "type": "text", "text": "{\"ok\":true,\"selector\":\"css selector=button[type=submit]\"}" }, { "type": "text", "text": "page: Login — http://…/login.html\n- form \"Username Password Sign in\" #login-form\n- textbox \"Username\" #username\n- button \"Sign in\" #submit\n+ button \"Logout\" #logout" } ], "structuredContent": { "result": { "ok": true, "selector": "css selector=button[type=submit]" } } } ``` - **First call in a session** returns the full snapshot (one line per visible interactive element: role + accessible name + locator hint). - **Subsequent calls** return only the lines that appeared (`+`) or disappeared (`-`). - **URL change** triggers a fresh full snapshot. - Capped at 80 nodes / 80 chars per name so the payload stays bounded regardless of page complexity. This is the MCP server's "killer feature" over the CLI: the agent sees what changed without a follow-up `read` call, in ~50–500 text tokens instead of 800–1500 image tokens for a screenshot. ## Artifact spilling (token efficiency) MCP content blocks count against the model's context window on every turn. To keep the per-call cost bounded, large payloads are **written to disk** and the inline block becomes a short preview plus the absolute path: ``` heading "Selectors Playground" textbox "by id" #by-id textbox "by name" #by-name img "Logo ALT" #by-alt button "Click me" #by-text … (full output: /tmp/craftdriver-mcp-1234-abc/0001-snapshot.txt, 1872 bytes) ``` Applies to: - **Screenshots** — always written to a file. If you pass `path`, that path is used; otherwise an artifact path is auto-allocated. The inline block carries the absolute path and byte count — **zero** image tokens. - **A11y snapshot diffs** — spill when the rendered diff exceeds the threshold (typically only the full first-call snapshot on big pages). - **Tool results** — `browser_read`, `browser_advanced_eval`, etc. spill when the JSON-stringified result exceeds the threshold. No more silent truncation. Configuration: | Env var | Default | Effect | | ---------------------------------- | ---------------- | ------------------------------------------------- | | `CRAFTDRIVER_MCP_ARTIFACTS_DIR` | `os.tmpdir()` | Root directory for the per-session artifact dir. | | `CRAFTDRIVER_MCP_SPILL_BYTES` | `2048` (~500 tk) | Inline content blocks larger than this spill. | The per-session directory (`/craftdriver-mcp--/`) is not deleted on shutdown — agents may still be reading past artifacts. Use `$CRAFTDRIVER_MCP_ARTIFACTS_DIR` to point at a dir with your own cleanup policy. The `structuredContent` field is unaffected by spilling — small results still round-trip in full there for programmatic consumers. ## Errors Errors are returned as `isError: true` content (per MCP spec), **not** as JSON-RPC errors. JSON-RPC errors are reserved for protocol-level failures (unknown method, malformed request). ```jsonc { "isError": true, "content": [ { "type": "text", "text": "error: click: no element matches css selector=#nope\ncode: NO_MATCH" } ], "structuredContent": { "error": { "code": "NO_MATCH", "message": "click: no element matches css selector=#nope" } } } ``` Match on `structuredContent.error.code` — full list in [error-codes.md](./error-codes.md). ## Fail-fast defaults Same rules as the CLI: - Default per-call timeout: **5 s** (override per call with `timeout_ms`, globally with `CRAFTDRIVER_AGENT_TIMEOUT`). - `browser_exists` is a **0-wait probe**. Call it before `browser_click` / `browser_wait` when you're guessing. - `browser_click` / `browser_fill` reject immediately with `NO_MATCH` when the selector matches zero elements at `t=0` — no burning the full timeout on a typo. ## When to use MCP vs. the CLI - **MCP** — your agent runs in a hosted or sandboxed environment that can't spawn child processes per call, or you want tool discovery via `tools/list`. Schema-typed args, structured errors, snapshot diffing. - **CLI** — your agent has a shell. Same surface, leaner per-call cost, also great for humans.