--- name: clawdcursor description: "AI desktop agent — control any app on Windows/macOS from your OpenClaw agent. Send natural language tasks to the Clawd Cursor API and it handles everything: opening apps, clicking buttons, typing text, navigating browsers, filling forms. If you can click it, your agent can too." homepage: https://clawdcursor.com --- # Clawd Cursor **If a human can do it on a screen, you can too.** No API? No integration? No problem. Every app already has a UI — Clawd Cursor gives you eyes and hands to use all of them. No task is impossible. If it's on the screen, it's within reach. > **IMPORTANT — READ THIS BEFORE ANYTHING ELSE** > > You have a desktop AI agent over GUI running on the user's machine. This means: > 1. You **CAN** open apps, click buttons, type text, fill forms > 2. You **CAN** see what's on screen (via Clawd Cursor's screenshot pipeline) > 3. You **CAN** interact with applications — browser, desktop apps, settings > 4. You **MUST** respect safety tiers — 🟢 Auto actions run freely, 🔴 Confirm actions require user approval > 5. You **MUST** ask before accessing sensitive apps (email, banking, private messages, password managers) > 6. You **SHOULD** prefer direct tools (file read, web_fetch, CLI) over desktop automation when available > > Think of Clawd Cursor as your hands and eyes on the user's computer — used with their permission. --- ## When to Use This (AGENTS: READ THIS) You have a desktop UI agent available. Use it when the user asks you to interact with their desktop, or when no other tool can accomplish the task. ### Tool vs Skill (OpenClaw terminology) - **Tool** = direct capability (API call, filesystem, shell, web fetch, browser command). - **Skill** = packaged workflow/domain logic that may call one or more tools. - **This skill (Clawd Cursor)** = GUI execution skill. Use it after OpenClaw tools/skills that can complete the same work without GUI. ### Use Clawd Cursor for (examples, not limits): Clawd Cursor can perform **any action that is visible and interactable in the GUI** (subject to safety policy). - **User-requested desktop tasks** — "open Gmail and send an email", "check my calendar" - **Read a webpage** — when web_fetch or browser tools aren't available - **Interact with desktop apps** — click buttons, fill forms, read results - **Browser tasks** — search, navigate, fill forms (when browser tool unavailable) - **Visual verification** — did the page load? what does the UI show? - **Cross-app workflows** — copy from one app, paste in another - **Settings changes** — when the user explicitly asks ### ⚠️ Sensitive App Policy **Always ask the user before** accessing: - Email clients (Gmail, Outlook) - Banking or financial apps - Private messaging (WhatsApp, Signal, Telegram) - Password managers - Admin panels or cloud consoles ### Don't use Clawd Cursor when: - You can do it with a direct API call or CLI command (faster) - The task is purely computational (math, text generation, code writing) - You can already read/write the file directly - The browser tool or web_fetch can handle it ## OpenClaw + Clawd Cursor Routing Contract (Avoid Overlap) Clawd Cursor should be treated as **OpenClaw's GUI execution layer**, not a competing planner. ### Route tasks in this order: 1. **OpenClaw native tools first** (filesystem, API, shell, provider-native skills) 2. **Browser-native automation next** (Playwright/CDP direct) for browser-only reads/clicks 3. **Clawd Cursor API task (`POST /task`)** only when desktop/UI-level interaction is required ### Practical rule - If OpenClaw already has a reliable skill/tool for the domain, use it. - Use Clawd Cursor to bridge gaps where no API/tool exists or when the user explicitly asks for GUI interaction. This keeps behavior predictable, lowers latency/cost, and avoids duplicated logic between the main OpenClaw agent and this skill. ### Universal task pattern For broad "get it done" requests, split into three phases: 1. **Plan in OpenClaw**: break work into API/CLI/browser/GUI subtasks. 2. **Execute cheap paths first**: API + CLI + browser direct. 3. **Escalate only residual UI steps** to Clawd Cursor. Think: **"OpenClaw decides, Clawd Cursor acts on GUI when needed."** ### Direct Browser Access (Fast Path) For quick page reads without a full task, connect to Chrome via Playwright CDP: ```js const pw = require('playwright'); const browser = await pw.chromium.connectOverCDP('http://127.0.0.1:9222'); const pages = browser.contexts()[0].pages(); const text = await pages[0].innerText('body'); ``` Use this when you just need page content — faster than sending a task. | Scenario | Use | Why | |----------|-----|-----| | Read page content/text | CDP Direct | Instant, free | | Fill a web form | API task (`POST /task`) | Clawd handles multi-step planning | | Check if a page loaded | CDP Direct | Just read the title/URL | | Click through a complex UI flow | API task (`POST /task`) | Clawd handles planning | | Get a list of elements on page | CDP Direct | Fast DOM query | | Interact with a desktop app | API task (`POST /task`) | CDP is browser-only | --- ## REST API Reference Base URL: `http://127.0.0.1:3847` > **Note:** On Windows PowerShell, use `curl.exe` (with .exe) or `Invoke-RestMethod`. Bare `curl` is aliased to `Invoke-WebRequest` which behaves differently. ### Pre-flight Check Before your first task, verify Clawd Cursor is running: ```bash curl.exe -s http://127.0.0.1:3847/health ``` Expected: `{"status":"ok","version":"0.6.0"}` If connection refused — **start it yourself** (don't ask the user): ```powershell # Find the skill directory and start the server Start-Process -FilePath "node" -ArgumentList "dist/index.js","start" -WorkingDirectory "" -WindowStyle Hidden Start-Sleep 3 # Verify it's running curl.exe -s http://127.0.0.1:3847/health ``` The skill directory is wherever SKILL.md lives (the parent of this file). Use that path as the working directory. ### Sending a Task (Async — Returns Immediately) `POST /task` accepts the task and returns immediately. The task runs in the background. **You must poll `/status` to know when it's done.** ```bash curl.exe -s -X POST http://127.0.0.1:3847/task -H "Content-Type: application/json" -d "{\"task\": \"YOUR_TASK_HERE\"}" ``` PowerShell: ```powershell Invoke-RestMethod -Uri http://127.0.0.1:3847/task -Method POST -ContentType "application/json" -Body '{"task": "YOUR_TASK_HERE"}' ``` ### Polling Pattern (Follow This) ``` 1. POST /task → get accepted 2. Wait 2 seconds 3. GET /status 4. If status is "idle" → done 5. If status is "waiting_confirm" → ASK THE USER, then POST /confirm based on their answer 6. If still running → wait 2 more seconds, go to step 3 7. If 60+ seconds → POST /abort and retry with clearer instructions ``` ### Checking Status ```bash curl.exe -s http://127.0.0.1:3847/status ``` ### Confirming Safety-Gated Actions Some actions (sending messages, deleting) require approval. **🔴 NEVER self-approve these.** Always ask the user for confirmation before POST /confirm. These exist to protect the user — do not bypass them. ```bash curl.exe -s -X POST http://127.0.0.1:3847/confirm -H "Content-Type: application/json" -d "{\"approved\": true}" ``` ### Aborting a Task ```bash curl.exe -s -X POST http://127.0.0.1:3847/abort ``` ### Reading Logs (Debugging) ```bash curl.exe -s http://127.0.0.1:3847/logs ``` Returns last 200 log entries. Check for `error` or `warn` entries when tasks fail. ### Response States | State | Response | What to do | |-------|----------|------------| | **Accepted** | `{"accepted": true, "task": "..."}` | Start polling | | **Running** | `{"status": "acting", "currentTask": "...", "stepsCompleted": 2}` | Keep polling | | **Waiting confirm** | `{"status": "waiting_confirm", "currentStep": "..."}` | POST /confirm | | **Done** | `{"status": "idle"}` | Task complete | | **Busy** | `{"error": "Agent is busy", "state": {...}}` | Wait or POST /abort first | --- ## CDP Direct Reference Chrome must be running with `--remote-debugging-port=9222`. ### Quick check: ```bash curl.exe -s http://127.0.0.1:9222/json/version ``` If this returns JSON, Chrome is ready. ### Connecting via Playwright: ```javascript const { chromium } = require('playwright'); const browser = await chromium.connectOverCDP('http://127.0.0.1:9222'); const context = browser.contexts()[0]; const page = context.pages()[0]; // Read page content const title = await page.title(); const url = page.url(); const text = await page.textContent('body'); // Click by role await page.getByRole('button', { name: 'Submit' }).click(); // Fill a field await page.getByLabel('Email').fill('user@example.com'); // Read specific elements const buttons = await page.$$eval('button', els => els.map(e => e.textContent)); ``` --- ## Task Writing Guidelines 1. **Be specific** — include app names, URLs, exact text to type, button names 2. **One task at a time** — wait for completion before sending the next 3. **Describe the goal, not the clicks** — say "Send an email to john@example.com about the meeting" not "click compose, click to field..." 4. **Check status** if a task seems to hang 5. **Don't include credentials in task text** — tasks are logged ## Task Examples | Goal | Task to send | |------|-------------| | **Simple navigation** | `Open Chrome and go to github.com` | | **Read screen content** | `What text is currently displayed in Notepad?` | | **Cross-app workflow** | `Copy the email address from the Chrome tab and paste it into the To field in Outlook` | | **Form filling** | `In the open Chrome tab, fill the contact form: name "John Doe", email "john@example.com"` | | **App interaction** | `Open Spotify and play the Discover Weekly playlist` | | **Settings change** | `Open Windows Settings and turn on Dark Mode` | | **Data extraction** | `Read the stock price shown in the Bloomberg tab in Chrome` | | **Complex browser** | `Open YouTube, search for "Adele Hello", and play the first video result` | | **Verification** | `Check if the deployment succeeded — look at the Vercel dashboard in Chrome` | | **Send email** | `Open Gmail, compose email to john@example.com, subject: Meeting Tomorrow, body: Confirming 2pm. Best regards.` | | **Take screenshot** | `Take a screenshot` | ## Error Recovery | Problem | Solution | |---------|----------| | Connection refused on :3847 | Start Clawd Cursor: `cd clawd-cursor && npm start` | | Connection refused on :9222 | Start Chrome with CDP: `Start-Process chrome -ArgumentList "--remote-debugging-port=9222"` | | Agent returns "busy" | Poll `/status` — wait for idle, or POST `/abort` | | Task fails with no details | Check `/logs` for error entries | | Task completes but wrong result | Rephrase with more specifics: exact app name, button text, field labels | | Same task fails repeatedly | Break into smaller tasks (one action per task) | | Safety confirmation pending | POST `/confirm` with `{"approved": true}` or `{"approved": false}` | | Task hangs > 60 seconds | POST `/abort`, then retry with simpler phrasing | --- ## How It Works — 5-Layer Pipeline | Layer | What | Speed | Cost | |-------|------|-------|------| | **0: Browser Layer** | URL detection → direct navigation | Instant | Free | | **1: Action Router + Shortcuts** | Regex + UI Automation + keyboard shortcuts | Instant | Free | | **1.5: Smart Interaction** | 1 LLM plan → CDP/UIDriver executes | ~2-5s | 1 LLM call | | **2: Accessibility Reasoner** | UI tree → text LLM decides | ~1s | Cheap | | **3: Computer Use** | Screenshot → vision LLM | ~5-8s | Expensive | Layer 1 includes keyboard shortcuts — common actions execute as direct keystrokes (0 LLM calls). 80%+ of tasks handled by Layer 0-1 (free, instant). Vision model is last resort only. ## Safety Tiers | Tier | Actions | Behavior | |------|---------|----------| | 🟢 Auto | Navigation, reading, opening apps | Runs immediately | | 🟡 Preview | Typing, form filling | Logs before executing | | 🔴 Confirm | Sending messages, deleting | Pauses — **ask the user** before POST `/confirm`. Never self-approve. | ## Security & Privacy ### Network Isolation - API binds to `127.0.0.1` only — **not network accessible**. Verify: `netstat -an | findstr 3847` should show `127.0.0.1:3847` - Screenshots stay in memory, never saved to disk (unless `--debug`) - No telemetry, no analytics, no phone-home calls ### Data Flow - **With Ollama (local)**: 100% offline — zero external network calls. No data leaves the machine. - **With cloud providers**: screenshots/text are sent to the user's chosen provider API **only**. No data goes to skill authors, ClawHub, or third parties. - **OpenClaw users**: credentials auto-discovered from local config files — no keys stored in skill directory. - The user controls data flow by choosing their provider. Ollama = fully private. ### Agent Autonomy Controls - **🟢 Auto** actions (navigation, reading, opening apps) run without prompting - **🟡 Preview** actions (typing, form filling) are logged before executing - **🔴 Confirm** actions (sending messages, deleting, purchases) **always pause for user approval** - Agents **must ask the user** before accessing sensitive apps (email, banking, messaging, passwords) - Agents **must never self-approve** 🔴 Confirm actions --- ## Setup (User Reference) Setup is handled by the user. If Clawd Cursor isn't running, **start it yourself** using the exec tool: ```powershell Start-Process -FilePath "node" -ArgumentList "dist/index.js","start" -WorkingDirectory "" -WindowStyle Hidden ``` Only ask the user if you cannot start it (e.g., node not installed, build missing). ```bash git clone https://github.com/AmrDab/clawd-cursor.git cd clawd-cursor npm install && npm run build npx clawd-cursor doctor # auto-detects and configures everything npm start # starts on port 3847 ``` **macOS:** Grant Accessibility permission to terminal: System Settings → Privacy & Security → Accessibility | Provider | Setup | Cost | |----------|-------|------| | **Ollama (free)** | `ollama pull ` | $0 (fully offline) | | **Any cloud provider** | Set `AI_API_KEY=your-key` | Varies by provider | | **OpenClaw users** | Automatic — no setup needed | Uses configured provider | --- ## Performance Optimization Proven optimizations applied to reduce task execution latency and LLM API costs. Reference files in `perf/references/patches/`. ### Applied Optimizations | # | Name | Impact | |---|------|--------| | 1 | Screenshot hash cache | 90% fewer LLM calls on static screens | | 2 | Parallel screenshot+a11y | 30-40% per-step latency cut | | 3 | A11y context cache (2s TTL) | Eliminates redundant PS spawns | | 4 | Screenshot compression | 52% smaller payload (58KB vs 120KB) | | 5 | Async debug writes | 94% less event loop blocking | | 6 | Streaming LLM responses | 1-3s faster per LLM call | | 7 | Trimmed system prompts | ~60% fewer prompt tokens | | 8 | A11y tree filtering | Interactive elements only, 3000 char cap | | 9 | Combined PS script | 1 spawn instead of 3 | | 10 | Taskbar cache (30s TTL) | Skip expensive taskbar query | | 11 | Delay reduction | 50-150ms vs 200-1500ms | ### Benchmarks (2560x1440) | Metric | v0.3 (VNC) | v0.4 (Native) | v0.4.1+ (Optimized) | |--------|------------|---------------|----------------------| | Screenshot capture | ~850ms | ~50ms | ~57ms | | Screenshot size | ~200KB | ~120KB | ~58KB | | A11y context (uncached) | N/A | ~600ms | ~462ms | | A11y context (cached) | N/A | 0ms | 0ms (2s TTL) | | Delays (per step) | N/A | 200-1500ms | 50-600ms | | System prompt tokens | N/A | ~800 | ~300 | ### Perf Tools - `perf/apply-optimizations.ps1` — apply all patches - `perf/perf-test.ts` — benchmark harness (`npx ts-node perf/perf-test.ts`)