# WebDroid Agent

中文 | English

WebDroid Agent is a browser-first Android phone agent experiment. In static deployments, it runs entirely on the frontend. It connects to an Android device from the browser through WebUSB/WebADB, captures the device screen, sends it to an OpenAI-compatible vision model, then parses, validates, and executes the model's constrained action through ADB. Docker deployments add a local Node API proxy so model requests go through the container instead of directly from the browser. The goal is not to replace long-running human supervision. It is a local browser environment for quickly validating the vision-model-plus-phone-control loop. ```text Chromium WebUSB -> Tango/WebADB -> Android ADB Static deployment: browser fetch -> OpenAI-compatible /v1/chat/completions -> vision model Docker: browser fetch -> same-origin local proxy -> OpenAI-compatible /v1/chat/completions -> vision model ``` ## What It Can Do - Run entirely on the frontend for static deployments such as Cloudflare Pages. - Use the built-in same-origin local API proxy in Docker deployments to avoid model-provider browser CORS limits. - Connect to an Android device with USB debugging enabled through WebADB in the browser. - Capture the phone screen and send the screenshot, current app, device state, full installed-app list, and step history to the model. - Explicitly choose the `webdroid_json`, `open_autoglm_function`, or `mobilerun_xml` action protocol. - Use the canonical JSON prompt/action format, while keeping parser compatibility for Open-AutoGLM-style and mobilerun-style action outputs. - Parse, normalize, and validate the next action returned by the model. - Execute app launches, taps, swipes, text input, Back, Home, long press, double tap, and wait actions through ADB, including model-controlled wait duration. - Support clear-before-type input, fixed post-action settle delay, transient model API and empty-model-response retries, and limited automatic recovery from non-sensitive execution failures. - Support editable App Cards, local Custom Tools, and safe Secret typing that exposes only IDs/labels to the model. - Support continuous auto-execution as well as step-by-step human confirmation. - Send chat messages to run automatically, with one-step planning kept in advanced debug controls. - Support sensitive-action confirmation, unrestricted mode, max-step limits, stop controls, and advanced reset/run-log export. - Persist page settings in the local browser `localStorage`, and persist agent thread/turn history in IndexedDB. ## Good Fits This project is a good fit for: - Testing whether an OpenAI-compatible vision model can understand real Android UI. - Debugging phone-agent action protocols, coordinate mapping, and auto-execution loops. - Exploring compatibility between Open-AutoGLM-style, mobilerun-style, and more general JSON actions. - Building Android UI automation prototypes in a local and controlled environment. It is not a good fit for: - Payments, checkout flows, deletions, authorization, or account settings. - Login, captcha, password, or verification-code flows that need explicit human intervention. - Production use cases that require a backend, long-running reliability, or multi-device orchestration. ## Flow 1. Open the app in a Chromium-based browser. 2. Connect an Android device with USB debugging enabled and authorize ADB on the phone. 3. Fill in the OpenAI-compatible `Base URL`, `API Key`, and `Model`. 4. Type a natural-language instruction in the chat, such as "Open Settings and go to Wi-Fi". 5. Sending the message captures the screen and asks the model for one action. 6. The frontend parses and validates the action, then executes safe actions automatically while sensitive actions ask for confirmation by default. 7. If a non-sensitive action execution fails, the failure feedback is added to the next model context and the model gets a small number of automatic recovery attempts. 8. The loop continues until the model returns `done`, requests `take_over`, the max step count is reached, or the user stops execution; unrestricted mode does not stop on `take_over`. ## Requirements - A Chromium-based browser with WebUSB support, such as Chrome or Edge. - An Android device with USB debugging enabled. - A USB data cable. - An OpenAI-compatible `/v1/chat/completions` API. - A vision model that accepts `image_url` input. - For static deployments, an API service configured to allow browser cross-origin requests. Docker deployments can avoid this requirement through the same-origin local proxy. - A `localhost` or HTTPS environment so WebUSB can work. ## Quick Start ```bash npm install npm run dev ``` Then open the local URL printed by Vite in Chrome or Edge. Common commands: ```bash npm test npm run lint npm run build npm run preview ``` ## Configuration The app stores these values in the current browser's `localStorage`: - `Base URL`: OpenAI-compatible API endpoint, default `https://api.openai.com/v1`. - `API Key`: model API key. - `Model`: model name, default `gpt-5.5`. - `Thinking depth`: `reasoning_effort` for reasoning models such as GPT-5.5. Use the provider default, or choose `none`, `minimal`, `low`, `medium`, `high`, or `xhigh`. - `Action protocol`: model action protocol, one of `webdroid_json`, `open_autoglm_function`, or `mobilerun_xml`. - `Max steps`: maximum auto-execution steps, default `50`. - `Confirm sensitive actions`: whether sensitive taps require human confirmation, default on. - `Unrestricted mode`: bypass local safety policy and sensitive confirmations, and prompt the model not to request human takeover. - `Stream responses`: whether to use streaming responses, default off. - `Use ADB Keyboard for text`: whether to prefer ADB Keyboard input, default off. - `Action settle`, `Double tap interval`, `Keyboard step`: timing controls for action execution and text input. - `App Cards`: package-name keyed editable app context cards; Chrome, Gmail, and Settings are built in by default. - `Secrets`: local secret records; the model sees only `id` and `label`, and `type_secret` resolves the value locally at execution time. - `Custom Tools`: local tool definitions; the model sees only tool names/descriptions, and local results are returned into later context. The API key stays in the browser only for static deployments. Docker deployments send model requests to the same-origin local proxy first, then the container's Node service forwards them to the configured model API so the browser does not need CORS access to the model provider. ## Docker Deployment The Docker image builds the same frontend app and enables a small local Node service: - WebUSB/WebADB still runs in the browser. - The frontend posts model calls to same-origin `/api/openai/chat/completions`. - The Node service reads the request `Base URL`, `API Key`, and OpenAI-compatible payload, then forwards the request to the model API. - Cloudflare Pages does not use this Node service and does not set the proxy build variable, so the hosted static app still calls the configured model API directly from the browser. Build and run: ```bash npm run docker:build docker run --rm -p 8080:8080 webdroid-agent ``` Then open this URL in Chrome or Edge: ```text http://localhost:8080/ ``` If you use the repository Docker Compose config: ```bash docker compose up -d --build ``` Then open: ```text http://localhost:8083/ ``` You do not need to pass the API key as an environment variable. Continue entering it in the model settings panel. Do not expose this container proxy directly to an untrusted public network because it forwards arbitrary OpenAI-compatible `Base URL` values submitted by the browser. ## Action Protocol The model should return a single JSON object and avoid Markdown or explanatory prose: ```json { "action": "tap", "x": 540, "y": 1280, "reason": "Click the search box" } ``` Recommended canonical JSON actions: | Action | Meaning | | --- | --- | | `launch` | Launch an app by common app name or package name | | `tap` | Tap a screen coordinate | | `swipe` | Swipe from one point to another | | `input_text` | Type text; `clear:true` clears the currently focused field first | | `type_secret` | Type a local secret; the model sends only `secretId` and never sees the value | | `open_url` | Open a web URL or app deep link with Android `ACTION_VIEW` | | `set_clipboard` | Set WebDroid clipboard text and best-effort sync it to the device clipboard | | `paste` | Paste/type WebDroid clipboard text into the current focus | | `custom_tool` | Run a locally configured Custom Tool | | `key` | Send an Android key such as `BACK`, `HOME`, or `ENTER` | | `back` | Navigate back | | `home` | Return to the home screen | | `long_press` | Long-press a coordinate | | `double_tap` | Double-tap a coordinate | | `wait` | Wait for a duration, preferably with `duration` in seconds | | `take_over` | Request human takeover | | `note` | Record an observation without touching the device | | `done` | Mark the task as complete | Examples: ```json { "action": "launch", "app": "Settings", "reason": "Open system settings" } ``` ```json { "action": "swipe", "fromX": 540, "fromY": 1700, "toX": 540, "toY": 500, "durationMs": 400, "reason": "Scroll the list down" } ``` ```json { "action": "take_over", "message": "The user needs to enter a verification code" } ``` ```json { "action": "type_secret", "secretId": "gmail_password", "clear": true, "reason": "Type the configured local password" } ``` ```json { "action": "open_url", "url": "https://example.com/search?q=webdroid", "reason": "Open the target page directly" } ``` The legacy compatibility layer still accepts `interact` and `call_api`, but they are not recommended real execution actions. `interact` is converted to `take_over`; `call_api` is converted to `take_over` with an unsupported-second-API-call message. ## mobilerun Compatibility The parser also accepts common mobilerun-style actions and maps them to real WebDroid execution actions: | mobilerun style | WebDroid execution | | --- | --- | | `click_at` / `tap_at` | `tap` | | `click_area` / `tap_area` | tap the area center | | `long_press_at` | `long_press` | | `type_text` / `type_text_direct` | `input_text` | | `type_secret` | `type_secret` | | `custom_tool` | `custom_tool` | | `system_button` / `press_button` | `key` | | `open_app` / `open_bundle_id` | `launch` | | `remember` | `note` | | `complete` | `done` | `swipe` also accepts `coordinate`, `coordinate2`, and `duration` in seconds. mobilerun-style `coordinate`, `point`, `position`, and `click_area` coordinates are screenshot pixels. Only Open-AutoGLM `element` coordinates keep using the `0-1000` relative coordinate space. ## Open-AutoGLM Compatibility The parser also accepts Open-AutoGLM-style action names and payloads, including: - `Launch` - `Tap`, including `element: [x, y]` relative coordinates - `Type` - `Swipe` - `Back` - `Home` - `Long Press` - `Double Tap` - `Wait` - `Take_over` - `Interact`, converted to `take_over` - `Note` - `Call_API`, converted to `take_over` with an unsupported-second-API-call message - `type_secret(secret_id="...")` - `custom_tool(tool="...")` It also accepts function-style outputs such as: ```text do(action="Launch", app="JD") ``` Open-AutoGLM coordinates use the `0-1000` relative coordinate space; canonical JSON uses screenshot pixel coordinates. The app maps them back to native device coordinates before execution. ## Device Control Details - Launching apps: prefer the device's installed-app list, and also support built-in common app-name mappings or direct Android package names. - Tap and swipe: coordinates are validated against the screen bounds before execution. - Screen tree: each step best-effort reads `uiautomator dump --compressed` and injects visible text, descriptions, resource ids, clickable state, and bounds into the model context. - Opening URLs: uses Android `am start -a android.intent.action.VIEW -d ` for web URLs and registered deep links. - Long press: simulated with Android `input swipe x y x y duration`. - Double tap: two taps with a configurable delay in between. - Text input: simple ASCII text uses Android `input text`. - Clear-before-type, Chinese, and complex text: use ADB Keyboard or AutoGLM Keyboard broadcast input. - Clipboard: `set_clipboard` stores a local WebDroid clipboard and tries `cmd clipboard set`; `paste` prefers the local clipboard through the current text-input channel. - ADB Keyboard mode requires `com.android.adbkeyboard/.AdbIME` to be installed and enabled on the device; the device panel provides install and enable controls. - After every device action, the app waits according to the `Action settle` setting so the next step is less likely to run during animation or page loading. ## Safety Boundaries The frontend tries to constrain and confirm actions before execution: - Model output must parse into a supported action. - Coordinates are checked against the screen bounds. - Text input is length-limited and control characters are rejected. - `type_secret` receives only a local secret ID from the model; real secret values are not sent to model requests or result summaries. - Auto-execution has a maximum step count. - The user can stop the run at any time. - Sensitive taps can require human confirmation; unrestricted mode skips those confirmations. - `take_over`, `note`, and `done` do not directly control the device; legacy `interact` and `call_api` are converted to human takeover unless unrestricted mode is enabled. It is still strongly recommended to avoid letting the agent handle account login, payments, checkout, deletions, authorization, verification codes, or privacy-sensitive pages. By default, when the model returns `take_over`, auto-execution stops and waits for a human; unrestricted mode does not stop on takeover requests. ## Project Structure ```text src/ adapters/ adbKeyboard.ts # ADB Keyboard install, detection, and encoding helpers appPackages.ts # common app-name to package-name mappings deviceCommands.ts # device command compatibility exports deviceParsers.ts # dumpsys and screenshot byte parsing deviceRetry.ts # device-read retry and delay helpers deviceTiming.ts # device execution timing defaults deviceTypes.ts # shared device backend types and errors inputCommands.ts # ADB input command building installedApps.ts # installed-app parsing, search, and display names sensitiveActions.ts # sensitive action confirmation screenshotPreprocess.ts # screenshot preprocessing stayAwakeCommands.ts # stay-awake commands while ADB is connected webAdbBackend.ts # WebADB/WebUSB implementation components/ AgentStepCard.tsx # agent step card AppTopbar.tsx # brand and status topbar ChatHistorySidebar.tsx # chat-history sidebar ChatPanel.tsx # chat transcript and composer shell ConfigRail.tsx # collapsed configuration shortcuts ConfigSidebar.tsx # device and model configuration sidebar composition ConversationPanel.tsx # chat, history, and pending action view DeviceOptionsSection.tsx # device input, confirmation, and timing options DevicePanel.tsx # device connection and execution settings panel DirectCommandsSection.tsx # direct ADB action panel InstalledAppsSection.tsx # installed-app search and launch controls LazyDetails.tsx # lazily rendered collapsible sections MarkdownContent.tsx # chat message Markdown rendering ModelPanel.tsx # model configuration panel PendingActionCard.tsx # pending action confirmation card PhoneStage.tsx # phone screenshot and action overlay RunLog.tsx # run log view ScreenshotLightbox.tsx # screenshot preview modal SettingsDialog.tsx # app settings, repository info, and editable resources TutorialPanel.tsx # quick-start tutorial expanded from the topbar hooks/ useAgentRunController.ts # auto-run and pending-action control useAgentSessionHistory.ts # session restore, persistence, and history state useBusyTask.ts # busy-task and error state management useConfigTargetScroll.ts # config sidebar target scrolling useDeviceBackendPreferences.ts # device backend preference sync useDeviceController.ts # device connection, screenshot, and direct action state useDocumentPreferences.ts # document theme and language attribute sync useLatestValue.ts # ref for reading latest values inside async callbacks usePersistedSettings.ts # settings persistence on changes useRepositoryStats.ts # GitHub repository stats loading for settings useRunLog.ts # run-log state management useStorageEstimate.ts # local storage quota estimate lib/ actionDefaults.ts # common screenshot action defaults actionParser.ts # action parsing, normalization, and validation actionPreview.ts # action preview text formatting actionProtocol.ts # explicit action protocol enum actionSafetyPolicy.ts # local action safety policy actionTypes.ts # action types and validation error definitions actions.ts # action module compatibility barrel agentResources.ts # local Secret and Custom Tool resources agent.ts # agent loop orchestration agentThread.ts # persistent agent thread/turn/event model appCards.ts # editable app context cards appCopy.ts # localized copy aggregation and locale resolution appCopy.en-US.ts # English UI copy appCopy.zh-CN.ts # Chinese UI copy busyTask.ts # in-page busy task identifiers contextBuilder.ts # model context building and compaction deviceDoctor.ts # device and model configuration diagnostics deviceState.ts # device state display formatting interactionStream.ts # combined chat message and agent step display stream openAiClient.ts # OpenAI-compatible network client openAiErrors.ts # OpenAI client error types openAiPayload.ts # OpenAI-compatible request payload building openAiResponse.ts # OpenAI-compatible response reading and error formatting openAiRuntimeConfig.ts # OpenAI request runtime configuration openAiTypes.ts # OpenAI client and message types promptContextFormatting.ts # model context formatting helpers prompts.ts # prompts and action rules repository.ts # repository links and GitHub stats parsing runLogEntries.ts # run-log entry and screenshot-view formatting screenshot/ # screenshot coordinates, context, and retention policy coordinates.ts index.ts retention.ts settings.ts # local settings persistence threadStore.ts # agent thread persistent storage toolRegistry.ts # agent action tool registration and execution styles/ # styles split by page area agent-step-card.css # agent step-card styles chat-composer.css # chat composer styles chat-history.css # chat-history sidebar styles chat-panel.css # chat panel styles compact-section.css # collapsible tool-section styles config-panel.css # device and model configuration panel styles config-rail.css # collapsed config rail styles controls.css # forms, buttons, and shared control styles conversation-panel.css # conversation-panel shell and pending-action styles device-doctor.css # device doctor result styles device-options.css # device execution option styles device-panel.css # device connection section styles direct-commands.css # direct command panel styles index.css # global style entrypoint installed-apps.css # installed-app list styles layout.css # page layout and panel frames markdown-content.css # Markdown content styles model-panel.css # model configuration section styles phone-stage.css # phone preview and action-overlay styles responsive.css # responsive layout adjustments run-log.css # run-log styles screenshot-lightbox.css # screenshot preview modal styles settings-dialog.css # settings dialog styles theme.css # theme tokens and base reset tutorial-panel.css # tutorial panel styles App.tsx # page state, workflow logic, and component composition main.tsx # React entrypoint and global style loading server/ index.js # static-file and API proxy server for Docker openAiProxy.js # local OpenAI-compatible proxy handler ``` ## Verification ```bash npm test npm run lint npm run build ``` The current tests mainly cover: - Action parsing and action safety validation. - OpenAI-compatible request payload construction, response parsing, and network client errors. - Single-step and continuous agent execution. - Failure feedback, transient model API and empty-model-response retries, and limited automatic recovery. - Settings persistence and compatibility migration. - Agent thread/turn persistence. - Installed-app parsing, matching, and full context injection. - Screenshot coordinate mapping. - The run-log, screenshot preview, and main layout components. Real-device control still needs manual verification with an Android device. ## Deploying to Cloudflare Pages The project is already set up on Cloudflare Pages: - Historical live site (legacy Pages hostname): https://webadb-autoglm.pages.dev/ - Deployment method: automatic deployment from GitHub Redeploy: ```bash git push origin main ``` You can also verify the build locally first: ```bash npm run build ``` Cloudflare Pages should keep using plain `npm run build` without `VITE_OPENAI_PROXY_URL`. That keeps the hosted static app on browser-direct model API requests and avoids depending on the Docker-only local Node proxy. ## License This project is open source under the [MIT License](./LICENSE). You may use, copy, modify, distribute, and build on the code, provided that the original copyright notice and license text are retained. Third-party dependencies remain subject to their own licenses. Please review dependency license terms before redistribution or commercial use. ## Related Projects and Community - [Tango / WebADB](https://github.com/yume-chan/ya-webadb): the browser-side ADB/WebUSB foundation. - Open-AutoGLM: an important reference for mobile GUI agent action protocols. - Linux.do: an active Chinese tech community centered on AI, software development, resource sharing, and current industry discussion.