# baremobile — Agent Integration Guide Use this file as context when building agents that control Android/iOS devices via baremobile. ## TypeScript types baremobile ships generated `.d.ts` declarations, so you get full autocomplete and type-checking with **zero TypeScript setup** — install and import. Both entry points are typed: ```js import { connect } from 'baremobile'; // Android — types/index.d.ts import { connect as connectIos } from 'baremobile/ios'; // iOS — types/ios.d.ts ``` The source is plain ESM JavaScript with JSDoc; the `.d.ts` are generated from that JSDoc by `tsc`, so the types always match the shipped code. baremobile itself has **no production dependencies** (`typescript` is dev-only and never installed by adopters). ## Core Loop Every agent interaction follows observe-think-act: ```js import { connect } from 'baremobile'; const page = await connect(); // auto-detect device (auto-reconnects WiFi if needed) let snapshot = await page.snapshot(); // observe // Agent reads snapshot, picks action await page.tap(5); // act snapshot = await page.snapshot(); // observe again ``` Always snapshot after every action. Refs reset per snapshot — never cache them. ## Snapshot Format ``` - ScrollView [ref=1] - Group - Text "Settings" - Group [ref=2] - Text "Search settings" - List - Group [ref=3] - Text "Wi-Fi" - Switch [ref=4] (Wi-Fi) [checked] - Group [ref=5] [disabled] - Text "Airplane mode" ``` **What to read:** - `[ref=N]` — interactive element, use with tap/type/scroll - `"quoted text"` — visible text on screen - `(parenthesized)` — contentDesc / accessibility label - `[checked]`, `[selected]`, `[focused]`, `[disabled]` — element state - Indentation = nesting (parent-child) **Roles:** Text, TextInput, Button, Image, ImageButton, CheckBox, Switch, Radio, Toggle, Slider, Progress, Select, List, ScrollView, Group, TabList, Tab. Unknown classes show their short Java class name. ## Page Methods ### Navigation ```js await page.launch('com.android.settings'); // open app by package await page.intent('android.settings.BLUETOOTH_SETTINGS'); // deep nav via intent await page.back(); // press back await page.home(); // press home await page.press('recent'); // app switcher ``` ### Reading ```js const yaml = await page.snapshot(); // pruned YAML with refs const png = await page.screenshot(); // PNG buffer ``` ### Interaction ```js await page.tap(ref); // tap element await page.tapXY(540, 1200); // tap by pixel coordinates await page.tapGrid('C5'); // tap by grid cell await page.type(ref, 'text'); // type into field await page.type(ref, 'new', {clear: true}); // clear field first, then type await page.press('enter'); // press key await page.scroll(ref, 'down'); // scroll within element await page.longPress(ref); // long press await page.swipe(x1, y1, x2, y2, 300); // raw swipe — coords must be numbers (non-numeric throws InvalidArgument) ``` ### Waiting ```js const snap = await page.waitForText('Bluetooth', 10000); // poll until text appears const snap = await page.waitForState(3, 'checked', 10000); // poll until state matches // States: 'enabled', 'disabled', 'checked', 'unchecked', 'focused', 'selected' ``` ### Keys for press() back, home, enter, delete, tab, escape, up, down, left, right, space, power, volup, voldown, recent ## Common Patterns ### Type into a field ``` Snapshot shows: TextInput [ref=3] "Search settings" [focused] ``` - If `[focused]` — just type, no extra tap needed: `page.type(3, 'wifi')` - If not focused — `page.type(3, 'wifi')` will tap first automatically - To replace existing text: `page.type(3, 'new text', {clear: true})` ### Navigate a list ``` Snapshot shows: ScrollView [ref=1] → List → Group [ref=2] "Wi-Fi" ... ``` - Tap an item: `page.tap(2)` - Scroll for more: `page.scroll(1, 'down')` then snapshot again - Items at the bottom may not be visible — scroll and re-snapshot ### Handle a dialog ``` Snapshot shows: Text "Allow access?" → Button [ref=5] "Allow" → Button [ref=6] "Deny" ``` - Read dialog text, decide, tap the appropriate button - Dialogs always have their buttons in the snapshot with refs ### Open an app ```js await page.launch('com.android.settings'); await new Promise(r => setTimeout(r, 2000)); // wait for app to load const snapshot = await page.snapshot(); ``` Common packages: `com.android.settings`, `com.android.chrome`, `com.google.android.apps.messaging`, `com.google.android.dialer`, `com.android.contacts` ### Deep navigation with intents ```js await page.intent('android.settings.BLUETOOTH_SETTINGS'); await page.intent('android.settings.WIFI_SETTINGS'); await page.intent('android.settings.DISPLAY_SETTINGS'); await page.intent('android.settings.SOUND_SETTINGS'); await page.intent('android.settings.LOCATION_SOURCE_SETTINGS'); await page.intent('android.settings.AIRPLANE_MODE_SETTINGS'); await page.intent('android.settings.APPLICATION_SETTINGS'); // With extras: await page.intent('android.intent.action.VIEW', { url: 'https://example.com' }); ``` Skip multi-step navigation when you know the intent action. ### Vision fallback (when ARIA tree fails) ```js const png = await page.screenshot(); // get visual const grid = await page.grid(); // get grid info console.log(grid.text); // "Screen: 1080×2400, Grid: 10 cols (A-J) × 22 rows..." // Send screenshot + grid.text to vision model // Model responds: "tap C5" await page.tapGrid('C5'); // or page.tapXY(x, y) ``` Use when: Flutter apps crash uiautomator, WebView content invisible, snapshot seems wrong. ### Send a message (multi-step) 1. `launch('com.google.android.apps.messaging')` 2. Snapshot → find "Start chat" button → `tap(ref)` 3. Snapshot → find TextInput for "To:" → `type(ref, '5551234567')` 4. Snapshot → find suggestion like "Send to (555) 123-4567" → `tap(ref)` 5. Snapshot → find compose TextInput → `type(ref, 'Hello!')` 6. Snapshot → find "Send SMS" button → `tap(ref)` Each step: snapshot, read, decide, act. The agent adapts to whatever the UI shows. ### Pick an emoji 1. In compose view, find emoji button (contentDesc contains "emoji") → `tap(ref)` 2. Snapshot → emoji grid appears, each emoji is `View [ref=N] (😀)` with name in contentDesc 3. Tap the emoji ref → it inserts into the TextInput 4. Press back or tap outside to close emoji panel ### Attach a file 1. Find attach/`+` button (contentDesc "Show attach" or "Show more options") → `tap(ref)` 2. Snapshot → options appear: Gallery, Files, Location, etc. → `tap(ref)` for Files 3. System file picker opens → snapshot shows folders and files with refs 4. Navigate to file → `tap(ref)` to select ### Unlock the screen ```js await page.press('power'); // wake await page.swipe(540, 1800, 540, 800, 300); // swipe up await page.type(ref, '1234'); // PIN (if needed) await page.press('enter'); ``` ## Gotchas ### Core ADB + Termux ADB (screen control) **Refs reset every snapshot.** Never store a ref and use it after another snapshot. Always re-read. **Snapshot takes 1-5 seconds.** uiautomator dump is slow, especially on emulators. Don't snapshot in a tight loop. **Wait after actions.** UI needs time to settle. Wait 500ms-2s after taps, 2-3s after launching apps. **Some list items aren't clickable.** Android file picker drawer items, some system UI elements don't have `clickable=true` so they don't get refs. Use raw `swipe()` to coordinates as fallback. **WebView content is invisible.** uiautomator can't see inside WebViews. If the snapshot looks empty/shallow in a browser or hybrid app, that's why. Future: CDP bridge. **Switch/toggle may disappear when off.** Android sometimes removes unchecked Switch/Toggle elements from the accessibility tree. On the Bluetooth page, when BT is off the Switch disappears — only `Text "Use Bluetooth"` remains. No switch present = off. Don't look for `Switch [unchecked]`. **Toggles have transitional states.** After tapping a system toggle (Bluetooth, WiFi), it briefly shows `[disabled]` while the hardware state changes. Use `waitForText()` or `waitForState()` instead of fixed delays to confirm the action completed. **HTML entities in text.** Decoded at parse time. `&` → `&`, `<` → `<`, etc. Snapshots show clean text. **Emojis show as entities in contentDesc.** `View [ref=8] (😀)` means the emoji 😀. The agent can read the unicode codepoint or just tap by ref position in the grid. **type() is word-by-word.** On API 35+, `adb input text` is broken for spaces. baremobile splits text into words and injects KEYCODE_SPACE between them. This means typing is slower for long strings. Shell special characters (`& | ; $ ~ # % ^ * { } [ ] ! ?` and quotes) are escaped automatically. ### Termux ADB only **Wireless debugging drops on reboot.** Must re-enable in Developer Options and re-pair after every device restart. The connection is not persistent. **Pairing port differs from connect port.** The port shown when tapping "Pair device with pairing code" is NOT the port for `adb connect`. The connect port is shown on the main Wireless debugging screen. ### Termux:API only **No screen control.** Termux:API cannot read the screen, take snapshots, or tap elements. It provides direct Android API access only (SMS, calls, location, etc.). Use Termux ADB for screen control. **Commands are blocking.** `termux-*` commands run synchronously. `location()` can take several seconds waiting for a GPS fix. `cameraPhoto()` blocks until capture completes. **Some commands need a real device.** `smsSend()`, `call()`, `location()` require hardware (SIM card, GPS) that emulators don't have. `batteryStatus()`, `clipboardGet/Set()`, `volumeGet()`, `wifiInfo()`, `vibrate()` work on emulators. **Termux:API addon must be installed separately.** The `termux-api` package (CLI tools) AND the Termux:API Android app (F-Droid) are both required. Missing the app causes silent failures. ## Termux Setup (on-device control) baremobile can run inside [Termux](https://termux.dev/) on the phone itself — no USB, no host machine. ### Termux + ADB (full screen control) ```bash # In Termux: pkg install android-tools nodejs-lts # On the phone: Settings → Developer options → Wireless debugging → ON # Tap "Pair device with pairing code" — note the port + code adb pair localhost:PORT CODE # Note the connect port (shown on Wireless debugging screen, different from pairing port) adb connect localhost:PORT # Verify adb devices # should show localhost:PORT device ``` Then in Node.js: ```js import { connect } from 'baremobile'; const page = await connect({ termux: true }); // or auto-detects const snap = await page.snapshot(); ``` **Limitations:** Wireless debugging must be re-enabled after every reboot. The pairing code is one-time but the connection drops on reboot. ### Termux:API (direct Android APIs, no ADB) Install Termux:API addon from F-Droid, then: ```bash pkg install termux-api ``` ```js import * as api from 'baremobile/src/termux-api.js'; // Check availability if (await api.isAvailable()) { await api.smsSend('5551234', 'Hello from baremobile!'); const inbox = await api.smsList({ limit: 5, type: 'inbox' }); await api.call('5551234'); const loc = await api.location({ provider: 'network' }); const battery = await api.batteryStatus(); await api.clipboardSet('copied text'); const text = await api.clipboardGet(); await api.notify('Agent', 'Task complete', { sound: true }); await api.torch(true); // flashlight on await api.vibrate({ duration: 500 }); } ``` Termux:API is **not** screen control — it's direct Android API access. Use it for SMS, calls, location, camera, clipboard. Faster and more reliable than tapping through the UI. ## iOS (WDA-based) Same `snapshot()` / `tap(ref)` pattern as Android. WDA XML is translated into the shared prune/format pipeline, producing identical YAML output. ### Quick start ```js import { connect } from 'baremobile/ios'; const page = await connect(); console.log(await page.snapshot()); await page.tap(1); await page.type(2, 'hello'); await page.launch('com.apple.Preferences'); await page.back(); await page.screenshot(); page.close(); ``` ### iOS Page Methods | Method | What it does | |--------|-------------| | `page.snapshot()` | Hierarchical YAML (same format as Android) | | `page.tap(ref)` | Coordinate tap at bounds center | | `page.type(ref, text, opts)` | Tap to focus + WDA keys. `{clear: true}` to clear first | | `page.scroll(ref, direction)` | Swipe within element bounds (up/down/left/right) | | `page.swipe(x1, y1, x2, y2, duration)` | Raw swipe between coordinates | | `page.longPress(ref)` | Long press at bounds center (1s) | | `page.tapXY(x, y)` | Tap by pixel coordinates | | `page.back()` | Find back button in refMap, fallback to swipe-from-left-edge | | `page.home()` | WDA homescreen | | `page.launch(bundleId)` | Launch app by bundle ID | | `page.screenshot()` | PNG buffer | | `page.waitForText(text, timeout)` | Poll snapshot until text appears | | `page.press(key)` | `home`, `volumeup`, `volumedown` only | | `page.unlock(passcode)` | Unlock device (throws if wrong passcode) | | `page.close()` | Close connection and clean up | ### Key differences from Android - **Bundle IDs, not package names** — `com.apple.Preferences` not `com.android.settings` - **No intents** — use `page.launch(bundleId)` for app navigation - **No grid/tapGrid** — coordinate tap from bounds is reliable - **Back is semantic** — searches refMap for back button, falls back to swipe gesture - **press() is limited** — only `home`, `volumeup`, `volumedown`. Use `tap(ref)` for UI buttons. ### Requirements - **WDA on device** — signed with free Apple ID (7-day cert, re-sign weekly) - **pymobiledevice3** — setup only (tunnel, DDI mount, WDA launch). Python 3.12. - **USB cable required** — WiFi tunnel needs Mac/Xcode, not possible on Linux - **Developer Mode on iPhone** — required for developer services ### Setup ```bash baremobile setup # interactive wizard — Android (emulator/USB/WiFi/Termux) + iOS baremobile ios resign # re-sign WDA when cert expires (every 7 days) baremobile ios teardown # kill tunnel/WDA processes ``` ## MCP Server MCP server (`mcp-server.js`) for Claude Code and other MCP clients. ```bash claude mcp add baremobile -- node /path/to/baremobile/mcp-server.js ``` ### Tools (11, dual-platform) All tools accept optional `platform: "android" | "ios"` (default: android). | Tool | Params | Returns | |------|--------|---------| | `snapshot` | `maxChars?`, `platform?` | YAML tree (or file path if >30K chars) | | `tap` | `ref`, `platform?` | `'ok'` | | `type` | `ref`, `text`, `clear?`, `platform?` | `'ok'` | | `press` | `key`, `platform?` | `'ok'` | | `scroll` | `ref`, `direction`, `platform?` | `'ok'` | | `swipe` | `x1`, `y1`, `x2`, `y2`, `duration?`, `platform?` | `'ok'` | | `long_press` | `ref`, `platform?` | `'ok'` | | `launch` | `pkg`, `platform?` | `'ok'` | | `screenshot` | `platform?` | base64 PNG | | `back` | `platform?` | `'ok'` | | `find_by_text` | `text`, `platform?` | ref number or null | Action tools return `'ok'` — call `snapshot` to observe the result. Large snapshots saved to `.baremobile/screen-{timestamp}.yml` when exceeding `maxChars` (default 30,000). iOS cert warning prepended to first snapshot if cert is >6 days old. ## CLI Session-based control for shell scripting and automation. ### Session lifecycle ```bash baremobile open [--device=SERIAL] [--platform=android|ios] baremobile status baremobile close ``` ### Commands ```bash # Screen baremobile snapshot # -> .baremobile/screen-*.yml baremobile screenshot # -> .baremobile/screenshot-*.png baremobile grid # screen grid info (for vision fallback) # Interaction baremobile tap baremobile tap-xy baremobile tap-grid baremobile type [--clear] baremobile press baremobile scroll baremobile swipe [--duration=N] baremobile long-press baremobile launch baremobile intent [--extra-string key=val ...] baremobile back baremobile home # Waiting baremobile wait-text [--timeout=N] baremobile wait-state [--timeout=N] # iOS management baremobile setup baremobile ios resign baremobile ios teardown # Logging baremobile logcat [--filter=TAG] [--clear] ``` ### Output conventions All output goes to `.baremobile/` in the current directory. Action commands print `ok`. File-producing commands print the file path. Errors go to stderr with non-zero exit. ### JSON mode (`--json`) ```bash baremobile open --json # {"ok":true,"pid":1234,"port":40049} baremobile snapshot --json # {"ok":true,"file":"/path/.baremobile/screen-*.yml"} baremobile tap 4 --json # {"ok":true} baremobile status --json # {"ok":false,"error":"No session found."} ``` Every response has `ok: true|false`. File-producing commands include `file`. Errors include `error`. ## Error Recovery If an action doesn't seem to work: 1. **waitForText** — use `waitForText('expected text', 5000)` instead of guessing delays 2. **Snapshot again** — the UI may have changed during the action 3. **Screenshot + vision** — `screenshot()` + `grid()` if the ARIA tree looks wrong 4. **Press back** — if stuck in an unexpected state, back out and retry 5. **Home + relaunch** — nuclear option to reset to known state