--- name: agent-browser description: AI代理的浏览器自动化CLI。在用户需要与网站交互时使用，包括导航页面、填写表单、单击按钮、截屏、提取数据、测试Web应用或自动执行任何浏览器任务。触发器包括“打开网站”、“填写表单”、“单击按钮”、“截屏”、“从页面中抓取数据”、“测试此Web应用”、“登录到网站”、“自动浏览器操作”或任何需要编程Web交互的任务。 allowed-tools: Bash(npx agent-browser:*), Bash(agent-browser:*) --- # 使用代理浏览器实现浏览器自动化 ## Core Workflow Every browser automation follows this pattern: 1. **Navigate**: `agent-browser open ` 2. **Snapshot**: `agent-browser snapshot -i` (get element refs like `@e1`, `@e2`) 3. **Interact**: Use refs to click, fill, select 4. **Re-snapshot**: After navigation or DOM changes, get fresh refs ```bash agent-browser open https://example.com/form agent-browser snapshot -i # Output: @e1 [input type="email"], @e2 [input type="password"], @e3 [button] "Submit" agent-browser fill @e1 "user@example.com" agent-browser fill @e2 "password123" agent-browser click @e3 agent-browser wait --load networkidle agent-browser snapshot -i # Check result ``` ## Command Chaining Commands can be chained with `&&` in a single shell invocation. The browser persists between commands via a background daemon, so chaining is safe and more efficient than separate calls. ```bash # Chain open + wait + snapshot in one call agent-browser open https://example.com && agent-browser wait --load networkidle && agent-browser snapshot -i # Chain multiple interactions agent-browser fill @e1 "user@example.com" && agent-browser fill @e2 "password123" && agent-browser click @e3 # Navigate and capture agent-browser open https://example.com && agent-browser wait --load networkidle && agent-browser screenshot page.png ``` **When to chain:** Use `&&` when you don't need to read the output of an intermediate command before proceeding (e.g., open + wait + screenshot). Run commands separately when you need to parse the output first (e.g., snapshot to discover refs, then interact using those refs). ## Essential Commands ```bash # Navigation agent-browser open # Navigate (aliases: goto, navigate) agent-browser close # Close browser # Snapshot agent-browser snapshot -i # Interactive elements with refs (recommended) agent-browser snapshot -i -C # Include cursor-interactive elements (divs with onclick, cursor:pointer) agent-browser snapshot -s "#selector" # Scope to CSS selector # Interaction (use @refs from snapshot) agent-browser click @e1 # Click element agent-browser click @e1 --new-tab # Click and open in new tab agent-browser fill @e2 "text" # Clear and type text agent-browser type @e2 "text" # Type without clearing agent-browser select @e1 "option" # Select dropdown option agent-browser check @e1 # Check checkbox agent-browser press Enter # Press key agent-browser scroll down 500 # Scroll page # Get information agent-browser get text @e1 # Get element text agent-browser get url # Get current URL agent-browser get title # Get page title # Wait agent-browser wait @e1 # Wait for element agent-browser wait --load networkidle # Wait for network idle agent-browser wait --url "**/page" # Wait for URL pattern agent-browser wait 2000 # Wait milliseconds # Capture agent-browser screenshot # Screenshot to temp dir agent-browser screenshot --full # Full page screenshot agent-browser screenshot --annotate # Annotated screenshot with numbered element labels agent-browser pdf output.pdf # Save as PDF # Diff (compare page states) agent-browser diff snapshot # Compare current vs last snapshot agent-browser diff snapshot --baseline before.txt # Compare current vs saved file agent-browser diff screenshot --baseline before.png # Visual pixel diff agent-browser diff url # Compare two pages agent-browser diff url --wait-until networkidle # Custom wait strategy agent-browser diff url --selector "#main" # Scope to element ``` ## Common Patterns ### Form Submission ```bash agent-browser open https://example.com/signup agent-browser snapshot -i agent-browser fill @e1 "Jane Doe" agent-browser fill @e2 "jane@example.com" agent-browser select @e3 "California" agent-browser check @e4 agent-browser click @e5 agent-browser wait --load networkidle ``` ### Authentication with State Persistence ```bash # Login once and save state agent-browser open https://app.example.com/login agent-browser snapshot -i agent-browser fill @e1 "$USERNAME" agent-browser fill @e2 "$PASSWORD" agent-browser click @e3 agent-browser wait --url "**/dashboard" agent-browser state save auth.json # Reuse in future sessions agent-browser state load auth.json agent-browser open https://app.example.com/dashboard ``` ### Session Persistence ```bash # Auto-save/restore cookies and localStorage across browser restarts agent-browser --session-name myapp open https://app.example.com/login # ... login flow ... agent-browser close # State auto-saved to ~/.agent-browser/sessions/ # Next time, state is auto-loaded agent-browser --session-name myapp open https://app.example.com/dashboard # Encrypt state at rest export AGENT_BROWSER_ENCRYPTION_KEY=$(openssl rand -hex 32) agent-browser --session-name secure open https://app.example.com # Manage saved states agent-browser state list agent-browser state show myapp-default.json agent-browser state clear myapp agent-browser state clean --older-than 7 ``` ### Data Extraction ```bash agent-browser open https://example.com/products agent-browser snapshot -i agent-browser get text @e5 # Get specific element text agent-browser get text body > page.txt # Get all page text # JSON output for parsing agent-browser snapshot -i --json agent-browser get text @e1 --json ``` ### Parallel Sessions ```bash agent-browser --session site1 open https://site-a.com agent-browser --session site2 open https://site-b.com agent-browser --session site1 snapshot -i agent-browser --session site2 snapshot -i agent-browser session list ``` ### Connect to Existing Chrome ```bash # Auto-discover running Chrome with remote debugging enabled agent-browser --auto-connect open https://example.com agent-browser --auto-connect snapshot # Or with explicit CDP port agent-browser --cdp 9222 snapshot ``` ### Visual Browser (Debugging) ```bash agent-browser --headed open https://example.com agent-browser highlight @e1 # Highlight element agent-browser record start demo.webm # Record session agent-browser profiler start # Start Chrome DevTools profiling agent-browser profiler stop trace.json # Stop and save profile (path optional) ``` ### Local Files (PDFs, HTML) ```bash # Open local files with file:// URLs agent-browser --allow-file-access open file:///path/to/document.pdf agent-browser --allow-file-access open file:///path/to/page.html agent-browser screenshot output.png ``` ### iOS Simulator (Mobile Safari) ```bash # List available iOS simulators agent-browser device list # Launch Safari on a specific device agent-browser -p ios --device "iPhone 16 Pro" open https://example.com # Same workflow as desktop - snapshot, interact, re-snapshot agent-browser -p ios snapshot -i agent-browser -p ios tap @e1 # Tap (alias for click) agent-browser -p ios fill @e2 "text" agent-browser -p ios swipe up # Mobile-specific gesture # Take screenshot agent-browser -p ios screenshot mobile.png # Close session (shuts down simulator) agent-browser -p ios close ``` **Requirements:** macOS with Xcode, Appium (`npm install -g appium && appium driver install xcuitest`) **Real devices:** Works with physical iOS devices if pre-configured. Use `--device ""` where UDID is from `xcrun xctrace list devices`. ## Diffing (Verifying Changes) Use `diff snapshot` after performing an action to verify it had the intended effect. This compares the current accessibility tree against the last snapshot taken in the session. ```bash # Typical workflow: snapshot -> action -> diff agent-browser snapshot -i # Take baseline snapshot agent-browser click @e2 # Perform action agent-browser diff snapshot # See what changed (auto-compares to last snapshot) ``` For visual regression testing or monitoring: ```bash # Save a baseline screenshot, then compare later agent-browser screenshot baseline.png # ... time passes or changes are made ... agent-browser diff screenshot --baseline baseline.png # Compare staging vs production agent-browser diff url https://staging.example.com https://prod.example.com --screenshot ``` `diff snapshot` output uses `+` for additions and `-` for removals, similar to git diff. `diff screenshot` produces a diff image with changed pixels highlighted in red, plus a mismatch percentage. ## Timeouts and Slow Pages The default Playwright timeout is 60 seconds for local browsers. For slow websites or large pages, use explicit waits instead of relying on the default timeout: ```bash # Wait for network activity to settle (best for slow pages) agent-browser wait --load networkidle # Wait for a specific element to appear agent-browser wait "#content" agent-browser wait @e1 # Wait for a specific URL pattern (useful after redirects) agent-browser wait --url "**/dashboard" # Wait for a JavaScript condition agent-browser wait --fn "document.readyState === 'complete'" # Wait a fixed duration (milliseconds) as a last resort agent-browser wait 5000 ``` When dealing with consistently slow websites, use `wait --load networkidle` after `open` to ensure the page is fully loaded before taking a snapshot. If a specific element is slow to render, wait for it directly with `wait ` or `wait @ref`. ## Session Management and Cleanup When running multiple agents or automations concurrently, always use named sessions to avoid conflicts: ```bash # Each agent gets its own isolated session agent-browser --session agent1 open site-a.com agent-browser --session agent2 open site-b.com # Check active sessions agent-browser session list ``` Always close your browser session when done to avoid leaked processes: ```bash agent-browser close # Close default session agent-browser --session agent1 close # Close specific session ``` If a previous session was not closed properly, the daemon may still be running. Use `agent-browser close` to clean it up before starting new work. ## Ref Lifecycle (Important) Refs (`@e1`, `@e2`, etc.) are invalidated when the page changes. Always re-snapshot after: - Clicking links or buttons that navigate - Form submissions - Dynamic content loading (dropdowns, modals) ```bash agent-browser click @e5 # Navigates to new page agent-browser snapshot -i # MUST re-snapshot agent-browser click @e1 # Use new refs ``` ## Annotated Screenshots (Vision Mode) Use `--annotate` to take a screenshot with numbered labels overlaid on interactive elements. Each label `[N]` maps to ref `@eN`. This also caches refs, so you can interact with elements immediately without a separate snapshot. ```bash agent-browser screenshot --annotate # Output includes the image path and a legend: # [1] @e1 button "Submit" # [2] @e2 link "Home" # [3] @e3 textbox "Email" agent-browser click @e2 # Click using ref from annotated screenshot ``` Use annotated screenshots when: - The page has unlabeled icon buttons or visual-only elements - You need to verify visual layout or styling - Canvas or chart elements are present (invisible to text snapshots) - You need spatial reasoning about element positions ## Semantic Locators (Alternative to Refs) When refs are unavailable or unreliable, use semantic locators: ```bash agent-browser find text "Sign In" click agent-browser find label "Email" fill "user@test.com" agent-browser find role button click --name "Submit" agent-browser find placeholder "Search" type "query" agent-browser find testid "submit-btn" click ``` ## JavaScript Evaluation (eval) Use `eval` to run JavaScript in the browser context. **Shell quoting can corrupt complex expressions** -- use `--stdin` or `-b` to avoid issues. ```bash # Simple expressions work with regular quoting agent-browser eval 'document.title' agent-browser eval 'document.querySelectorAll("img").length' # Complex JS: use --stdin with heredoc (RECOMMENDED) agent-browser eval --stdin <<'EVALEOF' JSON.stringify( Array.from(document.querySelectorAll("img")) .filter(i => !i.alt) .map(i => ({ src: i.src.split("/").pop(), width: i.width })) ) EVALEOF # Alternative: base64 encoding (avoids all shell escaping issues) agent-browser eval -b "$(echo -n 'Array.from(document.querySelectorAll("a")).map(a => a.href)' | base64)" ``` **Why this matters:** When the shell processes your command, inner double quotes, `!` characters (history expansion), backticks, and `$()` can all corrupt the JavaScript before it reaches agent-browser. The `--stdin` and `-b` flags bypass shell interpretation entirely. **Rules of thumb:** - Single-line, no nested quotes -> regular `eval 'expression'` with single quotes is fine - Nested quotes, arrow functions, template literals, or multiline -> use `eval --stdin <<'EVALEOF'` - Programmatic/generated scripts -> use `eval -b` with base64 ## Configuration File Create `agent-browser.json` in the project root for persistent settings: ```json { "headed": true, "proxy": "http://localhost:8080", "profile": "./browser-data" } ``` Priority (lowest to highest): `~/.agent-browser/config.json` < `./agent-browser.json` < env vars < CLI flags. Use `--config ` or `AGENT_BROWSER_CONFIG` env var for a custom config file (exits with error if missing/invalid). All CLI options map to camelCase keys (e.g., `--executable-path` -> `"executablePath"`). Boolean flags accept `true`/`false` values (e.g., `--headed false` overrides config). Extensions from user and project configs are merged, not replaced. ## Deep-Dive Documentation | Reference | When to Use | |-----------|-------------| | [references/commands.md](references/commands.md) | Full command reference with all options | | [references/snapshot-refs.md](references/snapshot-refs.md) | Ref lifecycle, invalidation rules, troubleshooting | | [references/session-management.md](references/session-management.md) | Parallel sessions, state persistence, concurrent scraping | | [references/authentication.md](references/authentication.md) | Login flows, OAuth, 2FA handling, state reuse | | [references/video-recording.md](references/video-recording.md) | Recording workflows for debugging and documentation | | [references/profiling.md](references/profiling.md) | Chrome DevTools profiling for performance analysis | | [references/proxy-support.md](references/proxy-support.md) | Proxy configuration, geo-testing, rotating proxies | ## Ready-to-Use Templates | Template | Description | |----------|-------------| | [templates/form-automation.sh](templates/form-automation.sh) | Form filling with validation | | [templates/authenticated-session.sh](templates/authenticated-session.sh) | Login once, reuse state | | [templates/capture-workflow.sh](templates/capture-workflow.sh) | Content extraction with screenshots | ```bash ./templates/form-automation.sh https://example.com/form ./templates/authenticated-session.sh https://app.example.com/login ./templates/capture-workflow.sh https://example.com ./output ```