--- name: gui-automation description: > Agent S integration for autonomous computer control. Triggers: GUI, desktop automation, computer use, screen control, mouse/keyboard automation, visual task, screenshot-based. --- # GUI Automation Skill (Agent S Integration) ## Overview Agent S is the SOTA framework for autonomous computer control: - **72.6%** on OSWorld (exceeds human ~72%) - Best Paper Award @ ICLR 2025 - Supports: Linux, macOS, Windows, Android ## Quick Setup ```bash # Install pip install gui-agents # Set API keys export OPENAI_API_KEY="sk-..." export ANTHROPIC_API_KEY="sk-ant-..." # Optional: OCR server for better accuracy export OCR_SERVER_ADDRESS="http://localhost:8000" ``` ## Basic Usage ```python import pyautogui import io from gui_agents.s3.agents.agent_s import AgentS3 from gui_agents.s3.agents.grounding import OSWorldACI # Setup grounding agent grounding_agent = OSWorldACI( engine_type="anthropic", model="claude-sonnet-4-20250514", grounding_width=1920, grounding_height=1080 ) # Setup main agent agent = AgentS3( engine_type="anthropic", model="claude-opus-4-20250514", grounding_agent=grounding_agent, enable_reflection=True, enable_memory=True ) # Capture screen screenshot = pyautogui.screenshot() buffered = io.BytesIO() screenshot.save(buffered, format="PNG") obs = {"screenshot": buffered.getvalue()} # Execute task instruction = "Open VS Code and create new Python file" info, action = agent.predict(instruction=instruction, observation=obs) # SECURITY: Never use exec(action[0]) directly - it runs arbitrary code! # Use the safe Agent S runner instead: agent.execute_action(action[0]) # Sandboxed execution ``` ## CLI Usage ```bash # Interactive mode agent_s # Single task agent_s --task "Find weather in Prague" ``` ## Architecture ``` ┌─────────────────────────────────────────────────────┐ │ Agent S3 │ ├─────────────────────────────────────────────────────┤ │ Planning │ Memory Search │ Reflection │ │ (GPT-5/Opus) │ (Past tasks) │ (Self-fix) │ ├─────────────────────────────────────────────────────┤ │ Grounding Agent (OSWorldACI) │ │ Converts instructions → screen coords │ ├─────────────────────────────────────────────────────┤ │ PyAutoGUI │ Selenium │ OCR Server │ Vision │ └─────────────────────────────────────────────────────┘ ``` ## Model Configuration | Component | Recommended | Alternative | |-----------|-------------|-------------| | Main Agent | claude-opus-4 | gpt-5 | | Grounding | UI-TARS-1.5-7B | claude-sonnet | | Reflection | claude-sonnet | gpt-4o | ## Security ⚠️ **Critical warnings:** - Agent executes Python code with YOUR permissions - Use in VM/sandbox for untrusted tasks - Single monitor only - Don't run on production machines ## Gotchas 1. **Conda breaks pyatspi on Linux** - install without virtual env 2. **Tesseract required** - `brew install tesseract` (macOS) 3. **Screen resolution matters** - set grounding_width/height correctly 4. **Rate limits** - OpenAI/Anthropic limits apply ## Local Code Execution ```python # Enable local code agent (dangerous!) agent = AgentS3( # ... other params enable_local_code=True # Allows Python/Bash execution ) ``` Use for: - Data processing (CSV, Excel) - File operations - System automation - Code development ## For Details - `references/agent-s-api.md` - Full API reference - `scripts/setup_agent_s.sh` - Installation script