--- name: omni-vu description: Screen capture, AI vision analysis, and GUI automation for macOS. Use when you need to see what's on screen, analyze UI state, detect changes, or automate mouse/keyboard actions. --- # omni.vu - Visual Understanding & Automation ## Overview omni.vu gives you eyes and hands on the user's macOS screen. Use it to: - **See** what the user sees (screen capture) - **Understand** UI state with AI vision - **Detect** changes and wait for events - **Act** with mouse and keyboard automation ## When to Use This Skill ### Proactively Use omni.vu When: 1. **Debugging UI issues** - "The button isn't working" → capture and analyze 2. **Verifying changes** - After modifying UI code, check if it rendered correctly 3. **Waiting for operations** - Build finishing, deployment completing, tests running 4. **Understanding context** - User describes something on screen you can't see 5. **Automating repetitive tasks** - Clicking through UI flows, filling forms 6. **Documentation** - Capturing screenshots of features ### Do NOT Use When: - Reading/writing files (use Read/Write tools) - Running terminal commands (use Bash) - Making API calls (use appropriate tools) - User hasn't granted screen recording permission ## Tool Reference ### Capture Tools #### `vu` - Full Screen Capture ``` vu(monitor=0, save_to_history=True) ``` Captures the entire screen. Returns base64 image. **Use when**: You need to see everything on screen. #### `vu_window` - Window Capture ``` vu_window(window_id=None, title="VS Code", include_frame=False) ``` Captures a specific window by ID or title (partial match). **Use when**: You only need one application's content. **Workflow**: 1. Call `vu_list_windows` to see available windows 2. Find the window_id or use title matching 3. Call `vu_window` with that ID/title #### `vu_region` - Region Capture ``` vu_region(x=100, y=100, width=500, height=300) ``` Captures a specific rectangle of the screen. **Use when**: You need a precise area (error message, specific component). #### `vu_list_windows` - List Windows ``` vu_list_windows(filter_app="Chrome") ``` Lists all visible windows with metadata. **Returns**: List of `{window_id, title, owner, x, y, width, height}` #### `vu_list_monitors` - List Displays ``` vu_list_monitors() ``` Lists connected displays with resolution and scale factor. ### Vision Tools #### `vu_describe` - AI Vision Analysis ``` vu_describe( prompt="What errors are visible?", provider="claude", # or openai, gemini, ollama max_tokens=1024, capture_first=True ) ``` Captures screen and analyzes with AI. **Best prompts**: - "Describe what you see on screen" - "Are there any error messages visible?" - "What is the state of the build/test output?" - "Is the login form filled correctly?" - "What color is the status indicator?" **Providers**: | Provider | Speed | Quality | Cost | |----------|-------|---------|------| | claude | Medium | Excellent | $$ | | openai | Fast | Very Good | $$ | | gemini | Fast | Good | $ | | ollama | Slow | Varies | Free | ### Utility Tools #### `vu_diff` - Change Detection ``` vu_diff(threshold=0.02, monitor=0) ``` Detects if screen changed since last capture. **Returns**: `{changed: bool, diff_percentage: float}` **Use for polling patterns**: ```python # Wait for build to finish while True: result = vu_diff(threshold=0.05) if result["changed"]: # Screen changed, check what happened analysis = vu_describe(prompt="Did the build succeed or fail?") break # Wait before checking again ``` #### `vu_history` - Capture History ``` vu_history(limit=10, capture_type="full_screen") ``` Gets recent capture metadata. #### `vu_last` - Last Capture ``` vu_last() ``` Returns the most recent capture with image data. **Use when**: You need to re-analyze without re-capturing. #### `vu_status` - System Status ``` vu_status() ``` Returns system info: monitors, providers, safety settings. ### Automation Tools #### `vu_click` - Mouse Click ``` vu_click(x=500, y=300, button="left", count=1) ``` Clicks at screen coordinates. **Buttons**: `left`, `right`, `middle` **Count**: 1 (single), 2 (double), 3 (triple) **Safety**: Coordinates validated against screen bounds. #### `vu_move` - Move Cursor ``` vu_move(x=500, y=300, duration_ms=0) ``` Moves cursor to position. Use `duration_ms` for animated movement. #### `vu_drag` - Drag Operation ``` vu_drag(start_x=100, start_y=100, end_x=500, end_y=300, duration_ms=500) ``` Drags from start to end position. **Safety**: Maximum 2000px drag distance. #### `vu_scroll` - Scroll ``` vu_scroll(direction="down", amount=3, x=None, y=None) ``` Scrolls at current or specified position. **Directions**: `up`, `down`, `left`, `right` **Amount**: Lines to scroll (1-20) #### `vu_type` - Type Text ``` vu_type(text="Hello World", delay_between_ms=0) ``` Types text character by character. **Note**: Click on target field first! #### `vu_hotkey` - Keyboard Shortcut ``` vu_hotkey(keys="cmd+s") ``` Executes keyboard shortcuts. **Format**: `modifier+key` (e.g., `cmd+c`, `ctrl+shift+s`, `cmd+opt+i`) **Modifiers**: `cmd`, `ctrl`, `alt`/`opt`, `shift` **Blocked hotkeys** (for safety): - `cmd+q` (quit) - `cmd+shift+q` (logout) - `cmd+opt+esc` (force quit) ## Common Workflows ### 1. Debug UI Issue ``` User: "The submit button doesn't do anything when I click it" 1. vu_describe(prompt="Describe the form and submit button state") 2. Analyze: Is button disabled? Is there validation error? 3. If needed: vu_window(title="Chrome") for focused capture 4. Check console: vu_describe(prompt="Are there any errors in the developer console?") ``` ### 2. Verify Build/Deploy ``` User: "Run the build and let me know when it's done" 1. Run: npm run build (via Bash) 2. Loop: vu_diff(threshold=0.05) 3. When changed: vu_describe(prompt="What is the build status? Did it succeed or fail?") 4. Report result ``` ### 3. Fill a Form ``` User: "Fill out the registration form with test data" 1. vu_describe(prompt="What form fields are visible?") 2. vu_click(x=field_x, y=field_y) # Click first field 3. vu_type(text="testuser@example.com") 4. vu_hotkey(keys="tab") # Move to next field 5. vu_type(text="Test User") 6. Continue for each field... 7. vu_click(x=submit_x, y=submit_y) # Submit 8. vu_describe(prompt="Was the form submitted successfully?") ``` ### 4. Navigate UI ``` User: "Open the settings panel in VS Code" 1. vu_list_windows(filter_app="Code") 2. vu_window(title="Code") # Capture VS Code 3. vu_hotkey(keys="cmd+,") # Open settings 4. vu_diff() # Wait for settings to open 5. vu_describe(prompt="Are the VS Code settings now visible?") ``` ### 5. Monitor for Changes ``` User: "Watch the dashboard and tell me when the status changes" 1. vu() # Initial capture 2. Loop every 5 seconds: - vu_diff(threshold=0.02) - If changed: vu_describe(prompt="What changed on the dashboard?") 3. Report changes to user ``` ## Best Practices ### 1. Capture Before Acting Always capture and analyze before clicking/typing: ``` ❌ Bad: vu_click(x=500, y=300) # Hope it's the right spot ✅ Good: 1. vu_describe(prompt="Where is the submit button?") 2. Parse coordinates from description 3. vu_click(x=parsed_x, y=parsed_y) ``` ### 2. Use Appropriate Capture Scope ``` Full screen → General context, multi-app workflows Window → Single app focus, cleaner output Region → Specific element, faster/smaller ``` ### 3. Specific Vision Prompts ``` ❌ Vague: "What do you see?" ✅ Specific: - "Is there an error message visible? If so, what does it say?" - "What is the status of the test runner? Passing, failing, or running?" - "List all form fields visible and their current values" ``` ### 4. Rate Limit Automation Don't spam clicks. Wait between actions: ``` vu_click(x=100, y=100) # Wait for UI response vu_diff(threshold=0.01) vu_click(x=200, y=200) ``` ### 5. Verify After Actions Always verify automation succeeded: ``` vu_type(text="hello@example.com") vu_describe(prompt="What text is now in the email field?") ``` ## Troubleshooting ### "Permission denied" or blank captures → Grant Screen Recording permission: System Settings > Privacy > Screen Recording ### Automation not working → Grant Accessibility permission: System Settings > Privacy > Accessibility ### Vision analysis failing → Check API keys in environment variables → Try different provider: `vu_describe(provider="openai")` ### Coordinates seem off → Retina display? Coordinates are in logical points, not pixels → Multi-monitor? Check `vu_list_monitors()` for display arrangement ### Hotkey blocked → Some dangerous hotkeys (cmd+q) are blocked for safety → Unblock in config if absolutely necessary ## Configuration Environment variables: ```bash OMNI_VU_ANTHROPIC_API_KEY=sk-ant-... # For Claude vision OMNI_VU_OPENAI_API_KEY=sk-... # For GPT-4 vision OMNI_VU_DEFAULT_PROVIDER=claude # Default AI provider OMNI_VU_SAFETY_LEVEL=medium # low/medium/high ``` History stored at: `~/.omni.vu/captures/`