# OCR Text Extraction The `ocr_screenshot` tool extracts all visible text from a screenshot with tap-ready coordinates. This is useful when accessibility labels are missing or when you need to find text that isn't exposed in the accessibility tree. > **Note:** Many iOS interaction tools (swipe, text input, accessibility queries) require [IDB](https://github.com/facebook/idb). See the [Platform Setup](../README.md#platform-setup) section for installation instructions. ## Why OCR? | Approach | Pros | Cons | |----------|------|------| | Accessibility tree (`find_element`) | Fast, reliable, low token usage | Only finds elements with accessibility labels | | Screenshot + Vision | Visual layout understanding | High token usage, slow | | **OCR** | Works on ANY visible text, returns tap coordinates | Requires text to be visible, may miss small text | ## Usage ``` ocr_screenshot with platform="ios" ``` Returns all visible text with tap-ready coordinates: ```json { "platform": "ios", "engine": "cloud", "processingTimeMs": 550, "elementCount": 24, "elements": [ { "text": "Settings", "confidence": 95, "tapX": 195, "tapY": 52 }, { "text": "Login", "confidence": 95, "tapX": 187, "tapY": 420 } ] } ``` Then tap the element: ``` tap with x=187 y=420 ``` ## OCR Engine OCR uses **Google Cloud Vision API** via a cloud proxy for fast, accurate text recognition (~97%+ accuracy, ~0.5s processing time). This works out of the box with no local dependencies. Screenshots are sent over HTTPS to our cloud endpoint for processing and immediately deleted after recognition — no images are stored. ## Offline Fallback (EasyOCR) If the cloud endpoint is unreachable (no internet, timeout), OCR falls back to local EasyOCR (Python-based). This requires Python 3.6+: ```bash # macOS brew install python@3.11 # Ubuntu/Debian sudo apt install python3 ``` EasyOCR and its Python dependencies are installed automatically by `node-easyocr`. The local fallback is slower (~2-3s) and less accurate (~85-90%) but works offline. ## OCR Language Configuration Google Cloud Vision automatically detects and recognizes text in most languages without configuration. For the offline EasyOCR fallback, set `EASYOCR_LANGUAGES` to add language support: ```bash EASYOCR_LANGUAGES=es,fr ``` ## Recommended Workflow 1. **Use unified `tap`** - Handles fallback chain automatically 2. **Fall back to OCR** - When `tap` suggests using coordinates 3. **Use screenshot** - For visual debugging or layout verification ``` # Simplest approach — tap handles everything tap with text="Submit" # If tap fails, use OCR to find coordinates ocr_screenshot with platform="android" # Then tap using coordinates from OCR result tap with x=540 y=1200 ```