---
name: silicon-paddle-ocr
description: OCR skill using PaddleOCR model via SiliconFlow API. This skill should be used when the user asks to "recognize text from an image", "extract text from a photo", "OCR this image", "read text from screenshot", or mentions "PaddleOCR", "image text recognition", "text extraction from images".
license: MIT
metadata:
  author: aotenjou
  version: "1.0.0"
---

# OCR - Image Text Recognition

Use PaddleOCR to extract text content from images. Supports single image or batch processing.

## Overview

This skill provides optical character recognition (OCR) capabilities using the PaddlePaddle/PaddleOCR-VL-1.5 model via the SiliconFlow API. Extract text from JPG, PNG, WebP, BMP, and GIF images.

## When to Use

Invoke this skill when:
- User wants to extract text from an image
- User asks to OCR a screenshot or photo
- User needs to read text from an image file
- User mentions text recognition from images

## How to Use

### Prerequisites

Ensure the `SILICONFLOW_API_KEY` environment variable is set:
```bash
export SILICONFLOW_API_KEY="your_api_key"
```

### Basic Usage

Execute the OCR script:
```bash
python3 scripts/ocr_skill.py [options] image_path
```

### Arguments

| Argument | Description |
|----------|-------------|
| `images` | Image file path(s) or glob pattern (required) |
| `-k, --api-key` | API key (default: from SILICONFLOW_API_KEY env) |
| `-m, --model` | OCR model name (default: PaddlePaddle/PaddleOCR-VL-1.5) |
| `-p, --prompt` | Recognition prompt for custom behavior |
| `-j, --json` | Output results in JSON format |
| `-o, --output` | Save results to specified file |
| `--max-tokens` | Maximum tokens in response (default: 2000) |

### Examples

Single image:
```bash
python3 scripts/ocr_skill.py /path/to/image.jpg
```

Multiple images with glob:
```bash
python3 scripts/ocr_skill.py /path/to/images/*.png
```

JSON output format:
```bash
python3 scripts/ocr_skill.py --json /path/to/image.jpg
```

Custom prompt for table extraction:
```bash
python3 scripts/ocr_skill.py -p "Please identify and format table content as Markdown" /path/to/table.jpg
```

Save to file:
```bash
python3 scripts/ocr_skill.py --json --output results.json /path/to/images/*.jpg
```

### Output Format

**Text output** (default):
```
--- image.jpg ---
识别到的文字内容
识别到 X 处文字区域
```

**JSON output**:
```json
{
  "image.jpg": {
    "image_path": "/path/to/image.jpg",
    "image_size": [width, height],
    "texts": [
      {
        "text": "识别的文字",
        "box": [[x1, y1], [x2, y2], [x3, y3], [x4, y4]]
      }
    ],
    "full_text": "所有文本的组合"
  },
  "image2.png": { ... }
}
```

**Coordinates Explanation:**
- LOC values are normalized coordinates converted to pixel coordinates
- Conversion: pixel = LOC × (image_size / LOC_max_value)
- LOC max_value is approximately 972 (may vary by model/image)
- The `box` field provides the four corner coordinates of each text region in pixel format

## Supported Image Formats

- JPG/JPEG
- PNG
- WebP
- BMP
- GIF

## Error Handling

If processing fails:
- Check that the image file exists
- Verify the SILICONFLOW_API_KEY is valid
- Ensure the API endpoint is reachable

Images that fail to process will show an error message, and other images will continue processing.

## Additional Resources

### Reference Files

- **`references/api-configuration.md`** - API configuration details

### Example Files

- **`examples/sample-usage.sh`** - Example usage script

### Scripts

- **`scripts/ocr_skill.py`** - The main OCR implementation