--- name: web-scraper description: Web scraping and data extraction from websites. Use when users need to extract data from web pages, scrape multiple URLs, gather structured information from websites, or collect data in bulk from online sources. Supports single-page scraping, multi-page crawling, and handling dynamic JavaScript-rendered content. --- # Web Scraper Skill Extract data from websites using deterministic Python scripts. ## Quick Start 1. **Single URL**: Run `scripts/scrape_single.py` with a URL 2. **Multiple URLs**: Run `scripts/scrape_batch.py` with a file of URLs 3. **Dynamic content**: Use `--browser` flag for JavaScript-rendered pages ## Scripts ### scrape_single.py Extract content from a single URL. ```bash python scripts/scrape_single.py "https://example.com" --output .tmp/result.json ``` Options: - `--output`: Output file path (JSON or CSV) - `--selector`: CSS selector to target specific content - `--browser`: Use headless browser for JavaScript content - `--wait`: Seconds to wait for page load (browser mode) ### scrape_batch.py Scrape multiple URLs from a file. ```bash python scripts/scrape_batch.py urls.txt --output .tmp/results.json --delay 1 ``` Options: - `--output`: Output file path - `--delay`: Delay between requests (seconds) - `--browser`: Use headless browser - `--max-concurrent`: Max parallel requests (default: 5) ## Output Format Results are saved as JSON with this structure: ```json { "url": "https://example.com", "title": "Page Title", "text": "Extracted text content...", "links": ["https://..."], "metadata": {"scraped_at": "2024-01-01T12:00:00Z"} } ``` ## Error Handling - **Rate limits**: Scripts auto-retry with exponential backoff - **Timeouts**: Default 30s, configurable via `--timeout` - **Failed URLs**: Logged to `{output}_errors.json` ## When to Create New Scripts Create a new script when: - Existing scripts don't support required output format - Site requires specific authentication flow - Complex multi-step navigation is needed Follow the self-annealing pattern: fix issues, update scripts, update this skill.