--- name: web-scraper description: Scrape web pages and save as HTML or Markdown (with text and images). Minimal dependencies - only requests and beautifulsoup4. Use when the user provides a URL and wants to download/archive the content locally. homepage: https://requests.readthedocs.io/ metadata: { "openclaw": { "emoji": "🕷️", "requires": { "bins": ["python3"], "env": [] }, }, } --- # Web Scraper Fetch web page content (text + images) and save as HTML or Markdown locally. **Minimal dependencies**: Only requires `requests` and `beautifulsoup4` - no browser automation. **Default behavior**: Downloads images to local `images/` directory automatically. ## Quick start ### Single page ```bash {baseDir}/scripts/scrape.py --url "https://example.com" --format html --output /tmp/page.html {baseDir}/scripts/scrape.py --url "https://example.com" --format md --output /tmp/page.md ``` ### Recursive (follow links) ```bash {baseDir}/scripts/scrape.py --url "https://docs.example.com" --format md --recursive --max-depth 2 --output ~/Downloads/docs-archive ``` ## Setup Requires Python 3.8+ and minimal dependencies: ```bash cd {baseDir} pip install -r requirements.txt ``` Or install manually: ```bash pip install requests beautifulsoup4 ``` **Note**: No browser or driver needed - uses pure HTTP requests. ## Inputs to collect ### Single page mode - **URL**: The web page to scrape (required) - **Format**: `html` or `md` (default: `html`) - **Output path**: Where to save the file (default: current directory with auto-generated name) - **Images**: Downloads images by default (use `--no-download-images` to disable) ### Recursive mode (--recursive) - **URL**: Starting point for recursive scraping - **Format**: `html` or `md` - **Output directory**: Where to save all scraped pages - **Max depth**: How many levels deep to follow links (default: 2) - **Max pages**: Maximum total pages to scrape (default: 50) - **Domain filter**: Whether to stay within same domain (default: yes) - **Images**: Downloads images by default ## Conversation Flow 1. Ask user for the URL to scrape 2. Ask preferred output format (HTML or Markdown) - Note: Both formats include text and images by default - HTML: Preserves original structure with downloaded images - Markdown: Clean text format with downloaded images in `images/` folder 3. For recursive mode: Ask max depth and max pages (optional, has sensible defaults) 4. Ask where to save (or suggest a default path like `/tmp/` or `~/Downloads/`) 5. Run the script and confirm success 6. Show the saved file/directory path ## Examples ### Single Page Scraping #### Save as HTML ```bash {baseDir}/scripts/scrape.py --url "https://docs.openclaw.ai/start/quickstart" --format html --output ~/Downloads/openclaw-quickstart.html ``` #### Save as Markdown (with images, default) ```bash {baseDir}/scripts/scrape.py --url "https://en.wikipedia.org/wiki/Web_scraping" --format md --output ~/Documents/web-scraping.md ``` **Result**: Creates `web-scraping.md` + `images/` folder with all downloaded images (text + images). #### Without downloading images (optional) ```bash {baseDir}/scripts/scrape.py --url "https://example.com" --format md --no-download-images ``` **Result**: Only text + image URLs (not downloaded locally). #### Auto-generate filename ```bash {baseDir}/scripts/scrape.py --url "https://example.com" --format html # Saves to: example-com-{timestamp}.html ``` ### Recursive Scraping #### Basic recursive crawl (depth 2, same domain, with images) ```bash {baseDir}/scripts/scrape.py --url "https://docs.example.com" --format md --recursive --output ~/Downloads/docs-archive ``` **Output structure** (text + images for all pages): ``` docs-archive/ ├── index.md ├── getting-started.md ├── api/ │ ├── authentication.md │ └── endpoints.md └── images/ # Shared images from all pages ├── logo.png └── diagram.svg ``` #### Deep crawl with custom limits ```bash {baseDir}/scripts/scrape.py \ --url "https://blog.example.com" \ --format html \ --recursive \ --max-depth 3 \ --max-pages 100 \ --output ~/Archives/blog-backup ``` #### Ignore robots.txt (use with caution) ```bash {baseDir}/scripts/scrape.py \ --url "https://example.com" \ --format md \ --recursive \ --no-respect-robots \ --rate-limit 1.0 ``` #### Faster scraping (reduced rate limit) ```bash {baseDir}/scripts/scrape.py \ --url "https://yoursite.com" \ --format md \ --recursive \ --rate-limit 0.2 ``` ## Features ### Single Page Mode - **HTML output**: Preserves original page structure - ✅ Clean, readable HTML document - ✅ All images downloaded to `images/` folder - ✅ Suitable for offline viewing - **Markdown output**: Extracts clean text content - ✅ **Auto-downloads images** to local `images/` directory (default) - ✅ Converts image URLs to relative paths - ✅ Clean, readable format for archiving - ✅ Fallback to original URLs if download fails - Use `--no-download-images` flag to keep original URLs only - **Simple and fast**: Pure HTTP requests, no browser needed - **Auto filename**: Generates safe filename from URL if not specified ### Recursive Mode (`--recursive`) - **✅ Intelligent link discovery**: Automatically follows all links on crawled pages - **✅ Depth control**: `--max-depth` limits how many levels deep to crawl (default: 2) - **✅ Page limit**: `--max-pages` caps total pages to prevent runaway crawls (default: 50) - **✅ Domain filtering**: `--same-domain` keeps crawl within starting domain (default: on) - **✅ robots.txt compliance**: Respects site's crawling rules by default - **✅ Rate limiting**: `--rate-limit` adds delay between requests (default: 0.5s) - **✅ Smart URL filtering**: Skips images, scripts, CSS, and duplicate URLs - **✅ Progress tracking**: Real-time console output with success/fail/skip counts - **✅ Organized output**: Preserves URL structure in directory hierarchy - **✅ Efficient crawling**: Sequential with rate limiting to respect servers ## Guardrails ### Single Page Mode - Respect robots.txt and site terms of service - Some sites may block automated access; this tool uses standard HTTP requests - Large pages with many images may take time to download ### Recursive Mode - **Start small**: Test with `--max-depth 1 --max-pages 10` first - **Respect robots.txt**: Default is on; only use `--no-respect-robots` for your own sites - **Rate limiting**: Default 0.5s is polite; don't go below 0.2s for public sites - **Same domain**: Strongly recommended to keep `--same-domain` enabled - **Monitor progress**: Watch for high fail rates (may indicate blocking) - **Storage**: Recursive crawls can generate many files; ensure sufficient disk space - **Legal**: Ensure you have permission to crawl and archive the target site ## Troubleshooting - **Connection errors**: Check your internet connection and URL validity - **403/blocked**: Some sites block scrapers; the tool uses realistic User-Agent headers - **Timeout**: Increase `--timeout` flag for slow-loading pages (value in seconds) - **Image download fails**: Images will fall back to original URLs - **Missing images**: Some sites use JavaScript to load images dynamically (not supported)