--- name: web-scraper description: >- Web scraper with SPA/JavaScript rendering, page interaction, and JS execution. Two-tier engine (HTTP → Playwright browser). Smart discovery, batch fetch, interactive content extraction, OpenAPI parsing. Use when read_url_content fails, SPA rendering needed, or page interaction required. metadata: version: 0.2.0 --- # Web Scraper > `./scrape [options]` — Run `./scrape help` for full option reference. ## Content Precision Levels The `fetch` command outputs filtered content by default. Three precision levels: | Level | Flag | Behavior | When to use | |-------|------|----------|-------------| | **Default** | *(none)* | PruningContentFilter removes boilerplate | Most scenarios — good enough | | **Precision** | `--selector ".css"` | Extract only the matched container | Production-quality docs, zero noise | | **Raw** | `--raw` | Full page, no filtering | Debugging, page structure analysis | Precision workflow: fetch a sample page → inspect remaining noise → identify content container via CSS selector → re-fetch all pages with `--selector`. ## Engine Architecture ``` L0: Pure HTTP (httpx + selectolax + markdownify) fetch → strip boilerplate → markdownify → clean content L1: Browser (crawl4ai + Playwright, networkidle wait) fetch → [JS interaction] → PruningContentFilter → fit_markdown → clean content Handles: SPAs, anti-bot, JS-rendered content, folded/lazy content Auto mode: L0 probe → detect SPA/empty → fallback to L1 Smart Discovery (3-tier): 1. sitemap.xml → 2. Nav DOM extraction → 3. Browser deep crawl ``` ## Workflow Patterns ### Single page fetch ```bash # Auto-detects if browser needed ./scrape fetch https://docs.example.com/api/create # Force browser for known SPA sites ./scrape fetch https://open.feishu.cn/document/... --engine cdp # Precision extraction with CSS selector ./scrape fetch https://developer.work.weixin.qq.com/document/path/90196 --engine cdp --selector ".ep-doc-area" ``` ### Interactive content extraction (SPA/folded content) ```bash # Auto-expand all collapsed/folded sections (universal preset) ./scrape fetch https://open.feishu.cn/document/... --expand-all --wait 5000 # Scroll through entire page to trigger lazy loading ./scrape fetch https://example.com/infinite-scroll --scroll-full # Custom JavaScript before extraction ./scrape fetch URL --js "document.querySelectorAll('.expand-btn').forEach(e => e.click())" # JS from file (for complex interactions) ./scrape fetch URL --js-file /path/to/interact.js --wait 5000 # Wait for specific element before extraction ./scrape fetch URL --wait-for ".api-response-table" # Combine: expand + custom JS + wait ./scrape fetch URL --expand-all --js "extraAction()" --wait-for ".loaded" -o /tmp/docs/ ``` ### Execute JavaScript on a page ```bash # Execute JS and return metadata (no markdown conversion) ./scrape exec URL --js "return document.title" # Extract data from page's JavaScript state ./scrape exec URL --js "return JSON.stringify(window.__NEXT_DATA__)" # Execute JS then fetch page as markdown ./scrape exec URL --js "expandAll()" --then-fetch -o /tmp/result.md ``` ### Batch-download documentation ```bash ./scrape fetch https://docs.example.com --auto -o /tmp/docs/ ./scrape fetch --from-file urls.txt -o /tmp/docs/ ./scrape fetch --from-file urls.txt --merge -o /tmp/docs/all.md ``` Files are automatically named by page title (e.g., `读取成员.md`, `Getting_Started.md`). ### Build organized local documentation 1. **Discover** site structure: `./scrape discover --json` → get URLs + titles 2. **Filter** relevant URLs (Agent selects subset based on user's needs) 3. **Write** filtered URLs to a file 4. **Fetch** to output directory: `./scrape fetch --from-file urls.txt -o /path/to/docs/` 5. **Organize** (Agent moves/renames files into logical folder structure) ### Analyze site structure ```bash ./scrape discover https://docs.example.com ./scrape discover https://docs.example.com --engine cdp --deep --max-pages 100 ``` ### Extract OpenAPI specs ```bash ./scrape openapi https://api.example.com/v3/api-docs -o /tmp/api.md ./scrape openapi https://api.example.com/swagger-ui/ -o /tmp/api.md ``` ## Key Options | Option | Commands | Description | |--------|----------|-------------| | `--engine MODE` | discover, fetch | `auto` (default), `http` (L0 only), `cdp` (L1 only) | | `--selector CSS` | fetch, exec | CSS selector for content area (precision mode) | | `--raw` | fetch, exec | Output full page (skip content filtering) | | `--js CODE` | fetch, exec | JavaScript to execute before extraction | | `--js-file PATH` | fetch, exec | JavaScript file to execute before extraction | | `--wait-for CSS` | fetch | Wait for CSS selector to appear before extraction | | `--expand-all` | fetch | Auto-expand collapsed/folded sections (universal preset) | | `--scroll-full` | fetch | Scroll entire page to trigger lazy loading | | `--then-fetch` | exec | After JS execution, also fetch page as markdown | | `--session ID` | exec | Reuse browser session for multi-step interactions | | `--js-only` | exec | Skip navigation, execute JS on existing session | | `--auto` | fetch | Auto-discover pages then fetch all | | `--from-file FILE` | fetch | Read URLs from file (one per line) | | `-o PATH` | fetch, exec, openapi | Output directory or file path | | `--merge` | fetch | Merge all pages into single markdown | | `--json` | discover, fetch | Machine-readable JSON output | | `--summary-only` | fetch | Only show URL, title, char count | | `--max-pages N` | discover, fetch | Limit pages (discover: 200, fetch: 50) | | `--deep` | discover | Follow links for browser-based BFS crawl | ## Requirements - **uv**: `curl -LsSf https://astral.sh/uv/install.sh | sh` - Dependencies and Playwright Chromium auto-installed on first run