--- name: web-scraping description: This skill activates for web scraping and Actor development. It proactively discovers APIs via traffic interception, recommends optimal strategy (traffic interception/sitemap/API/DOM scraping/hybrid), and implements iteratively. For production, it guides TypeScript Actor creation via Apify CLI. license: MIT --- # Web Scraping with Intelligent Strategy Selection ## When This Skill Activates Activate automatically when user requests: - "Scrape [website]" - "Extract data from [site]" - "Get product information from [URL]" - "Find all links/pages on [site]" - "I'm getting blocked" or "Getting 403 errors" (loads `strategies/anti-blocking.md`) - "Make this an Apify Actor" (loads `apify/` subdirectory) - "Productionize this scraper" ## Input Parsing Determine reconnaissance depth from user request: | User Says | Mode | Phases Run | |-----------|------|------------| | "quick recon", "just check", "what framework" | Quick | Phase 0 only | | "scrape X", "extract data from X" (default) | Standard | Phases 0-3 + 5, Phase 4 only if protection signals detected | | "full recon", "deep scan", "production scraping" | Full | All phases (0-5) including protection testing | Default is Standard mode. Escalate to Full if protection signals appear during any phase. ## Adaptive Reconnaissance Workflow This skill uses an adaptive phased workflow with quality gates. Each gate asks **"Do I have enough?"** — continue only when the answer is no. **See**: `strategies/framework-signatures.md` for framework detection tables referenced throughout. ### Phase 0: QUICK ASSESSMENT (curl, no browser) Gather maximum intelligence with minimum cost — a single HTTP request. **Step 0a: Fetch raw HTML and headers** ```bash curl -s -D- -L "https://target.com/page" -o response.html ``` **Step 0b: Check response headers** - Match headers against `strategies/framework-signatures.md` → Response Header Signatures table - Note `Server`, `X-Powered-By`, `X-Shopify-Stage`, `Set-Cookie` (protection markers) - Check HTTP status code (200 = accessible, 403 = protected, 3xx = redirects) **Step 0c: Check Known Major Sites table** - Match domain against `strategies/framework-signatures.md` → Known Major Sites - If matched: use the specified data strategy, skip generic pattern scanning **Step 0d: Detect framework from HTML** - Search raw HTML for signatures in `strategies/framework-signatures.md` → HTML Signatures table - Look for `__NEXT_DATA__`, `__NUXT__`, `ld+json`, `/wp-content/`, `data-reactroot` **Step 0e: Search for target data points** - For each data point the user wants: search raw HTML for that content - Track which data points are found vs missing - Check for sitemaps: `curl -s https://[site]/robots.txt | grep -i Sitemap` **Step 0f: Note protection signals** - 403/503 status, Cloudflare challenge HTML, CAPTCHA elements, `cf-ray` header - Record for Phase 4 decision **See**: `strategies/cheerio-vs-browser-test.md` for the Cheerio viability assessment > **QUALITY GATE A**: All target data points found in raw HTML + no protection signals? > → YES: Skip to Phase 3 (Validate Findings). No browser needed. > → NO: Continue to Phase 1. ### Phase 1: BROWSER RECONNAISSANCE (only if Phase 0 needs it) Launch browser only for data points missing from raw HTML or when JavaScript rendering is required. **Step 1a: Initialize browser session** - `proxy_start()` → Start traffic interception proxy - `interceptor_chrome_launch(url, stealthMode: true)` → Launch Chrome with anti-detection - `interceptor_chrome_devtools_attach(target_id)` → Attach DevTools bridge - `interceptor_chrome_devtools_screenshot()` → Capture visual state **Step 1b: Capture traffic and rendered DOM** - `proxy_list_traffic()` → Review all traffic from page load - `proxy_search_traffic(query: "application/json")` → Find JSON responses - `interceptor_chrome_devtools_list_network(resource_types: ["xhr", "fetch"])` → XHR/fetch calls - `interceptor_chrome_devtools_snapshot()` → Accessibility tree (rendered DOM) **Step 1c: Search rendered DOM for missing data points** - For each data point NOT found in Phase 0: search rendered DOM - Use framework-specific search strategy from `strategies/framework-signatures.md` → Framework → Search Strategy table - Only search patterns relevant to the detected framework **Step 1d: Inspect discovered endpoints** - `proxy_get_exchange(exchange_id)` → Full request/response for promising endpoints - Document: method, headers, auth, response structure, pagination > **QUALITY GATE B**: All target data points now covered (raw HTML + rendered DOM + traffic)? > → YES: Skip to Phase 3 (Validate Findings). No deep scan needed. > → NO: Continue to Phase 2 for missing data points only. ### Phase 2: DEEP SCAN (only for missing data points) Targeted investigation for data points not yet found. Only search for what's missing. **Step 2a: Test interactions for missing data** - `proxy_clear_traffic()` before each action → Isolate API calls - `humanizer_click(target_id, selector)` → Trigger dynamic content loads - `humanizer_scroll(target_id, direction, amount)` → Trigger lazy loading / infinite scroll - `humanizer_idle(target_id, duration_ms)` → Wait for delayed content - After each action: `proxy_list_traffic()` → Check for new API calls **Step 2b: Sniff APIs (framework-aware)** - Search only patterns relevant to detected framework: - Next.js → `proxy_list_traffic(url_filter: "/_next/data/")` - WordPress → `proxy_list_traffic(url_filter: "/wp-json/")` - GraphQL → `proxy_search_traffic(query: "graphql")` - Generic → `proxy_list_traffic(url_filter: "/api/")` + `proxy_search_traffic(query: "application/json")` - Skip patterns that don't apply to the detected framework **Step 2c: Test pagination and filtering** - Only if pagination data is a missing data point or needed for coverage assessment - `proxy_clear_traffic()` → click next page → `proxy_list_traffic(url_filter: "page=")` - Document pagination type (URL-based, API offset, cursor, infinite scroll) > **QUALITY GATE C**: Enough data points covered for a useful report? > → YES: Go to Phase 3. > → NO: Document gaps, go to Phase 3 anyway (report will note missing data in self-critique). ### Phase 3: VALIDATE FINDINGS Every claimed extraction method must be verified. A data point is not "found" until the extraction path is specified and tested. **See**: `strategies/cheerio-vs-browser-test.md` for validation methodology **Step 3a: Validate CSS selectors** - For each Cheerio/selector-based method: confirm the selector matches actual HTML - Test against raw HTML (curl output) or rendered DOM (snapshot) - Confirm selector extracts the correct value, not a different element **Step 3b: Validate JSON paths** - For each JSON extraction (e.g., `__NEXT_DATA__`, API response): confirm the path resolves - Parse the JSON, follow the path, verify it returns the expected data type and value **Step 3c: Validate API endpoints** - For each discovered API: replay the request (curl or `proxy_get_exchange`) - Confirm: response status 200, expected data structure, correct values - Test pagination if claimed (at least page 1 and page 2) **Step 3d: Downgrade or re-investigate failures** - If a selector doesn't match: try alternative selectors, or downgrade to PARTIAL confidence - If an API returns 403: note protection requirement, flag for Phase 4 - If a JSON path is wrong: re-examine the JSON structure, correct the path ### Phase 4: PROTECTION TESTING (conditional) **See**: `strategies/proxy-escalation.md` for complete skip/run decision logic **Skip Phase 4 when ALL true**: - No protection signals detected in Phases 0-2 - All data points have validated extraction methods - User didn't request "full recon" **Run Phase 4 when ANY true**: - 403/challenge page observed during any phase - Known high-protection domain - High-volume or production intent - User explicitly requested it **If running**: **Step 4a: Test raw HTTP access** ```bash curl -s -o /dev/null -w "%{http_code}" "https://target.com/page" ``` - 200 → Cheerio viable, no browser needed for accessible endpoints - 403/503 → Escalate to stealth browser **Step 4b: Test with stealth browser** (if needed) - Already running from Phase 1 — check if pages loaded without challenges - `interceptor_chrome_devtools_list_cookies(domain_filter: "cloudflare")` → Protection cookies - `interceptor_chrome_devtools_list_storage_keys(storage_type: "local")` → Fingerprint markers - `proxy_get_tls_fingerprints()` → TLS fingerprint analysis **Step 4c: Test with upstream proxy** (if needed) - `proxy_set_upstream("http://user:pass@proxy-provider:port")` - Re-test blocked endpoints through proxy - Document minimum access level for each data point **Step 4d: Document protection profile** - What protections exist, what worked to bypass them, what production scrapers will need ### Phase 5: REPORT + SELF-CRITIQUE Generate the intelligence report, then critically review it for gaps. **See**: `reference/report-schema.md` for complete report format **Step 5a: Generate report** - Follow `reference/report-schema.md` schema (Sections 1-6) - Include `Validated?` status for every strategy (YES / PARTIAL / NO) - Include all discovered endpoints with full specs **Step 5b: Self-critique** - Write Section 7 (Self-Critique) per `reference/report-schema.md`: - **Gaps**: Data points not found — why, and what would find them - **Skipped steps**: Which phases skipped, with quality gate reasoning - **Unvalidated claims**: Anything marked PARTIAL or NO - **Assumptions**: Things not verified (e.g., "consistent layout across categories") - **Staleness risk**: Geo-dependent prices, A/B layouts, session-specific content - **Recommendations**: Targeted next steps (not "re-run everything") **Step 5c: Fix gaps with targeted re-investigation** - If self-critique reveals fixable gaps: go back to the specific phase/step, not a full re-run - Example: "Price selector untested" → run one curl + parse, don't re-launch browser - Update report with results **Step 5d: Record session** (if browser was used) - `proxy_session_start(name)` → `proxy_session_stop(session_id)` → `proxy_export_har(session_id, path)` - HAR file captures all traffic for replay. See `strategies/session-workflows.md` --- ### IMPLEMENTATION (after reconnaissance) After reconnaissance report is accepted, implement scraper iteratively. **Core Pattern**: 1. Implement recommended approach (minimal code) 2. Test with small batch (5-10 items) 3. Validate data quality 4. Scale to full dataset or fallback 5. Handle blocking if encountered 6. Add robustness (error handling, retries, logging) **See**: `workflows/implementation.md` for complete implementation patterns and code examples ### PRODUCTIONIZATION (on request) Convert scraper to production-ready Apify Actor. **Activation triggers**: "Make this an Apify Actor", "Productionize this", "Deploy to Apify" **Core Pattern**: 1. Confirm TypeScript preference (STRONGLY RECOMMENDED) 2. Initialize with `apify create` command (CRITICAL) 3. Port scraping logic to Actor format 4. Test locally and deploy **Note**: During development, proxy-mcp provides reconnaissance and traffic analysis. For production Actors, use Crawlee crawlers (CheerioCrawler/PlaywrightCrawler) on Apify infrastructure. **See**: `workflows/productionization.md` for complete workflow and `apify/` for Actor development guides ## Quick Reference | Task | Pattern/Command | Documentation | |------|----------------|---------------| | **Reconnaissance** | **Adaptive Phases 0-5** | **`workflows/reconnaissance.md`** | | Framework detection | Header + HTML signature matching | `strategies/framework-signatures.md` | | Cheerio vs Browser | Three-way test + early exit | `strategies/cheerio-vs-browser-test.md` | | Traffic analysis | `proxy_list_traffic()` + `proxy_get_exchange()` | `strategies/traffic-interception.md` | | Protection testing | Conditional escalation | `strategies/proxy-escalation.md` | | Report format | Sections 1-7 with self-critique | `reference/report-schema.md` | | Find sitemaps | `RobotsFile.find(url)` | `strategies/sitemap-discovery.md` | | Filter sitemap URLs | `RequestList + regex` | `reference/regex-patterns.md` | | Discover APIs | Traffic capture (automatic) | `strategies/api-discovery.md` | | DOM scraping | DevTools bridge + humanizer | `strategies/dom-scraping.md` | | HTTP scraping | `CheerioCrawler` | `strategies/cheerio-scraping.md` | | Hybrid approach | Sitemap + API | `strategies/hybrid-approaches.md` | | Handle blocking | Stealth mode + upstream proxies | `strategies/anti-blocking.md` | | Session recording | `proxy_session_start()` / `proxy_export_har()` | `strategies/session-workflows.md` | | Proxy-MCP tools | Complete reference | `reference/proxy-tool-reference.md` | | Fingerprint configs | Stealth + TLS presets | `reference/fingerprint-patterns.md` | | Create Apify Actor | `apify create` | `apify/cli-workflow.md` | | Template selection | Cheerio vs Playwright | `workflows/productionization.md` | | Input schema | `.actor/input_schema.json` | `apify/input-schemas.md` | | Deploy actor | `apify push` | `apify/deployment.md` | ## Common Patterns ### Pattern 1: Sitemap-Based Scraping ```javascript import { RobotsFile, CheerioCrawler, Dataset } from 'crawlee'; // Auto-discover and parse sitemaps const robots = await RobotsFile.find('https://example.com'); const urls = await robots.parseUrlsFromSitemaps(); const crawler = new CheerioCrawler({ async requestHandler({ $, request }) { const data = { title: $('h1').text().trim(), // ... extract data }; await Dataset.pushData(data); }, }); await crawler.addRequests(urls); await crawler.run(); ``` See `examples/sitemap-basic.js` for complete example. ### Pattern 2: API-Based Scraping ```javascript import { gotScraping } from 'got-scraping'; const productIds = [123, 456, 789]; for (const id of productIds) { const response = await gotScraping({ url: `https://api.example.com/products/${id}`, responseType: 'json', }); console.log(response.body); } ``` See `examples/api-scraper.js` for complete example. ### Pattern 3: Hybrid (Sitemap + API) ```javascript // Get URLs from sitemap const robots = await RobotsFile.find('https://shop.com'); const urls = await robots.parseUrlsFromSitemaps(); // Extract IDs from URLs const productIds = urls .map(url => url.match(/\/products\/(\d+)/)?.[1]) .filter(Boolean); // Fetch data via API for (const id of productIds) { const data = await gotScraping({ url: `https://api.shop.com/v1/products/${id}`, responseType: 'json', }); // Process data } ``` See `examples/hybrid-sitemap-api.js` for complete example. ## Directory Navigation This skill uses **progressive disclosure** - detailed information is organized in subdirectories and loaded only when needed. ### Workflows (Implementation Patterns) **For**: Step-by-step workflow guides for each phase - `workflows/reconnaissance.md` - **Phase 1 interactive reconnaissance (CRITICAL)** - `workflows/implementation.md` - Phase 4 iterative implementation patterns - `workflows/productionization.md` - Phase 5 Apify Actor creation workflow ### Strategies (Deep Dives) **For**: Detailed guides on specific scraping approaches - `strategies/framework-signatures.md` - **Framework detection lookup tables (Phase 0/1)** - `strategies/cheerio-vs-browser-test.md` - **Cheerio vs Browser decision test with early exit** - `strategies/proxy-escalation.md` - **Protection testing skip/run conditions (Phase 4)** - `strategies/traffic-interception.md` - Traffic interception via MITM proxy - `strategies/sitemap-discovery.md` - Complete sitemap guide (4 patterns) - `strategies/api-discovery.md` - Finding and using APIs - `strategies/dom-scraping.md` - DOM scraping via DevTools bridge - `strategies/cheerio-scraping.md` - HTTP-only scraping - `strategies/hybrid-approaches.md` - Combining strategies - `strategies/anti-blocking.md` - Multi-layer anti-detection (stealth, humanizer, proxies, TLS) - `strategies/session-workflows.md` - Session recording, HAR export, replay ### Examples (Runnable Code) **For**: Working code to reference or execute **JavaScript Learning Examples** (Simple standalone scripts): - `examples/sitemap-basic.js` - Simple sitemap scraper - `examples/api-scraper.js` - Pure API approach - `examples/traffic-interception-basic.js` - Proxy-based reconnaissance - `examples/hybrid-sitemap-api.js` - Combined approach - `examples/iterative-fallback.js` - Try traffic interception→sitemap→API→DOM scraping **TypeScript Production Examples** (Complete Actors): - `apify/examples/basic-scraper/` - Sitemap + Playwright - `apify/examples/anti-blocking/` - Fingerprinting + proxies - `apify/examples/hybrid-api/` - Sitemap + API (optimal) ### Reference (Quick Lookup) **For**: Quick patterns and troubleshooting - `reference/report-schema.md` - **Intelligence report format (Sections 1-7 + self-critique)** - `reference/proxy-tool-reference.md` - Proxy-MCP tool reference (all 80+ tools) - `reference/regex-patterns.md` - Common URL regex patterns - `reference/fingerprint-patterns.md` - Stealth mode + TLS fingerprint presets - `reference/anti-patterns.md` - What NOT to do ### Apify (Production Deployment) **For**: Creating production Apify Actors - `apify/README.md` - When and how to use Apify - `apify/typescript-first.md` - **Why TypeScript for actors** - `apify/cli-workflow.md` - **apify create workflow (CRITICAL)** - `apify/initialization.md` - Complete setup guide - `apify/input-schemas.md` - Input validation patterns - `apify/configuration.md` - actor.json setup - `apify/deployment.md` - Testing and deployment - `apify/templates/` - TypeScript boilerplate **Note**: Each file is self-contained and can be read independently. Claude will navigate to specific files as needed. ## Core Principles ### 1. Assess Before Committing Resources Start cheap (curl), escalate only when needed: - Phase 0 (curl) before Phase 1 (browser) before Phase 2 (deep scan) - Quality gates skip phases when data is sufficient - Never launch a browser if curl gives you everything ### 2. Detect First, Then Search Relevant Patterns Use framework detection to focus searches: - Match against `strategies/framework-signatures.md` before scanning - Skip patterns that don't apply (no `__NEXT_DATA__` on Amazon) - Known major sites get direct strategy lookup ### 3. Validate, Don't Assume Every claimed extraction method must be tested: - "Found text in HTML" is not enough — need a working selector/path - Phase 3 validates every finding before the report - Unvalidated claims are marked PARTIAL or NO in the report ### 4. Iterative Implementation Build incrementally: - Small test batch first (5-10 items) - Validate quality - Scale or fallback - Add robustness last ### 5. Production-Ready Code When productionizing: - Use TypeScript (strongly recommended) - Use `apify create` (never manual setup) - Add proper error handling - Include logging and monitoring --- **Remember**: Traffic interception first, sitemaps second, APIs third, DOM scraping last! For detailed guidance on any topic, navigate to the relevant subdirectory file listed above.