--- name: web-scraping description: This skill activates for web scraping and Actor development. It proactively discovers sitemaps/APIs, recommends optimal strategy (sitemap/API/Playwright/hybrid), and implements iteratively. For production, it guides TypeScript Actor creation via Apify CLI. license: MIT --- # Web Scraping with Intelligent Strategy Selection ## When This Skill Activates Activate automatically when user requests: - "Scrape [website]" - "Extract data from [site]" - "Get product information from [URL]" - "Find all links/pages on [site]" - "I'm getting blocked" or "Getting 403 errors" (loads `strategies/anti-blocking.md`) - "Make this an Apify Actor" (loads `apify/` subdirectory) - "Productionize this scraper" ## Proactive Workflow This skill follows a systematic 5-phase approach to web scraping, always starting with interactive reconnaissance and ending with production-ready code. ### Phase 1: INTERACTIVE RECONNAISSANCE (Critical First Step) When user says "scrape X", **immediately start with hands-on reconnaissance** using MCP tools: **DO NOT jump to automated checks or implementation** - reconnaissance prevents wasted effort and discovers hidden APIs. #### Use Playwright MCP & Chrome DevTools MCP: **1. Open site in real browser** (Playwright MCP) - Navigate like a real user - Observe page loading behavior (SSR? SPA? Loading states?) - Take screenshots for reference - Test basic interactions **2. Monitor network traffic** (Chrome DevTools via Playwright) - Watch XHR/Fetch requests in real-time - **Find API endpoints** returning JSON (10-100x faster than HTML scraping!) - Analyze request/response patterns - Document headers, cookies, authentication tokens - Extract pagination parameters **3. Test site interactions** - **Pagination**: URL-based? API? Infinite scroll? - **Filtering and search**: How do they work? - **Dynamic content loading**: Triggers and patterns - **Authentication flows**: Required? Optional? **4. Assess protection mechanisms** - Cloudflare/bot detection - CAPTCHA requirements - Rate limiting behavior (test with multiple requests) - Fingerprinting scripts **5. Generate Intelligence Report** - Site architecture (framework, rendering method) - **Discovered APIs/endpoints** with full specs - Protection mechanisms and required countermeasures - **Optimal extraction strategy** (API > Sitemap > HTML) - Time/complexity estimates **See**: `workflows/reconnaissance.md` for complete reconnaissance guide with MCP examples **Why this matters**: Reconnaissance discovers hidden APIs (eliminating need for HTML scraping), identifies blockers before coding, and provides intelligence for optimal strategy selection. **Never skip this step.** ### Phase 2: AUTOMATIC DISCOVERY (Validate Reconnaissance) After Phase 1 reconnaissance, **validate findings with automated checks**: #### 1. Check for Sitemaps ```bash # Automatically check these locations curl -s https://[site]/robots.txt | grep -i Sitemap curl -I https://[site]/sitemap.xml curl -I https://[site]/sitemap_index.xml ``` **Log findings clearly**: - ✓ "Found sitemap at /sitemap.xml with ~1,234 URLs" - ✓ "Found sitemap index with 5 sub-sitemaps" - ✗ "No sitemap detected at common locations" **Why this matters**: Sitemaps provide instant URL discovery (60x faster than crawling) #### 2. Investigate APIs **Prompt user**: ``` Should I check for JSON APIs first? (Highly recommended) Benefits of APIs vs HTML scraping: • 10-100x faster execution • More reliable (structured JSON vs fragile HTML) • Less bandwidth usage • Easier to maintain Check for APIs? [Y/n] ``` **If yes**, guide user: 1. Open browser DevTools → Network tab 2. Navigate the target website 3. Look for XHR/Fetch requests 4. Check for endpoints: `/api/`, `/v1/`, `/v2/`, `/graphql`, `/_next/data/` 5. Analyze request/response format (JSON, GraphQL, REST) **Log findings**: - ✓ "Found API: GET /api/products/{id} (returns JSON)" - ✓ "Found GraphQL endpoint: /graphql" - ✗ "No obvious public APIs detected" #### 3. Analyze Site Structure **Automatically assess**: - JavaScript-heavy? (Look for React, Vue, Angular indicators) - Authentication required? (Login walls, auth tokens) - Page count estimate (from sitemap or site exploration) - Rate limiting indicators (robots.txt directives) ### Phase 3: STRATEGY RECOMMENDATION Based on Phases 1-2 findings, present 2-3 options with clear reasoning: #### Example Output Template: ``` ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 📊 Analysis of example.com ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Phase 1 Intelligence (Reconnaissance): ✓ API discovered via DevTools: GET /api/products?page=N&limit=100 ✓ Framework: Next.js (SSR + CSR hybrid) ✓ Protection: Cloudflare detected, rate limit ~60/min ✗ No authentication required Phase 2 Validation: ✓ Sitemap found: 1,234 product URLs (validates API total) ✓ Static HTML fallback available if needed ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Recommended Approaches: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ⭐ Option 1: Hybrid (Sitemap + API) [RECOMMENDED] ✓ Use sitemap to get all 1,234 product URLs instantly ✓ Extract product IDs from URLs ✓ Fetch data via API (fast, reliable JSON) Estimated time: 8-12 minutes Complexity: Low-Medium Data quality: Excellent Speed: Very Fast ⚡ Option 2: Sitemap + Playwright ✓ Use sitemap for URLs ✓ Scrape HTML with Playwright Estimated time: 15-20 minutes Complexity: Medium Data quality: Good Speed: Fast 🔧 Option 3: Pure API (if sitemap fails) ✓ Discover product IDs through API exploration ✓ Fetch all data via API Estimated time: 10-15 minutes Complexity: Medium Data quality: Excellent Speed: Fast ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ My Recommendation: Option 1 (Hybrid) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Reasoning: • Sitemap gives us complete URL list (instant discovery) • API provides clean, structured data (no HTML parsing) • Combines speed of sitemap with reliability of API • Best of both worlds Proceed with Option 1? [Y/n] ``` **Key principles**: - Always recommend the SIMPLEST approach that works - Sitemap > API > Playwright (in terms of simplicity) - Show time estimates and complexity - Explain reasoning clearly ### Phase 4: ITERATIVE IMPLEMENTATION Implement scraper incrementally, starting simple and adding complexity only as needed. **Core Pattern**: 1. Implement recommended approach (minimal code) 2. Test with small batch (5-10 items) 3. Validate data quality 4. Scale to full dataset or fallback 5. Handle blocking if encountered 6. Add robustness (error handling, retries, logging) **See**: `workflows/implementation.md` for complete implementation patterns and code examples ### Phase 5: PRODUCTIONIZATION (On Request) Convert scraper to production-ready Apify Actor. **Activation triggers**: - "Make this an Apify Actor" - "Productionize this scraper" - "Deploy to Apify" - "Create an actor from this" **Core Pattern**: 1. Confirm TypeScript preference (STRONGLY RECOMMENDED) 2. Initialize with `apify create` command (CRITICAL) 3. Port scraping logic to Actor format 4. Test locally and deploy **See**: `workflows/productionization.md` for complete productionization workflow and `apify/` directory for all Actor development guides ## Quick Reference | Task | Pattern/Command | Documentation | |------|----------------|---------------| | **Reconnaissance** | **Playwright + DevTools MCP** | **`workflows/reconnaissance.md`** | | Find sitemaps | `RobotsFile.find(url)` | `strategies/sitemap-discovery.md` | | Filter sitemap URLs | `RequestList + regex` | `reference/regex-patterns.md` | | Discover APIs | DevTools → Network tab | `strategies/api-discovery.md` | | Playwright scraping | `PlaywrightCrawler` | `strategies/playwright-scraping.md` | | HTTP scraping | `CheerioCrawler` | `strategies/cheerio-scraping.md` | | Hybrid approach | Sitemap + API | `strategies/hybrid-approaches.md` | | Handle blocking | fingerprint-suite + proxies | `strategies/anti-blocking.md` | | Fingerprint configs | Quick patterns | `reference/fingerprint-patterns.md` | | Create Apify Actor | `apify create` | `apify/cli-workflow.md` | | Template selection | Cheerio vs Playwright | `workflows/productionization.md` | | Input schema | `.actor/input_schema.json` | `apify/input-schemas.md` | | Deploy actor | `apify push` | `apify/deployment.md` | ## Common Patterns ### Pattern 1: Sitemap-Based Scraping ```javascript import { RobotsFile, PlaywrightCrawler, Dataset } from 'crawlee'; // Auto-discover and parse sitemaps const robots = await RobotsFile.find('https://example.com'); const urls = await robots.parseUrlsFromSitemaps(); const crawler = new PlaywrightCrawler({ async requestHandler({ page, request }) { const data = await page.evaluate(() => ({ title: document.title, // ... extract data })); await Dataset.pushData(data); }, }); await crawler.addRequests(urls); await crawler.run(); ``` See `examples/sitemap-basic.js` for complete example. ### Pattern 2: API-Based Scraping ```javascript import { gotScraping } from 'got-scraping'; const productIds = [123, 456, 789]; for (const id of productIds) { const response = await gotScraping({ url: `https://api.example.com/products/${id}`, responseType: 'json', }); console.log(response.body); } ``` See `examples/api-scraper.js` for complete example. ### Pattern 3: Hybrid (Sitemap + API) ```javascript // Get URLs from sitemap const robots = await RobotsFile.find('https://shop.com'); const urls = await robots.parseUrlsFromSitemaps(); // Extract IDs from URLs const productIds = urls .map(url => url.match(/\/products\/(\d+)/)?.[1]) .filter(Boolean); // Fetch data via API for (const id of productIds) { const data = await gotScraping({ url: `https://api.shop.com/v1/products/${id}`, responseType: 'json', }); // Process data } ``` See `examples/hybrid-sitemap-api.js` for complete example. ## Directory Navigation This skill uses **progressive disclosure** - detailed information is organized in subdirectories and loaded only when needed. ### Workflows (Implementation Patterns) **For**: Step-by-step workflow guides for each phase - `workflows/reconnaissance.md` - **Phase 1 interactive reconnaissance (CRITICAL)** - `workflows/implementation.md` - Phase 4 iterative implementation patterns - `workflows/productionization.md` - Phase 5 Apify Actor creation workflow ### Strategies (Deep Dives) **For**: Detailed guides on specific scraping approaches - `strategies/sitemap-discovery.md` - Complete sitemap guide (4 patterns) - `strategies/api-discovery.md` - Finding and using APIs - `strategies/playwright-scraping.md` - Browser-based scraping - `strategies/cheerio-scraping.md` - HTTP-only scraping - `strategies/hybrid-approaches.md` - Combining strategies - `strategies/anti-blocking.md` - Fingerprinting & proxies for blocked sites ### Examples (Runnable Code) **For**: Working code to reference or execute **JavaScript Learning Examples** (Simple standalone scripts): - `examples/sitemap-basic.js` - Simple sitemap scraper - `examples/api-scraper.js` - Pure API approach - `examples/playwright-basic.js` - Basic Playwright scraper - `examples/hybrid-sitemap-api.js` - Combined approach - `examples/iterative-fallback.js` - Try sitemap→API→Playwright **TypeScript Production Examples** (Complete Actors): - `apify/examples/basic-scraper/` - Sitemap + Playwright - `apify/examples/anti-blocking/` - Fingerprinting + proxies - `apify/examples/hybrid-api/` - Sitemap + API (optimal) ### Reference (Quick Lookup) **For**: Quick patterns and troubleshooting - `reference/regex-patterns.md` - Common URL regex patterns - `reference/selector-guide.md` - Playwright selector strategies - `reference/fingerprint-patterns.md` - Common fingerprint configurations - `reference/anti-patterns.md` - What NOT to do ### Apify (Production Deployment) **For**: Creating production Apify Actors - `apify/README.md` - When and how to use Apify - `apify/typescript-first.md` - **Why TypeScript for actors** - `apify/cli-workflow.md` - **apify create workflow (CRITICAL)** - `apify/initialization.md` - Complete setup guide - `apify/input-schemas.md` - Input validation patterns - `apify/configuration.md` - actor.json setup - `apify/deployment.md` - Testing and deployment - `apify/templates/` - TypeScript boilerplate **Note**: Each file is self-contained and can be read independently. Claude will navigate to specific files as needed. ## Core Principles ### 1. Progressive Enhancement Start with the simplest approach that works: - Sitemap > API > Playwright - Static > Dynamic - HTTP > Browser ### 2. Proactive Discovery Always investigate before implementing: - Check for sitemaps automatically - Look for APIs (ask user to check DevTools) - Analyze site structure ### 3. Iterative Implementation Build incrementally: - Small test batch first (5-10 items) - Validate quality - Scale or fallback - Add robustness last ### 4. Production-Ready Code When productionizing: - Use TypeScript (strongly recommended) - Use `apify create` (never manual setup) - Add proper error handling - Include logging and monitoring --- **Remember**: Sitemaps first, APIs second, scraping last! For detailed guidance on any topic, navigate to the relevant subdirectory file listed above.