`, long paragraphs, date/author | Extract main article body | | **Product Page** | Price, "Add to Cart", specs | Extract title, price, description, specs | | **Documentation** | Code blocks, headings hierarchy | Preserve structure and code | | **List/Search** | Repeated item patterns | Extract as structured list | | **Landing Page** | Hero section, CTAs | Extract key messaging | ### Step 2: Filter Noise **ALWAYS REMOVE these elements from output:** - Navigation menus and breadcrumbs - Footer content (copyright, links) - Sidebars (ads, related articles, social links) - Cookie banners and popups - Comments section (unless specifically requested) - Share buttons and social widgets - Login/signup prompts ### Step 3: Structure the Content **For Articles:** ```markdown # [Title] **Source:** [URL] **Date:** [if available] **Author:** [if available] --- [Main content in clean markdown] ``` **For Product Pages:** ```markdown # [Product Name] **Price:** [price] **Availability:** [in stock/out of stock] ## Description [product description] ## Specifications | Spec | Value | |------|-------| | ... | ... | ``` --- ## Output Formats ### Markdown (default) Clean, readable markdown with proper headings, lists, and formatting. ### JSON ```json { "url": "https://...", "title": "Page Title", "type": "article|product|docs|list", "content": { "main": "...", "metadata": {} }, "extracted_at": "ISO timestamp" } ``` ### Text Plain text with minimal formatting, suitable for further processing. --- ## Error Handling ### Navigation Errors | Error | Detection | Action | |-------|-----------|--------| | **Timeout** | Page doesn't load in 30s | Report error, suggest retry | | **404 Not Found** | "404" in title/content | Report "Page not found" | | **403 Forbidden** | "403", "Access Denied" | Report access restriction | | **CAPTCHA** | "captcha", "verify you're human" | Report CAPTCHA detected, cannot proceed | | **Paywall** | "subscribe", "premium content" | Extract visible content, note paywall | ### Recovery Actions ``` If page load fails: 1. Report the specific error to user 2. Suggest: "Try again?" or "Different URL?" 3. Close browser cleanly If content is blocked: 1. Report what was detected (CAPTCHA/paywall/geo-block) 2. Extract any available preview content 3. Suggest alternatives if applicable ``` --- ## Advanced Scenarios ### Single Page Applications (SPA) ``` 1. Navigate to URL 2. Wait longer (3-5 seconds) for JS hydration 3. Use browser_wait_for with specific text if known 4. Then snapshot ``` ### Infinite Scroll Pages ``` 1. Navigate 2. Execute scroll loop (see Phase 1) 3. Snapshot after scrolling completes ``` ### Pages with Click-to-Reveal Content ``` 1. Snapshot first to identify clickable elements 2. Use browser_click on "Read more" / "Show all" buttons 3. Wait briefly 4. Snapshot again for full content ``` ### Multi-page Articles ``` 1. Scrape first page 2. Identify "Next" or pagination links 3. Ask user: "Article has X pages. Scrape all?" 4. If yes, iterate through pages and combine ``` --- ## Performance Guidelines | Metric | Target | How | |--------|--------|-----| | **Speed** | < 15 seconds | Minimal waits, parallel where possible | | **Token Usage** | < 5000 tokens | Smart extraction, not full DOM | | **Reliability** | > 95% success | Proper error handling | --- ## Security Notes - Never execute arbitrary JavaScript from the page - Don't follow redirects to suspicious domains - Don't submit forms or click login buttons - Don't scrape pages that require authentication (unless user provides credentials flow) - Respect robots.txt when mentioned by user --- ## Quick Reference **Minimum viable scrape (4 tool calls):** ``` 1. browser_navigate → 2. browser_wait_for → 3. browser_snapshot → 4. browser_close ``` **Full-featured scrape (with scroll + screenshot):** ``` 1. browser_navigate 2. browser_wait_for 3. browser_evaluate (scroll) 4. browser_snapshot 5. browser_take_screenshot 6. browser_close ``` Remember: The goal is to deliver **clean, useful content** to the user, not raw HTML/DOM dumps.