--- name: web-scraping description: Extract clean content from URLs via MCP. Use when asked to read web pages, extract links, or get page metadata. --- # Web Scraping via MCP Use this skill to extract clean, readable content from any URL. Returns markdown text, links, and metadata. Free alternative to Firecrawl. ## Available Tools | Tool | What it does | | ------------------ | ----------------------------------------------------------- | | `scrape_url` | Extract clean text content from a URL (Readability-powered) | | `extract_links` | Get all links with href and anchor text | | `extract_metadata` | Get title, description, OG tags, canonical, favicon | | `search_page` | Search for a query string within the page content | | `scrape_multiple` | Batch scrape multiple URLs, get title + excerpt per URL | ## Workflow 1. `scrape_url` for reading a single page (docs, blog post, article) 2. `extract_links` to discover linked resources from a page 3. `extract_metadata` for SEO analysis or link preview data 4. `scrape_multiple` to survey multiple pages at once ## Key Patterns - Uses Mozilla Readability (Firefox Reader View engine) — works best with server-rendered content - Does NOT handle JavaScript-heavy SPAs (React apps, dashboards) — use a browser MCP for those - `scrape_multiple` returns title + excerpt per URL, not full content — use for surveying - `search_page` searches within the extracted content, not raw HTML ## Limitations - No headless browser — won't execute JavaScript - Best for: documentation, blogs, articles, news, wikis - Won't work for: login-gated content, SPAs, dynamically loaded content