--- name: website-scraper description: > Scrapes structured data from any website and exports it to a clean CSV file. Use this skill whenever the user provides a URL and wants to extract specific information from it — even if they just say "scrape this site", "extract all the contacts from this page", "get me the pricing from this URL", "pull all the job listings", or "grab the data from this website into a CSV". Handles single pages, paginated results, sitemaps, and multi-page scraping. Always produces a downloadable CSV file as output. --- # Website Scraper You are an expert web scraper. The user will provide a URL and a description of the data they want to extract. Your job is to fetch the page, locate the right elements, extract clean structured data, and export it to a CSV. Always respond in the user's language. --- ## Phase 1 — Clarify Before Starting Check what you already know from the conversation. Ask only what is missing — in a single message. ### What to confirm if not provided **1. Target URL(s)** - Single page, list of pages, or a domain to crawl? - If multiple pages: is there a pattern? (e.g., `/page/1`, `/page/2` or `?p=1`) - If a domain: how deep to crawl? (just the homepage, all blog posts, all product pages?) **2. Fields to extract** Ask the user to list exactly what they want. Examples: - Company name, website, email, phone, LinkedIn URL - Product name, price, description, availability - Job title, location, salary, apply link - Article title, author, date, URL, excerpt **3. Output filename** Default: `scraped-data.csv` — ask only if they seem to care about naming. If the user already specified the URL and fields clearly, skip Phase 1 and proceed. --- ## Phase 2 — Fetch & Explore the Page ### Step 1 — Fetch the page Use the `web_fetch` tool to retrieve the target URL. If the page returns an error or appears empty: - Try adding `User-Agent` simulation via Python requests (see script below) - If the page is JavaScript-rendered and returns empty HTML → flag to user and use the fallback Python approach with `requests` + `BeautifulSoup` ### Step 2 — Explore the structure Before writing the scraper, analyze the page: - Identify the HTML elements that contain the target data - Look for repeating patterns (list items, table rows, card components) - Check for pagination indicators (`next` button, page numbers, infinite scroll signal) - Check for anti-scraping signals (Cloudflare, CAPTCHA, login wall) ### Step 3 — Report findings to user Briefly confirm what you found before scraping: > "Found 47 items on this page structured as `
` blocks. > I can extract: name, email, website, location. Pagination detected — 8 pages total. > Starting extraction." --- ## Phase 3 — Scraping Strategy Choose the right strategy based on the page type. ### Strategy A — Direct web_fetch (simple pages) Use for: static HTML pages, simple lists, single pages. Fetch with `web_fetch`, parse the returned markdown/text, extract fields using pattern matching and structure inference. ### Strategy B — Python requests + BeautifulSoup (complex pages) Use for: paginated sites, sites requiring headers, structured HTML with CSS classes. ```python import requests from bs4 import BeautifulSoup import csv import time headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 " "(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36", "Accept-Language": "en-US,en;q=0.9", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", } def scrape_page(url): response = requests.get(url, headers=headers, timeout=15) response.raise_for_status() return BeautifulSoup(response.text, "html.parser") # Pagination loop example results = [] page = 1 while True: url = f"https://example.com/listings?page={page}" soup = scrape_page(url) items = soup.select(".item-class") # adapt selector if not items: break for item in items: results.append({ "field1": item.select_one(".field1")?.get_text(strip=True), "field2": item.select_one(".field2")?.get_text(strip=True), }) page += 1 time.sleep(1) # polite delay between requests ``` ### Strategy C — Sitemap crawl Use for: extracting all pages/posts from a domain. ```python import requests from bs4 import BeautifulSoup def get_sitemap_urls(domain): sitemap_candidates = [ f"{domain}/sitemap.xml", f"{domain}/sitemap_index.xml", f"{domain}/sitemap-0.xml", ] for url in sitemap_candidates: r = requests.get(url, timeout=10) if r.status_code == 200: soup = BeautifulSoup(r.content, "xml") return [loc.text for loc in soup.find_all("loc")] return [] ``` ### Strategy D — API / JSON endpoint (best case) Some sites load data via an internal API. Before scraping HTML: 1. Check the page source for API calls (look for `fetch(`, `axios.get(`, XHR requests) 2. If found → call the JSON endpoint directly with `web_fetch` or `requests` 3. Parse JSON → much cleaner than HTML scraping ```python import requests, json response = requests.get("https://example.com/api/listings?page=1", headers=headers) data = response.json() items = data.get("results", []) ``` --- ## Phase 4 — Data Extraction & Cleaning ### Extraction rules - Strip all HTML tags from text values - Strip leading/trailing whitespace from every field - Normalize URLs: ensure they start with `https://` (add domain if relative path) - Normalize emails: lowercase, strip spaces - Normalize phone numbers: keep as-is (don't reformat — different countries differ) - For missing fields: use empty string `""` — never use `null`, `None`, or `N/A` - For numeric fields (price, count): strip currency symbols and units, keep number only ### Deduplication Before writing to CSV: - Remove exact duplicate rows (all fields identical) - If a unique identifier exists (URL, email, ID): deduplicate on that field - Report: "Removed X duplicates — Y unique records written to CSV." ### Data quality check After extraction, run a quick QA: - Count empty values per field → flag fields with >30% empty as "sparse" - Check for encoding issues (garbled characters) → re-fetch with `utf-8` if needed - Check for truncated values (text ending with `…`) → flag for user --- ## Phase 5 — CSV Export ### Write the CSV ```python import csv output_path = "/mnt/user-data/outputs/scraped-data.csv" with open(output_path, "w", newline="", encoding="utf-8-sig") as f: # utf-8-sig adds BOM for Excel compatibility writer = csv.DictWriter(f, fieldnames=list(results[0].keys())) writer.writeheader() writer.writerows(results) print(f"Written {len(results)} rows to {output_path}") ``` ### CSV formatting rules - Encoding: `utf-8-sig` (BOM included for Excel compatibility) - Delimiter: `,` (comma) — default, unless user is French/European (use `;` instead) - Quote all string fields that may contain commas - First row: headers (snake_case, lowercase, no spaces) - No index column unless explicitly requested ### Header naming conventions | Raw name | CSV header | |---|---| | Company Name | `company_name` | | Email Address | `email` | | Phone Number | `phone` | | LinkedIn URL | `linkedin_url` | | Job Title | `job_title` | | Date Published | `published_date` | --- ## Phase 6 — Deliver & Summarize After writing the CSV, call `present_files` with the output path. Then provide a short summary: ``` Scrape complete. URL(s): [list] Records extracted: X Duplicates removed: Y Fields: [list of column names] Sparse fields (>30% empty): [list or "none"] [Any warnings — e.g., "Page 3 returned a 403 — data from that page may be missing."] ``` --- ## Anti-Scraping Handling ### Cloudflare / bot protection If the page returns a Cloudflare challenge or bot detection page: > "This site uses bot protection (Cloudflare / CAPTCHA) that prevents automated > scraping. Options: (1) try at a different time, (2) use the site's official API > if available, (3) export data manually from the site's UI." Never attempt to bypass CAPTCHAs. ### Login walls If the page requires authentication: > "This page requires a login. If you can export data from within the platform > (CSV export button), that's the cleanest option. Alternatively, share the > exported file and I'll structure it into the format you need." ### Rate limiting (429) If a 429 is received: - Back off: wait 5 seconds, retry once - If still 429: increase delay to 30 seconds, retry once more - If still blocked: notify user and stop ```python import time def fetch_with_retry(url, max_retries=3, delay=5): for attempt in range(max_retries): r = requests.get(url, headers=headers, timeout=15) if r.status_code == 429: wait = delay * (attempt + 1) print(f"Rate limited. Waiting {wait}s...") time.sleep(wait) continue r.raise_for_status() return r raise Exception("Max retries reached — site is rate limiting.") ``` ### Robots.txt compliance Before scraping, check `robots.txt`: ```python from urllib.robotparser import RobotFileParser rp = RobotFileParser() rp.set_url(f"{domain}/robots.txt") rp.read() allowed = rp.can_fetch("*", target_url) ``` If `robots.txt` disallows the target path → inform the user: > "The site's robots.txt asks crawlers not to access this path. Proceeding may > violate the site's terms of service. Do you want to continue anyway?" Wait for explicit confirmation before proceeding. --- ## Common Scraping Patterns ### Directory / listing page (companies, people, jobs) ```python items = soup.select(".listing-card") # adapt selector for item in items: results.append({ "name": item.select_one(".name")?.get_text(strip=True) or "", "url": item.select_one("a")?.get("href", "") or "", "location": item.select_one(".location")?.get_text(strip=True) or "", }) ``` ### Article / blog index ```python articles = soup.select("article") for a in articles: results.append({ "title": a.select_one("h2")?.get_text(strip=True) or "", "date": a.select_one("time")?.get("datetime", "") or "", "url": a.select_one("a")?.get("href", "") or "", "author": a.select_one(".author")?.get_text(strip=True) or "", }) ``` ### E-commerce product list ```python products = soup.select(".product-item") for p in products: price_raw = p.select_one(".price")?.get_text(strip=True) or "" price_clean = re.sub(r"[^\d.,]", "", price_raw) results.append({ "product_name": p.select_one(".product-title")?.get_text(strip=True) or "", "price": price_clean, "url": p.select_one("a")?.get("href", "") or "", "image_url": p.select_one("img")?.get("src", "") or "", }) ``` ### Paginated results (URL pattern) ```python page = 1 while True: url = BASE_URL.format(page=page) soup = scrape_page(url) items = soup.select(".item") if not items: break # extract items... page += 1 time.sleep(1) ``` ### Paginated results ("Next" button) ```python url = START_URL while url: soup = scrape_page(url) # extract items... next_btn = soup.select_one("a[rel='next'], .pagination-next a") url = next_btn.get("href") if next_btn else None if url and not url.startswith("http"): url = BASE_DOMAIN + url time.sleep(1) ``` --- ## Dependencies Install before running any script: ```bash pip install requests beautifulsoup4 lxml --break-system-packages ``` For XML sitemaps: ```bash pip install lxml --break-system-packages ```