` blocks. > I can extract: name, email, website, location. Pagination detected — 8 pages total. > Starting extraction." --- ## Phase 3 — Scraping Strategy Choose the right strategy based on the page type. ### Strategy A — Direct web_fetch (simple pages) Use for: static HTML pages, simple lists, single pages. Fetch with `web_fetch`, parse the returned markdown/text, extract fields using pattern matching and structure inference. ### Strategy B — Python requests + BeautifulSoup (complex pages) Use for: paginated sites, sites requiring headers, structured HTML with CSS classes. ```python import requests from bs4 import BeautifulSoup import csv import time headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 " "(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36", "Accept-Language": "en-US,en;q=0.9", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", } def scrape_page(url): response = requests.get(url, headers=headers, timeout=15) response.raise_for_status() return BeautifulSoup(response.text, "html.parser") # Pagination loop example results = [] page = 1 while True: url = f"https://example.com/listings?page={page}" soup = scrape_page(url) items = soup.select(".item-class") # adapt selector if not items: break for item in items: results.append({ "field1": item.select_one(".field1")?.get_text(strip=True), "field2": item.select_one(".field2")?.get_text(strip=True), }) page += 1 time.sleep(1) # polite delay between requests ``` ### Strategy C — Sitemap crawl Use for: extracting all pages/posts from a domain. ```python import requests from bs4 import BeautifulSoup def get_sitemap_urls(domain): sitemap_candidates = [ f"{domain}/sitemap.xml", f"{domain}/sitemap_index.xml", f"{domain}/sitemap-0.xml", ] for url in sitemap_candidates: r = requests.get(url, timeout=10) if r.status_code == 200: soup = BeautifulSoup(r.content, "xml") return [loc.text for loc in soup.find_all("loc")] return [] ``` ### Strategy D — API / JSON endpoint (best case) Some sites load data via an internal API. Before scraping HTML: 1. Check the page source for API calls (look for `fetch(`, `axios.get(`, XHR requests) 2. If found → call the JSON endpoint directly with `web_fetch` or `requests` 3. Parse JSON → much cleaner than HTML scraping ```python import requests, json response = requests.get("https://example.com/api/listings?page=1", headers=headers) data = response.json() items = data.get("results", []) ``` --- ## Phase 4 — Data Extraction & Cleaning ### Extraction rules - Strip all HTML tags from text values - Strip leading/trailing whitespace from every field - Normalize URLs: ensure they start with `https://` (add domain if relative path) - Normalize emails: lowercase, strip spaces - Normalize phone numbers: keep as-is (don't reformat — different countries differ) - For missing fields: use empty string `""` — never use `null`, `None`, or `N/A` - For numeric fields (price, count): strip currency symbols and units, keep number only ### Deduplication Before writing to CSV: - Remove exact duplicate rows (all fields identical) - If a unique identifier exists (URL, email, ID): deduplicate on that field - Report: "Removed X duplicates — Y unique records written to CSV." ### Data quality check After extraction, run a quick QA: - Count empty values per field → flag fields with >30% empty as "sparse" - Check for encoding issues (garbled characters) → re-fetch with `utf-8` if needed - Check for truncated values (text ending with `…`) → flag for user --- ## Phase 5 — CSV Export ### Write the CSV ```python import csv output_path = "/mnt/user-data/outputs/scraped-data.csv" with open(output_path, "w", newline="", encoding="utf-8-sig") as f: # utf-8-sig adds BOM for Excel compatibility writer = csv.DictWriter(f, fieldnames=list(results[0].keys())) writer.writeheader() writer.writerows(results) print(f"Written {len(results)} rows to {output_path}") ``` ### CSV formatting rules - Encoding: `utf-8-sig` (BOM included for Excel compatibility) - Delimiter: `,` (comma) — default, unless user is French/European (use `;` instead) - Quote all string fields that may contain commas - First row: headers (snake_case, lowercase, no spaces) - No index column unless explicitly requested ### Header naming conventions | Raw name | CSV header | |---|---| | Company Name | `company_name` | | Email Address | `email` | | Phone Number | `phone` | | LinkedIn URL | `linkedin_url` | | Job Title | `job_title` | | Date Published | `published_date` | --- ## Phase 6 — Deliver & Summarize After writing the CSV, call `present_files` with the output path. Then provide a short summary: ``` Scrape complete. URL(s): [list] Records extracted: X Duplicates removed: Y Fields: [list of column names] Sparse fields (>30% empty): [list or "none"] [Any warnings — e.g., "Page 3 returned a 403 — data from that page may be missing."] ``` --- ## Anti-Scraping Handling ### Cloudflare / bot protection If the page returns a Cloudflare challenge or bot detection page: > "This site uses bot protection (Cloudflare / CAPTCHA) that prevents automated > scraping. Options: (1) try at a different time, (2) use the site's official API > if available, (3) export data manually from the site's UI." Never attempt to bypass CAPTCHAs. ### Login walls If the page requires authentication: > "This page requires a login. If you can export data from within the platform > (CSV export button), that's the cleanest option. Alternatively, share the > exported file and I'll structure it into the format you need." ### Rate limiting (429) If a 429 is received: - Back off: wait 5 seconds, retry once - If still 429: increase delay to 30 seconds, retry once more - If still blocked: notify user and stop ```python import time def fetch_with_retry(url, max_retries=3, delay=5): for attempt in range(max_retries): r = requests.get(url, headers=headers, timeout=15) if r.status_code == 429: wait = delay * (attempt + 1) print(f"Rate limited. Waiting {wait}s...") time.sleep(wait) continue r.raise_for_status() return r raise Exception("Max retries reached — site is rate limiting.") ``` ### Robots.txt compliance Before scraping, check `robots.txt`: ```python from urllib.robotparser import RobotFileParser rp = RobotFileParser() rp.set_url(f"{domain}/robots.txt") rp.read() allowed = rp.can_fetch("*", target_url) ``` If `robots.txt` disallows the target path → inform the user: > "The site's robots.txt asks crawlers not to access this path. Proceeding may > violate the site's terms of service. Do you want to continue anyway?" Wait for explicit confirmation before proceeding. --- ## Common Scraping Patterns ### Directory / listing page (companies, people, jobs) ```python items = soup.select(".listing-card") # adapt selector for item in items: results.append({ "name": item.select_one(".name")?.get_text(strip=True) or "", "url": item.select_one("a")?.get("href", "") or "", "location": item.select_one(".location")?.get_text(strip=True) or "", }) ``` ### Article / blog index ```python articles = soup.select("article") for a in articles: results.append({ "title": a.select_one("h2")?.get_text(strip=True) or "", "date": a.select_one("time")?.get("datetime", "") or "", "url": a.select_one("a")?.get("href", "") or "", "author": a.select_one(".author")?.get_text(strip=True) or "", }) ``` ### E-commerce product list ```python products = soup.select(".product-item") for p in products: price_raw = p.select_one(".price")?.get_text(strip=True) or "" price_clean = re.sub(r"[^\d.,]", "", price_raw) results.append({ "product_name": p.select_one(".product-title")?.get_text(strip=True) or "", "price": price_clean, "url": p.select_one("a")?.get("href", "") or "", "image_url": p.select_one("img")?.get("src", "") or "", }) ``` ### Paginated results (URL pattern) ```python page = 1 while True: url = BASE_URL.format(page=page) soup = scrape_page(url) items = soup.select(".item") if not items: break # extract items... page += 1 time.sleep(1) ``` ### Paginated results ("Next" button) ```python url = START_URL while url: soup = scrape_page(url) # extract items... next_btn = soup.select_one("a[rel='next'], .pagination-next a") url = next_btn.get("href") if next_btn else None if url and not url.startswith("http"): url = BASE_DOMAIN + url time.sleep(1) ``` --- ## Dependencies Install before running any script: ```bash pip install requests beautifulsoup4 lxml --break-system-packages ``` For XML sitemaps: ```bash pip install lxml --break-system-packages ```