# MarkCrawl Recipes Runnable examples for common crawl + extract + RAG tasks. All commands use the public `markcrawl` CLI installed by `pip install markcrawl`. See [../README.md](../README.md) for installation and core concepts. ## Table of contents - [Single-page scrapes](#single-page-scrapes) - [Scrape a single page](#scrape-a-single-page) - [Scrape a single JS-rendered page](#scrape-a-single-js-rendered-page) - [Crawling a whole site](#crawling-a-whole-site) - [Crawl a docs site](#crawl-a-docs-site) - [Crawl a subsection without sitemap wandering](#crawl-a-subsection-without-sitemap-wandering) - [Back up a blog before it shuts down](#back-up-a-blog-before-it-shuts-down) - [Resume an interrupted crawl](#resume-an-interrupted-crawl) - [Filtering URLs](#filtering-urls) - [Skip junk pages](#skip-junk-pages) - [Crawl only specific sections](#crawl-only-specific-sections) - [Preview URLs before committing to a long crawl](#preview-urls-before-committing-to-a-long-crawl) - [Safe crawl of a job board](#safe-crawl-of-a-job-board) - [Smart-sample a large site](#smart-sample-a-large-site) - [Extraction and dedup](#extraction-and-dedup) - [Choose an extraction backend](#choose-an-extraction-backend) - [Crawl high-value pages first](#crawl-high-value-pages-first) - [Skip pages you've already crawled](#skip-pages-youve-already-crawled) - [Binary downloads and assets](#binary-downloads-and-assets) - [Download images alongside content](#download-images-alongside-content) - [Selectively download referenced binaries (PDF, DOCX)](#selectively-download-referenced-binaries-pdf-docx) - [Capture page screenshots](#capture-page-screenshots) - [Multi-site workflows](#multi-site-workflows) - [Discover seed URLs and fan out across sites](#discover-seed-urls-and-fan-out-across-sites) - [End-to-end use cases](#end-to-end-use-cases) - [Competitive analysis (crawl 3 competitors, extract pricing)](#competitive-analysis-crawl-3-competitors-extract-pricing) - [Docs site → RAG chatbot](#docs-site--rag-chatbot) - [API docs → code generation prompt](#api-docs--code-generation-prompt) --- ## Single-page scrapes ### Scrape a single page ```bash markcrawl --base https://example.com/pricing --no-sitemap --max-pages 1 ``` ### Scrape a single JS-rendered page For React, Vue, YouTube, and similar JS-rendered pages: ```bash markcrawl --base "https://www.youtube.com/@channel/videos" \ --no-sitemap --max-pages 1 --render-js # → outputs one .md file with video titles, view counts, and dates ``` For infinite-scroll pages like YouTube, this captures the first ~28 videos from the initial render. --- ## Crawling a whole site ### Crawl a docs site ```bash markcrawl --base https://docs.example.com --max-pages 500 --concurrency 5 --show-progress ``` ### Crawl a subsection without sitemap wandering Large sites (YouTube, GitHub, etc.) have sitemaps with thousands of unrelated pages. Use `--no-sitemap` to crawl only from your target URL: ```bash markcrawl --base https://docs.example.com/guides \ --no-sitemap --max-pages 50 --show-progress ``` ### Back up a blog before it shuts down ```bash markcrawl --base https://engineering.example.com/blog \ --no-sitemap --max-pages 1000 --concurrency 5 --out ./blog-archive --show-progress # → every post saved as clean Markdown with citations and access dates ``` ### Resume an interrupted crawl ```bash markcrawl --base https://docs.example.com --out ./docs --resume --show-progress ``` --- ## Filtering URLs ### Skip junk pages Exclude job listings, login walls, SEO spam: ```bash markcrawl --base https://example.com \ --exclude-path "/job/*" --exclude-path "/careers/*" --exclude-path "/login" \ --max-pages 500 --out ./output --show-progress ``` ### Crawl only specific sections Blog + pricing, ignore everything else: ```bash markcrawl --base https://example.com \ --include-path "/blog/*" --include-path "/pricing" \ --max-pages 200 --out ./output --show-progress ``` ### Preview URLs before committing to a long crawl ```bash markcrawl --base https://example.com --dry-run # → prints every URL that would be crawled (from sitemap), then exits # Pipe to wc -l to get a count, or grep to check for junk patterns markcrawl --base https://example.com --dry-run | wc -l markcrawl --base https://example.com --dry-run | grep "/job/" ``` ### Safe crawl of a job board Dry-run first, then exclude: ```bash # Step 1: see what you'd get markcrawl --base https://tealhq.com --dry-run | head -50 # Step 2: exclude the job listings, crawl just the content pages markcrawl --base https://tealhq.com \ --exclude-path "/job/*" --exclude-path "/resume-examples/*" \ --max-pages 200 --out ./tealhq --show-progress ``` ### Smart-sample a large site For e-commerce, job boards, real estate — sample N pages per templated cluster instead of crawling thousands: ```bash # Preview the pattern clusters first markcrawl --base https://bigsite.com --dry-run --smart-sample --show-progress # Crawl with sampling — 5 pages per templated cluster instead of thousands markcrawl --base https://bigsite.com --out ./bigsite \ --smart-sample --sample-size 5 --sample-threshold 20 --show-progress ``` --- ## Extraction and dedup ### Choose an extraction backend ```bash # Default (BS4 + markdownify) — fastest, good for most sites markcrawl --base https://docs.example.com --out ./output --show-progress # Ensemble — runs default + trafilatura, picks best per page markcrawl --base https://docs.example.com --out ./output --extractor ensemble --show-progress # ReaderLM-v2 — ML-based extraction (uses the bundled torch + transformers stack since v0.10.1) markcrawl --base https://docs.example.com --out ./output --extractor readerlm --show-progress ``` ### Crawl high-value pages first Link prioritization scores discovered links by predicted content value: ```bash markcrawl --base https://docs.example.com --out ./docs \ --prioritize-links --max-pages 100 --show-progress # Prioritizes content-rich pages (guides, docs) over low-value ones (legal, login) ``` ### Skip pages you've already crawled Cross-crawl dedup — only fetches new or changed pages: ```bash # First crawl markcrawl --base https://docs.example.com --out ./docs --show-progress # Later — only fetches new/changed pages markcrawl --base https://docs.example.com --out ./docs --cross-dedup --show-progress ``` --- ## Binary downloads and assets ### Download images alongside content Photography blogs, product pages: ```bash # Crawl a photography blog and save images from the content area markcrawl --base https://photography-blog.example.com --out ./photos \ --download-images --max-pages 50 --show-progress # Output: # ./photos/assets/mountain-abc123.jpg # ./photos/assets/sunset-def456.png # ./photos/post-1__a1b2c3.md ← Markdown with ![alt](assets/filename.ext) refs # ./photos/pages.jsonl ← index includes "images" array per page # Adjust minimum image size to skip thumbnails (default: 5000 bytes) markcrawl --base https://example.com/gallery --out ./gallery \ --download-images --min-image-size 20000 --show-progress ``` ### Selectively download referenced binaries (PDF, DOCX) *New in v0.11.0.* Stream-download referenced files with size + content-type guards and a pre-fetch filter callback: ```python # Crawl an aggregator site and harvest only the resume templates # (skips privacy policies, ToS, marketing PDFs by anchor + URL signal). from markcrawl import crawl from markcrawl.filters import is_likely_resume result = crawl( base_url="https://example.com/templates", out_dir="./resumes", download_types=["pdf", "docx"], # opt-in; default None = no downloads download_filter=is_likely_resume, # pre-fetch — rejected URLs never fetched download_max_files=200, # cap per crawl download_max_size_mb=25, # per-file cap (streaming) ) print(f"Saved {result.downloads_count} files, {result.downloads_bytes/1e6:.1f} MB") # Output: # ./resumes/downloads/cv-template-1__a1b2c3.pdf # ./resumes/downloads/cover-letter-2__d4e5f6.docx # ./resumes/pages.jsonl ← each page's row gets "downloads": [{url, path, ...}] ``` Filters are pure functions of `DownloadCandidate(url, anchor_text, parent_url, parent_title, extension)` — compose your own with the bundled starters: ```python from markcrawl.filters import is_likely_resume, exclude_legal_boilerplate # Combine a positive selector with the negative one: def my_filter(c): return is_likely_resume(c) and exclude_legal_boilerplate(c) and "spam" not in c.url crawl(..., download_types=["pdf"], download_filter=my_filter) ``` Bundled filters (`markcrawl.filters`): `is_likely_resume`, `is_likely_paper`, `exclude_legal_boilerplate`. Best-effort heuristics, not classifiers — test against your real corpus before relying on them. ### Capture page screenshots Dashboards, data visualisations, JS-rendered charts: ```bash # Full-page screenshot of every crawled page (auto-enables --render-js) markcrawl --base https://steamcharts.com/top --out ./dash \ --screenshot --max-pages 5 --show-progress # Output: # ./dash/screenshots/top-abc123def456.png ← 1920-wide full-page PNG # ./dash/pages.jsonl ← each row gets "screenshot": "screenshots/..." # Crop to just the dashboard region, JPEG for smaller files, longer wait for slow charts markcrawl --base https://example.com/dashboards --out ./dash \ --screenshot --screenshot-selector ".dashboard-main" \ --screenshot-format jpeg --screenshot-wait-ms 3000 --show-progress ``` The screenshot path loads with `wait_until="load"` and then pauses `--screenshot-wait-ms` (default 1500 ms) before capturing, so canvas/SVG charts have time to render. (`networkidle` is deliberately avoided — many real sites never idle due to analytics pings.) Failures are recorded in the JSONL row as `screenshot_error` rather than aborting the crawl. --- ## Multi-site workflows ### Discover seed URLs and fan out across sites Use a bundled curated seed pack, then crawl every site: ```bash # Use a bundled curated seed pack, then crawl every site with screenshots markcrawl discover --pack game-dashboards | \ markcrawl --seed-file - --out ./dashboards \ --screenshot --max-pages-per-site 5 --show-progress # List available packs markcrawl discover --list-packs ``` Output is organised per-site: `./dashboards//pages.jsonl` plus `screenshots/` under each. See the full recipe (including a YouTube frame-extraction path using `yt-dlp` + `ffmpeg`) at [recipes/game-dashboards.md](recipes/game-dashboards.md). --- ## End-to-end use cases ### Competitive analysis (crawl 3 competitors, extract pricing) ```bash markcrawl --base https://competitor-one.com/pricing --no-sitemap --max-pages 1 --out ./comp1 markcrawl --base https://competitor-two.com/pricing --no-sitemap --max-pages 1 --out ./comp2 markcrawl --base https://competitor-three.com/pricing --no-sitemap --max-pages 1 --out ./comp3 markcrawl-extract \ --jsonl ./comp1/pages.jsonl ./comp2/pages.jsonl ./comp3/pages.jsonl \ --fields pricing_tiers features free_trial --show-progress # → extracted.jsonl with structured pricing data across all three ``` ### Docs site → RAG chatbot Full pipeline: crawl, embed, query — **$0 in API charges** thanks to the local embedder default since v0.10.1: ```bash pip install markcrawl markcrawl[upload] # base install bundles the local embedder markcrawl --base https://docs.example.com --out ./docs --max-pages 500 --concurrency 5 --show-progress markcrawl-upload --jsonl ./docs/pages.jsonl --show-progress # → pages chunked + embedded locally with mxbai-embed-large-v1, uploaded to Supabase/pgvector # Wire your chatbot to query the vector table — see docs/SUPABASE.md ``` To use OpenAI embeddings instead (e.g. for parity with an existing index), set `MARKCRAWL_EMBEDDER=text-embedding-3-small` or pass `embedding_model=...` to `upload(...)` / `markcrawl-upload`. ### API docs → code generation prompt ```bash markcrawl --base https://api.example.com/docs --out ./api-docs --max-pages 200 --show-progress # Feed the output to an LLM: # "Using the API documentation in ./api-docs/pages.jsonl, generate a # typed Python client with methods for each endpoint." ```