# Crawl benchmark A **multi-page crawl** benchmark: turbo-surf's `Crawler` vs other open-source crawlers on a real, paginated, live site. For each engine we measure **throughput** (pages/s) and **correctness** (items extracted) on the *same* target, the *same* page cap, and counted with the *same* selector — so a fast crawler that misses content is exposed, not rewarded. > Distinct from `harness/competitive/`, which runs one Playwright *routine* across > engines and scores parity. This one is a bulk **crawl** race. ```sh npm run crawl-bench # both sets, 20 pages, 3 iters npm run crawl-bench:js # JS set only node harness/crawlers/run.mjs --set=nojs # non-JS set only node harness/crawlers/run.mjs --pages=10 --iters=5 ``` Needs **live network**. The page cap is hard-limited to ≤20 with a ~150ms per-request delay so we don't hammer toscrape.com. ## Two sets **Set A — non-JS** (`--set=nojs`): crawlers that fetch + parse HTML without running page JS. Target: `https://books.toscrape.com/` — a server-rendered, paginated catalog. Item metric: product titles (`.product_pod h3 a`). Compared against **turbo-surf (no-js)**. **Set B — JS** (`--set=js`): crawlers that execute page JS in a real engine. Target: `https://quotes.toscrape.com/js/` (+ `/js/page/N/`) — quotes are built client-side via `document.write` + jQuery, so a **non-JS crawler extracts ~0 quotes** while a JS crawler gets 10/page. Item metric: `.quote .text`. turbo-surf runs the page JS in its V8 isolate. ## Engines (auto-detected) turbo-surf runs whenever the native addon is built (`cargo build --release -p turbo-surf-napi`). Every competitor is lazy-loaded; if its package isn't installed the row reads `skipped (not installed)`. | engine | set | needs | |---|---|---| | `turbo-surf (no-js)` | nojs | built addon | | `spider-rs (Rust)` | nojs | `@spider-rs/spider-rs` + `cheerio` | | `Scrapy (Python)` | nojs | `scrapy` on PATH (CLI subprocess) | | `Colly (Go)` | nojs | `go` on PATH (CLI subprocess) | | `crawlee CheerioCrawler` | nojs | `crawlee` | | `got + cheerio` (hand-rolled BFS) | nojs | `got` + `cheerio` | | `node-crawler (crawler)` | nojs | `crawler` | | `x-ray` | nojs | `x-ray` | | `turbo-surf (js)` | js | built addon | | `crawlee PlaywrightCrawler` | js | `crawlee` + `playwright` + browser | | `crawlee PuppeteerCrawler` | js | `crawlee` + `puppeteer` + browser | | `puppeteer-cluster` | js | `puppeteer-cluster` + browser | ## Installing the optional competitors We deliberately do **not** vendor these — install what you want to race against: ```sh # non-JS Node crawlers (incl. spider-rs — the Rust "fastest crawler" claimant) npm i -D @spider-rs/spider-rs crawlee cheerio got crawler x-ray # JS crawlers (also need a browser binary) npm i -D crawlee playwright puppeteer puppeteer-cluster npx playwright install chromium ``` ### Cross-language speed kings (CLI subprocess) The two crawlers most often cited as the absolute fastest — **Scrapy** (Python) and **Colly** (Go) — aren't `import`able into Node, so the harness shells out to them via a thin CLI wrapper (`scrapy_spider.py`, `colly_crawler.go`). Each runs the SAME same-host BFS, counts the SAME CSS selector, and prints `{pages,items}` on stdout; the harness times the wall-clock and detects them by probing the tool on `PATH` (skipped if absent). ```sh # Scrapy — isolated CLI on PATH (pipx keeps it off the system Python) brew install pipx && pipx install scrapy # Colly — Go toolchain; first run downloads colly into the module cache brew install go ( cd harness/crawlers && go mod tidy ) # one-time: fetch colly + write go.sum ``` > Subprocess engines pay a fixed per-invocation startup (Python import / Go > link) that dominates at small page caps — read their **pages/s at higher > `--pages`**, not the raw ms on a 4-page run. ## Output A table per set: `crawler | pages | items | median ms | pages/s`, turbo-surf rows flagged with `»`. Each engine is warmed once (untimed), then run `--iters` times; the reported time is the median. ## How it's wired - `targets.mjs` — the two targets + the shared `itemSelector`/`itemAttr` so every crawler counts items identically. - `crawlers.mjs` — the registry. Each entry is `{ name, set, available(), crawl(target, opts) → { pages, items, ms } }`. turbo-surf entries run the whole crawl in Rust via the napi addon (`native.crawl` for no-js; a JS BFS over the render tier for the `js` set). before extraction. - `run.mjs` — the CLI/runner: detect, warm, time, report. Add a competitor: append an entry to `CRAWLERS` in `crawlers.mjs` with an `available()` probe and a `crawl()` that returns `{ pages, items, ms }`.