<!-- AUTO-GENERATED by sync_markcrawl.py — do not edit manually -->
# MarkCrawl Benchmarks

> **Summary:** Across 6 open-source crawlers tested on 8 sites, MarkCrawl is produces the cleanest output (53 words of nav pollution vs 500+ for others), the lowest total RAG pipeline cost at every scale tested.
>
> **Where MarkCrawl is not first:**  Speed is 2nd (2.7 pages/sec).   Answer quality is 6th (3.77/5, crawl4ai leads at 4.72).   Retrieval Hit@5 is 6th (42% vs 87% for crawl4ai-raw).   Content recall is 6th (22% vs 70% for crawlee).

*Last run: May 2026. Reproducible via [llm-crawler-benchmarks](https://github.com/AIMLPM/llm-crawler-benchmarks).*

---

## Speed

| Tool | Pages/sec |
|---|---|
| scrapy+md | 5.0 |
| **markcrawl** | **2.7** |
| playwright | 2.5 |
| crawl4ai | 1.4 |
| crawlee | 1.3 |
| colly+md | 1.0 |

MarkCrawl uses native async I/O (httpx) with concurrent fetching and process-pool HTML extraction. Playwright-based tools (crawl4ai, crawlee) are inherently slower due to full browser rendering per page.

## Output cleanliness

| Tool | Nav pollution (words) | Recall |
|---|---|---|
| **markcrawl** | **53** | **22%** |
| scrapy+md | 500 | 11% |
| crawl4ai | 545 | 56% |
| playwright | 1995 | 68% |
| crawlee | 2326 | 70% |
| colly+md | 2424 | 50% |

Nav pollution = boilerplate words (navigation, footer, cookie banners) that leak into extracted content. Lower is better — less junk means cleaner embeddings and fewer wasted tokens.

The tradeoff: crawlee captures 70% of page content but includes ~2,326 words of boilerplate per page. MarkCrawl captures 22% with 53 words of pollution. For RAG pipelines, the cleaner output produces better embeddings despite the lower recall.

## RAG answer quality

| Tool | Chunks | Answer Quality (/5) | Hit@5 | Hit@20 |
|---|---|---|---|---|
| crawl4ai | 24,400 | 4.72 | 86% | 95% |
| **markcrawl** | **27,193** | **3.77** | **42%** | **45%** |
| scrapy+md | 46,141 | 3.68 | 21% | 22% |
| playwright | 56,855 | 4.48 | 87% | 94% |
| crawlee | 58,912 | 4.68 | 87% | 94% |
| colly+md | 59,078 | 4.36 | 51% | 56% |

*FireCrawl's self-hosted version did not complete crawls on all sites across multiple attempts. Its scores are on a reduced set and are not directly comparable to tools that completed all sites.

**Reading this table:**
- **Chunks** — total chunks across all sites. Fewer = less redundancy, lower embedding costs.
- **Answer Quality** — LLM-judged score for answers generated from retrieved chunks.
- **Hit@5 / Hit@20** — what percentage of queries find a relevant chunk in the top 5 or 20 results.

**Fewer chunks = lower cost.** Each chunk requires an embedding call and vector storage. MarkCrawl produces 2.2x fewer chunks than colly+md for the same content, cutting embedding and storage costs significantly.

## Total cost of ownership

Annual cost estimate for a complete RAG pipeline: crawling + embedding + vector storage + query-time retrieval.

| Tool | 1K pages, 100 q/day | 100K pages, 1K q/day | 1M pages, 10K q/day |
|---|---|---|---|
| **markcrawl** | **$341** | **$4,505** | **$45,055** |
| scrapy+md | $409 | $5,464 | $54,640 |
| crawl4ai | $513 | $6,960 | $69,596 |
| colly+md | $516 | $7,213 | $72,129 |
| playwright | $517 | $7,320 | $73,202 |
| crawlee | $518 | $7,467 | $74,673 |

MarkCrawl's cost advantage comes from chunk efficiency — same content, fewer and cleaner chunks means fewer embedding API calls and less vector storage. The total cost difference between the cheapest and most expensive tools is $2,962/year at 100K pages.

## Why these numbers matter

For a RAG pipeline, the crawler is stage 1 — everything downstream (chunking, embedding, retrieval, LLM generation) depends on the quality of what the crawler produces.

- **Fewer chunks per page** = lower embedding costs, less vector DB storage, faster retrieval
- **Less nav pollution** = cleaner embeddings that match user queries instead of "Home | About | Login"
- **Higher answer quality** = the LLM gets better source material and produces more accurate answers

## Methodology

All benchmarks run on the same hardware, same sites, same queries, with reproducible scripts. No tool receives special treatment or configuration beyond its defaults. The full methodology, raw data, and reproduction instructions are in the [llm-crawler-benchmarks](https://github.com/AIMLPM/llm-crawler-benchmarks) repo.