Turn websites into clean markdown, JSON, and LLM-ready context. CLI, MCP server, REST API, and SDKs for AI agents and RAG pipelines.
---
Most web scraping tools give your agent one of two bad outputs:
- a blocked page, login wall, or empty app shell
- raw HTML full of nav, scripts, styling, ads, and duplicated boilerplate
[webclaw.io](https://webclaw.io) is the hosted web extraction API for webclaw. This repo contains the open-source CLI, MCP server, extraction engine, and self-hostable server.
webclaw turns a URL into clean content your tools can actually use.
```bash
webclaw https://example.com --format markdown
```
```md
# Example Domain
This domain is for use in illustrative examples in documents.
You may use this domain in literature without prior coordination or asking for permission.
```
Use it from the terminal, wire it into Claude/Cursor through MCP, call the hosted API from your app, or self-host the OSS server.
---
## Install
### Agent setup
The fastest way to connect webclaw to Claude Code, Claude Desktop, Cursor, Windsurf, OpenCode, Codex CLI, and other MCP-compatible tools:
```bash
npx create-webclaw
```
The installer detects supported clients and configures the MCP server for you.
### Homebrew
```bash
brew tap 0xMassi/webclaw
brew install webclaw
```
### Prebuilt binaries
Download macOS and Linux binaries from [GitHub Releases](https://github.com/0xMassi/webclaw/releases).
### Docker
```bash
docker run --rm ghcr.io/0xmassi/webclaw https://example.com
```
### Cargo
```bash
cargo install --git https://github.com/0xMassi/webclaw.git webclaw-cli
cargo install --git https://github.com/0xMassi/webclaw.git webclaw-mcp
```
If building from source fails because native build tools are missing, install the platform prerequisites:
| OS | Command |
| --- | --- |
| Debian / Ubuntu | `sudo apt install -y pkg-config libssl-dev cmake clang git build-essential` |
| Fedora / RHEL | `sudo dnf install -y pkg-config openssl-devel cmake clang git make gcc` |
| Arch | `sudo pacman -S pkg-config openssl cmake clang git base-devel` |
| macOS | `xcode-select --install` |
---
## Quick Start
### Scrape one page
```bash
webclaw https://stripe.com --format markdown
```
### Return LLM-optimized text
```bash
webclaw https://docs.anthropic.com --format llm
```
### Keep only the main content
```bash
webclaw https://example.com/blog/post --only-main-content
```
### Include or exclude selectors
```bash
webclaw https://example.com \
--include "article, main, .content" \
--exclude "nav, footer, .sidebar, .ad"
```
### Crawl a documentation site
```bash
webclaw https://docs.rust-lang.org --crawl --depth 2 --max-pages 50
```
### Workflow examples
- [HTML to Markdown for RAG](examples/html-to-markdown-rag/)
- [Firecrawl-compatible API](examples/firecrawl-compatible-api/)
- [MCP web scraping](examples/mcp-web-scraping/)
- [Proxy-backed crawling](examples/proxy-backed-crawling/)
- [Cloudflare diagnostics](examples/cloudflare-diagnostics/)
### Extract brand assets
```bash
webclaw https://github.com --brand
```
### Compare a page over time
```bash
webclaw https://example.com/pricing --format json > pricing-old.json
webclaw https://example.com/pricing --diff-with pricing-old.json
```
---
## MCP Server
webclaw ships with an MCP server for AI agents.
```bash
npx create-webclaw
```
Manual config:
```json
{
"mcpServers": {
"webclaw": {
"command": "~/.webclaw/webclaw-mcp"
}
}
}
```
Then ask your agent things like:
```text
Scrape these competitor pricing pages and summarize the differences.
```
```text
Crawl this documentation site and prepare clean context for a RAG index.
```
```text
Extract the brand colors, fonts, and logos from this company website.
```
---
## Tools
| Tool | What it does | Local |
| --- | --- | :-: |
| `scrape` | Extract one URL as markdown, text, JSON, LLM format, or HTML | Yes |
| `crawl` | Follow same-origin links and extract discovered pages | Yes |
| `map` | Discover URLs without extracting every page | Yes |
| `batch` | Scrape multiple URLs in parallel | Yes |
| `extract` | Convert page content into structured data | Yes, with local or configured LLM |
| `summarize` | Summarize a page | Yes, with local or configured LLM |
| `diff` | Compare page content snapshots | Yes |
| `brand` | Extract colors, fonts, logos, and metadata | Yes |
| `search` | Search the web and scrape results | Hosted API |
| `research` | Multi-source research workflow | Hosted API |
---
## SDKs
```bash
npm install @webclaw/sdk
pip install webclaw
go get github.com/0xMassi/webclaw-go
```
TypeScript
```ts
import { Webclaw } from "@webclaw/sdk";
const client = new Webclaw({ apiKey: process.env.WEBCLAW_API_KEY! });
const page = await client.scrape({
url: "https://example.com",
formats: ["markdown"],
only_main_content: true,
});
console.log(page.markdown);
```
Python
```python
from webclaw import Webclaw
client = Webclaw(api_key="wc_your_key")
page = client.scrape(
"https://example.com",
formats=["markdown"],
only_main_content=True,
)
print(page.markdown)
```
cURL
```bash
curl -X POST https://api.webclaw.io/v1/scrape \
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"formats": ["markdown"],
"only_main_content": true
}'
```
---
## Output Formats
| Format | Use it when you need |
| --- | --- |
| `markdown` | Clean page content with structure preserved |
| `llm` | Compact context for agents and RAG pipelines |
| `text` | Plain text with minimal formatting |
| `json` | Structured metadata, links, images, and extracted fields |
| `html` | Cleaned HTML for custom processing |
---
## Local First, Hosted When Needed
The CLI and MCP server work locally without an account for the core extraction path.
Use the hosted API at [webclaw.io](https://webclaw.io) when you need:
- protected-site access without managing infrastructure
- JavaScript rendering
- async crawl and research jobs
- web search
- watches and production usage tracking
- SDKs for application code
```bash
export WEBCLAW_API_KEY=wc_your_key
webclaw https://example.com --cloud
```
---
## What You Can Build
| Use case | Example |
| --- | --- |
| AI agent web access | Give Claude, Cursor, or another MCP client clean page context |
| RAG ingestion | Crawl docs, help centers, blogs, and knowledge bases |
| Competitor monitoring | Track pricing pages, changelogs, docs, and product pages |
| Structured extraction | Turn messy pages into typed JSON for automations |
| Research workflows | Search, scrape, summarize, and cite multiple sources |
| Brand intelligence | Extract logos, colors, fonts, and social metadata |
## Architecture
```text
webclaw/
crates/
webclaw-core HTML to markdown, text, JSON, and LLM-ready output
webclaw-fetch Fetching, crawling, batching, and mapping
webclaw-llm Local and hosted LLM provider support
webclaw-pdf PDF text extraction
webclaw-mcp MCP server for AI agents
webclaw-cli Command-line interface
```
`webclaw-core` is pure extraction logic: no network I/O, small surface area, and usable independently from the fetching layer.
---
## Configuration
| Variable | Description |
| --- | --- |
| `WEBCLAW_API_KEY` | Hosted API key |
| `OLLAMA_HOST` | Ollama URL for local LLM features |
| `OPENAI_API_KEY` | OpenAI-compatible LLM provider key |
| `OPENAI_BASE_URL` | OpenAI-compatible base URL |
| `ANTHROPIC_API_KEY` | Anthropic-compatible LLM provider key |
| `ANTHROPIC_BASE_URL` | Anthropic-compatible base URL |
| `WEBCLAW_PROXY` | Single proxy URL |
| `WEBCLAW_PROXY_FILE` | Proxy pool file |
---
## Contributing
The most useful contributions right now are practical and small:
- add examples for real agent and RAG workflows
- improve SDK snippets
- report pages that extract poorly
- add failing fixtures for messy HTML
- improve docs for MCP clients and local setup
- test the CLI on more Linux/macOS environments
Good first places to start:
- [Good first issues](https://github.com/0xMassi/webclaw/issues?q=label%3A%22good+first+issue%22)
- [Open a bug report](https://github.com/0xMassi/webclaw/issues/new)
- [Start a discussion](https://github.com/0xMassi/webclaw/discussions)
If a page extracts badly, include:
```text
URL:
Command or API request:
Expected output:
Actual output:
Format used: markdown / llm / text / json / html
CLI, MCP, SDK, or API:
```
Please remove secrets, cookies, private tokens, and customer data from logs before posting.
---
## Infrastructure Partner
ColdProxy supports webclaw as an Infrastructure Partner, providing residential IPv4,
residential IPv6, and datacenter IPv6 proxy infrastructure across 195+ countries for public data
collection, regional testing, monitoring, and web scraping workflows. Explore
ColdProxy's latest plans and available offers directly on the website.
---
## Studio Partners
Quantum Proxies provides fast, reliable residential and ISP proxy infrastructure for developers running large-scale extraction workloads.
Get 20% off any plan with code WEBCLAW20 at
quantumproxies.net.
Proxy-Seller maintains a global network of residential and datacenter proxies optimized for web extraction at scale.
The service supports high-volume concurrent scraping, geographic rotation, and integration with web extraction tools.
Use code WBC15 for 15% off IPv4, IPv6, ISP, and Residential proxies, and 10% off Mobile at
proxy-seller.com.
RapidProxy delivers fast, reliable proxy infrastructure for large-scale data collection.
With 90M+ residential IPs, smart rotation, high concurrency, AI-powered CAPTCHA bypass, and non-expiring traffic, it helps keep scraping workflows stable at scale.
Use code webclaw for 10% off, or
Try it free.
---
## Community Plugins
Third-party plugins that integrate webclaw with AI agent platforms:
| Plugin | Platform | What it does |
|---|---|---|
| [openclaw-webclaw](https://github.com/jal-co/openclaw-webclaw) | [OpenClaw](https://openclaw.ai) | Native webclaw v1 API plugin with 9 tools: scrape, search, crawl, extract, summarize, diff, map, batch, brand |
| [hermes-webclaw](https://github.com/jal-co/hermes-webclaw) | [Hermes Agent](https://github.com/NousResearch/hermes-agent) | Web search provider and 9 dedicated tools for the full v1 API surface. Install with `hermes plugins install jal-co/hermes-webclaw` |
Built a webclaw integration? [Open a PR](https://github.com/0xMassi/webclaw/pulls) to add it here.
---
## Contributors
Thanks to everyone improving webclaw through issues, examples, docs, bug reports, and pull requests.
---
## Star History
---
## License
[AGPL-3.0](LICENSE)