---
name: crawl4ai
description: "Comprehensive guide for Crawl4AI - open-source LLM-friendly web crawler and scraper with 62k+ GitHub stars. Covers CLI commands (crwl), Python API, Docker API, deep crawling strategies (BFS, DFS, best-first), content extraction (CSS selectors, LLM-based extraction), markdown/json output, anti-bot bypass, and advanced features for RAG pipelines, data extraction, and batch URL processing."
---

# Crawl4AI - Open Source Web Crawler

> **Full CLI docs:** [docs.crawl4ai.com/core/cli/](https://docs.crawl4ai.com/core/cli/)  
> **Ready-to-use scripts:** See [`scripts/`](scripts/) folder  
> **Detailed SDK reference:** See [`references/`](references/) folder

Crawl4AI is an open-source (Apache 2.0), LLM-friendly web crawler that transforms websites into clean Markdown or structured JSON. 62k+ GitHub stars.

---

## Verify Availability

Assume Crawl4AI is already installed. Verify the interface you plan to use instead of repeating installation steps.

```bash
# CLI present on PATH
which crwl

# CLI responds (there is no top-level --version flag)
crwl crawl -h

# Python package version when using the pipx interpreter
~/.local/pipx/venvs/crawl4ai/bin/python - <<'PY'
from importlib.metadata import version
print(version('crawl4ai'))
PY

# Docker health and version
curl http://localhost:11235/health
docker ps --filter "name=crawl4ai" --format "{{.Image}} {{.Status}}"
```

**Verified in session:** CLI at `/Users/Y_ALTAY1/.local/bin/crwl`, Python package `0.8.6` in the pipx interpreter, Docker health endpoint reporting version `0.8.5`.

**Docker endpoints:** Dashboard `http://localhost:11235/dashboard` · Playground `http://localhost:11235/playground` · API `http://localhost:11235/crawl`

### Default Choice

Use the interface that matches the job:

| Situation | Default |
|-----------|---------|
| Agent automation, custom scripts, extraction pipelines | **Python SDK** |
| Quick ad-hoc scrape from terminal | **CLI** |
| Shared service, remote access, dashboard, batch API | **Docker API** |

**Recommended default:** prefer the **Python SDK** when you are building repeatable automation or agent workflows. It gives more control than the CLI and avoids the operational overhead of Docker.

### Python Import Caveat

If Crawl4AI was installed with `pipx`, plain `python3` usually will **not** import `crawl4ai` because `pipx` keeps the package in its own virtual environment.

```bash
# CLI still works
crwl https://example.com

# But system python3 may not see the package
python3 -c "import crawl4ai"

# Use the pipx interpreter for SDK scripts
~/.local/pipx/venvs/crawl4ai/bin/python your_script.py
```

Use `python3` normally only when `crawl4ai` is installed in that exact Python environment. For this skill, prefer the pipx interpreter path when using the SDK.

---

## Ready-to-Use Scripts

The [`scripts/`](scripts/) folder contains ready-to-use Python scripts:

| Script | Usage |
|--------|-------|
| [`basic_crawler.py`](scripts/basic_crawler.py) | Simple markdown extraction |
| [`batch_crawler.py`](scripts/batch_crawler.py) | Batch process multiple URLs |
| [`extraction_pipeline.py`](scripts/extraction_pipeline.py) | Data extraction with auto-schema generation |

```bash
# Run scripts
python scripts/basic_crawler.py https://example.com
python scripts/batch_crawler.py urls.txt
python scripts/extraction_pipeline.py --generate-schema https://shop.com "extract products"
```

---

## Docker API Features

The Docker API provides advanced capabilities beyond the CLI:

| Feature | Description |
|---------|-------------|
| **Batch processing** | Crawl 100s of URLs in a single request |
| **Async queue** | Submit job, get task ID, check results later |
| **Dashboard UI** | Visual monitoring at http://localhost:11235 |
| **Remote deployment** | Run on server, access via API |
| **Priority queuing** | Set priority for important jobs first |
| **Better scaling** | Multiple containers, load balancing |
| **Centralized caching** | Shared cache across requests |

### Docker API Endpoints

```bash
# Health check
curl http://localhost:11235/health

# Submit crawl job
curl -X POST http://localhost:11235/crawl \
  -H "Content-Type: application/json" \
  -d '{"urls": ["https://example.com"], "priority": 10}'

# Get task status (if async)
curl http://localhost:11235/crawl/{task_id}

# Batch crawl multiple URLs
curl -X POST http://localhost:11235/crawl \
  -H "Content-Type: application/json" \
  -d '{
    "urls": [
      "https://altin.doviz.com/gram-altin",
      "https://kur.doviz.com/serbest-piyasa/amerikan-dolari"
    ]
  }'
```

### When to Use Docker vs CLI

| Use Case | Best Choice |
|----------|-------------|
| Quick one-off scrape | CLI (`crwl`) |
| Production monitoring | Docker API |
| Batch URLs (100s) | Docker API |
| Local dev/testing | CLI |
| Schedule recurring jobs | Docker API |
| LLM extraction with schema | Docker API |

---

## Quick CLI Reference

```bash
# Basic crawl
crwl https://example.com

# Output formats: all, json, markdown, md-fit (markdown-fit)
crwl https://example.com -o json
crwl https://example.com -o markdown-fit

# Output to file
crwl https://example.com -o json -O output.json

# Bypass cache (force fresh crawl)
crwl https://example.com -bc

# Deep crawl: bfs, dfs, best-first
crwl https://docs.site.com --deep-crawl bfs --max-pages 10

# LLM Q&A - ask questions about page content
crwl https://example.com -q "What is the main topic?"

# LLM extraction - extract structured data with AI
crwl https://example.com -j "extract all prices and dates"

# JSON schema extraction - extract with CSS selectors
crwl https://example.com -s schema.json -o json

# With config files
crwl https://example.com -B browser.yml -C crawler.yml

# Browser parameters inline
crwl https://example.com -b "headless=true,timeout=30000"

# Crawler parameters (user-agent, etc.)
crwl https://example.com -c "user_agent_mode=random"

# Anti-bot bypass for protected sites (Reddit, etc.)
crwl https://www.reddit.com/r/politics/hot/ -o markdown -c "user_agent_mode=random"
```

---

## Python API

**Use this as the default programmable interface.** It worked in session for both normal pages and protected Reddit URLs/search pages when using `user_agent_mode="random"`.

### Basic Usage

```python
import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url="https://example.com")
        print(result.markdown)

asyncio.run(main())
```

### With Configuration

```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode

async def main():
    browser = BrowserConfig(headless=True, user_agent_mode="random")
    run_cfg = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        wait_until="networkidle",
        word_count_threshold=100,
    )

    async with AsyncWebCrawler(config=browser) as crawler:
        result = await crawler.arun(url="https://example.com", config=run_cfg)
        print(result.success)
        print(result.metadata.get("title"))

asyncio.run(main())
```

### Recommended Default Pattern

```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig

async def main():
    browser = BrowserConfig(headless=True, user_agent_mode="random")
    run_cfg = CrawlerRunConfig(wait_until="networkidle")

    async with AsyncWebCrawler(config=browser) as crawler:
        result = await crawler.arun(
            url="https://www.reddit.com/search/?q=gold%20price%20drop&sort=relevance",
            config=run_cfg,
        )
        print(result.success)
        print(result.metadata.get("title"))
        print(result.markdown[:1000])

asyncio.run(main())
```

Use this pattern first, then add extraction, proxies, hooks, or deep crawling as needed.

Run SDK scripts with the pipx interpreter when `python3` cannot import the package:

```bash
~/.local/pipx/venvs/crawl4ai/bin/python your_script.py
```

### CrawlResult Properties

| Property | Description |
|----------|-------------|
| `url` | Crawled URL |
| `markdown` | Clean markdown content |
| `html` | Raw HTML |
| `success` | Boolean success status |
| `links` | Internal/external links dict |
| `metadata` | Page metadata |
| `extracted_content` | Structured data (if extraction enabled) |

---

## Deep Crawling

For detailed deep crawling documentation, see [`references/deep-crawling.md`](references/deep-crawling.md).

### Strategies

| Strategy | Description | Best For |
|----------|-------------|----------|
| `BFSDeepCrawlStrategy` | Breadth-first (level by level) | Comprehensive coverage |
| `DFSDeepCrawlStrategy` | Depth-first (follows one path) | Documentation sites |
| `BestFirstCrawlingStrategy` | Priority-based with scorers | Focused relevance |

See [`references/deep-crawling.md`](references/deep-crawling.md) for streaming, filtering, crash recovery, and prefetch mode.

---

## Extraction Strategies

For detailed extraction docs, see [`references/extraction.md`](references/extraction.md).

| Strategy | Use Case |
|----------|----------|
| **CSS-based** | Fast, no AI needed - define JSON schema |
| **LLM-based** | Complex pages, AI extracts structured data |

```python
# CSS extraction example
schema = {"name": "Products", "baseSelector": ".product", 
    "fields": [{"name": "title", "selector": "h3", "type": "text"}]}
config = CrawlerRunConfig(extraction_strategy=JsonCssExtractionStrategy(schema))
```

---

## Advanced Features

For complete advanced features (proxy, session, hooks, anti-bot), see [`references/config.md`](references/config.md).

### Proxy Configuration

```python
from crawl4ai.async_configs import ProxyConfig

config = CrawlerRunConfig(
    proxy_config=[
        ProxyConfig.DIRECT,
        ProxyConfig(server="http://proxy:8080", username="user", password="pass"),
    ],
    max_retries=2
)
```

### Session Management

```python
browser_config = BrowserConfig(
    user_data_dir="/path/to/profile",
    use_persistent_context=True
)
```

### Page Interaction (JavaScript)

```python
config = CrawlerRunConfig(
    js_code=["(async () => { await document.querySelector('button').click(); })()"],
    wait_until="networkidle"
)
```

### Hooks

```python
async def on_page_created(page, context, **kwargs):
    await context.route("**/*.png", lambda r: r.abort())
    return page

config = CrawlerRunConfig(hooks={"on_page_context_created": on_page_created})
```

### Anti-Bot (v0.8.5+)

```python
config = CrawlerRunConfig(
    proxy_config=[ProxyConfig.DIRECT, ProxyConfig(server="http://residential-proxy:8080")],
    max_retries=2,
    flatten_shadow_dom=True
)
```

---

## Docker API

```python
import requests

# Single URL crawl
resp = requests.post("http://localhost:11235/crawl", json={
    "urls": ["https://example.com"],
    "priority": 10
})
result = resp.json()

# Batch multiple URLs (recommended for parallel crawling)
resp = requests.post("http://localhost:11235/crawl", json={
    "urls": [
        "https://www.reddit.com/r/politics/hot/",
        "https://www.reddit.com/r/politics/new/",
        "https://www.reddit.com/r/politics/top/?t=week"
    ],
    "options": {
        "output_format": "markdown",
        "user_agent_mode": "random"
    }
})
results = resp.json()["results"]  # Array of crawl results
```

## CLI vs Python SDK vs Docker

| Aspect | CLI | Python SDK | Docker API |
|--------|-----|------------|------------|
| **Best for** | Quick terminal use | Programmable automation | Shared service/API |
| **Speed** | Fastest for one-offs | Fast | Slightly slower (HTTP overhead) |
| **Control** | Lowest | Highest | Medium |
| **Batch URLs** | Limited | Full control in code | Yes (JSON array) |
| **Web UI** | No | No | Dashboard + Playground |
| **REST API** | No | No | Yes |
| **Deep crawl** | Simple flags | Full strategy objects | Request payload |
| **Anti-bot tuning** | Basic | Better | Better |
| **Operational overhead** | Lowest | Low | Highest |

### When to Use Which

| Use Case | Choose |
|----------|--------|
| Quick one-off scrape | CLI |
| Scripted automation | Python SDK |
| Build custom extraction pipeline | Python SDK |
| Repeatable agent workflow | Python SDK |
| API for other services | Docker API |
| Dashboard / playground / remote service | Docker API |
| Batch parallel crawl from another app | Docker API |
| Fast terminal debugging | CLI |

---

## Common Use Cases

| Use Case | Approach |
|----------|----------|
| **RAG Pipeline** | `crwl <url> -o markdown-fit` |
| **News Aggregation** | Batch crawl multiple sources |
| **Documentation** | Deep crawl with BFS |
| **Site Mapping** | Prefetch mode |
| **Price Monitoring** | Scheduled crawl + extraction |
| **Lead Generation** | LLM extraction for contact info |

---

## Best Practices

1. **Start simple** - test single page before deep crawl
2. **Use filters** - target specific content
3. **Limit scope** - use `max_pages` to control costs
4. **Handle blocks** - add proxies for protected sites
5. **Monitor** - use dashboard for production runs

---

## Troubleshooting

| Issue | Solution |
|-------|----------|
| `python3` cannot import `crawl4ai` | If installed via `pipx`, use `~/.local/pipx/venvs/crawl4ai/bin/python` or install into a project venv |
| No output | Check browser accessibility, try `CacheMode.BYPASS` |
| Anti-bot blocks (Reddit, etc.) | Use `user_agent_mode=random` (CLI: `-c "user_agent_mode=random"`, Docker: `"user_agent_mode": "random"`) |
| Need a default interface | Use Python SDK for automation, CLI for quick manual runs, Docker API for service-style usage |
| Reddit shows "Prove your human" | Add `user_agent_mode=random` - works reliably |
| Slow performance | Use prefetch mode, enable caching |
| JS not rendering | Increase timeout, check `wait_until` |