# Crawl-MCP: Unofficial MCP Server for crawl4ai

> **⚠️ Important**: This is an **unofficial** MCP server implementation for the excellent [crawl4ai](https://github.com/unclecode/crawl4ai) library.  
> **Not affiliated** with the original crawl4ai project.

A comprehensive Model Context Protocol (MCP) server that wraps the powerful crawl4ai library with advanced AI capabilities. Extract and analyze content from **any source**: web pages, PDFs, Office documents, YouTube videos, and more. Features intelligent summarization to dramatically reduce token usage while preserving key information.

## 🌟 Key Features

- **🔍 Google Search Integration** - 7 optimized search genres with Google official operators
- **🔍 Advanced Web Crawling**: JavaScript support, deep site mapping, entity extraction
- **🌐 Universal Content Extraction**: Web pages, PDFs, Word docs, Excel, PowerPoint, ZIP archives
- **🤖 AI-Powered Summarization**: Smart token reduction (up to 88.5%) while preserving essential information
- **🎬 YouTube Integration**: Extract video transcripts and summaries without API keys  
- **⚡ Production Ready**: 19 specialized tools with comprehensive error handling

## 🚀 Quick Start

### Prerequisites (Required First)

- Python 3.11 以上（FastMCP が Python 3.11+ を要求）

**Install system dependencies for Playwright:**

**Ubuntu 24.04 LTS (Manual Required):**
```bash
# Manual setup required due to t64 library transition
sudo apt update && sudo apt install -y \
  libnss3 libatk-bridge2.0-0 libxss1 libasound2t64 \
  libgbm1 libgtk-3-0t64 libxshmfence-dev libxrandr2 \
  libxcomposite1 libxcursor1 libxdamage1 libxi6 \
  fonts-noto-color-emoji fonts-unifont python3-venv python3-pip

python3 -m venv venv && source venv/bin/activate
pip install playwright==1.55.0 && playwright install chromium
sudo playwright install-deps
```

**Other Linux/macOS:**
```bash
sudo bash scripts/prepare_for_uvx_playwright.sh
```

**Windows (as Administrator):**
```powershell
scripts/prepare_for_uvx_playwright.ps1
```

### Installation

**UVX (Recommended - Easiest):**
```bash
# After system preparation above - that's it!
uvx --from git+https://github.com/walksoda/crawl-mcp crawl-mcp
```

**Docker (Production-Ready):**
```bash
# Clone the repository
git clone https://github.com/walksoda/crawl-mcp
cd crawl-mcp

# Build and run with Docker Compose (STDIO mode)
docker-compose up --build

# Or build and run HTTP mode on port 8000
docker-compose --profile http up --build crawl4ai-mcp-http

# Or build manually
docker build -t crawl4ai-mcp .
docker run -it crawl4ai-mcp
```

**Docker Features:**
- 🔧 **Multi-Browser Support**: Chromium, Firefox, Webkit headless browsers
- 🐧 **Google Chrome**: Additional Chrome Stable for compatibility
- ⚡ **Optimized Performance**: Pre-configured browser flags for Docker
- 🔒 **Security**: Non-root user execution
- 📦 **Complete Dependencies**: All required libraries included

### Claude Desktop Setup

**UVX Installation:**
Add to your `claude_desktop_config.json`:

```json
{
  "mcpServers": {
    "crawl-mcp": {
      "transport": "stdio",
      "command": "uvx",
      "args": [
        "--from",
        "git+https://github.com/walksoda/crawl-mcp",
        "crawl-mcp"
      ],
      "env": {
        "CRAWL4AI_LANG": "en"
      }
    }
  }
}
```

**Docker HTTP Mode:**
```json
{
  "mcpServers": {
    "crawl-mcp": {
      "transport": "http",
      "baseUrl": "http://localhost:8000"
    }
  }
}
```

**For Japanese interface:**
```json
"env": {
  "CRAWL4AI_LANG": "ja"
}
```

## 📖 Documentation

| Topic | Description |
|-------|-------------|
| **[Installation Guide](docs/INSTALLATION.md)** | Complete installation instructions for all platforms |
| **[API Reference](docs/API_REFERENCE.md)** | Full tool documentation and usage examples |
| **[Configuration Examples](docs/CONFIGURATION_EXAMPLES.md)** | Platform-specific setup configurations |
| **[HTTP Integration](docs/HTTP_INTEGRATION.md)** | HTTP API access and integration methods |
| **[Advanced Usage](docs/ADVANCED_USAGE.md)** | Power user techniques and workflows |
| **[Development Guide](docs/DEVELOPMENT.md)** | Contributing and development setup |

### Language-Specific Documentation

- **English**: [docs/](docs/) directory
- **日本語**: [docs/ja/](docs/ja/) directory

## 🛠️ Tool Overview

### Web Crawling (3)
- `crawl_url` - Extract web page content with JavaScript support
- `deep_crawl_site` - Crawl multiple pages from a site with configurable depth
- `crawl_url_with_fallback` - Crawl with fallback strategies for anti-bot sites

### Data Extraction (3)
- `intelligent_extract` - Extract specific data from web pages using LLM
- `extract_entities` - Extract entities (emails, phones, etc.) from web pages
- `extract_structured_data` - Extract structured data using CSS selectors or LLM

### YouTube (4)
- `extract_youtube_transcript` - Extract YouTube transcripts with timestamps
- `batch_extract_youtube_transcripts` - Extract transcripts from multiple YouTube videos (max 3)
- `get_youtube_video_info` - Get YouTube video metadata and transcript availability
- `extract_youtube_comments` - Extract YouTube video comments with pagination

### Search (4)
- `search_google` - Search Google with genre filtering
- `batch_search_google` - Perform multiple Google searches (max 3)
- `search_and_crawl` - Search Google and crawl top results
- `get_search_genres` - Get available search genres

### File Processing (3)
- `process_file` - Convert PDF, Word, Excel, PowerPoint, ZIP to markdown
- `get_supported_file_formats` - Get supported file formats and capabilities
- `enhanced_process_large_content` - Process large content with chunking and BM25 filtering

### Batch Operations (2)
- `batch_crawl` - Crawl multiple URLs with fallback (max 3 URLs)
- `multi_url_crawl` - Multi-URL crawl with pattern-based config (max 5 URL patterns)

## 💾 Persist Large Results to Disk (token-saver)

All information-gathering tools accept an optional `output_path` parameter that writes the full fetched content straight to disk and returns a slim metadata-only response. This lets an LLM fetch huge pages, long YouTube transcripts, or whole batches without blowing its context budget — read from the saved file only when needed.

**How it works:**
- Single-file tools (e.g. `crawl_url`, `extract_youtube_transcript`) write one `.md` (or `.json` for JSON-kind tools) — pass an absolute file path; the extension is auto-added if omitted. An existing regular file at that path is rejected unless `overwrite=true`.
- Batch tools (`batch_crawl`, `multi_url_crawl`, `deep_crawl_site`, `search_and_crawl`, `batch_extract_youtube_transcripts`) expect an absolute **directory** path and write one `.md` per URL plus `index.json`. Any non-existent path is treated as a directory and created — including names containing dots such as `/tmp/run.v1`. If the path already exists as a regular file, the call is rejected. `batch_crawl` / `multi_url_crawl` keep their `list` return shape and embed an `output_file` key on each success item.
- Request-dict tools (`search_google`, `batch_search_google`, `search_and_crawl`, `batch_extract_youtube_transcripts`) read the persistence keys directly from their request dict.
- Common parameters: `output_path` (absolute; `None` or `""` skips persistence), `include_content_in_response` (default `false` — when `true`, content is included in the response too, **still subject to any `content_limit`/`content_offset`/`max_content_per_page` slicing**), `overwrite` (default `false`).
- Writes are atomic per file (temp file + `os.replace`); parent directories are auto-created; the full unsliced payload is persisted **before** any slicing or tool-internal truncation so the on-disk copy is always complete even when the response is sliced.
- Batch dict tools (`deep_crawl_site`, `search_and_crawl`, `batch_extract_youtube_transcripts`) skip per-item persistence for items that report `success=false`; these still appear in `index.json` with `file: null` so callers can reason about the attempt list.

**Markdown single-file example:**
```json
{
  "tool": "crawl_url",
  "arguments": {
    "url": "https://example.com/long-article",
    "output_path": "/tmp/crawl_out/article.md"
  }
}
```

**JSON structured extraction (extension auto-added):**
```json
{
  "tool": "extract_structured_data",
  "arguments": {
    "url": "https://example.com/products",
    "extraction_type": "css",
    "css_selectors": {"price": ".price", "name": "h1"},
    "output_path": "/tmp/crawl_out/products"
  }
}
```

**Batch directory mode:**
```json
{
  "tool": "batch_crawl",
  "arguments": {
    "urls": ["https://a.example", "https://b.example"],
    "output_path": "/tmp/crawl_out/batch_run1"
  }
}
```

Each persisted markdown file begins with a YAML frontmatter block containing `url`, `title`, `fetched_at`, and `source_tool` so the artifact is self-describing.

## 🎯 Common Use Cases

**Content Research:**
```bash
search_and_crawl → extract_structured_data → analysis
```

**Documentation Mining:**
```bash
deep_crawl_site → batch processing → extraction
```

**Media Analysis:**
```bash
extract_youtube_transcript → summarization workflow
```

**Site Mapping:**
```bash
batch_crawl → multi_url_crawl → comprehensive data
```

## 🚨 Quick Troubleshooting

**Installation Issues:**
1. Re-run setup scripts with proper privileges
2. Try development installation method
3. Check browser dependencies are installed

**Performance Issues:**
- Use `wait_for_js: true` for JavaScript-heavy sites
- Increase timeout for slow-loading pages
- Use `extract_structured_data` for targeted extraction

**Configuration Issues:**
- Check JSON syntax in `claude_desktop_config.json`
- Verify file paths are absolute
- Restart Claude Desktop after configuration changes

## 🏗️ Project Structure

- **Original Library**: [crawl4ai](https://github.com/unclecode/crawl4ai) by unclecode
- **MCP Wrapper**: This repository (walksoda)
- **Implementation**: Unofficial third-party integration

## 📄 License

This project is an unofficial wrapper around the crawl4ai library. Please refer to the original [crawl4ai license](https://github.com/unclecode/crawl4ai) for the underlying functionality.

## 🤝 Contributing

See our [Development Guide](docs/DEVELOPMENT.md) for contribution guidelines and development setup instructions.

## 🔗 Related Projects

- [crawl4ai](https://github.com/unclecode/crawl4ai) - The underlying web crawling library
- [Model Context Protocol](https://modelcontextprotocol.io/) - The standard this server implements
- [Claude Desktop](https://docs.anthropic.com/claude/docs/claude-desktop) - Primary client for MCP servers