# Crawl-MCP: Unofficial MCP Server for crawl4ai > **⚠️ Important**: This is an **unofficial** MCP server implementation for the excellent [crawl4ai](https://github.com/unclecode/crawl4ai) library. > **Not affiliated** with the original crawl4ai project. A comprehensive Model Context Protocol (MCP) server that wraps the powerful crawl4ai library with advanced AI capabilities. Extract and analyze content from **any source**: web pages, PDFs, Office documents, YouTube videos, and more. Features intelligent summarization to dramatically reduce token usage while preserving key information. ## 🌟 Key Features - **πŸ” Google Search Integration** - 7 optimized search genres with Google official operators - **πŸ” Advanced Web Crawling**: JavaScript support, deep site mapping, entity extraction - **🌐 Universal Content Extraction**: Web pages, PDFs, Word docs, Excel, PowerPoint, ZIP archives - **πŸ€– AI-Powered Summarization**: Smart token reduction (up to 88.5%) while preserving essential information - **🎬 YouTube Integration**: Extract video transcripts and summaries without API keys - **⚑ Production Ready**: 19 specialized tools with comprehensive error handling ## πŸš€ Quick Start ### Prerequisites (Required First) - Python 3.11 δ»₯上(FastMCP が Python 3.11+ を要求) **Install system dependencies for Playwright:** **Ubuntu 24.04 LTS (Manual Required):** ```bash # Manual setup required due to t64 library transition sudo apt update && sudo apt install -y \ libnss3 libatk-bridge2.0-0 libxss1 libasound2t64 \ libgbm1 libgtk-3-0t64 libxshmfence-dev libxrandr2 \ libxcomposite1 libxcursor1 libxdamage1 libxi6 \ fonts-noto-color-emoji fonts-unifont python3-venv python3-pip python3 -m venv venv && source venv/bin/activate pip install playwright==1.55.0 && playwright install chromium sudo playwright install-deps ``` **Other Linux/macOS:** ```bash sudo bash scripts/prepare_for_uvx_playwright.sh ``` **Windows (as Administrator):** ```powershell scripts/prepare_for_uvx_playwright.ps1 ``` ### Installation **UVX (Recommended - Easiest):** ```bash # After system preparation above - that's it! uvx --from git+https://github.com/walksoda/crawl-mcp crawl-mcp ``` **Docker (Production-Ready):** ```bash # Clone the repository git clone https://github.com/walksoda/crawl-mcp cd crawl-mcp # Build and run with Docker Compose (STDIO mode) docker-compose up --build # Or build and run HTTP mode on port 8000 docker-compose --profile http up --build crawl4ai-mcp-http # Or build manually docker build -t crawl4ai-mcp . docker run -it crawl4ai-mcp ``` **Docker Features:** - πŸ”§ **Multi-Browser Support**: Chromium, Firefox, Webkit headless browsers - 🐧 **Google Chrome**: Additional Chrome Stable for compatibility - ⚑ **Optimized Performance**: Pre-configured browser flags for Docker - πŸ”’ **Security**: Non-root user execution - πŸ“¦ **Complete Dependencies**: All required libraries included ### Claude Desktop Setup **UVX Installation:** Add to your `claude_desktop_config.json`: ```json { "mcpServers": { "crawl-mcp": { "transport": "stdio", "command": "uvx", "args": [ "--from", "git+https://github.com/walksoda/crawl-mcp", "crawl-mcp" ], "env": { "CRAWL4AI_LANG": "en" } } } } ``` **Docker HTTP Mode:** ```json { "mcpServers": { "crawl-mcp": { "transport": "http", "baseUrl": "http://localhost:8000" } } } ``` **For Japanese interface:** ```json "env": { "CRAWL4AI_LANG": "ja" } ``` ## πŸ“– Documentation | Topic | Description | |-------|-------------| | **[Installation Guide](docs/INSTALLATION.md)** | Complete installation instructions for all platforms | | **[API Reference](docs/API_REFERENCE.md)** | Full tool documentation and usage examples | | **[Configuration Examples](docs/CONFIGURATION_EXAMPLES.md)** | Platform-specific setup configurations | | **[HTTP Integration](docs/HTTP_INTEGRATION.md)** | HTTP API access and integration methods | | **[Advanced Usage](docs/ADVANCED_USAGE.md)** | Power user techniques and workflows | | **[Development Guide](docs/DEVELOPMENT.md)** | Contributing and development setup | ### Language-Specific Documentation - **English**: [docs/](docs/) directory - **ζ—₯本θͺž**: [docs/ja/](docs/ja/) directory ## πŸ› οΈ Tool Overview ### Web Crawling (3) - `crawl_url` - Extract web page content with JavaScript support - `deep_crawl_site` - Crawl multiple pages from a site with configurable depth - `crawl_url_with_fallback` - Crawl with fallback strategies for anti-bot sites ### Data Extraction (3) - `intelligent_extract` - Extract specific data from web pages using LLM - `extract_entities` - Extract entities (emails, phones, etc.) from web pages - `extract_structured_data` - Extract structured data using CSS selectors or LLM ### YouTube (4) - `extract_youtube_transcript` - Extract YouTube transcripts with timestamps - `batch_extract_youtube_transcripts` - Extract transcripts from multiple YouTube videos (max 3) - `get_youtube_video_info` - Get YouTube video metadata and transcript availability - `extract_youtube_comments` - Extract YouTube video comments with pagination ### Search (4) - `search_google` - Search Google with genre filtering - `batch_search_google` - Perform multiple Google searches (max 3) - `search_and_crawl` - Search Google and crawl top results - `get_search_genres` - Get available search genres ### File Processing (3) - `process_file` - Convert PDF, Word, Excel, PowerPoint, ZIP to markdown - `get_supported_file_formats` - Get supported file formats and capabilities - `enhanced_process_large_content` - Process large content with chunking and BM25 filtering ### Batch Operations (2) - `batch_crawl` - Crawl multiple URLs with fallback (max 3 URLs) - `multi_url_crawl` - Multi-URL crawl with pattern-based config (max 5 URL patterns) ## πŸ’Ύ Persist Large Results to Disk (token-saver) All information-gathering tools accept an optional `output_path` parameter that writes the full fetched content straight to disk and returns a slim metadata-only response. This lets an LLM fetch huge pages, long YouTube transcripts, or whole batches without blowing its context budget β€” read from the saved file only when needed. **How it works:** - Single-file tools (e.g. `crawl_url`, `extract_youtube_transcript`) write one `.md` (or `.json` for JSON-kind tools) β€” pass an absolute file path; the extension is auto-added if omitted. An existing regular file at that path is rejected unless `overwrite=true`. - Batch tools (`batch_crawl`, `multi_url_crawl`, `deep_crawl_site`, `search_and_crawl`, `batch_extract_youtube_transcripts`) expect an absolute **directory** path and write one `.md` per URL plus `index.json`. Any non-existent path is treated as a directory and created β€” including names containing dots such as `/tmp/run.v1`. If the path already exists as a regular file, the call is rejected. `batch_crawl` / `multi_url_crawl` keep their `list` return shape and embed an `output_file` key on each success item. - Request-dict tools (`search_google`, `batch_search_google`, `search_and_crawl`, `batch_extract_youtube_transcripts`) read the persistence keys directly from their request dict. - Common parameters: `output_path` (absolute; `None` or `""` skips persistence), `include_content_in_response` (default `false` β€” when `true`, content is included in the response too, **still subject to any `content_limit`/`content_offset`/`max_content_per_page` slicing**), `overwrite` (default `false`). - Writes are atomic per file (temp file + `os.replace`); parent directories are auto-created; the full unsliced payload is persisted **before** any slicing or tool-internal truncation so the on-disk copy is always complete even when the response is sliced. - Batch dict tools (`deep_crawl_site`, `search_and_crawl`, `batch_extract_youtube_transcripts`) skip per-item persistence for items that report `success=false`; these still appear in `index.json` with `file: null` so callers can reason about the attempt list. **Markdown single-file example:** ```json { "tool": "crawl_url", "arguments": { "url": "https://example.com/long-article", "output_path": "/tmp/crawl_out/article.md" } } ``` **JSON structured extraction (extension auto-added):** ```json { "tool": "extract_structured_data", "arguments": { "url": "https://example.com/products", "extraction_type": "css", "css_selectors": {"price": ".price", "name": "h1"}, "output_path": "/tmp/crawl_out/products" } } ``` **Batch directory mode:** ```json { "tool": "batch_crawl", "arguments": { "urls": ["https://a.example", "https://b.example"], "output_path": "/tmp/crawl_out/batch_run1" } } ``` Each persisted markdown file begins with a YAML frontmatter block containing `url`, `title`, `fetched_at`, and `source_tool` so the artifact is self-describing. ## 🎯 Common Use Cases **Content Research:** ```bash search_and_crawl β†’ extract_structured_data β†’ analysis ``` **Documentation Mining:** ```bash deep_crawl_site β†’ batch processing β†’ extraction ``` **Media Analysis:** ```bash extract_youtube_transcript β†’ summarization workflow ``` **Site Mapping:** ```bash batch_crawl β†’ multi_url_crawl β†’ comprehensive data ``` ## 🚨 Quick Troubleshooting **Installation Issues:** 1. Re-run setup scripts with proper privileges 2. Try development installation method 3. Check browser dependencies are installed **Performance Issues:** - Use `wait_for_js: true` for JavaScript-heavy sites - Increase timeout for slow-loading pages - Use `extract_structured_data` for targeted extraction **Configuration Issues:** - Check JSON syntax in `claude_desktop_config.json` - Verify file paths are absolute - Restart Claude Desktop after configuration changes ## πŸ—οΈ Project Structure - **Original Library**: [crawl4ai](https://github.com/unclecode/crawl4ai) by unclecode - **MCP Wrapper**: This repository (walksoda) - **Implementation**: Unofficial third-party integration ## πŸ“„ License This project is an unofficial wrapper around the crawl4ai library. Please refer to the original [crawl4ai license](https://github.com/unclecode/crawl4ai) for the underlying functionality. ## 🀝 Contributing See our [Development Guide](docs/DEVELOPMENT.md) for contribution guidelines and development setup instructions. ## πŸ”— Related Projects - [crawl4ai](https://github.com/unclecode/crawl4ai) - The underlying web crawling library - [Model Context Protocol](https://modelcontextprotocol.io/) - The standard this server implements - [Claude Desktop](https://docs.anthropic.com/claude/docs/claude-desktop) - Primary client for MCP servers