# Crawl4AI MCP Server πŸ•·οΈ **A lightweight Model Context Protocol (MCP) server that exposes [Crawl4AI](https://docs.crawl4ai.com/) web scraping and crawling capabilities as tools for AI agents.** Similar to Firecrawl's API but self-hosted and free. Perfect for integrating web scraping into your AI workflows with OpenAI Agents SDK, Cursor, Claude Code, and other MCP-compatible tools. ## Features - **πŸ”§ MCP Tools**: Exposes 4 powerful tools: `scrape`, `crawl`, `crawl_site`, `crawl_sitemap` via stdio MCP server - **🌐 Web Scraping**: Single-page scraping with markdown extraction - **πŸ•·οΈ Web Crawling**: Multi-page breadth-first crawling with depth control - **🧠 Adaptive Crawling**: Smart crawling that stops when sufficient content is gathered - **πŸ›‘οΈ Safety**: Blocks internal networks, localhost, and private IPs - **πŸ“± Agent Ready**: Works with OpenAI Agents SDK, Cursor, and Claude Code - **⚑ Fast**: Powered by Playwright and Crawl4AI's optimized extraction ## πŸš€ Quick Start Choose between **Docker** (recommended) or **manual installation**: ### Option A: Docker (Recommended) 🐳 Docker eliminates all setup complexity and provides a consistent environment: #### Option A1: Use Pre-built Image (Fastest) ⚑ ```bash # No setup required! Just pull and run the published image docker pull uysalsadi/crawl4ai-mcp-server:latest # Test it works python test-config.py # Use directly in MCP configurations (see examples below) ``` #### Option A2: Build Yourself ```bash # Clone the repository git clone https://github.com/uysalsadi/crawl4ai-mcp-server.git cd crawl4ai-mcp-server # Quick build and test (simplified) docker build -f Dockerfile.simple -t crawl4ai-mcp . echo '{"jsonrpc": "2.0", "id": 1, "method": "initialize", "params": {"protocolVersion": "2024-11-05", "capabilities": {}, "clientInfo": {"name": "test", "version": "1.0"}}}' | docker run --rm -i crawl4ai-mcp # Or use helper script (full build with Playwright) ./docker-run.sh build ./docker-run.sh test ./docker-run.sh run ``` **Docker Quick Commands:** - `./docker-run.sh build` - Build the image - `./docker-run.sh run` - Run MCP server (stdio mode) - `./docker-run.sh test` - Run smoke tests - `./docker-run.sh dev` - Development mode with shell access ### Option B: Manual Installation ```bash # Clone and setup git clone https://github.com/uysalsadi/crawl4ai-mcp-server.git cd crawl4ai-mcp-server python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate pip install -r requirements.txt # Install Playwright browsers python -m playwright install chromium # Test basic functionality python -m crawler_agent.smoke_client # Test adaptive crawling python test_adaptive.py ``` ### Use with OpenAI Agents SDK ```bash # Set your OpenAI API key export OPENAI_API_KEY="your-key-here" # Docker: Run the example agent docker-compose run --rm -e OPENAI_API_KEY crawl4ai-mcp python -m crawler_agent.agents_example # Manual: Run the example agent python -m crawler_agent.agents_example ``` ## πŸ› οΈ Tools Reference The MCP server exposes 4 production-ready tools with content hiding features: ### `scrape` Fetch a single URL and return markdown content. **Arguments:** - `url` (required): The URL to scrape - `output_dir` (optional): If provided, persists content to disk and returns metadata only - `crawler`: Optional Crawl4AI crawler config overrides - `browser`: Optional Crawl4AI browser config overrides - `script`: Optional C4A-Script for page interaction - `timeout_sec`: Request timeout (default: 45s, max: 600s) **Returns (without output_dir):** ```json { "url": "https://example.com", "markdown": "# Page Title\n\nContent...", "links": ["https://example.com/page1", "..."], "metadata": {} } ``` **Returns (with output_dir):** ```json { "run_id": "scrape_20250815_122811_4fb188", "file_path": "/path/to/output/pages/example.com_index.md", "manifest_path": "/path/to/output/manifest.json", "bytes_written": 230 } ``` ### `crawl` Multi-page breadth-first crawling with filtering and adaptive stopping. **Arguments:** - `seed_url` (required): Starting URL for the crawl - `max_depth`: Maximum link depth to follow (default: 1, max: 4) - `max_pages`: Maximum pages to crawl (default: 5, max: 100) - `same_domain_only`: Stay within the same domain (default: true) - `include_patterns`: Regex patterns URLs must match - `exclude_patterns`: Regex patterns to exclude URLs - `adaptive`: Enable adaptive crawling (default: false) - `output_dir` (optional): If provided, persists content to disk and returns metadata only - `crawler`, `browser`, `script`, `timeout_sec`: Same as scrape **Returns (without output_dir):** ```json { "start_url": "https://example.com", "pages": [ { "url": "https://example.com/page1", "markdown": "Content...", "links": ["..."] } ], "total_pages": 3 } ``` **Returns (with output_dir):** ```json { "run_id": "crawl_20250815_122828_464944", "pages_ok": 3, "pages_failed": 0, "manifest_path": "/path/to/output/manifest.json", "bytes_written": 690 } ``` ### `crawl_site` Comprehensive site crawling with persistence (always requires output_dir). **Arguments:** - `entry_url` (required): Starting URL for site crawl - `output_dir` (required): Directory to persist results - `max_depth`: Maximum crawl depth (default: 2, max: 6) - `max_pages`: Maximum pages to crawl (default: 200, max: 5000) - Additional config options for filtering and performance **Returns:** ```json { "run_id": "site_20250815_122851_0e2455", "output_dir": "/path/to/output", "manifest_path": "/path/to/output/manifest.json", "pages_ok": 15, "pages_failed": 2, "bytes_written": 45672 } ``` ### `crawl_sitemap` Sitemap-based crawling with persistence (always requires output_dir). **Arguments:** - `sitemap_url` (required): URL to sitemap.xml - `output_dir` (required): Directory to persist results - `max_entries`: Maximum sitemap entries to process (default: 1000) - Additional config options for filtering and performance **Returns:** ```json { "run_id": "sitemap_20250815_123006_667d71", "output_dir": "/path/to/output", "manifest_path": "/path/to/output/manifest.json", "pages_ok": 25, "pages_failed": 0, "bytes_written": 123456 } ``` ## πŸ’‘ Usage Examples ### Standalone MCP Server ```python from mcp import ClientSession, StdioServerParameters from mcp.client.stdio import stdio_client params = StdioServerParameters( command="python", args=["-m", "crawler_agent.mcp_server"] ) async with stdio_client(params) as (read, write): async with ClientSession(read, write) as session: await session.initialize() # Scrape a single page result = await session.call_tool("scrape", { "url": "https://example.com" }) # Crawl with adaptive stopping result = await session.call_tool("crawl", { "seed_url": "https://docs.example.com", "max_pages": 10, "adaptive": True }) ``` ### OpenAI Agents SDK Integration ```python from agents import Agent, Runner from agents.mcp.server import MCPServerStdio async with MCPServerStdio( params={ "command": "python", "args": ["-m", "crawler_agent.mcp_server"] }, cache_tools_list=True ) as server: agent = Agent( name="Research Assistant", instructions="Use scrape and crawl tools to research topics.", mcp_servers=[server], mcp_config={"convert_schemas_to_strict": True} ) result = await Runner.run( agent, "Research the latest AI safety papers" ) ``` ### Cursor/Claude Code Integration This project supports both **Cursor** and **Claude Code** with synchronized configuration: #### Cursor Setup - Uses `.cursorrules` for AI assistant behavior and project guidance - Automatic MCP tool detection and integration - Project-specific rules and workflow guidance #### Claude Code Setup - Uses `CLAUDE.md` for comprehensive project context and memory bank - Native MCP integration via `mcp__crawl4ai-mcp__*` tool calls - Global config: `~/.claude/claude_desktop_config.json` or project config: `.mcp.json` #### Docker Integration for Both Environments Both Cursor and Claude Code can use the Dockerized MCP server: ```json // .mcp.json (project-scoped configuration) { "mcpServers": { "crawl4ai-mcp": { "command": "docker", "args": [ "run", "--rm", "-i", "--volume", "./crawls:/app/crawls", "uysalsadi/crawl4ai-mcp-server:latest" ], "env": { "CRAWL4AI_MCP_LOG": "INFO" } } } } ``` #### Dual Environment Features - **Synchronized Documentation**: Changes to documentation are maintained across both environments - **Shared MCP Configuration**: Both use the same MCP server and tool schemas - **Docker Compatibility**: Consistent environment across both Cursor and Claude Code - **Cross-Compatible**: Projects work seamlessly whether using Cursor or Claude Code ## 🐳 Docker Usage ### Quick Start with Docker The Docker approach eliminates all manual setup and provides a consistent environment: ```bash # 1. Clone and build git clone https://github.com/uysalsadi/crawl4ai-mcp-server.git cd crawl4ai-mcp-server ./docker-run.sh build # 2. Test the installation ./docker-run.sh test # 3. Run MCP server ./docker-run.sh run ``` ### Docker Commands Reference | Command | Description | |---------|-------------| | `./docker-run.sh build` | Build the Docker image | | `./docker-run.sh run` | Run MCP server in stdio mode | | `./docker-run.sh test` | Run smoke tests | | `./docker-run.sh dev` | Development mode with shell access | | `./docker-run.sh stop` | Stop running containers | | `./docker-run.sh clean` | Remove containers and images | | `./docker-run.sh logs` | Show container logs | ### Docker Compose Services The `docker-compose.yml` provides multiple service configurations: - **`crawl4ai-mcp`**: Production MCP server - **`crawl4ai-mcp-dev`**: Development container with shell access - **`crawl4ai-mcp-test`**: Test runner for smoke tests ### MCP Integration with Docker #### Global MCP Configuration (Claude Code) Add to `~/.claude/claude_desktop_config.json`: ```json { "mcpServers": { "crawl4ai-mcp": { "command": "docker", "args": [ "run", "--rm", "-i", "--volume", "/tmp/crawl4ai-crawls:/app/crawls", "uysalsadi/crawl4ai-mcp-server:latest" ], "env": { "CRAWL4AI_MCP_LOG": "INFO" } } } } ``` **βœ… Copy this exact config** - it uses the published Docker image! #### Global MCP Configuration (Cursor) Add to `~/.cursor/mcp.json`: ```json { "mcpServers": { "crawl4ai-mcp": { "command": "docker", "args": [ "run", "--rm", "-i", "--volume", "/tmp/crawl4ai-crawls:/app/crawls", "uysalsadi/crawl4ai-mcp-server:latest" ], "env": { "CRAWL4AI_MCP_LOG": "INFO" } } } } ``` **βœ… This configuration is tested and working!** #### Project-Scoped Configuration Add to your project's `.mcp.json`: ```json { "mcpServers": { "crawl4ai-mcp": { "command": "docker", "args": [ "run", "--rm", "-i", "--volume", "./crawls:/app/crawls", "uysalsadi/crawl4ai-mcp-server:latest" ], "env": { "CRAWL4AI_MCP_LOG": "INFO" } } } } ``` **βœ… This project already includes this configuration** - see `.mcp.json` #### Configuration Validation After setting up any MCP configuration: 1. **Test the Docker image works**: ```bash python test-config.py ``` 2. **Restart your editor** (Cursor/Claude Code) to reload MCP configuration 3. **Verify tools are available**: - Look for `crawl4ai-mcp` in the MCP tools panel - Should see 4 tools: `scrape`, `crawl`, `crawl_site`, `crawl_sitemap` If tools don't appear, check: - Docker is running and image is accessible - MCP configuration file syntax is valid JSON - Editor has been restarted after config changes ### Docker Advantages βœ… **Zero Setup**: No need for Python venv, pip, or Playwright installation βœ… **Consistent Environment**: Same behavior across all platforms βœ… **Isolated Dependencies**: No conflicts with your system Python βœ… **Easy Updates**: `docker pull` to get latest version βœ… **Portable**: Works anywhere Docker runs βœ… **Volume Persistence**: Crawl outputs saved to host filesystem ### Environment Variables Set these when running Docker containers: ```bash # Using docker-compose OPENAI_API_KEY=your-key docker-compose up crawl4ai-mcp # Using docker run directly docker run --rm -i \ -e OPENAI_API_KEY=your-key \ -e CRAWL4AI_MCP_LOG=DEBUG \ -v ./crawls:/app/crawls \ crawl4ai-mcp-server:latest ``` ## βš™οΈ Configuration ### Environment Variables - `TARGET_URL`: Default URL for smoke testing (default: https://modelcontextprotocol.io/docs) - `RESEARCH_TASK`: Custom research task for agents example - `OPENAI_API_KEY`: Required for OpenAI Agents SDK ### Safety Settings The server blocks these URL patterns by default: - `localhost`, `127.0.0.1`, `::1` - Private IP ranges (RFC 1918) - `file://` schemes - `.local`, `.internal`, `.lan` domains ## πŸš€ Advanced Features ### Adaptive Crawling When `adaptive: true` is set, the crawler uses a simple content-based stopping strategy: - Stops when total content exceeds 5,000 characters - Prevents over-crawling for information gathering tasks - More sophisticated LLM-based strategies can be added ### Custom Filtering Use regex patterns to control which URLs are crawled: ```python await session.call_tool("crawl", { "seed_url": "https://docs.example.com", "include_patterns": [r"/docs/", r"/api/"], "exclude_patterns": [r"/old/", r"\.pdf$"] }) ``` ### Browser Configuration Pass custom browser/crawler settings: ```python await session.call_tool("scrape", { "url": "https://example.com", "browser": {"headless": True, "viewport": {"width": 1280, "height": 720}}, "crawler": {"verbose": True} }) ``` ## πŸ—οΈ Architecture ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ AI Agent │───▢│ MCP Server │───▢│ Crawl4AI β”‚ β”‚ (Cursor/Agents) β”‚ β”‚ (stdio/stdio) β”‚ β”‚ (Playwright) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Safety Guards β”‚ β”‚ (URL validation) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` ## πŸ“¦ Publishing to Docker Hub For maintainers who want to publish updates to the Docker registry: ### Publishing Process ```bash # 1. Login to Docker Hub (one time setup) ./docker-push.sh login # 2. Build, push, and test everything ./docker-push.sh all # Or do steps individually: ./docker-push.sh build # Build and tag image ./docker-push.sh push # Push to Docker Hub ./docker-push.sh test # Test the published image ``` ### Docker Hub Repository The image is published at: **`uysalsadi/crawl4ai-mcp-server`** - Latest: `uysalsadi/crawl4ai-mcp-server:latest` - Versioned: `uysalsadi/crawl4ai-mcp-server:v1.0.0` ### Usage Statistics Users can pull and use the image without any local setup: ```bash docker pull uysalsadi/crawl4ai-mcp-server:latest ``` ## πŸ”§ Development ### Project Structure ``` crawler_agent/ β”œβ”€β”€ __init__.py β”œβ”€β”€ mcp_server.py # Main MCP server β”œβ”€β”€ safety.py # URL safety validation β”œβ”€β”€ adaptive_strategy.py # Adaptive crawling logic β”œβ”€β”€ smoke_client.py # Basic testing client └── agents_example.py # OpenAI Agents SDK example ``` ### Testing ```bash # Test MCP server python -m crawler_agent.smoke_client # Test adaptive crawling python test_adaptive.py # Test with agents (requires OPENAI_API_KEY) python -m crawler_agent.agents_example ``` ### Contributing 1. **Environment Setup**: Always activate `.venv` before development work 2. **Documentation**: Follow the Documentation Memory Rule - synchronize changes across CLAUDE.md, .cursorrules, and README.md 3. **Code Style**: Follow `.cursorrules` for Cursor or `CLAUDE.md` for Claude Code 4. **Testing**: Use `python -m crawler_agent.smoke_client` for validation 5. **Safety**: Ensure security guards are maintained for all new features 6. **Dual Compatibility**: Verify changes work in both Cursor and Claude Code environments ## πŸ“š References - [Crawl4AI Documentation](https://docs.crawl4ai.com/) - [Model Context Protocol](https://modelcontextprotocol.io/introduction) - [OpenAI Agents SDK](https://openai.github.io/openai-agents-python/mcp/) - [MCP Tools Reference](https://openai.github.io/openai-agents-python/tools/) ## πŸ“„ License MIT License - see [LICENSE](LICENSE) file for details. ### πŸ™ Acknowledgments This project uses [Crawl4AI](https://github.com/unclecode/crawl4ai) by [UncleCode](https://github.com/unclecode) for web scraping capabilities. Crawl4AI is an excellent open-source LLM-friendly web crawler and scraper. ## 🀝 Contributing Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change. ## ⭐ Support If this project helped you, please give it a star! It helps others discover the project. ## πŸ› Issues Found a bug or have a feature request? Please [open an issue](https://github.com/uysalsadi/crawl4ai-mcp-server/issues).