--- name: crawl4ai description: This skill should be used when users need to scrape websites, extract structured data, handle JavaScript-heavy pages, crawl multiple URLs, or build automated web data pipelines. Includes optimized extraction patterns with schema generation for efficient, LLM-free extraction. --- # Crawl4AI ## Overview Crawl4AI provides comprehensive web crawling and data extraction capabilities. This skill supports both **CLI** (recommended for quick tasks) and **Python SDK** (for programmatic control). **Choose your interface:** - **CLI** (`crwl`) - Quick, scriptable commands: [CLI Guide](references/cli-guide.md) - **Python SDK** - Full programmatic control: [SDK Guide](references/sdk-guide.md) --- ## Quick Start ### Installation ```bash pip install crawl4ai crawl4ai-setup # Verify installation crawl4ai-doctor ``` ### CLI (Recommended) ```bash # Basic crawling - returns markdown crwl https://example.com # Get markdown output crwl https://example.com -o markdown # JSON output with cache bypass crwl https://example.com -o json -v --bypass-cache # See more examples crwl --example ``` ### Python SDK ```python import asyncio from crawl4ai import AsyncWebCrawler async def main(): async with AsyncWebCrawler() as crawler: result = await crawler.arun("https://example.com") print(result.markdown[:500]) asyncio.run(main()) ``` For SDK configuration details: [SDK Guide - Configuration](references/sdk-guide.md#configuration) (lines 61-150) --- ## Core Concepts ### Configuration Layers Both CLI and SDK use the same underlying configuration: | Concept | CLI | SDK | |---------|-----|-----| | Browser settings | `-B browser.yml` or `-b "param=value"` | `BrowserConfig(...)` | | Crawl settings | `-C crawler.yml` or `-c "param=value"` | `CrawlerRunConfig(...)` | | Extraction | `-e extract.yml -s schema.json` | `extraction_strategy=...` | | Content filter | `-f filter.yml` | `markdown_generator=...` | ### Key Parameters **Browser Configuration:** - `headless`: Run with/without GUI - `viewport_width/height`: Browser dimensions - `user_agent`: Custom user agent - `proxy_config`: Proxy settings **Crawler Configuration:** - `page_timeout`: Max page load time (ms) - `wait_for`: CSS selector or JS condition to wait for - `cache_mode`: bypass, enabled, disabled - `js_code`: JavaScript to execute - `css_selector`: Focus on specific element For complete parameters: [CLI Config](references/cli-guide.md#configuration) | [SDK Config](references/sdk-guide.md#configuration) ### Output Content Every crawl returns: - **markdown** - Clean, formatted markdown - **html** - Raw HTML - **links** - Internal and external links discovered - **media** - Images, videos, audio found - **extracted_content** - Structured data (if extraction configured) --- ## Markdown Generation (Primary Use Case) Crawl4AI excels at generating clean, well-formatted markdown: ### CLI ```bash # Basic markdown crwl https://docs.example.com -o markdown # Filtered markdown (removes noise) crwl https://docs.example.com -o markdown-fit # With content filter crwl https://docs.example.com -f filter_bm25.yml -o markdown-fit ``` **Filter configuration:** ```yaml # filter_bm25.yml (relevance-based) type: "bm25" query: "machine learning tutorials" threshold: 1.0 ``` ### Python SDK ```python from crawl4ai.content_filter_strategy import BM25ContentFilter from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator bm25_filter = BM25ContentFilter(user_query="machine learning", bm25_threshold=1.0) md_generator = DefaultMarkdownGenerator(content_filter=bm25_filter) config = CrawlerRunConfig(markdown_generator=md_generator) result = await crawler.arun(url, config=config) print(result.markdown.fit_markdown) # Filtered print(result.markdown.raw_markdown) # Original ``` For content filters: [Content Processing](references/complete-sdk-reference.md#content-processing) (lines 2481-3101) --- ## Data Extraction ### 1. Schema-Based CSS Extraction (Most Efficient) **No LLM required** - fast, deterministic, cost-free. **CLI:** ```bash # Generate schema once (uses LLM) python scripts/extraction_pipeline.py --generate-schema https://shop.com "extract products" # Use schema for extraction (no LLM) crwl https://shop.com -e extract_css.yml -s product_schema.json -o json ``` **Schema format:** ```json { "name": "products", "baseSelector": ".product-card", "fields": [ {"name": "title", "selector": "h2", "type": "text"}, {"name": "price", "selector": ".price", "type": "text"}, {"name": "link", "selector": "a", "type": "attribute", "attribute": "href"} ] } ``` ### 2. LLM-Based Extraction For complex or irregular content: **CLI:** ```yaml # extract_llm.yml type: "llm" provider: "openai/gpt-4o-mini" instruction: "Extract product names and prices" api_token: "your-token" ``` ```bash crwl https://shop.com -e extract_llm.yml -o json ``` For extraction details: [Extraction Strategies](references/complete-sdk-reference.md#extraction-strategies) (lines 4522-5429) --- ## Advanced Patterns ### Dynamic Content (JavaScript-Heavy Sites) **CLI:** ```bash crwl https://example.com -c "wait_for=css:.ajax-content,scan_full_page=true,page_timeout=60000" ``` **Crawler config:** ```yaml # crawler.yml wait_for: "css:.ajax-content" scan_full_page: true page_timeout: 60000 delay_before_return_html: 2.0 ``` ### Multi-URL Processing **CLI (sequential):** ```bash for url in url1 url2 url3; do crwl "$url" -o markdown; done ``` **Python SDK (concurrent):** ```python urls = ["https://site1.com", "https://site2.com", "https://site3.com"] results = await crawler.arun_many(urls, config=config) ``` For batch processing: [arun_many() Reference](references/complete-sdk-reference.md#arunmany-reference) (lines 1057-1224) ### Session & Authentication **CLI:** ```yaml # login_crawler.yml session_id: "user_session" js_code: | document.querySelector('#username').value = 'user'; document.querySelector('#password').value = 'pass'; document.querySelector('#submit').click(); wait_for: "css:.dashboard" ``` ```bash # Login crwl https://site.com/login -C login_crawler.yml # Access protected content (session reused) crwl https://site.com/protected -c "session_id=user_session" ``` For session management: [Advanced Features](references/complete-sdk-reference.md#advanced-features) (lines 5429-5940) ### Anti-Detection & Proxies **CLI:** ```yaml # browser.yml headless: true proxy_config: server: "http://proxy:8080" username: "user" password: "pass" user_agent_mode: "random" ``` ```bash crwl https://example.com -B browser.yml ``` --- ## Common Use Cases ### Documentation to Markdown ```bash crwl https://docs.example.com -o markdown > docs.md ``` ### E-commerce Product Monitoring ```bash # Generate schema once python scripts/extraction_pipeline.py --generate-schema https://shop.com "extract products" # Monitor (no LLM costs) crwl https://shop.com -e extract_css.yml -s schema.json -o json ``` ### News Aggregation ```bash # Multiple sources with filtering for url in news1.com news2.com news3.com; do crwl "https://$url" -f filter_bm25.yml -o markdown-fit done ``` ### Interactive Q&A ```bash # First view content crwl https://example.com -o markdown # Then ask questions crwl https://example.com -q "What are the main conclusions?" crwl https://example.com -q "Summarize the key points" ``` --- ## Resources ### Provided Scripts - **scripts/extraction_pipeline.py** - Schema generation and extraction - **scripts/basic_crawler.py** - Simple markdown extraction - **scripts/batch_crawler.py** - Multi-URL processing ### Reference Documentation | Document | Purpose | |----------|---------| | [CLI Guide](references/cli-guide.md) | Command-line interface reference | | [SDK Guide](references/sdk-guide.md) | Python SDK quick reference | | [Complete SDK Reference](references/complete-sdk-reference.md) | Full API documentation (5900+ lines) | --- ## Best Practices 1. **Start with CLI** for quick tasks, SDK for automation 2. **Use schema-based extraction** - 10-100x more efficient than LLM 3. **Enable caching during development** - `--bypass-cache` only when needed 4. **Set appropriate timeouts** - 30s normal, 60s+ for JS-heavy sites 5. **Use content filters** for cleaner, focused markdown 6. **Respect rate limits** - Add delays between requests --- ## Troubleshooting ### JavaScript Not Loading ```bash crwl https://example.com -c "wait_for=css:.dynamic-content,page_timeout=60000" ``` ### Bot Detection Issues ```bash crwl https://example.com -B browser.yml ``` ```yaml # browser.yml headless: false viewport_width: 1920 viewport_height: 1080 user_agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36" ``` ### Content Not Extracted ```bash # Debug: see full output crwl https://example.com -o all -v # Try different wait strategy crwl https://example.com -c "wait_for=js:document.querySelector('.content')!==null" ``` ### Session Issues ```bash # Verify session crwl https://site.com -c "session_id=test" -o all | grep -i session ``` --- For comprehensive API documentation, see [Complete SDK Reference](references/complete-sdk-reference.md).