# CrawlForge Skill Reference > Auto-generated from `src/skills/agent-skills/*/SKILL.md`. To regenerate, run `npm run skills:gen` (calls `concatenateSkills()` from `src/skills/installer.js`). > > This file is the canonical capabilities reference for AI agents using CrawlForge MCP tools. --- # CrawlForge Batch & Automation Scale up scraping and drive interactive pages. Use `batch_scrape` for many URLs, `get_batch_results` to page through async output, `scrape_with_actions` to interact before scraping, and `generate_llms_txt` to produce a site's AI policy file. ## When to use - "Scrape these 30 URLs" / "batch-scrape this list" → `batch_scrape` - "Collect dozens of product / news / competitor pages" → `batch_scrape` (async) - "Get the rest of the results from that batch" → `get_batch_results` - "Click / type / scroll / wait before scraping" / "log in then extract" → `scrape_with_actions` - "Generate an llms.txt for this site" → `generate_llms_txt` ## batch_scrape — many URLs in parallel (cost: 5) Sync mode (results returned immediately), good for up to ~25 URLs: ```json { "tool": "batch_scrape", "params": { "urls": ["https://a.com", "https://b.com", "https://c.com"], "formats": ["markdown"], "mode": "sync", "maxConcurrency": 5 } } ``` Async mode with a webhook for large batches (returns a `batchId` immediately): ```json { "tool": "batch_scrape", "params": { "urls": ["https://a.com", "https://b.com"], "formats": ["json"], "mode": "async", "webhook": { "url": "https://my-site.com/hook", "events": ["batch_completed", "batch_failed"] } } } ``` - `urls` accepts plain strings OR objects `{url, selectors, headers, timeout, metadata}` for per-URL config. 1–50 URLs. - `formats`: `markdown`, `html`, `json`, `text`. - `extractionSchema` applies structured extraction to every URL. - `maxConcurrency` 1–20 (default 10); `delayBetweenRequests` throttles. - Sync batches over ~25 URLs trigger a confirmation prompt (elicitation) — use async for large jobs. CLI: `crawlforge batch urls.txt --format markdown --concurrency 10`. ## get_batch_results — page through output (cost: 1) ```json { "tool": "get_batch_results", "params": { "batchId": "batch_1234567890_abc", "page": 2, "pageSize": 25 } } ``` Use the `batchId` from `batch_scrape` to retrieve paginated results for a completed or in-progress job. Cheap (1 credit) because the batch was already paid for. Completed jobs are also exposed as `crawlforge://job/{jobId}` resources. ## scrape_with_actions — interact, then scrape (cost: 5) For SPAs, login-gated content, or multi-step flows. Run an ordered list of browser actions before extraction. ```json { "tool": "scrape_with_actions", "params": { "url": "https://app.com/login", "actions": [ { "type": "type", "selector": "#email", "text": "user@a.com" }, { "type": "type", "selector": "#password", "text": "secret" }, { "type": "click", "selector": "#login" }, { "type": "wait", "duration": 1500 } ], "formats": ["markdown"] } } ``` Allowed action types: `wait`, `click`, `type`, `press`, `scroll`, `screenshot`, `executeJavaScript`. `executeJavaScript` is disabled unless the deploy sets `ALLOW_JAVASCRIPT_EXECUTION=true`. 1–20 actions per call. Screenshots are stored as `crawlforge://screenshot/{actionId}` resources. Full action schemas: [actions](references/actions.md). CLI: `crawlforge actions https://example.com --script login.json --screenshot`. ## generate_llms_txt — AI policy file (cost: 5) ```json { "tool": "generate_llms_txt", "params": { "url": "https://example.com", "format": "both" } } ``` Generates `llms.txt` (and optionally `llms-full.txt`) describing how AI models should interact with a site. `format`: `both`, `llms-txt`, `llms-full-txt`. Tune `analysisOptions` (`maxDepth`, `maxPages`, `respectRobots`) and `outputOptions` (`contactEmail`, `organizationName`, custom guidelines). CLI: `crawlforge llmstxt https://example.com --include-full > llms.txt`. ## Sync vs async decision - **≤ ~25 URLs, need results now** → `batch_scrape` `mode:"sync"`. - **Large list or long-running** → `mode:"async"` + `webhook`, then `get_batch_results` (or read the `crawlforge://job/{jobId}` resource). ## Cost note `batch_scrape`, `scrape_with_actions`, `generate_llms_txt` = 5 credits each; `get_batch_results` = 1 (retrieval of an already-paid batch). Async batches are billed once at submission — paging results afterward stays cheap. --- # CrawlForge Change Tracking Detect when a web page changes over time using the `track_changes` tool. Useful for competitor pricing, regulation updates, product availability, and any page you need to watch for edits. ## When to use - "Track changes to this page" / "watch this URL" - "Tell me when competitor pricing changes" - "Detect when this content updates" - "Diff this page against last week's version" - "Notify me when product availability / a regulation changes" ## Core workflow `track_changes` is one tool driven by an `operation` parameter (cost: 3 credits per call). 1. **Create a baseline** the first time: ```json { "tool": "track_changes", "params": { "url": "https://example.com/pricing", "operation": "create_baseline" } } ``` 2. **Compare** later to get the change percentage + diff: ```json { "tool": "track_changes", "params": { "url": "https://example.com/pricing", "operation": "compare" } } ``` `compare` is the default operation. It returns a change percentage and a structured diff against the stored baseline. ## Scoping to part of a page Use `trackingOptions` to ignore noise and focus on what matters: ```json { "tool": "track_changes", "params": { "url": "https://example.com/pricing", "operation": "compare", "trackingOptions": { "granularity": "element", "customSelectors": [".price", ".plan-name"], "excludeSelectors": [".timestamp", ".ad"], "ignoreWhitespace": true } } } ``` `granularity`: `page`, `section` (default), `element`, `text`. Toggle `trackText`, `trackStructure`, `trackLinks`, `trackImages`. Set `significanceThresholds` (`minor`/`moderate`/`major`) to classify change size. CLI: `crawlforge track https://example.com --selector ".price" --threshold 1`. ## Scheduled monitoring & notifications Run continuous monitoring with webhooks instead of polling manually: ```json { "tool": "track_changes", "params": { "url": "https://example.com/pricing", "operation": "create_scheduled_monitor", "monitoringOptions": { "enabled": true, "interval": 300000, "notificationThreshold": "moderate", "enableWebhook": true, "webhookUrl": "https://my-site.com/notify" } } } ``` CLI (runs until Ctrl+C): `crawlforge monitor https://example.com --interval 60 --webhook https://my-site.com/notify`. ## Other operations | Operation | Purpose | |-----------|---------| | `create_baseline` | Save the first snapshot to diff against. | | `compare` (default) | Diff current content vs baseline → % change + diff. | | `monitor` | One monitoring pass. | | `get_history` | Retrieve past change records (`queryOptions`). | | `get_stats` | Summary statistics for a tracked URL. | | `create_scheduled_monitor` / `stop_scheduled_monitor` | Manage cron-style monitors. | | `get_dashboard` | Aggregate status, recent alerts, trends. | | `export_history` | Export change history as `json` or `csv`. | | `create_alert_rule` | Conditional alerts (webhook / email / slack). | | `generate_trend_report` | Trend analysis over time. | | `get_monitoring_templates` | List built-in monitoring presets. | You can also pass `content` or `html` directly to compare pre-fetched content against the baseline without re-fetching. ## Cost note `track_changes` = 3 credits per call. A typical watch is one `create_baseline` plus periodic `compare` calls; scheduled monitors run server-side and notify via webhook/Slack so you don't pay for manual polling loops. --- # CrawlForge Deep Research Answer questions and produce reports from many web sources using the CrawlForge MCP server. Three tools, from lightest to heaviest: `search_web` (ranked results), `agent` (autonomous plan-and-answer), `deep_research` (exhaustive multi-source synthesis). ## When to use - "Search the web for X" / "find pages about X" → `search_web` - "Answer this question from the web" / "what are the top 5 X" (no URLs given) → `agent` - "Research this topic in depth" / "compare competitors" / "give me a cited report" / "gather facts with sources" → `deep_research` Do NOT use these for a single known URL — use the **crawlforge-web-scraping** skill (`scrape` / `extract_content`) instead. ## Tool selection | Need | Tool | Cost | |------|------|------| | A ranked list of result URLs + snippets | `search_web` | 5 | | A direct answer, agent decides what to read | `agent` | 8 (scales) | | A synthesized, multi-source, cited report | `deep_research` | 10+ (scales) | A common pipeline: `search_web` to find sources → `batch_scrape` (see **crawlforge-batch-automation**) to read them → `deep_research` to synthesize. ## search_web (cost: 5) ```json { "tool": "search_web", "params": { "query": "best MCP servers 2025", "limit": 10, "time_range": "month" } } ``` Returns titles, URLs, snippets. Supports `lang`, `site` (domain filter), `file_type`, `time_range` (`day`/`week`/`month`/`year`/`all`), `expand_query`, `enable_ranking`, and `enable_deduplication`. CLI: `crawlforge search "MCP server tutorial" --limit 5`. ## agent — autonomous answer (cost: 8, scales with maxUrls) No URLs required. The agent plans search queries, fetches and filters relevant pages, then returns a prose or structured answer. ```json { "tool": "agent", "params": { "prompt": "What are the top 5 MCP servers in 2025 and who maintains them?", "maxUrls": 10 } } ``` - `model:"default"` runs a SamplingClient loop (no LLM keys needed). - `model:"pro"` runs the full ResearchOrchestrator (deeper, multi-source); it asks for confirmation (elicitation) before running. - Pass `schema` for structured JSON output, `urls` for optional seed URLs. - Hard limits enforced by the orchestrator: `maxSteps` ≤ 10, `maxUrls` ≤ 20, 120s wall-clock. Output is degraded-but-useful if no LLM keys/Ollama exist. ## deep_research — exhaustive report (cost: 10+, scales) ```json { "tool": "deep_research", "params": { "topic": "React vs Vue vs Angular in 2025", "maxUrls": 20, "researchApproach": "comparative", "outputFormat": "comprehensive" } } ``` How it works: query expansion → parallel fetch (deduped) → content extraction → conflict detection → synthesis. `researchApproach`: `broad`, `focused`, `academic`, `current_events`, `comparative`. `outputFormat`: `comprehensive`, `summary`, `citations_only`, `conflicts_focus`. Results are stored as a `crawlforge://research/{sessionId}` resource — re-read them without re-running. CLI: `crawlforge research "B2B SaaS pricing trends" --depth deep --max-urls 30`. ### Cost control `deep_research` starts at 10 credits and grows with sources. **`maxUrls > 50` triggers a confirmation prompt (elicitation).** Cap cost with `maxUrls` (e.g. `maxUrls:10` ≈ 12 credits). `timeLimit` (default 120000 ms) bounds runtime. ## Synthesis & LLM fallback The server ships with **no LLM key by design**. The fallback chain is: Ollama (local) → OpenAI key → Anthropic key → MCP Sampling → raw evidence. With no LLM configured, `deep_research` returns structured **raw evidence** for the calling assistant (e.g. Claude Code) to synthesize — this is expected, do not suggest adding API keys. See [research workflows](references/workflows.md) for pipelines, depth tiers, and parameter detail. --- # CrawlForge: Getting Started CrawlForge is an MCP server with **26 tools** for web scraping, crawling, extraction, research, change tracking, and AI-compliance. This skill orients you and routes each request to the right specialized skill. ## Setup 1. Get an API key at https://crawlforge.dev/signup (1,000 free credits). 2. Provide it to the server: ```bash export CRAWLFORGE_API_KEY="your_api_key_here" # or, for the CLI, run: crawlforge init (detects key, installs skills, merges MCP config) ``` Every tool is metered and requires a key — there is no free tier. User config is stored at `~/.crawlforge/config.json`. ## Route to the right skill | The user wants to... | Use skill | |----------------------|-----------| | Scrape a page, get markdown/text/links/metadata, map or crawl a site | **crawlforge-web-scraping** | | Research a topic, search the web, get a cited report, autonomous Q&A | **crawlforge-deep-research** | | Get past 403/CAPTCHA/Cloudflare, or emulate a region/locale | **crawlforge-stealth-browsing** | | Extract JSON/fields, parse a PDF, summarize, analyze sentiment | **crawlforge-structured-extraction** | | Watch a page for changes / monitor pricing | **crawlforge-change-tracking** | | Scrape many URLs, run browser actions, generate llms.txt | **crawlforge-batch-automation** | ## The 26 tools at a glance - **Basic (5):** fetch_url, extract_text, extract_links, extract_metadata, scrape_structured - **Unified (1):** scrape (multi-format single fetch) - **Search & research (3):** search_web, deep_research, agent - **Crawl (2):** crawl_deep, map_site - **Extract & analyze (7):** extract_content, process_document, summarize_content, analyze_content, extract_structured, extract_with_llm, list_ollama_models - **Batch & automation (4):** batch_scrape, get_batch_results, scrape_with_actions, generate_llms_txt - **Stealth & locale (2):** stealth_mode, localization - **Templates & tracking (2):** scrape_template, track_changes ## MCP tools vs CLI - **MCP tools** — call inline within an AI assistant session (Claude Code, Cursor, etc.). This is the default in chat. - **CLI** (`crawlforge `) — for scripts, CI, and pipelines. 15 tool commands + 2 skill commands. See [cli](references/cli.md). Both hit the same backend and consume the same credits. ## LLM fallback chain (Ollama-first) LLM-backed tools (`extract_with_llm`, `extract_structured`, abstractive `summarize_content`, `deep_research`, `agent`) resolve a provider in this order: ``` Ollama (local, default, no key) → OpenAI key → Anthropic key → MCP Sampling → raw evidence ``` The server ships with **no cloud LLM key by design**. With none configured, research tools return raw evidence for the calling assistant to synthesize. Do not suggest adding API keys — local Ollama is the intended zero-cost default. ## When a tool fails — fallbacks | Symptom | Try next | |---------|----------| | `scrape` / `fetch_url` returns 403, 429, CAPTCHA, or empty JS shell | `stealth_mode` (crawlforge-stealth-browsing) | | No template for a known site | `scrape_structured` → `extract_structured` → `extract_with_llm` | | LLM extraction unavailable (no Ollama/keys) | `scrape_structured` with CSS selectors | | Single page too slow / many pages | `batch_scrape` (async + webhook) | | Wrong region / currency shown | `localization` | | Need a big report but cost is high | lower `maxUrls` on `deep_research` | ## Credits Costs range from 1 credit (fetch_url, extract_text, scrape_template, get_batch_results...) up to 10+ for `deep_research`. Full table: [credits](references/credits.md). Prefer the cheapest tool that does the job, and cap dynamic tools (`deep_research`, `agent`, `crawl_deep`) with their max parameters. --- # CrawlForge Stealth Browsing Get past bot-detection systems and geo-blocks. Use `stealth_mode` when a normal scrape is blocked, and `localization` when you need region-specific content, pricing, or locale emulation. ## When to escalate to stealth_mode Escalate from a normal `scrape` / `fetch_url` (see crawlforge-web-scraping) when: - The site returns **403 or 429** on a regular fetch. - You get a **CAPTCHA** or a "please enable JavaScript" interstitial. - Content comes back **empty** or only a shell (JS-rendered SPA). - The site uses **Cloudflare, DataDome, PerimeterX**, or similar protection. `stealth_mode` drives a real browser with randomized fingerprints, human behavior simulation, and WebRTC/canvas/WebGL spoofing. ## stealth_mode (cost: 5) `stealth_mode` is operation-based. Typical flow: create a context, then create a page that navigates to the target URL. ```json { "tool": "stealth_mode", "params": { "operation": "create_context", "stealthConfig": { "level": "advanced", "simulateHumanBehavior": true }, "engine": "playwright" } } ``` Then use the returned `contextId`: ```json { "tool": "stealth_mode", "params": { "operation": "create_page", "contextId": "", "urlToTest": "https://protected-site.com" } } ``` Operations: `configure`, `enable`, `disable`, `create_context`, `create_page`, `get_stats`, `cleanup`. `stealthConfig.level` is `basic` / `medium` (default) / `advanced`. Always run `cleanup` when done to release the browser. ### Engine: playwright vs camoufox - `engine:"playwright"` (default) — Chromium with stealth patches. Fast, good for most basic bot detection. - `engine:"camoufox"` — Firefox-based with native anti-detection (no patches). Scores higher against DataDome / Cloudflare / PerimeterX and on CreepJS. Use for heavily protected, financial, or e-commerce sites. Full decision table: [engine selection](references/engine-selection.md). ### CLI ```bash crawlforge stealth https://protected-site.com crawlforge stealth https://protected-site.com --engine camoufox --wait 3000 --screenshot ``` The CLI exposes a one-shot form (`--engine`, `--wait `, `--screenshot`). Force the engine globally with `export CRAWLFORGE_STEALTH_ENGINE=camoufox`. ## localization (cost: 2) Emulate a country/language/timezone/currency for region-specific content and geo-blocked pages. ```json { "tool": "localization", "params": { "operation": "configure_country", "countryCode": "DE", "language": "de", "currency": "EUR" } } ``` Operations: `configure_country`, `localize_search`, `localize_browser`, `generate_timezone_spoof`, `handle_geo_blocking`, `auto_detect`, `get_stats`, `get_supported_countries`. `countryCode` is ISO 3166-1 alpha-2; `currency` is ISO 4217. Supports proxy routing and GPS geolocation emulation. CLI: `crawlforge localize https://shop.example.com --locale en-GB --country GB --currency GBP`. ## stealth_mode vs localization - **Blocked / bot-detected** → `stealth_mode`. - **Wrong region / language / currency, but not blocked** → `localization`. - **Geo-blocked AND bot-protected** → `localization` to set region context, then `stealth_mode` for the actual fetch. ## Cost note `stealth_mode` = 5 credits per call (screenshots add a small extra cost); `localization` = 2 credits. Try the cheaper `scrape` (2 credits) first and only escalate to stealth when you see the block signals above. --- # CrawlForge Structured Extraction & Analysis Turn pages and documents into structured data, and run NLP analysis. Pick the extraction method by how predictable the page is and whether an LLM is needed. ## Tool selection | You have / want | Tool | Cost | |-----------------|------|------| | A well-known site (Amazon, GitHub, LinkedIn...) | `scrape_template` | 1 | | Exact CSS selectors for the fields | `scrape_structured` | 2 | | A JSON schema to fill (LLM, CSS fallback) | `extract_structured` | 3 | | A natural-language extraction instruction | `extract_with_llm` | 3 | | A PDF / DOCX / TXT to parse | `process_document` | 2 | | A summary of long text | `summarize_content` | 4 | | Sentiment / entities / keywords / readability | `analyze_content` | 3 | | To list local LLMs available for extraction | `list_ollama_models` | 1 | Cheapest-first rule: try `scrape_template` → `scrape_structured` (deterministic) before reaching for the LLM tools. ## scrape_template — known sites, zero selectors (cost: 1) ```json { "tool": "scrape_template", "params": { "template": "github-repo", "url": "https://github.com/user/repo" } } ``` Pass `template:"list"` to enumerate templates. Supports `amazon-product`, `linkedin-profile`, `github-repo`, `youtube-video`, `tweet`, `reddit-thread`, `hacker-news-front-page`, `producthunt-launch`, `stackoverflow-question`, `npm-package`. Full field lists: [templates](references/templates.md). CLI: `crawlforge template github-repo https://github.com/owner/repo`. ## scrape_structured — CSS selectors (cost: 2) ```json { "tool": "scrape_structured", "params": { "url": "https://shop.com/products", "selectors": { "price": ".price", "name": ".product-title", "link": "a.product@href" } } } ``` Most reliable for consistent markup. Append `@attr` to a selector to read an attribute (e.g. `a.link@href`, `img@src`). `max_results` caps matches per field. ## extract_structured — schema-driven (cost: 3) ```json { "tool": "extract_structured", "params": { "url": "https://jobs.example.com/post/123", "schema": { "properties": { "title": { "type": "string" }, "salary": { "type": "string" } }, "required": ["title"] } } } ``` Uses an LLM by default and falls back to CSS selectors when no LLM is configured (`fallbackToSelectors:true`). Provide `selectorHints` to guide the fallback. A schema with >3 required fields and no LLM configured triggers a confirmation prompt. CLI: `crawlforge extract --schema schema.json`. ## extract_with_llm — natural-language prompt (cost: 3) ```json { "tool": "extract_with_llm", "params": { "url": "https://example.com/article", "prompt": "extract title, author, date, and a one-line summary", "provider": "ollama" } } ``` Defaults to a **local Ollama** model (`http://localhost:11434`, no API key). Run `list_ollama_models` first to see installed models, then pass one via `model` (e.g. `"llama3.2"`). Use `provider:"openai"` or `"anthropic"` with the matching key for a cloud model. CLI: `crawlforge extract --prompt "..." --model llama3.2`. ## process_document — PDF / DOCX / TXT (cost: 2) ```json { "tool": "process_document", "params": { "source": "https://example.com/report.pdf", "sourceType": "pdf_url" } } ``` `sourceType`: `url`, `pdf_url`, `file`, `pdf_file`. Returns structured sections, metadata, and word count. `options` passes through `maxPages`, `pageRange:{start,end}`, `password`, `outputFormat`, etc. ## summarize_content & analyze_content ```json { "tool": "summarize_content", "params": { "text": "..long article..", "options": { "summaryLength": "short", "summaryType": "abstractive" } } } ``` ```json { "tool": "analyze_content", "params": { "text": "..article text..", "options": { "extractTopics": true, "includeSentiment": true } } } ``` `summarize_content` (cost 4) does extractive or abstractive summaries. `analyze_content` (cost 3) returns language, sentiment, entities, topics, keyword density, and readability. Both take raw `text` — feed them output from `extract_content` / `extract_text`. ## LLM fallback chain For LLM-backed extraction (extract_structured, extract_with_llm, abstractive summarize): **Ollama (local, default) → OpenAI key → Anthropic key → MCP Sampling**. No cloud key is required; local Ollama is the zero-cost default. ## Cost note `scrape_template` & `list_ollama_models` = 1 · `scrape_structured`, `process_document` = 2 · `extract_structured`, `extract_with_llm`, `analyze_content` = 3 · `summarize_content` = 4. Cloud LLM usage is billed separately by your provider, on top of credits. --- # CrawlForge Web Scraping Fetch and clean web content with the CrawlForge MCP server. This skill covers single-page scraping (one URL → Markdown / HTML / text / links / metadata), site discovery (sitemaps and URL maps), and whole-site crawling. ## When to use - "Scrape this URL" / "get the markdown of this page" / "read this article" → `scrape` - "Just give me the raw HTML / JSON / headers" → `fetch_url` - "Give me the clean article text, no nav or ads" → `extract_content` - "Get the plain text / markdown body" → `extract_text` - "List every link on this page" → `extract_links` - "Get the title / meta / Open Graph / SEO tags" → `extract_metadata` - "Map all the URLs on this site" / "generate a sitemap" → `map_site` - "Crawl the whole docs site" / "index every page" → `crawl_deep` ## Tool selection (fastest path first) 1. **`scrape`** is the default for single pages. One fetch returns many formats at once — no N-request fan-out. Ask for exactly the formats you need. 2. Use the **single-purpose basic tools** (`fetch_url`, `extract_text`, `extract_links`, `extract_metadata`) when you only want one cheap thing. 3. **`extract_content`** when you specifically want Readability-cleaned article body (strips ads, nav, footers) — better than `extract_text` for articles. 4. **`map_site`** before a crawl to estimate scope / find section URLs. 5. **`crawl_deep`** to walk an entire site and (optionally) extract content. If a page returns 403/429, a CAPTCHA, or empty "enable JavaScript" content, switch to the **crawlforge-stealth-browsing** skill (`stealth_mode`). ## scrape — unified multi-format (cost: 2) Get markdown + links + metadata in a single call: ```json { "tool": "scrape", "params": { "url": "https://example.com/article", "formats": ["markdown", "links", "metadata"], "onlyMainContent": true } } ``` `formats` accepts: `"markdown"`, `"html"`, `"rawHtml"`, `"text"`, `"links"`, `"metadata"`, `"screenshot"`, or an object `{ "type": "json", "schema": {...}, "prompt": "..." }` for LLM-structured extraction. Partial success is supported: a failing format adds a `warnings[]` entry instead of failing the whole call. `onlyMainContent` (default `true`) strips boilerplate via Readability. CLI equivalent: ```bash crawlforge scrape https://example.com/article --extract --format markdown ``` ## fetch_url — raw HTTP (cost: 1) ```json { "tool": "fetch_url", "params": { "url": "https://example.com", "timeout": 15000 } } ``` Returns headers + body. Good for JSON/XML APIs or as a first step before `extract_text` / `extract_content`. Supports `headers` (e.g. auth tokens). ## extract_content vs extract_text ```json { "tool": "extract_content", "params": { "url": "https://blog.example.com/post" } } ``` ```json { "tool": "extract_text", "params": { "url": "https://example.com/article", "output_format": "markdown" } } ``` `extract_content` (cost 2) = Readability-cleaned article body, best for RAG. `extract_text` (cost 1) = faster, strips tags; pass `output_format:"markdown"` for RAG-friendly output. ## extract_links / extract_metadata (cost: 1 each) ```json { "tool": "extract_links", "params": { "url": "https://example.com", "filter_external": true } } ``` ```json { "tool": "extract_metadata", "params": { "url": "https://example.com" } } ``` ## map_site — discover URLs (cost: 2) ```json { "tool": "map_site", "params": { "url": "https://example.com", "include_sitemap": true, "max_urls": 500 } } ``` Reads `sitemap.xml` when available. Pass `search:"pricing"` to rank discovered URLs by relevance and get a `ranked_urls:[{url,score}]` array (default output is unchanged when `search` is omitted). CLI: `crawlforge map https://example.com --pretty` (add `--format xml` for a sitemap file). ## crawl_deep — walk a whole site (cost: 4, scales with pages) ```json { "tool": "crawl_deep", "params": { "url": "https://docs.example.com", "max_depth": 3, "max_pages": 200, "extract_content": true } } ``` BFS crawl up to depth 5 / 1000 pages. Use `include_patterns` / `exclude_patterns` (regex) to scope, `respect_robots` to honor robots.txt (on by default). Crawls projected over ~500 pages trigger a confirmation prompt (elicitation). CLI: `crawlforge crawl https://docs.example.com --depth 3 --max-pages 200`. ## Cost note Cheapest first: `fetch_url`, `extract_text`, `extract_links`, `extract_metadata` = 1 credit · `scrape`, `extract_content`, `map_site` = 2 · `crawl_deep` = 4 (grows with `max_pages`). Prefer one `scrape` call over several single-format calls when you need multiple formats. See [tool reference](references/tool-reference.md) for the full parameter list.