--- name: firecrawl-research-patterns description: "Programmatic Firecrawl usage, self-hosted operations, academic paper routing, recursive deep research, and raw corpus persistence. TRIGGERS - firecrawl search, firecrawl scrape, academic paper, arxiv, deep research, recursive search, research pattern, corpus persistence, firecrawl, self-hosted scraping, web scrape, scraper wrapper, littleblack, ZeroTier scraping." allowed-tools: Read, Write, Edit, Bash, Grep, Glob --- # Firecrawl Research Patterns Programmatic patterns for using self-hosted Firecrawl in research workflows — search, scrape, route academic papers, run recursive deep research, and persist raw results for future re-analysis. Also covers self-hosted deployment, health checks, and recovery. For archiving AI chat conversations (ChatGPT/Gemini shares), see `Skill(gh-tools:research-archival)`. --- ## FIRST — TodoWrite Task Templates **MANDATORY**: Select and load the appropriate template before any research work. ### Template A — Single Firecrawl Search + Persist ``` 1. Health check — GET http://172.25.236.1:3002/v1/health (fallback: test search) 2. Execute search — POST /v1/search with query, limit, scrapeOptions 3. Persist raw results — save each result page to docs/research/corpus/ with frontmatter 4. Update corpus index — append entries to docs/research/corpus-index.jsonl 5. Extract findings — summarize key learnings from raw corpus files ``` ### Template B — Academic Paper Retrieval + Persist ``` 1. Identify source — classify URL/DOI per academic-paper-routing.md decision tree 2. Route to scraper — arxiv direct HTML, Semantic Scholar API, Firecrawl, or Jina Reader 3. Scrape content — execute fetch with appropriate method and timeout 4. Persist raw result — save to docs/research/corpus/ with academic-specific frontmatter 5. Update corpus index — append entry to corpus-index.jsonl 6. Summarize paper — extract key claims, methods, results from raw corpus file ``` ### Template C — Full Recursive Deep Research with Corpus ``` 1. Health check — verify Firecrawl reachable at 172.25.236.1:3002 2. Initialize parameters — set breadth (default 4), depth (default 2), concurrency (default 2) 3. Generate search queries — LLM generates N queries from topic + prior learnings 4. Execute searches — Firecrawl /v1/search for each query via p-limit(concurrency) 5. Persist raw results — save ALL scraped pages to docs/research/corpus/ with provenance 6. Extract learnings — LLM extracts key findings + follow-up questions per result set 7. Recurse — for each follow-up, recurse with breadth=ceil(breadth/2), depth=depth-1 8. Base case — depth=0, return accumulated learnings 9. Synthesize report — LLM generates final markdown from all learnings 10. Write session report — save to docs/research/sessions/ with corpus file references 11. Update corpus index — append all new entries to corpus-index.jsonl ``` ### Template D — Corpus Review / Re-Analysis ``` 1. Inventory corpus — read docs/research/corpus-index.jsonl, filter by session/topic/date 2. Read raw files — load matching corpus files from docs/research/corpus/ 3. Re-analyze — extract new insights with current context/questions 4. Update session report — amend or create new session report in docs/research/sessions/ ``` --- ## Section 1 — Programmatic Firecrawl Usage **Instance**: Self-hosted at `http://172.25.236.1:3002` via ZeroTier. No API key needed. ### Why `fetch()` Instead of `@mendable/firecrawl-js` SDK The official SDK uses `jiti` for dynamic imports, which is incompatible with Bun's module resolution. Direct `fetch()` calls are simpler, more reliable, and have zero dependencies. ### Two Endpoints | Endpoint | Purpose | When to Use | | ----------------- | --------------------- | ------------------------------------------------- | | `POST /v1/search` | Search + scrape combo | Research queries — returns multiple scraped pages | | `POST /v1/scrape` | Single URL scrape | Known URL — extract markdown from one page | See [api-endpoint-reference.md](./references/api-endpoint-reference.md) for full request/response contracts. ### Quick Examples **Search** (returns multiple results with markdown): ```typescript const res = await fetch("http://172.25.236.1:3002/v1/search", { method: "POST", headers: { "Content-Type": "application/json" }, body: JSON.stringify({ query: "mixture of experts scaling laws", limit: 5, scrapeOptions: { formats: ["markdown"] }, }), }); const { data } = await res.json(); // data: [{ url, markdown, metadata }] ``` **Scrape** (single URL): ```typescript const res = await fetch("http://172.25.236.1:3002/v1/scrape", { method: "POST", headers: { "Content-Type": "application/json" }, body: JSON.stringify({ url: "https://arxiv.org/abs/2401.12345", formats: ["markdown"], waitFor: 3000, // ms — for JS-heavy pages }), }); const { data } = await res.json(); // data: { markdown, metadata } ``` ### Error Handling ```typescript // Always set a timeout const controller = new AbortController(); const timeoutId = setTimeout(() => controller.abort(), 15_000); try { const res = await fetch(url, { ...opts, signal: controller.signal }); if (!res.ok) throw new Error(`Firecrawl: ${res.status} ${res.statusText}`); const json = await res.json(); if (!json.data || (Array.isArray(json.data) && json.data.length === 0)) { // Empty results — not an error, but no content to process } } finally { clearTimeout(timeoutId); } ``` ### Health Check ```typescript // Quick health check before starting a research session const res = await fetch("http://172.25.236.1:3002/v1/health"); if (!res.ok) throw new Error( "Firecrawl unhealthy — see self-hosted-operations.md and self-hosted-troubleshooting.md references", ); ``` --- ## Section 2 — Academic Paper Routing Route paper retrieval to the most effective method based on source. Full decision tree in [academic-paper-routing.md](./references/academic-paper-routing.md). ### Quick Reference | Source | Best Method | Fallback | | ----------------- | ------------------------------------- | ------------------------- | | arxiv.org | Direct HTML (`/html/ID`) | Firecrawl `/v1/scrape` | | Semantic Scholar | API (`api.semanticscholar.org`) | Firecrawl search by title | | ACL Anthology | Firecrawl `/v1/scrape` | Direct PDF download | | NeurIPS/ICML/ICLR | Firecrawl `/v1/scrape` with `waitFor` | Search by title | | IEEE Xplore | Firecrawl with `waitFor: 3000` | Author's website | | ACM DL | Firecrawl with `waitFor: 3000` | Author's website | | Author blogs | Jina Reader (`r.jina.ai`) | Firecrawl `/v1/scrape` | | Google Scholar | Firecrawl `/v1/search` | Direct search query | ### DOI Resolution ```typescript // DOI → publisher URL → route to appropriate scraper const res = await fetch(`https://doi.org/${doi}`, { redirect: "follow" }); const publisherUrl = res.url; // e.g., https://dl.acm.org/doi/10.1145/... // Then route publisherUrl through the decision tree above ``` --- ## Section 3 — Recursive Research Protocol The iterative search → extract → recurse → synthesize pattern. Full step-by-step protocol in [recursive-research-protocol.md](./references/recursive-research-protocol.md). ### Algorithm Overview ``` deepResearch(topic, breadth=4, depth=2, concurrency=2): 1. Generate N search queries (N = breadth) from topic + prior learnings 2. For each query (via p-limit concurrency): a. Firecrawl /v1/search → get results b. PERSIST each raw result to docs/research/corpus/ c. Extract learnings + follow-up questions 3. For each follow-up question: → Recurse with breadth=ceil(breadth/2), depth=depth-1 4. Base case: depth=0 → return accumulated learnings 5. Synthesize final report from all learnings 6. Write session report to docs/research/sessions/ ``` ### Default Parameters (from working implementation) | Parameter | Default | Max | Rationale | | ------------- | ------- | --- | ------------------------------------------------------- | | `breadth` | 4 | — | Number of parallel search queries per level | | `depth` | 2 | 5 | Recursion levels (depth > 5 yields diminishing returns) | | `concurrency` | 2 | — | Parallel Firecrawl requests (self-hosted, be gentle) | | `limit` | 5 | — | Results per search query | | `timeout` | 15000ms | — | Per-search timeout | ### Token Budget Each search returns up to 5 pages. Trim each page to ~25,000 tokens before LLM processing: ```typescript function trimToTokenLimit(text: string, maxTokens: number): string { if (!text) return ""; const estimatedTokens = Math.ceil(text.length / 3.5); if (estimatedTokens <= maxTokens) return text; const maxChars = Math.floor(maxTokens * 3.5 * 0.8); return text.slice(0, maxChars); } ``` ### Partial Failure Principle **Partial results are better than total failure.** If a query fails, log it and continue with remaining queries. Never abort the entire research session because one query timed out. --- ## Section 4 — Raw Corpus Persistence **Critical principle**: Every Firecrawl-scraped page must be persisted in its **original raw markdown** with provenance metadata. Synthesized reports reference these originals but never replace them. Full format specification in [corpus-persistence-format.md](./references/corpus-persistence-format.md). ### Directory Layout ``` {project-root}/ ├── docs/research/ │ ├── corpus/ # Raw scraped pages (committed) │ │ └── YYYY-MM-DD-{slug}.md # One file per scraped URL │ ├── sessions/ # Research session reports (committed) │ │ └── YYYY-MM-DD-{topic-slug}.md # Synthesized report with corpus refs │ └── corpus-index.jsonl # Append-only registry (committed) ``` ### Corpus File Frontmatter ```yaml --- source_url: https://arxiv.org/html/2401.12345 scraped_at: "2026-02-25T14:30:00Z" scraper: firecrawl firecrawl_endpoint: /v1/search search_query: "mixture of experts scaling" result_index: 2 research_session: "2026-02-25-moe-scaling" depth_level: 1 claude_code_uuid: SESSION_UUID content_tokens_approx: 4200 --- [RAW MARKDOWN FROM FIRECRAWL — NEVER MODIFIED] ``` ### Key Rules 1. Content below `---` is the **exact markdown Firecrawl returned** — no summarization, trimming, or reformatting 2. One file per URL per scrape — if the same URL is scraped in multiple sessions, each gets its own timestamped file 3. File naming: `YYYY-MM-DD-{slug}.md` where slug is kebab-case from page title or URL path (max 60 chars) 4. Session reports in `docs/research/sessions/` reference corpus files by relative path ### Corpus Index (JSONL) ```json { "url": "https://arxiv.org/html/2401.12345", "file": "corpus/2026-02-25-moe-scaling-arxiv-2401-12345.md", "scraped_at": "2026-02-25T14:30:00Z", "session": "2026-02-25-moe-scaling", "tokens": 4200, "scraper": "firecrawl" } ``` ### Why This Matters - **LLM re-analysis**: Future sessions can re-read raw corpus files and extract different insights with better prompts or newer models - **No information loss**: Synthesis drops details; raw files preserve everything Firecrawl captured - **Deduplication awareness**: The JSONL index lets agents skip URLs already in the corpus - **Git-friendly**: Markdown files diff cleanly, JSONL is append-only --- ## Section 5 — Self-Hosted Operations The Firecrawl instance runs on **littleblack** (172.25.236.1) via ZeroTier. No API key needed. | Port | Service | Type | Purpose | | ---- | --------------- | ------ | -------------------------- | | 3002 | Firecrawl API | Docker | Core scraping engine | | 3003 | Scraper Wrapper | Bun | Saves to file, returns URL | | 8080 | Caddy | Binary | Serves saved markdown | For architecture diagrams, health checks, recovery commands, and deployment details, see: - [Self-Hosted Operations](./references/self-hosted-operations.md) — Architecture, health checks, recovery commands - [Self-Hosted Bootstrap Guide](./references/self-hosted-bootstrap-guide.md) — Fresh installation (7 steps) - [Self-Hosted Best Practices](./references/self-hosted-best-practices.md) — Docker restart policies, monitoring - [Self-Hosted Troubleshooting](./references/self-hosted-troubleshooting.md) — Symptom-based diagnosis --- ## Anti-Patterns | # | Anti-Pattern | Why It Fails | Correct Approach | | --- | ------------------------------------------- | ------------------------------------------------ | ----------------------------------------------- | | 1 | Using `@mendable/firecrawl-js` SDK | `jiti` dynamic imports break in Bun | Direct `fetch()` calls | | 2 | Searching paywalled sites without `waitFor` | JS SPAs return empty shell | Use `waitFor: 3000` for IEEE, ACM DL | | 3 | Setting depth > 5 | Exponential query explosion, diminishing returns | Cap at depth 5 (`clampDepth()`) | | 4 | No timeout on `fetch()` | Hangs indefinitely on unreachable pages | Always use `AbortController` with 15s timeout | | 5 | Not trimming long page content | Exceeds LLM context window | `trimToTokenLimit(text, 25_000)` per page | | 6 | Aborting on partial failure | Loses all completed work | Log failures, continue with remaining queries | | 7 | Not checking Firecrawl health first | Wastes time on queries that all fail | `GET /v1/health` or test search before starting | | 8 | Saving only synthesis without raw originals | Loses source material, prevents re-analysis | Always persist raw Firecrawl markdown to corpus | --- ## References - [API Endpoint Reference](./references/api-endpoint-reference.md) — `/v1/search` and `/v1/scrape` contracts - [Academic Paper Routing](./references/academic-paper-routing.md) — Decision tree for paper sources - [Recursive Research Protocol](./references/recursive-research-protocol.md) — Step-by-step recursive pattern - [Corpus Persistence Format](./references/corpus-persistence-format.md) — Raw content archival format + directory layout - [Self-Hosted Operations](./references/self-hosted-operations.md) — Architecture, health checks, recovery - [Self-Hosted Bootstrap Guide](./references/self-hosted-bootstrap-guide.md) — Fresh installation guide - [Self-Hosted Best Practices](./references/self-hosted-best-practices.md) — Docker restart policies, monitoring - [Self-Hosted Troubleshooting](./references/self-hosted-troubleshooting.md) — Symptom-based diagnosis and recovery