--- name: web-research-workflow description: Unified decision tree for web research. Auto-selects WebFetch, Tavily, or agent-browser based on target site characteristics and available API keys. Use when researching web content, scraping, extracting raw markdown, or capturing documentation. context: fork agent: web-research-analyst version: 1.2.0 author: OrchestKit AI Agent Hub tags: [research, browser, webfetch, tavily, automation, scraping, content-extraction] user-invocable: false allowedTools: [Bash, Read, Write, WebFetch] complexity: low --- # Web Research Workflow Unified approach for web content research that automatically selects the right tool for each situation. ## Quick Decision Tree ``` URL to research │ ▼ ┌─────────────────┐ │ 1. Try WebFetch │ ← Fast, free, no overhead │ (always try) │ └─────────────────┘ │ Content OK? ──Yes──► Parse and return │ No (empty/partial/<500 chars) │ ▼ ┌───────────────────────┐ │ 2. TAVILY_API_KEY set?│ └───────────────────────┘ │ │ Yes No ──► Skip to step 3 │ ▼ ┌───────────────────────────┐ │ Tavily search/extract/ │ ← Raw markdown, batch URLs │ crawl/research │ └───────────────────────────┘ │ Content OK? ──Yes──► Parse and return │ No (JS-rendered/auth-required) │ ▼ ┌─────────────────────┐ │ 3. Use agent-browser │ ← Full browser, last resort └─────────────────────┘ │ ├─ SPA (react/vue/angular) ──► wait --load networkidle ├─ Login required ──► auth flow + state save ├─ Dynamic content ──► wait --text "Expected" └─ Multi-page ──► crawl pattern ``` ## Tavily Enhanced Research When `TAVILY_API_KEY` is set, Tavily provides a powerful middle tier between WebFetch and agent-browser. It returns raw markdown content, supports batch URL extraction, and offers semantic search with relevance scoring. **When to use Tavily over WebFetch:** - WebFetch returned <500 chars (likely incomplete) - You need raw markdown content (not Haiku-summarized) - Batch extracting content from multiple URLs - Semantic search with relevance scoring - Site discovery/crawling (map API) **When to skip Tavily and go to agent-browser:** - Content requires JavaScript rendering (SPAs) - Authentication/login is required - Interactive elements need clicking - Content is behind CAPTCHAs ### Tavily Search (Semantic Web Search) Returns relevance-scored results with raw markdown content: ```bash curl -s -X POST 'https://api.tavily.com/search' \ -H 'Content-Type: application/json' \ -H "Authorization: Bearer $TAVILY_API_KEY" \ -d '{ "query": "your search query", "search_depth": "advanced", "max_results": 5, "include_raw_content": "markdown" }' | python3 -m json.tool ``` Options: - `search_depth`: `"basic"` (fast) or `"advanced"` (thorough, 2x cost) - `topic`: `"general"` (default) or `"news"` or `"finance"` - `include_domains`: `["example.com"]` — restrict to specific sites - `exclude_domains`: `["reddit.com"]` — filter out sites - `days`: `3` — limit to recent results (news/finance) - `include_raw_content`: `"markdown"` — get full page content ### Tavily Extract (Batch URL Content) Extract raw content from up to 20 URLs at once: ```bash curl -s -X POST 'https://api.tavily.com/extract' \ -H 'Content-Type: application/json' \ -H "Authorization: Bearer $TAVILY_API_KEY" \ -d '{ "urls": [ "https://docs.example.com/guide", "https://competitor.com/pricing" ] }' | python3 -m json.tool ``` Returns markdown content for each URL. Use when you have specific URLs and need full content. ### Tavily Map (Site Discovery) Discover all URLs on a site before extracting: ```bash curl -s -X POST 'https://api.tavily.com/map' \ -H 'Content-Type: application/json' \ -H "Authorization: Bearer $TAVILY_API_KEY" \ -d '{ "url": "https://docs.example.com", "max_depth": 2, "limit": 50 }' | python3 -m json.tool ``` Useful for documentation sites and competitor sitemaps. Combine with extract for full crawl. ### Tavily Crawl (Multi-Page Content Extraction) Crawl an entire site and extract content from all discovered pages in one call. Replaces the manual map→extract two-step workflow: ```bash curl -s -X POST 'https://api.tavily.com/crawl' \ -H 'Content-Type: application/json' \ -H "Authorization: Bearer $TAVILY_API_KEY" \ -d '{ "url": "https://docs.example.com", "max_depth": 2, "limit": 50, "include_raw_content": "markdown" }' | python3 -m json.tool ``` Options: - `max_depth`: `2` — how many link-hops from the seed URL - `limit`: `50` — max pages to crawl - `include_raw_content`: `"markdown"` — get full page content (not just snippets) - `exclude_paths`: `["/blog/*"]` — skip certain URL patterns - `include_paths`: `["/docs/*"]` — restrict to certain URL patterns **When to use Crawl vs Map+Extract:** - Use **Crawl** when you want content from an entire site section (docs, changelog, pricing) - Use **Map** when you only need URL discovery without content - Use **Extract** when you already have specific URLs ### Tavily Research (Deep Research — Beta) Multi-step research agent that searches, reads, and synthesizes across multiple sources. Returns a comprehensive report with citations: ```bash curl -s -X POST 'https://api.tavily.com/research' \ -H 'Content-Type: application/json' \ -H "Authorization: Bearer $TAVILY_API_KEY" \ -d '{ "query": "comparison of vector databases for RAG in 2026", "max_results": 10, "report_type": "research_report" }' | python3 -m json.tool ``` Options: - `report_type`: `"research_report"` (default) | `"outline_report"` | `"detailed_report"` - `max_results`: number of sources to analyze (more = deeper but slower) **Note:** The `/research` endpoint is in beta. Falls back gracefully to `/search` + `/extract` if unavailable. Best for deep competitive analysis, market research, and technical comparisons. ### Escalation Heuristic ```bash # Auto-escalate from WebFetch to Tavily CONTENT=$(WebFetch url="$URL" prompt="Extract main content") CHAR_COUNT=${#CONTENT} if [ "$CHAR_COUNT" -lt 500 ] && [ -n "$TAVILY_API_KEY" ]; then # WebFetch returned thin content — try Tavily extract RESULT=$(curl -s -X POST 'https://api.tavily.com/extract' \ -H 'Content-Type: application/json' \ -H "Authorization: Bearer $TAVILY_API_KEY" \ -d "{\"urls\":[\"$URL\"]}") # Parse result... fi ``` ### Cost Awareness | API | Cost | Notes | |-----|------|-------| | Tavily Search | 1 credit/search | `advanced` depth = 2 credits | | Tavily Extract | 1 credit/5 URLs | Batch up to 20 URLs | | Tavily Map | 1 credit/10 pages | Good for site discovery | | Tavily Crawl | 1 credit/5 pages | Full site extraction in one call | | Tavily Research | 5 credits/query | Deep multi-source synthesis (beta) | | WebFetch | Free | Always try first | | agent-browser | Free (compute) | Slowest, most capable | ### Graceful Degradation If `TAVILY_API_KEY` is not set, the 3-tier tree collapses to the original 2-tier (WebFetch → agent-browser). No configuration needed — agents check for the env var before attempting Tavily calls. ## When to Use What | Scenario | Tool | Why | |----------|------|-----| | Static HTML page | WebFetch | Fast, no browser needed | | Public API docs | WebFetch | Usually server-rendered | | GitHub README | WebFetch | Static content | | Batch URL extraction (2-20 URLs) | Tavily Extract | Parallel, raw markdown | | Semantic search with content | Tavily Search | Relevance-scored results | | Site crawl/discovery (URLs only) | Tavily Map | Finds all URLs on a domain | | Full site content extraction | Tavily Crawl | Crawl + extract in one call | | Competitor page deep content | Tavily Extract | Full markdown, not summarized | | Deep multi-source research | Tavily Research | Synthesized report with citations (beta) | | React/Vue/Angular app | agent-browser | Needs JS execution | | Interactive pricing page | agent-browser | Dynamic content | | Login-protected content | agent-browser | Needs session state | | Swagger UI | agent-browser | Client-rendered | | Documentation with sidebar nav | agent-browser | Client-side routing | ## Pattern 1: Auto-Fallback Try WebFetch first, fall back to browser if needed: ```bash # Step 1: Try WebFetch WebFetch(url="https://example.com", prompt="Extract main content") # If result is empty, partial, or contains "Loading..." indicators: # Step 2: Fall back to browser agent-browser open https://example.com agent-browser wait --load networkidle agent-browser get text body ``` ### Detection Heuristics Content likely needs browser if WebFetch returns: - Empty or very short content (< 500 chars) - Contains `