---
name: web-search
description: Web search and content extraction skill for AI coding agents. Zero API keys required. Decision tree with fallback chains across 5 tools.
---

# Web Search

Web search, scraping, and content extraction for AI coding agents. Zero API keys required. Five tools organized in fallback chains: WebSearch and Crawl4AI as primary, Jina as secondary, duckduckgo-search and WebFetch as fallbacks. Use when your agent needs web information -- finding pages, extracting content, or conducting research.

Terminology used in this file:
- **Playwright:** Browser automation framework used by Crawl4AI for JavaScript-rendered pages.
- **SPA:** Single-page application; content is rendered dynamically in JavaScript.
- **MCP:** Model Context Protocol, a standard for exposing tool servers to AI agents.

## Setup

```bash
python3 -m pip install crawl4ai duckduckgo-search
crawl4ai-setup
```

- **Claude Code:** copy this skill folder into `.claude/skills/web-search/`
- **Codex CLI:** append this SKILL.md content to your project's root `AGENTS.md`

For the full installation walkthrough (prerequisites, verification, troubleshooting), see [references/installation-guide.md](references/installation-guide.md).

## Staying Updated

This skill ships with an `UPDATES.md` changelog and `UPDATE-GUIDE.md` for your AI agent.

After installing, tell your agent: "Check `UPDATES.md` in the web-search skill for any new features or changes."

When updating, tell your agent: "Read `UPDATE-GUIDE.md` and apply the latest changes from `UPDATES.md`."

Follow `UPDATE-GUIDE.md` so customized local files are diffed before any overwrite.

---

## Quick Start

Run this minimal fallback-safe sequence:

```bash
# 1) Find candidate pages
python3 -c "from duckduckgo_search import DDGS; import json; print(json.dumps(DDGS().text('your query', max_results=5), indent=2))"

# 2) Extract one page quickly (no local deps)
curl -s "https://r.jina.ai/http://example.com/article" | head -80

# 3) Escalate to Crawl4AI if JS rendering is needed
crwl https://example.com/app --f markdown --bypass-cache
```

Use this routing rule: search with `WebSearch` first, extract with Jina/WebFetch for simple pages, escalate to Crawl4AI for JS-heavy targets.

## Decision Tree

```text
Need info from the web?
  |
  +-- Need to SEARCH for pages/answers?
  |     +-- Default first choice --> WebSearch (built-in, zero setup)
  |     +-- WebSearch unavailable? --> Jina s.jina.ai (no key needed)
  |     +-- Both fail? --> duckduckgo-search Python lib (emergency fallback)
  |
  +-- Need to EXTRACT content from a known URL?
  |     +-- JS-heavy SPA, dynamic content? --> Crawl4AI crwl (full browser rendering)
  |     +-- Simple text page (article, docs, blog)? --> Jina r.jina.ai (fast, no install)
  |     +-- Jina/Crawl4AI unavailable? --> WebFetch (built-in, AI-summarized)
  |     +-- Need structured data extraction? --> Crawl4AI with extraction strategy
  |     +-- Multiple URLs in batch? --> Crawl4AI batch mode
  |
  +-- Need DEEP RESEARCH (search + extract + combine)?
        --> WebSearch to find URLs --> Crawl4AI/Jina extract each --> synthesize

Rule of thumb: WebSearch for finding, Jina for reading, Crawl4AI for rendering.
```

---

## Tool Reference

### WebSearch (Built-in) -- Primary Search

**What:** Claude Code built-in web search tool. Returns search results with links and snippets.
**Install required:** None (built-in to Claude Code)
**Strengths:** Zero setup, zero API keys, integrated into agent workflow, always available
**Weaknesses:** No direct SDK/CLI access (tool-only), results are search-result blocks not raw JSON

```text
# Invoked as a Claude Code tool:
WebSearch(query="your search query")

# Supports domain filtering:
WebSearch(query="your query", allowed_domains=["docs.python.org"])
WebSearch(query="your query", blocked_domains=["pinterest.com"])
```

**Returns:** Search result blocks with titles, URLs, and content snippets.

### WebFetch (Built-in) -- Fallback URL Extraction

**What:** Claude Code built-in URL fetcher. Fetches page content, converts HTML to markdown, processes with AI.
**Install required:** None (built-in to Claude Code)
**Strengths:** Zero setup, AI-processed output, handles redirects, 15-min cache
**Weaknesses:** Cannot handle authenticated/private URLs, may summarize large content

```text
# Invoked as a Claude Code tool:
WebFetch(url="https://example.com/page", prompt="Extract the main content")
```

**Limitations:**
- Will fail for authenticated URLs (Google Docs, Confluence, Jira, private GitHub)
- HTTP auto-upgraded to HTTPS
- Large content may be summarized rather than returned in full
- When redirected to a different host, returns redirect URL instead of content

### Crawl4AI -- JS-Rendering Web Scraper

**What:** Open-source scraper with full Playwright browser rendering. Outputs LLM-friendly markdown.
**Install required:** `pip install crawl4ai && crawl4ai-setup`
**Strengths:** Full JS rendering, handles SPAs, batch crawling, structured extraction
**Weaknesses:** Requires Playwright install, heavier than Jina

```bash
# CLI (simplest)
crwl https://example.com
crwl https://example.com -o markdown

# Python API
from crawl4ai import AsyncWebCrawler
async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(url='https://example.com')
    print(result.markdown)
```

### Jina Reader/Search -- Zero-Install Extraction & Search

**What:** URL-to-markdown converter and search via HTTP API. No install needed -- just curl.
**API key:** Not required. `JINA_API_KEY` is optional and only increases rate limits.
**Strengths:** Zero install, fast (~1s), works everywhere curl works, search + extract in one service
**Weaknesses:** No JS rendering, rate limited without API key

```bash
# Read a URL (returns markdown)
curl -s 'https://r.jina.ai/https://example.com'

# Search (returns search results)
curl -s 'https://s.jina.ai/your+search+query'

# With API key (higher rate limits, optional)
curl -s -H "Authorization: Bearer $JINA_API_KEY" 'https://r.jina.ai/https://example.com'
```

### duckduckgo-search -- Emergency Search Fallback

**What:** Python library for DuckDuckGo search. Zero API keys, zero registration.
**Install required:** `pip install duckduckgo-search`
**Strengths:** Completely free, no API key, no rate limit concerns, reliable fallback
**Weaknesses:** Less AI-optimized results than WebSearch, Python-only

```python
from duckduckgo_search import DDGS
results = DDGS().text("your query", max_results=5)
for r in results:
    print(r['title'], r['href'], r['body'])
```

```bash
# One-liner from CLI
python3 -c "from duckduckgo_search import DDGS; import json; print(json.dumps(DDGS().text('your query', max_results=5), indent=2))"
```

---

## Core Workflows

### Pattern 1: Quick Web Search

When: Need factual answers or find relevant pages

1. Use WebSearch: `WebSearch(query="your query here")`
2. Parse results: each result has title, URL, and content snippet
3. Fallback: `curl -s 'https://s.jina.ai/your+query+here'`
4. Emergency: `python3 -c "from duckduckgo_search import DDGS; ..."`

### Pattern 2: URL Content Extraction

When: Have a URL, need its content as clean text/markdown

a) JS-heavy site: `crwl URL` (Crawl4AI, full rendering)
b) Lightweight static page: `curl -s 'https://r.jina.ai/URL'` (Jina)
c) Both fail: `WebFetch(url="URL", prompt="Extract the main content")`

Decision: Is it a SPA/JS-heavy? Use Crawl4AI. Static content? Use Jina first. If output is empty/broken, escalate.

### Pattern 3: Deep Research

When: Need comprehensive research on a topic with multiple sources

1. WebSearch to find relevant pages
2. Pick top 3-5 URLs from results
3. Extract each with Crawl4AI or Jina
4. If any extraction fails (JS site), use the other tool
5. Synthesize extracted content into research summary

Token budget: ~5K per extracted page, budget 25K total for 5 pages

### Pattern 4: Batch URL Scraping

When: Need content from multiple URLs (5+)

```python
import asyncio
from crawl4ai import AsyncWebCrawler
urls = ['url1', 'url2', 'url3']

async def batch():
    async with AsyncWebCrawler() as crawler:
        for url in urls:
            result = await crawler.arun(url=url)
            print(f'--- {url} ---')
            print(result.markdown[:2000])

asyncio.run(batch())
```

### Pattern 5: Fallback Chain

When: Primary tool fails

Search chain: WebSearch (built-in) --> Jina s.jina.ai --> duckduckgo-search

Extract chain: Crawl4AI crwl --> Jina r.jina.ai --> WebFetch (built-in)

Always try the primary tool first, escalate on failure.

---

## MCP Configuration

Jina MCP (optional enhancement, not required):

```json
{
  "jina-reader": {
    "command": "npx",
    "args": ["-y", "jina-ai-reader-mcp"]
  }
}
```

MCP (Model Context Protocol) is optional. Your agent can use CLI/Python/built-in tools directly.

---

## Environment Setup

**Zero API keys required.** All tools work out of the box.

Optional:
- `JINA_API_KEY` (get from https://jina.ai) -- increases rate limits, not required

```bash
export JINA_API_KEY='jina_...'  # optional
```

Install:
- `pip install crawl4ai duckduckgo-search`
- `crawl4ai-setup`  # installs Playwright browsers

Built-in tools (WebSearch, WebFetch) require no installation.

Verify: `./scripts/search-check.sh`

---

## Anti-Patterns

| Do NOT | Do instead |
|---|---|
| Use Crawl4AI for simple text pages | Use Jina `r.jina.ai` (zero overhead) |
| Use Jina for JS-heavy SPAs | Use Crawl4AI (Jina has no JS rendering) |
| Skip the fallback chain | Always have a backup: WebSearch->Jina->duckduckgo, Crawl4AI->Jina->WebFetch |
| Extract full pages when you need one fact | Use WebSearch (returns relevant snippets directly) |
| Batch with Jina for 10+ URLs | Use Crawl4AI batch mode (designed for it) |
| Forget rate limits | Jina without API key has stricter limits |
| Use WebFetch for authenticated URLs | It will fail; use browser-ops skill or direct API access |

---

## Error Handling

| Symptom | Tool | Cause | Fix |
|---|---|---|---|
| No results returned | WebSearch | Query too specific or topic too niche | Broaden query, try Jina s.jina.ai or duckduckgo-search |
| Redirect notification | WebFetch | URL redirects to different host | Make a new WebFetch request with the provided redirect URL |
| Auth failure | WebFetch | Authenticated/private URL | Use browser-ops skill or direct API access instead |
| Content summarized | WebFetch | Page content too large | Use Jina r.jina.ai or Crawl4AI for full content |
| 429 Too Many Requests | Jina | Rate limit hit | Add `JINA_API_KEY` header, or add delay between requests |
| Empty/truncated output | Jina | JS-rendered content not captured | Escalate to Crawl4AI: `crwl URL` |
| crwl: command not found | Crawl4AI | Not installed or not on PATH | `pip install crawl4ai && crawl4ai-setup` |
| Playwright browser not found | Crawl4AI | `crawl4ai-setup` not run | Run: `crawl4ai-setup` |
| TimeoutError | Crawl4AI | Page too slow or blocking | Add timeout parameter, check if site blocks bots |
| SSL certificate error | Any | Expired or self-signed cert | Retry; for Crawl4AI add `ignore_https_errors=True` |
| 403 Forbidden | Jina/Crawl4AI | Site blocking automated access | Try different tool from fallback chain |
| ImportError: duckduckgo_search | duckduckgo-search | Package not installed | `pip install duckduckgo-search` |
| RatelimitException | duckduckgo-search | Too many requests too fast | Add 1-2s delay between calls, or switch to WebSearch |

---

## Bundled Resources Index

| Path | What | When to load |
|---|---|---|
| `./UPDATES.md` | Structured changelog for AI agents | When checking for new features or updates |
| `./UPDATE-GUIDE.md` | Instructions for AI agents performing updates | When updating this skill |
| `./references/installation-guide.md` | Detailed install walkthrough for Claude Code and Codex CLI | First-time setup or environment repair |
| `./references/tool-comparison.md` | Side-by-side comparison: latency, cost, JS support, accuracy | When choosing between tools for a specific use case |
| `./references/error-patterns.md` | Detailed failure modes and recovery per tool | When debugging a failed extraction or search |
| `./scripts/search-check.sh` | Health check: verifies all tools are available | Before first web search task in a session |
| `./scripts/setup.sh` | One-shot installer for all dependencies | First-time setup or after environment reset |