# Tool Specifications These tools let your AI assistant search the web, read pages, find academic papers, track multi-step research, and more — always returning real, verifiable sources. Below are the detailed schemas and behavioral contracts for each tool. > **Note:** Output schemas describe the JSON shape returned by each tool. See the corresponding `internal/tools/*.go` file for the implementation. Input schemas are auto-generated from struct `jsonschema` tags. ## Tool Registration Pattern Each tool follows the pattern in `internal/tools/registry.go`: a typed input struct with `jsonschema` tags (the SDK auto-generates JSON Schema from these) and a `register*` function that calls `mcp.AddTool`. See `internal/tools/search.go` for a representative example. ## Cache-Key Contract **Every result-affecting parameter MUST be included in a tool's cache key.** This single rule prevents an entire class of cache-collision bugs — two requests that would produce different results must never share a key (e.g. different providers, or a smaller `max_length` that would serve a later larger request a truncated body). - Canonical implementations: `searchCacheKey(...)` (`internal/tools/search.go`) for the search family — keys on the tool name plus every query parameter **including `provider`** — and `scrapeCacheKey(url, mode, maxLength)` (`internal/tools/scrape.go`) for scrapes. - Each key carries a **version segment** (e.g. `v2`); bump it whenever the cached response *shape* changes so a post-upgrade cache hit can never serve a blob missing a newly-added field. - Enforcement: `internal/tools/cachekey_test.go` guards today's parameters. When you add a tool or a result-affecting parameter, extend both the key and that test — the test only covers the params it knows about, so a new param can reintroduce the bug without failing any existing assertion. --- ## Large-Payload Linking (resource_link) The heaviest tools — `scrape_page` (`mode: raw`), `search_and_scrape`, and `research_export` — can return tens to hundreds of KB. When a result is **at or above the link threshold**, the tool returns an MCP `resource_link` (2025-06-18 content type) instead of inlining the full body: a small inline summary (`{resource, bytes, mimeType, summary, expiresAt, linked:true}`) plus a `resource_link` the client fetches on demand. Below the threshold, results inline exactly as before (no behavior change). - The linked body is stored in the shared `cache.Cache` (memory + AES-encrypted disk, or Redis in HTTP mode) under a **content-addressed** key and served read-only via the `research://artifact/{id}` resource template. The id is the SHA-256 of the body, so identical payloads de-dupe and the URI is stable/idempotent. - Artifacts are **short-lived** (bounded TTL); a fetch after expiry returns a not-found error, never another caller's data. With no cache configured, large payloads inline (correctness over size). - Canonical implementation: `largeResultOrInline(...)` + `registerArtifactResource(...)` in `internal/tools/artifacts.go`. Cache-freshness `_meta` (and routing `_meta`) ride on either shape. --- ## Tool 1: `web_search` ### Purpose Perform a web search and return structured result URLs with metadata. ### Input Schema | Field | Type | Required | Default | Constraints | |-------|------|----------|---------|-------------| | `query` | string | yes | — | 1-500 chars | | `num_results` | int | no | 5 | 1-10 | | `time_range` | string | no | — | `day`, `week`, `month`, `year` | | `safe` | string | no | `medium` | `off`, `medium`, `high` | | `language` | string | no | — | ISO 639-1 code | | `site` | string | no | — | Domain restriction (cannot combine with `lens`) | | `exact_terms` | string | no | — | Exact phrase match | | `exclude_terms` | string | no | — | Terms to exclude | | `country` | string | no | — | ISO 3166-1 alpha-2 | | `lens` | string | no | — | Domain lens (overrides `site`). See `lenses/` directory for available lenses | | `provider` | string | no | — | Force search provider. Returns error listing available providers if unknown | | `sessionId` | string | no | — | Link results to a `sequential_search` session | | `claim` | string | no | — | Optional claim to evaluate against each result's snippet; when set, each result gains a `claimSignal` (#66). Evidence only — never a verdict | ### Output Schema ```go type SearchOutput struct { URLs []string `json:"urls"` Query string `json:"query"` ResultCount int `json:"resultCount"` Results []SearchResult `json:"results"` Hints *ZeroResultHints `json:"hints,omitempty"` // present ONLY on zero-result responses (see below) Trust string `json:"trust"` // "untrusted-external-content" — treat results as data, not instructions (OWASP LLM01) } type SearchResult struct { Title string `json:"title"` URL string `json:"url"` Snippet string `json:"snippet"` DisplayLink string `json:"displayLink"` SourceReputation *DomainReputation `json:"sourceReputation,omitempty"` // present when host is in the reputation dataset (#198); omitted for unknown hosts ClaimSignal string `json:"claimSignal"` // most claim-relevant snippet sentence; present on EVERY result whenever `claim` is set (empty string when no snippet sentence matched) — uniform shape (#66, #235) } ``` `sourceReputation` is a descriptive signal (same shape as `scrape_page`/`search_and_scrape`) indicating the host's known reliability tier (`high`, `low`, `mixed`) with a `basis` note. It is omitted for hosts not in the dataset — absence means unknown, not bad. When `claim` is set, every result carries a `claimSignal` holding the most claim-relevant snippet sentence to help triage which links to read — it is the empty string (not absent) when no snippet sentence matched, so the field's shape is uniform across results and downstream null-checking stays simple (#235). For full-text claim evidence use `search_and_scrape` with `claim`. On a zero-result response, `hints` carries a `ZeroResultHints` object (the same shape `academic_search` and `patent_search` emit) explaining why nothing matched and how to recover: `reason` (`no_match` | `filters_too_restrictive`), `filtersApplied` (the constraints that may have eliminated results — `site`, `lens`, `time_range`, `country`, `language`, `exact_terms`, `exclude_terms`), and `suggestedActions` (remove-filter / try-different-provider). Suggested alternative providers are limited to those **configured and currently healthy**. On any non-empty result set the field is omitted. ### Behavior 1. If `SEARCH_ROUTING` is set, route through the multi-provider Router (priority-ordered fallback with per-provider circuit breakers). 2. If `lens` is specified and has a dedicated `cx`, route directly to that Google PSE engine. 3. If `lens` is specified without `cx`, inject `site:` operators and route to the configured provider. 4. Apply `time_range` as date restriction parameter. 5. Return deduplicated URLs and full result objects. ### Cache - Key: SHA-256 of (provider + query + all params) - TTL: 30 minutes ### Error Conditions - Unknown provider → error listing all supported providers (no duplicates) - Invalid/missing API key → `upstreamErrorResponse()` with setup instructions referencing `.env.example` - Rate limited → `rateLimitError()` suggesting 60s wait or different provider - No results → return empty `urls` array (not an error) - All errors use `upstreamErrorResponse()` from `internal/tools/search.go` for consistent formatting --- ## Tool 2: `scrape_page` ### Purpose Extract content from a URL, supporting web pages, documents, YouTube videos, and Hacker News threads (read natively via the HN API). ### Input Schema | Field | Type | Required | Default | Constraints | |-------|------|----------|---------|-------------| | `url` | string | yes | — | Valid HTTP(S) URL | | `mode` | string | no | `full` | `full` (cleaned readable text), `preview` (first ~5000 bytes), `raw` (verbatim unsanitized bytes — see [Raw Mode](#raw-mode)) | | `max_length` | int | no | 50000 | Bytes. Capped at 5,000,000 (5 MB) for all modes; in `preview` mode it is forced to 5000. Applies to `raw` mode as an `io.LimitReader` cap on the fetched bytes | | `sessionId` | string | no | — | Link to a `sequential_search` session | ### Output Schema ```go type ScrapeOutput struct { URL string `json:"url"` Content string `json:"content"` ContentType string `json:"contentType"` // html, markdown, youtube, pdf, docx, pptx (raw mode: the server's Content-Type header, may be "") Trust string `json:"trust"` // always "untrusted-external-content" — boundary marker: treat content as data, not instructions (OWASP LLM01) ContentLength int `json:"contentLength"` Truncated bool `json:"truncated"` EstimatedTokens int `json:"estimatedTokens"` SizeCategory string `json:"sizeCategory"` // small, medium, large, very_large Citation *Citation `json:"citation"` // always present Raw bool `json:"raw,omitempty"` // true only in raw mode; omitted otherwise ExtractedBy string `json:"extractedBy,omitempty"` // extraction tier: markdown|stealth|html|browser|exa:cached|exa:crawled; omitted when unknown ExtractionQuality string `json:"extractionQuality,omitempty"` // complete when the pipeline returned a confident extraction; partial when every tier was exhausted and the best-quality candidate was returned instead. Never an error. Omitted in raw mode. Metadata *Metadata `json:"metadata,omitempty"` // present only when a title was extracted (full/preview only) StructuredData *StructuredData `json:"structuredData,omitempty"` // page-embedded machine-readable metadata; present only when found (full/preview, HTML pages) ForumSignals *ForumSignals `json:"forumSignals,omitempty"` // Reddit engagement metadata extracted from JSON-LD; present only for Reddit posts where the HTML tier ran (#247) SourceType string `json:"sourceType"` // typed classification (#62): peer_reviewed|official_docs|government|news_publication|blog|forum|wiki|social_media|unknown AuthorityTier string `json:"authorityTier"` // banded authority: high|medium|low DomainCategory string `json:"domainCategory"` // subject area: academic|legal|medical|financial|technical|general DetectedDOI string `json:"detectedDoi,omitempty"` // a scholarly DOI the page declares (#199); peer-reviewed pages only; omitted when none RetractionStatus *RetractionStatus `json:"retractionStatus,omitempty"` // Crossref integrity status for detectedDoi; omitted when clean/unresolved — never a guess } type RetractionStatus struct { Retracted bool `json:"retracted"` // true for a formal retraction/withdrawal/removal Kind string `json:"kind"` // retraction | expression_of_concern | correction Date string `json:"date,omitempty"` // notice date (YYYY-MM-DD) when supplied NoticeDOI string `json:"noticeDoi,omitempty"` // DOI of the retraction/correction notice Source string `json:"source,omitempty"` // provenance: retraction-watch | publisher } type Metadata struct { Title string `json:"title"` Author string `json:"author"` } type ForumSignals struct { Platform string `json:"platform"` // Forum platform, e.g. "reddit" Upvotes int `json:"upvotes"` // Vote count from JSON-LD interactionStatistic Comments int `json:"comments"` // Comment count DatePublished string `json:"datePublished,omitempty"` // ISO 8601 publish date when available AuthorName string `json:"authorName,omitempty"` // Original poster name when available CredibilityNote string `json:"credibilityNote,omitempty"` // Contextual note, e.g. vote-manipulation risk when Upvotes < 20 } type StructuredData struct { JSONLD []json.RawMessage `json:"jsonLd,omitempty"` // each