# SmartWebFetchTool AI-powered web content fetching and summarization tool with intelligent caching and safety features. Fetches web pages, converts HTML to Markdown, and uses AI to extract relevant information based on a user prompt. **Features:** - HTML to Markdown conversion for clean content processing - 15-minute TTL cache for faster repeated access to the same URLs - Automatic retry with exponential backoff on network failures and 5xx errors - Optional domain safety checking via Claude's domain info API - Configurable content length limits with automatic truncation - Fail-open/fail-closed security modes for safety check errors - Proper charset detection and handling - Thread-safe concurrent cache access ## Overview The `SmartWebFetchTool` retrieves content from URLs and processes it using an AI model for intelligent summarization. Unlike simple HTTP clients, it: 1. Fetches HTML content using HTTP GET 2. Converts HTML to clean Markdown format 3. Uses AI to extract information relevant to your specific prompt 4. Caches results for 15 minutes to avoid redundant requests 5. Automatically retries on transient failures This tool implements `AutoCloseable` for proper resource cleanup. ## Basic Usage ```java // Build with required ChatClient SmartWebFetchTool webFetch = SmartWebFetchTool.builder(chatClient).build(); // Fetch and summarize web content String result = webFetch.webFetch( "https://docs.spring.io/spring-ai/reference/", "What are the key features of Spring AI?" ); System.out.println(result); // Output: "Spring AI provides integration with various AI models including..." ``` ## Builder Configuration ### Required Parameters **`chatClient`** - The ChatClient instance used for AI-powered summarization ```java SmartWebFetchTool.builder(chatClient) .build(); ``` ### Optional Parameters | Option | Default | Description | |--------|---------|-------------| | `maxContentLength` | 100,000 | Maximum characters to process; content is truncated with warning | | `domainSafetyCheck` | true | Enable domain safety verification before fetching | | `failOpenOnSafetyCheckError` | true | Allow fetch if safety check fails (true) or block (false) | | `maxCacheSize` | 100 | Maximum number of URL+prompt combinations to cache | | `maxRetries` | 2 | Maximum retry attempts for transient network failures | **Example with all options:** ```java SmartWebFetchTool webFetch = SmartWebFetchTool.builder(chatClient) .maxContentLength(150_000) // Process up to 150KB .domainSafetyCheck(true) // Check domain safety .failOpenOnSafetyCheckError(true) // Allow fetch if safety check errors .maxCacheSize(200) // Cache up to 200 entries .maxRetries(3) // Retry up to 3 times .build(); ``` ## Configuration Details ### Max Content Length Controls the maximum number of characters processed from the fetched content. Content exceeding this limit is truncated with a warning logged. ```java SmartWebFetchTool.builder(chatClient) .maxContentLength(50_000) // For small articles .build(); SmartWebFetchTool.builder(chatClient) .maxContentLength(200_000) // For long documentation .build(); ``` **Use Cases:** - Smaller limits (50K-100K): Blog posts, news articles - Medium limits (100K-150K): Technical documentation - Larger limits (150K-200K): Comprehensive guides, API references ### Domain Safety Check Verifies domain safety using Claude's domain info API before fetching content. ```java // Enable safety checks (default) SmartWebFetchTool.builder(chatClient) .domainSafetyCheck(true) .build(); // Disable for trusted internal URLs SmartWebFetchTool.builder(chatClient) .domainSafetyCheck(false) .build(); ``` **When to disable:** - Internal company documentation - Localhost development servers - Known trusted domains in controlled environments ### Fail-Open vs Fail-Closed Controls behavior when domain safety check encounters an error (not a failed check, but an error performing the check). ```java // Fail-open: Allow fetch if safety check errors (default, more permissive) SmartWebFetchTool.builder(chatClient) .failOpenOnSafetyCheckError(true) .build(); // Fail-closed: Block fetch if safety check errors (more secure) SmartWebFetchTool.builder(chatClient) .failOpenOnSafetyCheckError(false) .build(); ``` **Security Trade-offs:** - **Fail-open (true)**: Better availability, accepts risk if safety service is down - **Fail-closed (false)**: Better security, blocks all fetches if safety service fails ### Max Retries Configures automatic retry attempts for network failures and 5xx server errors with exponential backoff. ```java SmartWebFetchTool.builder(chatClient) .maxRetries(0) // No retries, fail immediately .build(); SmartWebFetchTool.builder(chatClient) .maxRetries(2) // Default: retry twice (3 total attempts) .build(); SmartWebFetchTool.builder(chatClient) .maxRetries(5) // Aggressive retries for unreliable networks .build(); ``` **Backoff Strategy:** - Attempt 1: Immediate - Attempt 2: Wait 1 second - Attempt 3: Wait 2 seconds - Attempt 4: Wait 4 seconds - Attempt N: Wait 2^(N-1) seconds ## Caching Behavior The tool implements a sophisticated caching system to improve performance and reduce redundant network requests. ### Cache Key Structure Cache keys include **both** the URL and the prompt: ``` url::prompt::promptHashCode ``` **Example:** ```java // These create DIFFERENT cache entries webFetch.webFetch("https://example.com", "What is the main topic?"); webFetch.webFetch("https://example.com", "List all features"); // This reuses the FIRST cache entry (same URL + prompt) webFetch.webFetch("https://example.com", "What is the main topic?"); ``` ### Time-To-Live (TTL) - **TTL**: 15 minutes per cache entry - **Cleanup**: Automatic when cache size exceeds `maxCacheSize` - **Thread Safety**: Concurrent access is safe ### Cache Management ```java // Configure cache size SmartWebFetchTool webFetch = SmartWebFetchTool.builder(chatClient) .maxCacheSize(500) // Cache up to 500 URL+prompt combinations .build(); // Cache is automatically cleared on close try (SmartWebFetchTool tool = SmartWebFetchTool.builder(chatClient).build()) { // Use tool } // Cache cleared here ``` ## Error Handling The tool provides comprehensive error handling with descriptive messages. ### Common Error Scenarios **Invalid URL:** ```java webFetch.webFetch("not-a-url", "Summarize"); // Returns: "Error: Invalid URL format. Please provide a fully-formed URL (e.g., https://example.com)" ``` **Empty URL:** ```java webFetch.webFetch("", "Summarize"); // Returns: "Error: URL cannot be empty or null" ``` **Network Error:** ```java webFetch.webFetch("https://nonexistent-domain-xyz123.com", "Summarize"); // Returns: "Error fetching URL: Network error while fetching URL: ..." ``` **HTTP Error:** ```java webFetch.webFetch("https://example.com/404-page", "Summarize"); // Returns: "Error: Failed to fetch URL. HTTP status code: 404" ``` **Domain Safety Failure:** ```java SmartWebFetchTool webFetch = SmartWebFetchTool.builder(chatClient) .domainSafetyCheck(true) .build(); webFetch.webFetch("https://unsafe-domain.com", "Summarize"); // Returns: "Domain safety check failed for URL 'https://unsafe-domain.com': The domain is not safe to fetch content from." ``` ### Retry Behavior The tool automatically retries on: - Network errors (IOException) - Server errors (5xx status codes) **It does NOT retry on:** - 4xx client errors (bad request, not found, unauthorized, etc.) - Invalid URL format - Failed domain safety checks - Interrupted requests ## Resource Management The tool implements `AutoCloseable` for proper cleanup. ### Try-with-Resources (Recommended) ```java try (SmartWebFetchTool webFetch = SmartWebFetchTool.builder(chatClient).build()) { String result = webFetch.webFetch(url, prompt); System.out.println(result); } // Cache automatically cleared, resources released ``` ### Manual Cleanup ```java SmartWebFetchTool webFetch = SmartWebFetchTool.builder(chatClient).build(); try { String result = webFetch.webFetch(url, prompt); } finally { webFetch.close(); // Clear cache } ``` ## Integration Examples ### Spring Boot Configuration ```java @Configuration public class ToolsConfig { @Bean public SmartWebFetchTool smartWebFetchTool(ChatClient.Builder chatClientBuilder) { ChatClient chatClient = chatClientBuilder.build(); return SmartWebFetchTool.builder(chatClient) .maxContentLength(150_000) .domainSafetyCheck(true) .failOpenOnSafetyCheckError(true) .maxCacheSize(100) .maxRetries(2) .build(); } } ``` ### ChatClient Integration ```java ChatClient chatClient = chatClientBuilder .defaultTools(SmartWebFetchTool.builder(chatClient) .domainSafetyCheck(false) // Disable for internal docs .maxRetries(3) // More retries for reliability .build()) .build(); // AI can now use web fetch automatically String response = chatClient.prompt() .user("Search for Spring AI documentation and tell me about vector stores") .call() .content(); ``` ### Custom Prompts ```java SmartWebFetchTool webFetch = SmartWebFetchTool.builder(chatClient).build(); // Extract specific information String features = webFetch.webFetch( "https://spring.io/projects/spring-ai", "List all supported AI model providers" ); // Compare content String comparison = webFetch.webFetch( "https://example.com/product-a", "What are the pricing tiers and features for each tier?" ); // Technical analysis String analysis = webFetch.webFetch( "https://github.com/spring-projects/spring-ai", "What programming languages and frameworks are used in this project?" ); ``` ## Advanced Use Cases ### Multiple Sources ```java SmartWebFetchTool webFetch = SmartWebFetchTool.builder(chatClient) .maxCacheSize(500) // Large cache for multiple URLs .build(); String[] urls = { "https://docs.spring.io/spring-ai/reference/", "https://docs.spring.io/spring-boot/reference/", "https://docs.spring.io/spring-framework/reference/" }; for (String url : urls) { String summary = webFetch.webFetch(url, "What are the main features?"); System.out.println("Summary for " + url + ":\n" + summary + "\n"); } ``` ### Different Prompts Same URL ```java String url = "https://example.com/api-docs"; // Cache miss - fetches and caches String overview = webFetch.webFetch(url, "Provide an overview"); // Cache miss - different prompt, fetches again String endpoints = webFetch.webFetch(url, "List all API endpoints"); // Cache hit - same URL and prompt String overview2 = webFetch.webFetch(url, "Provide an overview"); ``` ### Internal Documentation ```java // Optimized for internal trusted sources SmartWebFetchTool internalWebFetch = SmartWebFetchTool.builder(chatClient) .domainSafetyCheck(false) // Skip safety for internal URLs .maxRetries(1) // Fewer retries for fast network .maxContentLength(200_000) // Large docs expected .build(); String docs = internalWebFetch.webFetch( "http://internal-docs.company.local/api-spec", "Summarize the authentication requirements" ); ``` ## Security Considerations ### Domain Safety API The tool uses Claude's domain info API (`https://claude.ai/api/web/domain_info`) to verify domain safety before fetching. **Safety Check Process:** 1. Extract domain from URL 2. Query Claude's API with domain 3. Receive `can_fetch` boolean response 4. Block or allow based on response and configuration **Disable if:** - Fetching from trusted internal domains - Behind corporate firewall with controlled access - Using allowlist of known-safe domains ### Read-Only Operations The tool only performs HTTP GET requests and does not: - Modify any local files - Send data to fetched URLs (except HTTP headers) - Execute JavaScript or active content - Store credentials or sensitive data ### User-Agent and Headers Standard browser headers are sent for compatibility: - User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64)... - Accept: text/html,application/xhtml+xml,application/xml - Accept-Language: en-US,en;q=0.5 ## Performance Tips ### Optimize Cache Size ```java // Small application, limited URLs SmartWebFetchTool.builder(chatClient).maxCacheSize(50).build(); // Large application, many URLs/prompts SmartWebFetchTool.builder(chatClient).maxCacheSize(500).build(); ``` ### Content Length Limits ```java // Fast responses for short content SmartWebFetchTool.builder(chatClient).maxContentLength(50_000).build(); // Comprehensive extraction for long content SmartWebFetchTool.builder(chatClient).maxContentLength(200_000).build(); ``` ### Retry Strategy ```java // Fast-fail for time-sensitive operations SmartWebFetchTool.builder(chatClient).maxRetries(0).build(); // Resilient for unreliable networks SmartWebFetchTool.builder(chatClient).maxRetries(5).build(); ``` ## Limitations - **Read-only**: Only HTTP GET requests supported - **No authentication**: Basic auth, OAuth, or API keys not supported in headers - **No cookies**: Stateless requests, no session management - **No JavaScript**: Static HTML only, no dynamic content rendering - **No redirects to different hosts**: Automatically follows same-host redirects only - **Text content**: Optimized for HTML/text, binary content not supported - **English-focused**: AI summarization works best with English content ## Troubleshooting ### "Domain safety check failed" - Disable safety checks if fetching internal/trusted URLs - Set `failOpenOnSafetyCheckError(true)` to allow fetch on check errors ### "Content too long, truncating" - Increase `maxContentLength` if you need more content - Or refine your prompt to extract specific information ### "Failed after N attempts" - Check network connectivity - Verify URL is accessible - Increase `maxRetries` for unreliable connections ### Cache not working as expected - Remember cache includes both URL AND prompt - Check if 15-minute TTL has expired - Verify cache hasn't exceeded `maxCacheSize` (causing eviction) ## See Also - [FileSystemTools](FileSystemTools.md) - For file operations - [ShellTools](ShellTools.md) - For shell command execution - [BraveWebSearchTool](BraveWebSearchTool.md) - For web search capabilities