# MCP Best Practices for Service Status Communication This document outlines best practices for communicating API service status to AI clients through the Model Context Protocol (MCP). ## Overview According to MCP documentation: > "Tool errors should be reported within the result object, not as MCP protocol-level errors. This allows the LLM to see and potentially take corrective action or request human intervention." The goal is to ensure AI clients can: 1. **Detect** when a service is offline or degraded 2. **Understand** the nature and scope of the issue 3. **Communicate** the problem clearly to end users 4. **Take action** to resolve or work around the issue ## MCP Error Handling Patterns ### Pattern 1: Tool-Level Error Responses (Current Implementation) **When to use:** For errors that occur during tool execution **Implementation:** ```typescript try { const result = await someAPICall(); return { content: [{ type: 'text', text: formatResult(result) }] }; } catch (error) { return { content: [{ type: 'text', text: `Error: ${error.message}\n\n` + `Check service status: https://status.example.com` }], isError: true // Critical: tells AI this is an error }; } ``` **Why this works:** - The `isError: true` flag signals to the AI that something went wrong - The error message is visible to the AI for interpretation - The AI can decide whether to retry, check status, or inform the user ### Pattern 2: Enhanced Tool Descriptions **When to use:** Always - guides AI behavior proactively **Implementation:** ```typescript { name: 'get_weather', description: 'Get weather data for a location. ' + 'If this tool returns an error, check the error message ' + 'for status page links and consider using check_service_status ' + 'to verify API availability.', inputSchema: { /* ... */ } } ``` **Why this works:** - AI reads tool descriptions before calling tools - Provides guidance on error handling strategies - Encourages proactive status checking ### Pattern 3: Dedicated Status Check Tool (Recommended) **When to use:** When you want AI to proactively verify service health **Implementation:** ```typescript { name: 'check_service_status', description: 'Check the operational status of weather APIs. ' + 'Use this when experiencing errors or to proactively verify ' + 'service availability before making weather data requests.', inputSchema: { type: 'object', properties: {}, required: [] } } ``` **Why this works:** - AI can call this tool without user prompt when errors occur - Provides structured status information - Enables intelligent error recovery (e.g., "Service is down, I'll check back later") ### Pattern 4: Resource-Based Status (Advanced - Optional) **When to use:** For real-time status that AI can check without calling a tool **Implementation:** ```typescript server.setRequestHandler(ListResourcesRequestSchema, async () => { return { resources: [ { uri: 'status://api-health', name: 'API Health Status', description: 'Real-time operational status of weather APIs', mimeType: 'application/json' } ] }; }); server.setRequestHandler(ReadResourceRequestSchema, async (request) => { if (request.params.uri === 'status://api-health') { const noaaStatus = await noaaService.checkServiceStatus(); const openMeteoStatus = await openMeteoService.checkServiceStatus(); return { contents: [{ uri: 'status://api-health', mimeType: 'application/json', text: JSON.stringify({ noaa: noaaStatus, openMeteo: openMeteoStatus, timestamp: new Date().toISOString() }, null, 2) }] }; } throw new Error('Resource not found'); }); ``` **Why this works:** - Resources are lower-cost than tool calls - AI can check status frequently without explicit user commands - Enables background monitoring and proactive error handling ## Error Message Best Practices ### 1. Structure Your Error Messages **Good:** ``` NOAA API server error: Service temporarily unavailable The NOAA Weather API may be experiencing an outage. Check service status: - Planned outages: https://weather-gov.github.io/api/planned-outages - Service notices: https://www.weather.gov/notification - Report issues: nco.ops@noaa.gov or (301) 683-1518 ``` **Why it's good:** - Clear problem statement - Context about what might be wrong - Actionable links - Multiple ways to get information **Bad:** ``` Error: 503 Service Unavailable ``` **Why it's bad:** - No context - No guidance on next steps - AI can't communicate this effectively to users ### 2. Include Structured Information Even though MCP errors are text-based, you can include structured information that AI can parse: ```typescript const errorMessage = [ `Error Type: SERVICE_UNAVAILABLE`, `Service: NOAA Weather API`, ``, `The NOAA Weather API is currently experiencing issues.`, ``, `Status Page: https://weather-gov.github.io/api/planned-outages`, `Estimated Resolution: Check status page for updates`, ``, `Alternative: For historical data, Open-Meteo API may still be available.` ].join('\n'); ``` ### 3. Categorize Errors Help AI understand the severity and nature of errors: ```typescript enum ErrorCategory { TEMPORARY = 'TEMPORARY', // Retry might work PERMANENT = 'PERMANENT', // Won't work without changes DEGRADED = 'DEGRADED', // Partial functionality EXTERNAL = 'EXTERNAL' // Outside our control } const errorMessage = `Error Category: ${ErrorCategory.EXTERNAL}\n\n` + `NOAA API is experiencing server issues...`; ``` ### 4. Suggest Alternatives When one service is down, guide AI to alternatives: ```typescript const errorMessage = `NOAA API Error: Service temporarily unavailable\n\n` + `Suggestion: For historical weather data (>7 days old), ` + `try using get_historical_weather which uses Open-Meteo API ` + `(global coverage, independent service).\n\n` + `Status: https://weather-gov.github.io/api/planned-outages`; ``` ## AI Client Behavior Expectations Based on MCP design, here's how AI clients typically respond to errors: ### 1. Error Detection ``` Tool returns { isError: true, content: [...] } → AI knows something went wrong ``` ### 2. Error Analysis ``` AI reads the error message text → Looks for patterns: "service unavailable", "timeout", "rate limit" → Identifies status page links ``` ### 3. Response Generation ``` Good AI response: "I tried to get the weather forecast, but the NOAA Weather API is currently experiencing an outage. According to their status page (https://weather-gov.github.io/api/planned-outages), they're aware of the issue. Would you like me to check again in a few minutes, or try getting historical data instead?" Poor AI response: "Error: Service unavailable." ``` ### 4. Proactive Recovery ``` If tool description mentions check_service_status: → AI may automatically call it after errors → Can make informed decisions about retry timing → Can communicate ETA to users if status page provides it ``` ## Implementation Checklist - [x] **Tool errors use `isError: true` flag** - [x] **Error messages include clear problem descriptions** - [x] **Error messages contain status page links** - [x] **Error messages suggest concrete next steps** - [x] **Tool descriptions mention error handling strategies** - [x] **Dedicated `check_service_status` tool available** - [ ] **Consider: Resources for real-time status (optional)** - [ ] **Consider: Error categorization (TEMPORARY, PERMANENT, etc.)** - [ ] **Consider: Alternative service suggestions in errors** ## Real-World Example Flow **Scenario:** NOAA API is down 1. **User asks:** "What's the weather in New York?" 2. **AI calls:** `get_current_conditions(lat: 40.7128, lon: -74.0060)` 3. **MCP Server returns:** ```json { "content": [{ "type": "text", "text": "NOAA API server error: Service temporarily unavailable\n\nThe NOAA Weather API may be experiencing an outage.\n\nCheck service status:\n- Planned outages: https://weather-gov.github.io/api/planned-outages\n- Service notices: https://www.weather.gov/notification\n- Report issues: nco.ops@noaa.gov or (301) 683-1518" }], "isError": true } ``` 4. **AI (good behavior):** - Reads error message - Sees it's a service outage - Notes the status page link - Decides to check service status 5. **AI calls:** `check_service_status()` 6. **MCP Server returns:** Status showing NOAA is down, Open-Meteo is up 7. **AI responds to user:** > "I'm unable to get the current weather for New York right now because the NOAA Weather API is experiencing an outage. According to their status page, they're aware of the issue. > > However, I can get historical weather data for New York if you'd like to know what the weather was like in recent days or weeks. Would that be helpful?" ## Benefits of This Approach 1. **Graceful Degradation:** AI can work around failures 2. **User Communication:** AI explains issues clearly in natural language 3. **Reduced Frustration:** Users get context, not cryptic errors 4. **Self-Service:** AI can check status and retry without user intervention 5. **Intelligent Recovery:** AI knows when to retry vs. when to give up ## Common Pitfalls to Avoid ❌ **Don't:** Throw protocol-level exceptions for API errors ```typescript // Bad throw new Error('API is down'); // This breaks the MCP connection ``` ✅ **Do:** Return errors within the result object ```typescript // Good return { isError: true, content: [{ type: 'text', text: 'API is down...' }] }; ``` ❌ **Don't:** Return generic error messages ```typescript // Bad return { isError: true, content: [{ type: 'text', text: 'Error occurred' }] }; ``` ✅ **Do:** Provide context and actionable information ```typescript // Good return { isError: true, content: [{ type: 'text', text: 'NOAA API unavailable. Check status: https://...' }] }; ``` ❌ **Don't:** Assume AI will know what to do ```typescript // Bad - no guidance description: 'Get weather forecast' ``` ✅ **Do:** Guide AI behavior in tool descriptions ```typescript // Good - includes guidance description: 'Get weather forecast. If errors occur, use check_service_status to verify API availability.' ``` ## Conclusion The best practice for MCP servers is to use a **layered approach**: 1. **Tool-level errors** with `isError: true` and detailed messages 2. **Enhanced tool descriptions** that guide AI error handling 3. **Dedicated status check tools** for proactive monitoring 4. **Optional: Resources** for real-time status without tool calls Our current implementation uses patterns 1-3, which aligns with MCP best practices and enables AI clients to handle service outages intelligently.