--- agent_id: "agent-mesh" display_name: "Agent Mesh" version: "1.0.0" description: "Multi-agent orchestration mesh for complex workflows" type: "orchestrator" confidence_threshold: 0.9 --- # agent-mesh — Agent Development Guide ## What this is This document defines the agent interaction model, skill definitions, and development patterns for building AI agents that integrate with the `agent-mesh` orchestrator. It complements `ARCHITECTURE.md` (which covers the orchestrator's internal design) by focusing specifically on how to build, configure, and deploy agents that participate in the multi-agent system. **Target audience:** Engineers building MCP-compliant AI agents for enterprise orchestration platforms, platform teams integrating agents into multi-agent systems, and SREs deploying agent infrastructure at scale. --- ## Monorepo Structure agent-mesh is organized as a pnpm monorepo with 10 packages published under the `@reaatech` scope, plus a reference deployment example. | Package | npm Name | Purpose | |---------|----------|---------| | `packages/core` | [`@reaatech/agent-mesh`](https://www.npmjs.com/package/@reaatech/agent-mesh) | Core domain types, Zod schemas, env config, constants | | `packages/registry` | [`@reaatech/agent-mesh-registry`](https://www.npmjs.com/package/@reaatech/agent-mesh-registry) | Agent YAML loader, SIGHUP hot-reload | | `packages/session` | [`@reaatech/agent-mesh-session`](https://www.npmjs.com/package/@reaatech/agent-mesh-session) | Firestore-backed multi-turn session management | | `packages/classifier` | [`@reaatech/agent-mesh-classifier`](https://www.npmjs.com/package/@reaatech/agent-mesh-classifier) | Gemini Flash intent classification | | `packages/confidence` | [`@reaatech/agent-mesh-confidence`](https://www.npmjs.com/package/@reaatech/agent-mesh-confidence) | Confidence-gated routing decision tree | | `packages/router` | [`@reaatech/agent-mesh-router`](https://www.npmjs.com/package/@reaatech/agent-mesh-router) | MCP-based agent dispatch | | `packages/gateway` | [`@reaatech/agent-mesh-gateway`](https://www.npmjs.com/package/@reaatech/agent-mesh-gateway) | Express middleware and request handler | | `packages/mcp-server` | [`@reaatech/agent-mesh-mcp-server`](https://www.npmjs.com/package/@reaatech/agent-mesh-mcp-server) | MCP server exposing orchestrator | | `packages/utils` | [`@reaatech/agent-mesh-utils`](https://www.npmjs.com/package/@reaatech/agent-mesh-utils) | Circuit breaker with Firestore persistence | | `packages/observability` | [`@reaatech/agent-mesh-observability`](https://www.npmjs.com/package/@reaatech/agent-mesh-observability) | Logging, metrics, tracing, audit | **Toolchain:** pnpm workspaces + Turbo + Changesets + tsup + Biome + Vitest. --- ## Architecture Overview ``` ┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ │ AI Client │────▶│ Orchestrator │────▶│ Agent Registry │ │ (Claude, etc) │ │ (agent-mesh) │ │ (YAML configs) │ └─────────────────┘ │ │ └─────────────────┘ │ ┌────────────┐ │ │ │ │ Rate │ │ │ │ │ Limiter │ │ │ │ └────────────┘ │ │ │ ┌────────────┐ │ ▼ │ │ Session │ │ ┌─────────────────┐ │ │ Manager │ │ │ Agent Pool │ │ └────────────┘ │ │ (MCP servers) │ │ ┌────────────┐ │ └─────────────────┘ │ │ Classifier │ │ ▲ │ │ (Gemini) │ │ │ │ └────────────┘ │ │ │ ┌────────────┐ │ │ │ │ Confidence │ │ │ │ │ Gate │ │ │ │ └────────────┘ │ │ │ ┌────────────┐ │ │ │ │ Circuit │ │ │ │ │ Breaker │ │──────────────┘ │ └────────────┘ │ └──────────────────┘ ``` ### Key Components | Component | Package | Purpose | |-----------|---------|---------| | **Agent Registry** | `@reaatech/agent-mesh-registry` | YAML agent definitions with SIGHUP hot-reload | | **Rate Limiter** | `@reaatech/agent-mesh-gateway` | Token bucket per-client rate limiting | | **Session Manager** | `@reaatech/agent-mesh-session` | Firestore-backed multi-turn state | | **Classifier** | `@reaatech/agent-mesh-classifier` | Gemini Flash intent classification | | **Confidence Gate** | `@reaatech/agent-mesh-confidence` | Route/clarify/fallback decision tree | | **Circuit Breaker** | `@reaatech/agent-mesh-utils` | Per-agent resilience pattern | | **MCP Router** | `@reaatech/agent-mesh-router` | Agent dispatch via MCP protocol | | **Gateway** | `@reaatech/agent-mesh-gateway` | Express middleware, entry handler, auth | | **MCP Server** | `@reaatech/agent-mesh-mcp-server` | Exposes orchestrator as MCP-compliant agent | | **Observability** | `@reaatech/agent-mesh-observability` | Winston logging, OTel tracing/metrics, audit | --- ## Agent Configuration Agents participating in the orchestrator must be registered via YAML configuration files. The registry is loaded at startup and reloaded on `SIGHUP` without restart. ### agent.yaml Schema ```yaml # agents/my-agent.yaml — Agent registration for orchestrator agent_id: "my-agent" display_name: "My Agent" description: >- Detailed description of agent capabilities. This text is injected verbatim into the Gemini classifier prompt, so be specific about what this agent handles and what it doesn't. endpoint: "${MY_AGENT_ENDPOINT:-http://localhost:8081}" type: mcp is_default: false confidence_threshold: 0.7 clarification_required: false # Shown to users when the orchestrator asks for clarification clarification_context: >- I can help you with X, Y, and Z. What specifically do you need? # Few-shot examples for the classifier — more examples = better routing examples: - "Example query that should route to this agent" - "Another query this agent should handle" - "A third example showing the agent's domain" ``` ### Schema Reference | Field | Required | Type | Description | |-------|----------|------|-------------| | `agent_id` | yes | string | Unique identifier (lowercase, hyphens allowed) | | `display_name` | yes | string | Human-readable name for UI and prompts | | `description` | yes | string | Injected into classifier prompt — be precise | | `endpoint` | yes | string | MCP server URL (must be valid HTTP/HTTPS) | | `type` | yes | string | Always `"mcp"` | | `is_default` | yes | boolean | Exactly one agent must be default | | `confidence_threshold` | yes | number | 0.0–1.0, default agent must be 0.0 | | `clarification_required` | yes | boolean | Whether to ask clarifying questions | | `clarification_context` | no | string | Shown when clarification is needed | | `examples` | yes | string[] | Few-shot examples for classifier | ### Invariants Enforced at Load Time 1. **Exactly one default agent** — if multiple agents have `is_default: true`, the entire reload is aborted and the old registry remains active. 2. **Default agent threshold must be 0** — the default agent always accepts fallback traffic, so its threshold is enforced to 0.0. 3. **Unique agent IDs** — duplicate IDs cause reload abort. 4. **Valid endpoint URLs** — localhost and private IP ranges (10.x, 172.16.x, 192.168.x) are rejected to prevent SSRF vulnerabilities. 5. **File size limits** — YAML files exceeding 1MB are skipped. ### Registration Process 1. Create your agent YAML file in the registry directory: ```bash cp agents/my-agent.yaml /path/to/orchestrator/agents/ ``` 2. Set the environment variable for your endpoint: ```bash export MY_AGENT_ENDPOINT=https://my-agent.cloudrun.app ``` 3. Send `SIGHUP` to the orchestrator process to hot-reload: ```bash kill -HUP $(pgrep -f orchestrator) ``` 4. Verify the agent is loaded: ```bash curl http://localhost:8080/health/deep | jq .agents ``` --- ## MCP Protocol Contract Agents must implement the MCP (Model Context Protocol) server interface. The orchestrator communicates with agents via `StreamableHTTP` transport. ### Request Format The orchestrator sends a `handle_message` tool call with the following structure: ```json { "jsonrpc": "2.0", "id": "request-123", "method": "tools/call", "params": { "name": "handle_message", "arguments": { "session_id": "550e8400-e29b-41d4-a716-446655440000", "request_id": "660e8400-e29b-41d4-a716-446655440001", "employee_id": "emp123", "display_name": "John Doe", "raw_input": "Reset my password", "intent_summary": "User needs password reset assistance", "entities": { "account_type": "okta" }, "detected_language": "en", "turn_history": [ { "role": "user", "content": "First message", "timestamp": "2026-04-15T22:00:00Z" }, { "role": "agent", "content": "Response", "timestamp": "2026-04-15T22:00:05Z" } ], "workflow_state": {} } } } ``` ### Response Format Agents must return a response matching this schema: ```json { "jsonrpc": "2.0", "id": "request-123", "result": { "content": [ { "type": "text", "text": "I can help you reset your password. Please visit..." } ] } } ``` The orchestrator validates the response against this Zod schema (from `@reaatech/agent-mesh`): ```typescript import { AgentResponseSchema } from '@reaatech/agent-mesh'; // Schema shape: // z.object({ // content: z.string().min(1, 'content is required'), // workflow_complete: z.boolean(), // workflow_state: z.record(z.string(), z.unknown()).optional(), // }); ``` ### Response Fields | Field | Required | Type | Description | |-------|----------|------|-------------| | `content` | yes | string | Human-readable response for the user | | `workflow_complete` | yes | boolean | `true` = close session, `false` = keep open | | `workflow_state` | no | object | Agent-managed state for multi-turn context | ### Workflow State For multi-turn agents, use `workflow_state` to persist context across turns: ```json { "content": "I've started your password reset. What's your email?", "workflow_complete": false, "workflow_state": { "step": "awaiting_email", "reset_token": "abc123" } } ``` The orchestrator passes this state back on subsequent turns, allowing the agent to maintain context without the orchestrator understanding the content. --- ## Skill System Skills are the atomic unit of agent capability. Each skill maps to one or more MCP tools and is described by a `skills/{skill-id}.md` file. ### Skill File Structure ```markdown # {skill-display-name} ## Capability One-sentence description of what this skill enables. ## MCP Tools | Tool | Input Schema | Output | Rate Limit | |------|-------------|--------|------------| | `tool_name` | Zod schema summary | Return type | RPM | ## Usage Examples ### Example 1: Basic usage - User intent - Tool call - Expected response ## Error Handling - Known failure modes - Recovery strategies - Escalation paths ## Security Considerations - PII handling - Permission requirements - Audit logging ``` ### Built-in Skills The `agent-mesh` orchestrator exposes these skills for agent use: | Skill ID | File | Description | |----------|------|-------------| | `routing` | `skills/routing/skill.md` | Intent classification and agent routing | | `circuit-breaker` | `skills/circuit-breaker/skill.md` | Agent resilience and failure isolation | | `session-management` | `skills/session-management/skill.md` | Multi-turn conversation state | | `clarification` | `skills/clarification/skill.md` | User clarification when confidence is low | ### Adding a New Skill 1. **Create the skill definition:** ```bash mkdir -p skills/my-skill touch skills/my-skill/skill.md ``` 2. **Implement the MCP tools** in your agent server. 3. **Update this document** with the new skill in the table above. --- ## Confidence-Gated Routing The orchestrator uses a confidence gate to decide whether to route to your agent, ask for clarification, or fall back to the default agent. ### Decision Tree ``` 1. Unknown agent_id → route to default agent 2. Default agent → always route directly (no threshold check) 3. Confidence ≥ threshold AND not ambiguous → route to your agent 4. clarification_required → generate clarification question 5. Otherwise → fall back to default agent ``` ### Configuring Your Threshold Set `confidence_threshold` in your agent YAML based on how specific your domain is: | Threshold | Use Case | |-----------|----------| | 0.0–0.3 | Broad, general-purpose agent (like the default) | | 0.4–0.6 | Moderately specific domain (IT helpdesk, HR) | | 0.7–0.9 | Narrow, well-defined domain (password reset, expense reports) | | 1.0 | Never recommended — you'll never receive traffic | ### Clarification Flow If your agent has `clarification_required: true` and confidence is below threshold, the orchestrator generates a clarification question using Gemini: 1. Orchestrator detects low confidence for your agent 2. Gemini generates a targeted question in the user's language 3. Question is shown to the user 4. User's response is re-classified with the additional context 5. If confidence improves, route to your agent; otherwise fall back This allows users to confirm their intent before being routed to specialized agents. --- ## Circuit Breaker Integration The orchestrator implements per-agent circuit breakers to prevent cascading failures. Your agent's health directly affects whether it receives traffic. ### Circuit States | State | Behavior | When | |-------|----------|------| | **CLOSED** | Normal — requests pass through | Default state, agent is healthy | | **OPEN** | Requests rejected immediately | Agent has exceeded failure threshold | | **HALF_OPEN** | Limited test requests allowed | Testing if agent has recovered | ### Configuration The orchestrator's circuit breaker is configured via environment variables: | Variable | Default | Description | |----------|---------|-------------| | `CIRCUIT_BREAKER_FAILURE_THRESHOLD` | 5 | Failures before opening | | `CIRCUIT_BREAKER_RESET_TIMEOUT_MS` | 30000 | Time before recovery attempt | | `CIRCUIT_BREAKER_HALF_OPEN_MAX_CALLS` | 3 | Test calls in half-open state | | `CIRCUIT_BREAKER_HALF_OPEN_TIMEOUT_MS` | 60000 | Max time in half-open state | ### Best Practices for Agents 1. **Return proper error responses** — don't just crash or timeout 2. **Implement health checks** — the orchestrator may probe your `/health` endpoint 3. **Use timeouts** — don't let requests hang indefinitely 4. **Return `workflow_complete: true`** for terminal states to close sessions 5. **Handle retries gracefully** — the orchestrator may retry on transient failures ### State Persistence Circuit breaker state is persisted to Firestore and survives Cloud Run restarts. This means if your agent is unhealthy, it won't immediately receive traffic when the orchestrator restarts — it must prove it has recovered first. --- ## Session Management The orchestrator manages multi-turn sessions in Firestore. Understanding the session lifecycle helps you build agents that maintain context correctly. ### Session Lifecycle ``` createSession → status: 'active', ttl: now + 30m appendTurn → arrayUnion (transaction), ttl refreshed updateWorkflow → workflow_state replaced closeSession → status: COMPLETED | ABANDONED | ERROR ttl field deleted (Firestore TTL policy GCs document) resumeSession → new session_id, prior turn_history carried forward ``` ### Session Bypass If a user has an active session (`status == 'active'` AND `ttl > now()`), the orchestrator **skips classification** and routes directly to the session's agent. This is a hard requirement — mid-turn messages are never re-classified. ### Turn History Each turn is stored as: ```typescript interface TurnEntry { role: 'user' | 'agent'; content: string; timestamp: string; // ISO-8601 intent_summary?: string; // For bypass-path context } ``` Your agent receives the full turn history on each request, allowing you to maintain conversational context. ### Workflow State Use `workflow_state` to persist agent-specific context: ```json { "workflow_state": { "step": "collecting_info", "collected_fields": ["name", "email"], "pending_action": "password_reset" } } ``` The orchestrator passes this through without interpreting it — it's your agent's private state bag. --- ## Security Model ### Input Validation All tool inputs are validated against Zod schemas before reaching your agent. The orchestrator also sanitizes string inputs for prompt-injection patterns. ### PII Handling - **Never log raw user input** — the orchestrator's logger redacts PII automatically - **Never return PII in error messages** — use generic error text - **Use the orchestrator's audit logging** for compliance-critical events ### Authentication The orchestrator validates API keys on all inbound requests. Your agent should trust that authentication has already been performed — don't re-validate unless you have specific agent-level permissions. ### SSRF Protection Agent endpoint URLs are validated to reject localhost and private IP ranges. This prevents malicious agent registrations from accessing internal services. --- ## Observability ### Structured Logging The orchestrator logs all events with `request_id` and `service` context. When building agents, follow the same pattern: ```typescript import { logger } from '@reaatech/agent-mesh-observability'; logger.info({ request_id: context.requestId, agent_id: 'my-agent', action: 'handle_message', duration_ms: elapsed, }, 'Agent request completed'); ``` ### Tracing Each tool call is traced as an OpenTelemetry span. Add custom attributes: ```typescript import { trace } from '@opentelemetry/api'; const span = trace.getActiveSpan(); span?.setAttribute('agent.action', 'password_reset'); span?.setAttribute('agent.step', 'collecting_email'); ``` ### Metrics The orchestrator exposes these default metrics: | Metric | Type | Labels | Description | |--------|------|--------|-------------| | `agent.dispatch.duration_ms` | Histogram | `agent_id` | Agent response latency | | `agent.dispatch.errors` | Counter | `agent_id`, `error_type` | Agent error rate | | `circuit_breaker.state` | Gauge | `agent_id` | Circuit breaker state | Add custom metrics for your agent's key operations. --- ## Testing ### Contract Tests The orchestrator includes contract tests that validate: 1. **Registry Contract** — YAML schema validation, invariant enforcement 2. **Protocol Contract** — MCP request/response schema compliance 3. **Routing Contract** — Decision tree correctness Run these tests to ensure your agent is compatible: ```bash pnpm test ``` ### Agent Testing Test your agent's MCP server independently: ```typescript import { describe, it, expect } from 'vitest'; import { AgentResponseSchema } from '@reaatech/agent-mesh'; describe('my-agent', () => { it('should handle password reset request', async () => { const result = await handle_message({ session_id: 'test-session', request_id: 'test-request', employee_id: 'emp123', raw_input: 'Reset my password', intent_summary: 'Password reset request', turn_history: [], workflow_state: {}, }); expect(result.workflow_complete).toBe(false); expect(result.content).toContain('password'); }); }); ``` ### Integration Tests Test the full routing flow with the orchestrator: ```typescript describe('Multi-agent routing', () => { it('should route password reset queries to my-agent', async () => { const response = await fetch('http://localhost:8080/v1/request', { method: 'POST', headers: { 'x-api-key': 'test-key', 'Content-Type': 'application/json' }, body: JSON.stringify({ input: 'I need to reset my password', employee_id: 'emp123', entry_point: 'ui', }), }); const result = await response.json(); expect(result.agent_id).toBe('my-agent'); }); }); ``` --- ## Deployment ### Environment Variables | Variable | Required | Default | Purpose | |----------|----------|---------|---------| | `PORT` | no | `8080` | HTTP listen port | | `NODE_ENV` | no | `development` | Environment name (development/production/test) | | `GOOGLE_CLOUD_PROJECT` | yes | — | GCP project ID | | `GOOGLE_CLOUD_REGION` | no | `us-central1` | GCP region | | `FIRESTORE_DATABASE` | no | `(default)` | Firestore database ID | | `VERTEX_AI_LOCATION` | no | `us-central1` | Vertex AI region | | `VERTEX_AI_MODEL` | no | `gemini-2.0-flash` | Classification model | | `API_KEY` | yes | — | API key for authentication | | `API_KEY_SECRET_NAME` | no | — | Secret Manager secret name for API key | | `SLACK_BOT_TOKEN` | no | — | Slack bot token for profile resolution | | `OTEL_EXPORTER_OTLP_ENDPOINT` | no | — | OTel collector endpoint | | `LOG_LEVEL` | no | `info` | Log level (debug/info/warn/error) | | `SESSION_TTL_MINUTES` | no | `30` | Session TTL in minutes | | `SESSION_MAX_TURNS` | no | `100` | Maximum turns per session | | `ENABLE_SESSION_BYPASS` | no | `true` | Enable session bypass for mid-turn messages | | `ENABLE_CLARIFICATION` | no | `true` | Enable clarification questions when confidence is low | | `ENABLE_CIRCUIT_BREAKER` | no | `true` | Enable per-agent circuit breakers | | `ENABLE_RATE_LIMITING` | no | `true` | Enable rate limiting middleware | | `RATE_LIMIT_WINDOW_MS` | no | `900000` | Rate limit window in ms (15 min default) | | `RATE_LIMIT_MAX_REQUESTS` | no | `100` | Max requests per window | | `CIRCUIT_BREAKER_FAILURE_THRESHOLD` | no | `5` | Failures before opening circuit | | `CIRCUIT_BREAKER_RESET_TIMEOUT_MS` | no | `30000` | Time before recovery attempt (ms) | | `CIRCUIT_BREAKER_HALF_OPEN_MAX_CALLS` | no | `3` | Test calls in half-open state | | `CIRCUIT_BREAKER_HALF_OPEN_TIMEOUT_MS` | no | `60000` | Max time in half-open state (ms) | | `CB_SYNC_INTERVAL_MS` | no | `5000` | Circuit breaker Firestore sync interval | | `CB_LEADER_LEASE_MS` | no | `15000` | Leader lease duration (ms) | | `AGENT_REGISTRY_DIR` | no | `./agents` | Directory containing agent YAML files | | `MCP_REQUEST_TIMEOUT_MS` | no | `30000` | MCP request timeout (ms) | | `MCP_MAX_RETRIES` | no | `3` | Max retries for failed MCP requests | ### Local Development ```bash git clone https://github.com/reaatech/agent-mesh.git cd agent-mesh pnpm install pnpm build pnpm --filter @reaatech/agent-mesh-orchestrator build GOOGLE_CLOUD_PROJECT=my-project API_KEY=dev-key node examples/orchestrator/dist/index.js ``` ### Docker ```bash docker build -t agent-mesh . docker run -p 8080:8080 -e GOOGLE_CLOUD_PROJECT=my-project agent-mesh ``` ### GCP Cloud Run ```bash gcloud run deploy agent-mesh \ --image gcr.io/my-project/agent-mesh:latest \ --platform managed \ --region us-central1 \ --allow-unauthenticated \ --set-env-vars GOOGLE_CLOUD_PROJECT=my-project ``` ### Register with Orchestrator 1. Deploy your agent and get the URL 2. Create agent YAML: ```yaml agent_id: my-agent display_name: My Agent description: "My agent's capabilities..." endpoint: https://my-agent-xyz.run.app type: mcp is_default: false confidence_threshold: 0.7 clarification_required: false examples: - "Example query" ``` 3. Copy to orchestrator's agent registry 4. Send `SIGHUP` to reload --- ## Checklist: Production Readiness Before deploying an agent to production: - [ ] Agent implements MCP `handle_message` tool - [ ] Response schema matches `AgentResponseSchema` (content, workflow_complete, workflow_state) - [ ] Agent handles all error cases gracefully (no unhandled exceptions) - [ ] Agent respects timeouts (configurable, default 30s) - [ ] Agent returns `workflow_complete: true` for terminal states - [ ] Agent uses `workflow_state` for multi-turn context - [ ] Agent YAML is valid and passes schema validation - [ ] Agent endpoint is reachable (not localhost/private IP) - [ ] Agent has health check endpoint (`GET /health`) - [ ] Agent logs are structured JSON with `request_id` - [ ] Agent uses OpenTelemetry for tracing - [ ] Agent contract tests pass - [ ] Agent integration tests pass - [ ] No PII in logs or error messages - [ ] Agent handles concurrent requests safely --- ## References - **ARCHITECTURE.md** — Orchestrator system design deep dive - **README.md** — Quick start and overview - **MCP Specification** — https://modelcontextprotocol.io/ - **skills/** — Skill definitions for orchestrator capabilities ```