--- name: ai-agents description: Production-grade AI agent patterns with MCP integration, agentic RAG, handoff orchestration, multi-layer guardrails, observability, token economics, ROI frameworks, and build-vs-not decision guidance (modern best practices) --- # AI Agents Development — Production Skill Hub **Modern Best Practices (January 2026)**: deterministic control flow, bounded tools, auditable state, MCP-based tool integration, handoff-first orchestration, multi-layer guardrails, OpenTelemetry tracing, and human-in-the-loop controls (OWASP LLM Top 10: https://owasp.org/www-project-top-10-for-large-language-model-applications/). This skill provides **production-ready operational patterns** for designing, building, evaluating, and deploying AI agents. It centralizes **procedures**, **checklists**, **decision rules**, and **templates** used across RAG agents, tool-using agents, OS agents, and multi-agent systems. No theory. No narrative. Only operational steps and templates. --- ## When to Use This Skill Codex should activate this skill whenever the user asks for: - Designing an agent (LLM-based, tool-based, OS-based, or multi-agent). - Scoping capability maturity and rollout risk for new agent behaviors. - Creating action loops, plans, workflows, or delegation logic. - Writing tool definitions, MCP tools, schemas, or validation logic. - Generating RAG pipelines, retrieval modules, or context injection. - Building memory systems (session, long-term, episodic, task). - Creating evaluation harnesses, observability plans, or safety gates. - Preparing CI/CD, rollout, deployment, or production operational specs. - Producing any template in `/references/` or `/assets/`. - Implementing MCP servers or integrating Model Context Protocol. - Setting up agent handoffs and orchestration patterns. - Configuring multi-layer guardrails and safety controls. - **Evaluating whether to build an agent** (build vs not decision). - **Calculating agent ROI**, token costs, or cost/benefit analysis. - **Assessing hallucination risk** and mitigation strategies. - **Deciding when to kill** an agent project (kill triggers). - For prompt scaffolds, retrieval tuning, or security depth, see Scope Boundaries below. ## Scope Boundaries (Use These Skills for Depth) - **Prompt scaffolds & structured outputs** → [ai-prompt-engineering](../ai-prompt-engineering/SKILL.md) - **RAG retrieval & chunking** → [ai-rag](../ai-rag/SKILL.md) - **Search tuning (BM25/HNSW/hybrid)** → [ai-rag](../ai-rag/SKILL.md) - **Security/guardrails** → [ai-mlops](../ai-mlops/SKILL.md) - **Inference optimization** → [ai-llm-inference](../ai-llm-inference/SKILL.md) ## Default Workflow (Production) - Pick an architecture with the Decision Tree (below); default to **workflow/FSM/DAG** for production. - Draft an agent spec with [`assets/core/agent-template-standard.md`](assets/core/agent-template-standard.md) (or [`assets/core/agent-template-quick.md`](assets/core/agent-template-quick.md)). - Specify tools and handoffs with JSON Schema using [`assets/tools/tool-definition.md`](assets/tools/tool-definition.md) and [`references/api-contracts-for-agents.md`](references/api-contracts-for-agents.md). - Add retrieval only when needed; start with [`assets/rag/rag-basic.md`](assets/rag/rag-basic.md) and scale via [`assets/rag/rag-advanced.md`](assets/rag/rag-advanced.md) + [`references/rag-patterns.md`](references/rag-patterns.md). - Add eval + telemetry early via [`references/evaluation-and-observability.md`](references/evaluation-and-observability.md). - Run the go/no-go gate with [`assets/checklists/agent-safety-checklist.md`](assets/checklists/agent-safety-checklist.md). - Plan deploy/rollback and safety controls via [`references/deployment-ci-cd-and-safety.md`](references/deployment-ci-cd-and-safety.md). --- ## Quick Reference | Agent Type | Core Control Flow | Interfaces | MCP/A2A | When to Use | |------------|-----------|------------|---------|-------------| | **Workflow Agent (FSM/DAG)** | Explicit state transitions | State store, tool allowlist | MCP | Deterministic, auditable flows | | **Tool-Using Agent** | Route → call tool → observe | Tool schemas, retries/timeouts | MCP | External actions (APIs, DB, files) | | **RAG Agent** | Retrieve → answer → cite | Retriever, citations, ACLs | MCP | Knowledge-grounded responses | | **Planner/Executor** | Plan → execute steps with caps | Planner prompts, step budget | MCP (+A2A) | Multi-step problems with bounded autonomy | | **Multi-Agent (Orchestrated)** | Delegate → merge → validate | Handoff contracts, eval gates | A2A | Specialization with explicit handoffs | | **OS Agent** | Observe UI → act → verify | Sandbox, UI grounding | MCP | Desktop/browser control under strict guardrails | | **Code/SWE Agent** | Branch → edit → test → PR | Repo access, CI gates | MCP | Coding tasks with review/merge controls | ### Framework Selection (2026) | Framework | Architecture | Best For | Ease | |-----------|--------------|----------|------| | **LangGraph** | Graph-based, stateful | Enterprise, compliance, auditability | Medium | | **OpenAI Agents SDK** | Tool-centric, lightweight | Fast prototyping, OpenAI ecosystem | Easy | | **Google ADK** | Code-first, multi-language | Gemini/Vertex AI, polyglot teams | Medium | | **Pydantic AI** | Type-safe, graph FSM | Production Python, type safety | Medium | | **CrewAI** | Role-based crews | Team workflows, content generation | Easiest | | **AutoGen** | Conversational | Code generation, research | Medium | | **AWS Bedrock Agents** | Managed infrastructure | Enterprise AWS, knowledge bases | Easy | See [`references/modern-best-practices.md`](references/modern-best-practices.md) for detailed framework comparison and selection guide. --- ## Decision Tree: Choosing Agent Architecture ```text What does the agent need to do? ├─ Answer questions from knowledge base? │ ├─ Simple lookup? → RAG Agent (LangChain/LlamaIndex + vector DB) │ └─ Complex multi-step? → Agentic RAG (iterative retrieval + reasoning) │ ├─ Perform external actions (APIs, tools, functions)? │ ├─ 1-3 tools, linear flow? → Tool-Using Agent (LangGraph + MCP) │ └─ Complex workflows, branching? → Planning Agent (ReAct/Plan-Execute) │ ├─ Write/modify code autonomously? │ ├─ Single file edits? → Tool-Using Agent with code tools │ └─ Multi-file, issue resolution? → Code/SWE Agent (HyperAgent pattern) │ ├─ Delegate tasks to specialists? │ ├─ Fixed workflow? → Multi-Agent Sequential (A → B → C) │ ├─ Manager-Worker? → Multi-Agent Hierarchical (Manager + Workers) │ └─ Dynamic routing? → Multi-Agent Group Chat (collaborative) │ ├─ Control desktop/browser? │ └─ OS Agent (Anthropic Computer Use + MCP for system access) │ └─ Hybrid (combination of above)? └─ Planning Agent that coordinates: - Tool-using for actions (MCP) - RAG for knowledge (MCP) - Multi-agent for delegation (A2A) - Code agents for implementation ``` **Protocol Selection**: - Use **MCP** for: Tool access, data retrieval, single-agent integration - Use **A2A** for: Agent-to-agent handoffs, multi-agent coordination, task delegation --- ## Core Concepts (Vendor-Agnostic) ### Control Flow Options - **Reactive**: direct tool routing per user request (fast, brittle if unbounded). - **Workflow (FSM/DAG)**: explicit states and transitions (default for deterministic production). - **Planner/Executor**: plan with strict budgets, then execute step-by-step (use when branching is unavoidable). - **Orchestrated multi-agent**: separate roles with validated handoffs (use when specialization is required). ### Memory Types (Tradeoffs) - **Short-term (session)**: cheap, ephemeral; best for conversational continuity. - **Episodic (task)**: scoped to a case/ticket; supports audit and replay. - **Long-term (profile/knowledge)**: high risk; requires consent, retention limits, and provenance. ### Failure Handling (Production Defaults) - **Classify errors**: retriable vs fatal vs needs-human. - **Bound retries**: max attempts, backoff, jitter; avoid retry storms. - **Fallbacks**: degraded mode, smaller model, cached answers, or safe refusal. ## Do / Avoid **Do** - Do keep state explicit and serializable (replayable runs). - Do enforce tool allowlists, scopes, and idempotency for side effects. - Do log traces/metrics for model calls and tool calls (OpenTelemetry GenAI semantic conventions: https://opentelemetry.io/docs/specs/semconv/gen-ai/). **Avoid** - Avoid runaway autonomy (unbounded loops or step counts). - Avoid hidden state (implicit memory that cannot be audited). - Avoid untrusted tool outputs without validation/sanitization. ## Navigation: Economics & Decision Framework ### Should You Build an Agent? - **Build vs Not Decision Framework** - [`references/build-vs-not-decision.md`](references/build-vs-not-decision.md) - 10-second test (volume, cost, error tolerance) - Red flags and immediate disqualifiers - Alternatives to agents (usually better) - Full decision tree with stage gates - Kill triggers during development and post-launch - Pre-build validation checklist ### Agent ROI & Token Economics - **Agent Economics** - [`references/agent-economics.md`](references/agent-economics.md) - Token pricing by model (January 2026) - Cost per task by agent type - ROI calculation formula and tiers - Hallucination cost framework and mitigation ROI - Investment decision matrix - Monthly tracking dashboard --- ## Navigation: Core Concepts & Patterns ### Governance & Maturity - **Agent Maturity & Governance** - [`references/agent-maturity-governance.md`](references/agent-maturity-governance.md) - Capability maturity levels (L0-L4) - Identity & policy enforcement - Fleet control and registry management - Deprecation rules and kill switches ### Modern Best Practices - **Modern Best Practices** - [`references/modern-best-practices.md`](references/modern-best-practices.md) - Model Context Protocol (MCP) - Agent-to-Agent Protocol (A2A) - Agentic RAG (Dynamic Retrieval) - Multi-layer guardrails - LangGraph over LangChain - OpenTelemetry for agents ### Context Management - **Context Engineering** - [`references/context-engineering.md`](references/context-engineering.md) - Progressive disclosure - Session management - Memory provenance - Retrieval timing - Multimodal context ### Core Operational Patterns - **Operational Patterns** - [`references/operational-patterns.md`](references/operational-patterns.md) - Agent loop pattern (PLAN → ACT → OBSERVE → UPDATE) - OS agent action loop - RAG pipeline pattern - Tool specification - Memory system pattern - Multi-agent workflow - Safety & guardrails - Observability - Evaluation patterns - Deployment & CI/CD --- ## Navigation: Protocol Implementation - **MCP Practical Guide** - [`references/mcp-practical-guide.md`](references/mcp-practical-guide.md) Building MCP servers, tool integration, and standardized data access - **MCP Server Builder** - [`references/mcp-server-builder.md`](references/mcp-server-builder.md) End-to-end checklist for workflow-focused MCP servers (design → build → test) - **A2A Handoff Patterns** - [`references/a2a-handoff-patterns.md`](references/a2a-handoff-patterns.md) Agent-to-agent communication, task delegation, and coordination protocols - **Protocol Decision Tree** - [`references/protocol-decision-tree.md`](references/protocol-decision-tree.md) When to use MCP vs A2A, decision framework, and selection criteria --- ## Navigation: Agent Capabilities - **Agent Operations** - [`references/agent-operations-best-practices.md`](references/agent-operations-best-practices.md) Action loops, planning, observation, and execution patterns - **RAG Patterns** - [`references/rag-patterns.md`](references/rag-patterns.md) Contextual retrieval, agentic RAG, and hybrid search strategies - **Memory Systems** - [`references/memory-systems.md`](references/memory-systems.md) Session, long-term, episodic, and task memory architectures - **Tool Design & Validation** - [`references/tool-design-specs.md`](references/tool-design-specs.md) Tool schemas, validation, error handling, and MCP integration ### Skill Packaging & Sharing - **Skill Lifecycle** - [`references/skill-lifecycle.md`](references/skill-lifecycle.md) Scaffold, validate, package, and share skills with teams (Slack-ready) - **API Contracts for Agents** - [`references/api-contracts-for-agents.md`](references/api-contracts-for-agents.md) Request/response envelopes, safety gates, streaming/async patterns, error taxonomy - **Multi-Agent Patterns** - [`references/multi-agent-patterns.md`](references/multi-agent-patterns.md) Manager-worker, sequential, handoff, and group chat orchestration - **OS Agent Capabilities** - [`references/os-agent-capabilities.md`](references/os-agent-capabilities.md) Desktop automation, UI grounding, and computer use patterns - **Code/SWE Agents** - [`references/code-swe-agents.md`](references/code-swe-agents.md) SE 3.0 paradigm, autonomous coding patterns, SWE-Bench, HyperAgent architecture --- ## Navigation: Production Operations - **Evaluation & Observability** - [`references/evaluation-and-observability.md`](references/evaluation-and-observability.md) OpenTelemetry GenAI, metrics, LLM-as-judge, and monitoring - **Deployment, CI/CD & Safety** - [`references/deployment-ci-cd-and-safety.md`](references/deployment-ci-cd-and-safety.md) Multi-layer guardrails, HITL controls, NIST AI RMF, production checklists --- ## Navigation: Templates (Copy-Paste Ready) ### Checklists - **Agent Design & Safety Checklist** - [`assets/checklists/agent-safety-checklist.md`](assets/checklists/agent-safety-checklist.md) Go/No-Go safety gate: permissions, HITL triggers, eval gates, observability, rollback ### Core Agent Templates - **Standard Agent Template** - [`assets/core/agent-template-standard.md`](assets/core/agent-template-standard.md) Full production spec: memory, tools, RAG, evaluation, observability, safety - **Specialized Agent Template** - [`assets/core/agent-template-specialized.md`](assets/core/agent-template-specialized.md) Domain-specific agents with custom capabilities and constraints - **Quick Agent Template** - [`assets/core/agent-template-quick.md`](assets/core/agent-template-quick.md) Minimal viable agent for rapid prototyping ### RAG Templates - **Basic RAG** - [`assets/rag/rag-basic.md`](assets/rag/rag-basic.md) Simple retrieval-augmented generation pipeline - **Advanced RAG** - [`assets/rag/rag-advanced.md`](assets/rag/rag-advanced.md) Contextual retrieval, reranking, and agentic RAG patterns - **Hybrid Retrieval** - [`assets/rag/hybrid-retrieval.md`](assets/rag/hybrid-retrieval.md) Semantic + keyword search with BM25 fusion ### Tool Templates - **Tool Definition** - [`assets/tools/tool-definition.md`](assets/tools/tool-definition.md) MCP-compatible tool schemas with validation and error handling - **Tool Validation Checklist** - [`assets/tools/tool-validation-checklist.md`](assets/tools/tool-validation-checklist.md) Testing, security, and production readiness checks ### Multi-Agent Templates - **Manager-Worker Template** - [`assets/multi-agent/manager-worker-template.md`](assets/multi-agent/manager-worker-template.md) Orchestration pattern with task delegation and result aggregation - **Evaluator-Router Template** - [`assets/multi-agent/evaluator-router-template.md`](assets/multi-agent/evaluator-router-template.md) Dynamic routing with quality assessment and domain classification ### Service Layer Templates - **FastAPI Agent Service** - [`../dev-api-design/assets/fastapi/fastapi-complete-api.md`](../dev-api-design/assets/fastapi/fastapi-complete-api.md) Auth, pagination, validation, error handling; extend with model lifespan loads, SSE, background tasks --- ## External Sources Metadata - **Curated References** - [`data/sources.json`](data/sources.json) Authoritative sources spanning standards, protocols, and production agent frameworks --- ## Shared Utilities (Centralized patterns — extract, don't duplicate) - [../software-clean-code-standard/utilities/llm-utilities.md](../software-clean-code-standard/utilities/llm-utilities.md) — Token counting, streaming, cost estimation - [../software-clean-code-standard/utilities/error-handling.md](../software-clean-code-standard/utilities/error-handling.md) — Effect Result types, correlation IDs - [../software-clean-code-standard/utilities/resilience-utilities.md](../software-clean-code-standard/utilities/resilience-utilities.md) — p-retry v6, circuit breaker for API calls - [../software-clean-code-standard/utilities/logging-utilities.md](../software-clean-code-standard/utilities/logging-utilities.md) — pino v9 + OpenTelemetry integration - [../software-clean-code-standard/utilities/observability-utilities.md](../software-clean-code-standard/utilities/observability-utilities.md) — OpenTelemetry SDK, tracing, metrics - [../software-clean-code-standard/utilities/testing-utilities.md](../software-clean-code-standard/utilities/testing-utilities.md) — Test factories, fixtures, mocks - [../software-clean-code-standard/references/clean-code-standard.md](../software-clean-code-standard/references/clean-code-standard.md) — Canonical clean code rules (`CC-*`) for citation --- ## Trend Awareness Protocol **IMPORTANT**: When users ask recommendation questions about AI agents, you MUST use WebSearch to check current trends before answering. If WebSearch is unavailable, use `data/sources.json` + any available web browsing tools, and explicitly state what you verified vs assumed. ### Trigger Conditions - "What's the best agent framework for [use case]?" - "What should I use for [multi-agent/tool use/orchestration]?" - "What's the latest in AI agents?" - "Current best practices for [agent architecture/MCP/A2A]?" - "Is [LangGraph/CrewAI/AutoGen] still relevant in 2026?" - "[Agent framework A] vs [Agent framework B]?" - "Best way to build [coding agent/RAG agent/OS agent]?" - "What MCP servers are available?" ### Required Searches 1. Search: `"AI agent frameworks best practices 2026"` 2. Search: `"[LangGraph/CrewAI/AutoGen/Semantic Kernel] comparison 2026"` 3. Search: `"AI agent trends January 2026"` 4. Search: `"MCP servers available 2026"` ### What to Report After searching, provide: - **Current landscape**: What agent frameworks are popular NOW - **Emerging trends**: New patterns gaining traction (MCP, A2A, agentic coding) - **Deprecated/declining**: Frameworks or patterns losing relevance - **Recommendation**: Based on fresh data, not just static knowledge ### Example Topics (verify with fresh search) - Agent frameworks (LangGraph, CrewAI, AutoGen, Semantic Kernel, Pydantic AI) - MCP ecosystem (available servers, new integrations) - Agentic coding (Codex CLI, Claude Code, Cursor, Windsurf, Cline) - Multi-agent patterns (hierarchical, collaborative, competitive) - Tool use protocols (MCP, function calling) - Agent evaluation (SWE-Bench, AgentBench, GAIA) - OS/computer use agents (computer-use APIs, browser automation) --- ## Related Skills This skill integrates with complementary skills: ### Core Dependencies - [`../ai-llm/`](../ai-llm/SKILL.md) - LLM patterns, prompt engineering, and model selection for agents - [`../ai-rag/`](../ai-rag/SKILL.md) - Deep RAG implementation: chunking, embedding, reranking - [`../ai-prompt-engineering/`](../ai-prompt-engineering/SKILL.md) - System prompt design, few-shot patterns, reasoning strategies ### Production & Operations - [`../qa-observability/`](../qa-observability/SKILL.md) - OpenTelemetry, metrics, distributed tracing - [`../software-security-appsec/`](../software-security-appsec/SKILL.md) - OWASP Top 10, input validation, secure tool design - [`../ops-devops-platform/`](../ops-devops-platform/SKILL.md) - CI/CD pipelines, deployment strategies, infrastructure ### Supporting Patterns - [`../dev-api-design/`](../dev-api-design/SKILL.md) - REST/GraphQL design for agent APIs and tool interfaces - [`../ai-mlops/`](../ai-mlops/SKILL.md) - Model deployment, monitoring, drift detection - [`../qa-debugging/`](../qa-debugging/SKILL.md) - Agent debugging, error analysis, root cause investigation **Usage pattern**: Start here for agent architecture, then reference specialized skills for deep implementation details. --- ## Usage Notes - **Modern Standards**: Default to MCP for tools, agentic RAG for retrieval, handoff-first for multi-agent - **Lightweight SKILL.md**: Use this file for quick reference and navigation - **Drill-down resources**: Reference detailed resources for implementation guidance - **Copy-paste templates**: Use templates when the user asks for structured artifacts - **External sources**: Reference `data/sources.json` for authoritative documentation links - **No theory**: Never include theoretical explanations; only operational steps --- ## Key Modern Migrations **Traditional → Modern**: - Custom APIs → Model Context Protocol (MCP) - Static RAG → Agentic RAG with contextual retrieval - Ad-hoc handoffs → Versioned handoff APIs with JSON Schema - Single guardrail → Multi-layer defense (5+ layers) - LangChain agents → LangGraph stateful workflows - Custom observability → OpenTelemetry GenAI standards - Model-centric → Context engineering-centric --- ## AI-Native SDLC Template - Use [`assets/agent-template-ainative-sdlc.md`](assets/agent-template-ainative-sdlc.md) for the Delegate → Review → Own runbook (guardrails + outputs checklist).