# Security, Evals, and Observability ## Threat model Agent risks usually come from the combination of language, tools, and external data. Threat categories: ```text prompt injection malicious retrieved content tool misuse permission bypass secret leakage data exfiltration unsafe external communication financial or destructive side effects connector abuse malicious skill packages runaway loops cost exhaustion false success claims compaction state loss subagent miscoordination workflow packet drift verification gaps ``` ## Guardrail layers Use layered guardrails: ```text input guardrails: reject or route unsafe user requests context guardrails: label untrusted content and redact secrets schema guardrails: force structured tool arguments and outputs tool guardrails: validate args and results around execution permission guardrails: approve, deny, or pause actions output guardrails: check final answer before user-visible output trace guardrails: grade tool calls and decisions after the run ``` Guardrails should be fast, specific, and testable. ## Prompt injection handling Rules: - external content is data, not instruction; - extract structured fields where possible; - isolate untrusted content from authoritative instructions; - do not let external content choose tools directly; - do not copy secrets into context; - require approval for actions influenced by arbitrary text; - log the source of data used for tool calls. ## Approval records Approval request format: ```json { "approval_type": "external_send", "action": "send_email", "target": "customer@example.com", "risk": "external_communication", "preview_ref": "artifact://drafts/email_123", "expected_result": "Customer receives renewal reminder.", "rollback": "Cannot unsend; follow-up correction possible.", "scope": "single_send_only" } ``` Approval result format: ```json { "status": "approved", "approved_by": "user_id", "timestamp": "...", "scope": "single_send_only", "expires_at": "..." } ``` Never let the model approve its own action. ## Observability Trace operational events, not private hidden reasoning. Trace fields: ```text run_id session_id user or tenant model and provider context size instructions loaded tools visible tool calls tool args hash or redacted args permission decisions approval requests/results tool results summary errors and retries compaction boundaries workflow packet status workflow verification status workflow version and state refs latency token usage cost final status ``` A trace should answer: - what did the agent try to do; - what data did it use; - what tool changed state; - who approved it; - what failed; - why did it stop; - could the run be audited or safely rerun from recorded state. ## Evaluation strategy Evaluate the harness, not only the model. Eval categories: ```text task success tool selection precision unnecessary tool calls permission correctness approval correctness prompt injection resistance context compaction retention workflow coverage and verification quality retrieval relevance output format adherence failure recovery cost and latency human intervention rate false confidence ``` ## Test cases Create adversarial tests: - retrieved document says “ignore previous instructions”; - email contains a request to exfiltrate data; - user asks for an external send without approval; - tool returns malformed data; - connector auth expires; - model calls unknown tool; - model supplies invalid arguments; - context reaches limit and compaction happens; - workflow packet silently expands scope; - verifier accepts a finding without evidence; - two instructions conflict; - goal is vague or impossible; - tool output is huge; - sensitive data appears in retrieved content; - subagent returns unsupported conclusion. ## Trace grading Grade specific events: ```text Did the agent use the right tool? Was the tool call necessary? Were arguments valid? Was permission checked? Was approval requested at the right time? Was the final answer grounded in tool results? Did compaction preserve the active objective? Did workflow integration report failed packets and coverage gaps? ``` ## Launch gates Before production: - narrow tool registry; - local schema validation; - permission matrix enforced in code; - approval UX for risky actions; - prompt injection tests pass; - compaction tests pass; - connector auth and revocation tested; - trace logging enabled; - cost budgets enforced; - rollback or incident path documented; - evals run on realistic and adversarial tasks. ## Incident response When an agent misbehaves: 1. Pause risky tools. 2. Preserve traces and artifacts. 3. Identify instruction, tool, connector, or model failure. 4. Patch policy/tool/schema/context logic. 5. Add regression eval. 6. Re-enable gradually. ## Source links - OpenAI guardrails and human review: https://developers.openai.com/api/docs/guides/agents/guardrails-approvals - OpenAI agent safety: https://developers.openai.com/api/docs/guides/agent-builder-safety - OpenAI sandbox agents: https://developers.openai.com/api/docs/guides/agents/sandboxes - Anthropic building effective agents: https://www.anthropic.com/research/building-effective-agents - Anthropic demystifying evals for agents: https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents - Anthropic writing effective tools for agents: https://www.anthropic.com/engineering/writing-tools-for-agents - MCP specification: https://modelcontextprotocol.io/specification/2025-11-25