# Enterprise Guardrails, Evaluation & Safety

> Sources: OpenAI "Practical Guide to Building Agents" (2025), Google "Agents Companion v2" (Feb 2025)

## Guardrails — Layered Defense

No single guardrail provides sufficient protection. Use **multiple specialized guardrails** together.

### Types of Guardrails

| Type | Purpose | Implementation |
|---|---|---|
| **Relevance classifier** | Flag off-topic queries | LLM-based classifier checking scope |
| **Safety classifier** | Detect jailbreaks/prompt injections | Input validation model |
| **PII filter** | Prevent PII exposure | Regex + NER on outputs |
| **Moderation** | Flag hate/harassment/violence | Content classifier on inputs |
| **Tool safeguards** | Risk-rate each tool | Low/Medium/High based on reversibility, permissions, financial impact |
| **Rules-based** | Block known threats | Blocklists, input length limits, regex filters |
| **Output validation** | Ensure brand alignment | Prompt engineering + content checks |

### Tool Risk Assessment
Rate every tool the agent can access:
- **Low risk**: Read-only, no side effects (fetch data, search)
- **Medium risk**: Write access but reversible (update record, create draft)
- **High risk**: Irreversible or financial impact (send email, process payment, delete data)

High-risk actions should require:
- Human approval
- Additional confirmation steps
- Audit logging
- Rate limiting

### Building Guardrails — Heuristic
1. Start with data privacy and content safety
2. Add guardrails based on real-world edge cases and failures
3. Optimize for both security AND user experience
4. Iterate as your agent evolves

## Evaluation Framework

### Three Levels of Evaluation

**Level 1 — Agent Capabilities**
- Can it understand instructions correctly?
- Can it reason logically?
- Can it use tools appropriately?
- Benchmark with public leaderboards (BFCL, τ-bench, PlanBench, AgentBench)

**Level 2 — Trajectory (Steps Taken)**
- Did it take the right steps?
- Were the tool calls correct and efficient?
- 6 metrics: Exact match, In-order match, Any-order match, Precision, Recall, Single-tool use

**Level 3 — Final Response**
- Does the output achieve the goal?
- Is it accurate, relevant, correct?
- Use an **autorater** (LLM-as-judge) with defined criteria

### Human-in-the-Loop Evaluation
Essential because humans can evaluate:
- **Subjectivity**: Creativity, common sense, nuance
- **Context**: Broader implications of agent actions
- **Iterative improvement**: Rich feedback for refinement
- **Evaluator calibration**: Tune autoraters against human judgment

Methods: Direct assessment, comparative evaluation, user studies

### Evaluation Method Comparison

| Method | Strengths | Weaknesses |
|---|---|---|
| **Human** | Captures nuance, considers human factors | Subjective, expensive, doesn't scale |
| **LLM-as-Judge** | Scalable, efficient, consistent | May miss intermediate steps, limited by LLM |
| **Automated Metrics** | Objective, scalable, efficient | May not capture full capabilities, gameable |

## Success Metrics

### Metric Hierarchy
1. **Business KPIs** (north star) — revenue, engagement, conversion
2. **Goal completion rate** — agent achieves its objective
3. **Critical task success** — key milestones completed
4. **Application telemetry** — latency, errors, throughput
5. **Human feedback** — 👍👎, surveys, in-context feedback
6. **Trace/observability** — full audit trail of agent decisions

### Multi-Agent Metrics (Additional)
- **Cooperation & Coordination**: How well do agents work together?
- **Planning & Task Assignment**: Right plan? Did we stick to it?
- **Agent Utilization**: How effectively do agents select the right peer?
- **Scalability**: Does quality improve as more agents are added?

## Human Intervention Design

### When to Escalate to Human
1. **Exceeding failure thresholds**: Set limits on retries/actions; escalate when exceeded
2. **High-risk actions**: Sensitive, irreversible, or high-stakes operations

### Intervention Principles
- Critical early in deployment — helps identify failures and edge cases
- Builds evaluation feedback cycle
- Confidence grows → reduce frequency over time
- Always maintain the option for user to regain control

## Enterprise Security (OpenAI)

- **Data ownership**: Enterprise retains full ownership
- **Encryption**: In transit and at rest (SOC 2 Type 2, CSA STAR)
- **Access controls**: Granular, role-based (who sees/manages data)
- **Flexible retention**: Adjust logging/storage to organizational policies
- **Model isolation**: Training data isolation from customer data

## Practical Guardrail Checklist for Aurion Studio

### Input Guardrails
- [ ] Zod validation on all API inputs ✅ (already done)
- [ ] Input length limits on user prompts
- [ ] Rate limiting per user/IP ✅ (already done, but in-memory)
- [ ] CSRF origin check ✅ (already done)
- [ ] Sanitize HTML inputs ✅ (already done)

### Output Guardrails
- [ ] Sanitize generated code before preview ✅ (sanitizeForPreview)
- [ ] CSP headers for preview iframe
- [ ] Output length limits on AI responses
- [ ] Content moderation on generated output

### Tool Guardrails
- [ ] Risk-rate all 46 API routes (read-only vs write vs deploy)
- [ ] Human confirmation for deploy actions
- [ ] Audit logging for sensitive operations
- [ ] API key rotation strategy

### Evaluation
- [ ] Automated tests for API routes ✅ (201 tests)
- [ ] E2E tests with Playwright (TODO)
- [ ] Goal completion tracking for generated apps
- [ ] User feedback mechanism (thumbs up/down)
- [ ] Error tracing and observability