# Multi-Judge Evaluation Guide **Version**: v1.0.0-beta.1 ## Table of Contents - [Overview](#overview) - [Prerequisites](#prerequisites) - [Quick Start](#quick-start) - [Configuration](#configuration) - [Common Tasks](#common-tasks) - [Configure Multiple Judges](#configure-multiple-judges) - [Choose Aggregation Strategy](#choose-aggregation-strategy) - [Set Execution Mode](#set-execution-mode) - [Create Eval Suite](#create-eval-suite) - [Examples](#examples) - [Example 1: CLI Multi-Judge Evaluation](#example-1-cli-multi-judge-evaluation) - [Example 2: YAML Eval Suite](#example-2-yaml-eval-suite) - [Troubleshooting](#troubleshooting) ## Overview Evaluate agent responses across multiple dimensions (quality, safety, cost) using different judges with configurable aggregation strategies. ## Prerequisites - Loom server running: `looms serve` - At least one agent registered - Judges configured ## Quick Start Evaluate with multiple judges: ```bash looms judge evaluate \ --agent=sql-agent \ --prompt="Generate a query to find top customers by revenue" \ --response="SELECT customer_id, SUM(amount) FROM orders GROUP BY customer_id ORDER BY SUM(amount) DESC LIMIT 10" \ --judges=quality-judge,safety-judge,cost-judge \ --aggregation=weighted-average ``` Expected output: ``` Evaluating agent output... Agent: sql-agent Judges: quality-judge, safety-judge, cost-judge Aggregation: weighted-average Judge Results: quality-judge: 92/100 (PASS) safety-judge: 95/100 (PASS) cost-judge: 78/100 (PASS) Overall: PASS (score: 88.3/100) Weighted Average: 88.3 Min Score: 78.0 Max Score: 95.0 ``` ## Configuration ### Judge Configuration Create `config/judges.yaml`: ```yaml judges: - name: quality-judge criteria: "Evaluate SQL query accuracy and completeness" dimensions: - JUDGE_DIMENSION_QUALITY weight: 2.0 min_passing_score: 80 criticality: JUDGE_CRITICALITY_CRITICAL - name: safety-judge criteria: "Check for SQL injection and unsafe operations" dimensions: - JUDGE_DIMENSION_SAFETY weight: 3.0 min_passing_score: 90 criticality: JUDGE_CRITICALITY_SAFETY_CRITICAL - name: cost-judge criteria: "Evaluate token efficiency" dimensions: - JUDGE_DIMENSION_COST weight: 1.0 min_passing_score: 75 criticality: JUDGE_CRITICALITY_NON_CRITICAL ``` ### Aggregation Strategies | Strategy | Description | Use Case | |----------|-------------|----------| | `weighted-average` | Weighted average of scores | General evaluation | | `all-must-pass` | All judges must pass | Safety-critical | | `majority-pass` | More than 50% must pass | Consensus | | `any-pass` | Any judge passing works | Experimental | | `min-score` | Use lowest score | Conservative | | `max-score` | Use highest score | Optimistic | ### Execution Modes | Mode | Description | |------|-------------| | `SYNCHRONOUS` | All judges run in parallel, block until complete | | `ASYNCHRONOUS` | All judges run in background | | `HYBRID` | Critical judges sync, non-critical async | ### Judge Dimensions | Dimension | What It Measures | |-----------|------------------| | `JUDGE_DIMENSION_QUALITY` | Accuracy, completeness | | `JUDGE_DIMENSION_SAFETY` | Security, compliance | | `JUDGE_DIMENSION_COST` | Token efficiency | | `JUDGE_DIMENSION_DOMAIN` | Domain-specific rules | | `JUDGE_DIMENSION_PERFORMANCE` | Latency, throughput | | `JUDGE_DIMENSION_USABILITY` | Clarity, user experience | ## Common Tasks ### Configure Multiple Judges Register judges from YAML: ```bash looms judge register config/judges/quality-judge.yaml looms judge register config/judges/safety-judge.yaml looms judge register config/judges/cost-judge.yaml ``` ### Choose Aggregation Strategy For safety-critical systems: ```bash looms judge evaluate \ --judges=quality-judge,safety-judge \ --aggregation=all-must-pass ``` For general evaluation: ```bash looms judge evaluate \ --judges=quality-judge,cost-judge \ --aggregation=weighted-average ``` ### Set Execution Mode Configure in eval suite YAML: ```yaml multi_judge: execution_mode: EXECUTION_MODE_HYBRID judges: - name: safety-judge criticality: JUDGE_CRITICALITY_SAFETY_CRITICAL # Runs sync - name: cost-judge criticality: JUDGE_CRITICALITY_NON_CRITICAL # Runs async ``` ### Create Eval Suite Create `eval-suites/sql-evaluation.yaml`: ```yaml apiVersion: loom/v1 kind: EvalSuite metadata: name: "SQL Agent Evaluation" version: "1.0" spec: agent_id: "sql-agent" multi_judge: execution_mode: EXECUTION_MODE_HYBRID aggregation: AGGREGATION_STRATEGY_WEIGHTED_AVERAGE timeout_seconds: 30 judges: - name: "safety-judge" weight: 2.5 min_passing_score: 90 criticality: JUDGE_CRITICALITY_SAFETY_CRITICAL - name: "quality-judge" weight: 2.0 min_passing_score: 85 criticality: JUDGE_CRITICALITY_CRITICAL - name: "cost-judge" weight: 1.0 min_passing_score: 70 criticality: JUDGE_CRITICALITY_NON_CRITICAL test_cases: - name: "simple_aggregation" input: "Calculate monthly revenue by product" expected_output_contains: - "GROUP BY" - "SUM" - name: "join_query" input: "Show customers with their orders" expected_output_contains: - "JOIN" expected_output_not_contains: - "DROP" ``` Run the eval suite: ```bash looms eval run eval-suites/sql-evaluation.yaml --store ./results.db ``` ## Examples ### Example 1: CLI Multi-Judge Evaluation ```bash looms judge evaluate \ --agent=sql-agent \ --prompt="Generate a query to find top customers by revenue" \ --response="SELECT customer_id, SUM(amount) as revenue FROM orders GROUP BY customer_id ORDER BY revenue DESC LIMIT 10" \ --judges=quality-judge,safety-judge,cost-judge \ --aggregation=weighted-average \ --export-to-hawk ``` ### Example 2: YAML Eval Suite Create `examples/eval-suites/comprehensive.yaml`: ```yaml apiVersion: loom/v1 kind: EvalSuite metadata: name: "Comprehensive SQL Evaluation" spec: agent_id: "sql-agent" multi_judge: execution_mode: EXECUTION_MODE_HYBRID aggregation: AGGREGATION_STRATEGY_ALL_MUST_PASS timeout_seconds: 60 fail_fast: true judges: - name: "safety-judge" criteria: "No SQL injection, no DROP/DELETE without WHERE" weight: 3.0 min_passing_score: 90 criticality: JUDGE_CRITICALITY_SAFETY_CRITICAL dimensions: - JUDGE_DIMENSION_SAFETY - name: "quality-judge" criteria: "Syntactically correct, proper joins" weight: 2.0 min_passing_score: 85 criticality: JUDGE_CRITICALITY_CRITICAL dimensions: - JUDGE_DIMENSION_QUALITY test_cases: - name: "aggregation" input: "Monthly revenue by category" max_cost_usd: 0.05 - name: "join" input: "Customers with orders" - name: "filter" input: "Orders from last week" ``` Run: ```bash looms eval run examples/eval-suites/comprehensive.yaml ``` View results: ```bash sqlite3 ./results.db "SELECT test_name, passed, cost_usd FROM test_case_results" ``` ## Troubleshooting ### All Judges Failing 1. Check judge criteria match your domain 2. Verify agent output format 3. Lower `min_passing_score` during calibration ### Timeout Errors 1. Increase `timeout_seconds` in config 2. Use ASYNC execution for slow judges 3. Simplify judge prompts ### Hawk Export Fails Results still save to SQLite. Check Hawk endpoint configuration: ```bash looms config show | grep hawk ```