--- id: ins_outcomes-grader-agent-evaluation operator: Anthropic operator_role: AI safety research company and Claude developer co_operators: [] source_url: unknown source_type: talk source_title: Code with Claude source_date: 2026-05-06 captured_date: 2026-05-07 domain: [ai-product, evals] lifecycle: [growth-loops, retention] maturity: frontier artifact_class: framework score: { originality: 4, specificity: 4, evidence: 4, transferability: 4, source: 3 } tier: A related: [ins_traces-need-feedback-to-learn, ins_error-analysis-highest-leverage-eval-step, ins_evals-are-data-analysis-on-llm-apps, ins_benevolent-dictator-not-committee, ins_dreaming-cross-session-memory-curation] raw_ref: --- # A separate grader agent in its own context window closes the output verification loop at production scale ## Claim Deploy a separate grader agent in its own context window to evaluate whether output meets a defined success rubric. The grader runs independently of the generator and has no access to the generator's reasoning chain, only the output. ## Mechanism When a grader shares context with the generator, it inherits the generator's blind spots. A separate context window forces independent evaluation against the rubric. The rubric — not the generator — determines pass or fail. This creates a closed feedback loop: the generator produces, the grader measures, the delta drives improvement without scaling manual review linearly with volume. ## Conditions Holds when: the task has a defined, measurable success rubric. Output format is consistent enough for the grader to evaluate. The grader can access necessary ground truth or reference material. Fails when: success criteria are vague or subjective. The grader and generator share overlapping context that smuggles in confirmation bias. The rubric itself is wrong. ## Evidence Announced at Code with Claude, May 6, 2026, as part of Claude Managed Agents public beta. Internal testing showed +8.4% improvement on docx file generation and +10.1% on pptx file generation after Outcomes was added. > "Agents do their best work when they know what 'good' looks like." ## Signals - Output quality scores improve week-over-week without manual review hours scaling proportionally - The grader catches the same error class repeatedly, pointing to a generator training target - Task success rate and human-rated quality converge over time (grader is calibrated) ## Counter-evidence For tasks without a verifiable rubric, adding a grader adds latency and cost with no quality signal. The grader itself can be miscalibrated if the rubric is underspecified. ## Cross-references - `ins_traces-need-feedback-to-learn` (Harrison Chase): traces without outcome feedback are incomplete raw material - `ins_error-analysis-highest-leverage-eval-step` (Hamel Husain): error analysis is the highest-leverage eval step most teams skip - `ins_dreaming-cross-session-memory-curation` (Anthropic): paired feature — Outcomes closes the output loop, Dreaming closes the memory loop