# Autoevals scorer reference Complete reference for all scorers available in Autoevals, including parameters, score ranges, and usage examples. ## Table of contents - [LLM-as-a-judge scorers](#llm-as-a-judge-scorers) - [RAG (Retrieval-Augmented Generation) scorers](#rag-retrieval-augmented-generation-scorers) - [Heuristic scorers](#heuristic-scorers) - [JSON scorers](#json-scorers) - [List scorers](#list-scorers) --- ## LLM-as-a-judge scorers These scorers use language models to evaluate outputs based on semantic understanding. ### Factuality Evaluates whether the output is factually consistent with the expected answer. **Parameters:** - `input` (string): The input question or prompt - `output` (string, required): The generated answer to evaluate - `expected` (string, required): The ground truth answer - `model` (string, optional): Model to use (default: configured via `init()` or "gpt-5-mini") - `client` (Client, optional): Custom OpenAI client **Score Range:** 0-1 - `1.0` = Output is factually accurate - `0.0` = Output contains factual errors **Example:** ```typescript import { Factuality } from "autoevals"; const result = await Factuality({ input: "What is the capital of France?", output: "Paris", expected: "The capital of France is Paris", }); // Score: 1.0 (factually correct) ``` ### Battle Compares two outputs and determines which one is better. **Parameters:** - `input` (string): The input question or prompt - `output` (string, required): First answer to compare - `expected` (string, required): Second answer to compare - `model` (string, optional): Model to use - `client` (Client, optional): Custom OpenAI client **Score Range:** 0-1 - `1.0` = Output is significantly better than expected - `0.5` = Both outputs are roughly equal - `0.0` = Expected is significantly better than output **Example:** ```python from autoevals.llm import Battle evaluator = Battle() result = evaluator.eval( input="Explain photosynthesis", output="Plants use sunlight to make food from CO2 and water", expected="Photosynthesis is a process" ) # Score: ~1.0 (first answer is more detailed) ``` ### ClosedQA Evaluates answers to closed-ended questions where there's a clear correct answer. **Parameters:** - `input` (string): The question - `output` (string, required): The generated answer - `expected` (string, required): The correct answer - `model` (string, optional): Model to use - `criteria` (string, optional): Custom evaluation criteria **Score Range:** 0-1 - `1.0` = Answer is correct - `0.0` = Answer is incorrect ### Humor Evaluates whether the output is humorous. **Parameters:** - `input` (string): The context or setup - `output` (string, required): The text to evaluate for humor - `model` (string, optional): Model to use **Score Range:** 0-1 - `1.0` = Very humorous - `0.0` = Not humorous ### Security Evaluates whether the output contains security vulnerabilities or unsafe content. **Parameters:** - `output` (string, required): The content to evaluate - `model` (string, optional): Model to use **Score Range:** 0-1 - `1.0` = No security concerns - `0.0` = Contains security vulnerabilities ### Moderation Evaluates content for policy violations using OpenAI's moderation API. **Parameters:** - `output` (string, required): The content to moderate - `client` (Client, optional): Custom OpenAI client **Score Range:** 0-1 - `1.0` = Content is safe - `0.0` = Content violates policies **Categories Checked:** - Sexual content - Hate speech - Harassment - Self-harm - Violence - Sexual content involving minors ### Sql Evaluates SQL query correctness and quality. **Parameters:** - `input` (string): The natural language question - `output` (string, required): The generated SQL query - `expected` (string, optional): The correct SQL query - `model` (string, optional): Model to use **Score Range:** 0-1 ### Summary Evaluates the quality of text summaries. **Parameters:** - `input` (string): The original text - `output` (string, required): The generated summary - `expected` (string, optional): A reference summary - `model` (string, optional): Model to use **Score Range:** 0-1 - `1.0` = Excellent summary (accurate, concise, complete) - `0.0` = Poor summary ### Translation Evaluates translation quality. **Parameters:** - `input` (string): The source text - `output` (string, required): The generated translation - `expected` (string, optional): A reference translation - `model` (string, optional): Model to use **Score Range:** 0-1 - `1.0` = Excellent translation - `0.0` = Poor translation --- ## RAG (Retrieval-Augmented Generation) scorers These scorers evaluate RAG systems by assessing both context retrieval and answer generation quality. All RAG scorers support passing `context` through the `metadata` parameter when used with Braintrust Eval. See the [RAGAS module documentation](js/ragas.ts) for examples. ### ContextRelevancy Evaluates how relevant the retrieved context is to the input question. **Parameters:** - `input` (string, required): The question - `output` (string, required): The generated answer - `context` (string[] | string, required): Retrieved context passages - `model` (string, optional): Model to use (default: "gpt-5-nano") **Score Range:** 0-1 - `1.0` = All context is highly relevant - `0.0` = Context is irrelevant **Example:** ```python from autoevals.ragas import ContextRelevancy scorer = ContextRelevancy() result = scorer.eval( input="What is the capital of France?", output="Paris", context=[ "Paris is the capital of France.", "Berlin is the capital of Germany." ] ) # Score: ~0.5 (only first context item is relevant) ``` ### ContextRecall Measures how well the context supports the expected answer. **Parameters:** - `input` (string): The question - `expected` (string, required): The ground truth answer - `context` (string[] | string, required): Retrieved context passages - `model` (string, optional): Model to use **Score Range:** 0-1 - `1.0` = Context fully supports the expected answer - `0.0` = Context doesn't support the expected answer ### ContextPrecision Measures the precision of retrieved context - whether relevant context appears before irrelevant context. **Parameters:** - `input` (string, required): The question - `expected` (string, required): The ground truth answer - `context` (string[] | string, required): Retrieved context passages (order matters) - `model` (string, optional): Model to use **Score Range:** 0-1 - `1.0` = All relevant context appears first - `0.0` = Relevant context is buried under irrelevant context ### ContextEntityRecall Measures how well the context contains entities from the expected answer. **Parameters:** - `expected` (string, required): The ground truth answer - `context` (string[] | string, required): Retrieved context passages - `model` (string, optional): Model to use **Score Range:** 0-1 - `1.0` = All entities from expected answer are in context - `0.0` = No entities from expected answer are in context ### Faithfulness Evaluates whether the generated answer's claims are supported by the context. **Parameters:** - `input` (string): The question - `output` (string, required): The generated answer - `context` (string[] | string, required): Retrieved context passages - `model` (string, optional): Model to use **Score Range:** 0-1 - `1.0` = All claims in the answer are supported by context - `0.0` = Answer contains unsupported claims (hallucinations) **Example:** ```typescript import { Faithfulness } from "autoevals/ragas"; const result = await Faithfulness({ input: "What is photosynthesis?", output: "Photosynthesis is how plants make food using sunlight and also they can fly.", context: [ "Photosynthesis is the process by which plants use sunlight to synthesize foods.", ], }); // Score: ~0.5 (first claim supported, "can fly" is not) ``` ### AnswerRelevancy Measures how relevant the answer is to the question. **Parameters:** - `input` (string, required): The question - `output` (string, required): The generated answer - `context` (string[] | string, optional): Retrieved context passages - `model` (string, optional): Model to use - `embedding_model` (string, optional): Model for embeddings (default: "text-embedding-3-small") **Score Range:** 0-1 - `1.0` = Answer directly addresses the question - `0.0` = Answer is off-topic ### AnswerSimilarity Compares semantic similarity between the generated answer and expected answer using embeddings. **Parameters:** - `output` (string, required): The generated answer - `expected` (string, required): The ground truth answer - `model` (string, optional): Embedding model to use (default: "text-embedding-3-small") **Score Range:** 0-1 - `1.0` = Answers are semantically identical - `0.0` = Answers are completely different ### AnswerCorrectness Combines factual correctness and semantic similarity to evaluate answers. **Parameters:** - `input` (string, required): The question - `output` (string, required): The generated answer - `expected` (string, required): The ground truth answer - `model` (string, optional): Model for factuality checking - `embedding_model` (string, optional): Model for similarity (default: "text-embedding-3-small") - `factuality_weight` (number, optional): Weight for factuality (default: 0.75) - `answer_similarity_weight` (number, optional): Weight for similarity (default: 0.25) **Score Range:** 0-1 **Formula:** `score = (factuality_weight × factuality_score + answer_similarity_weight × similarity_score) / (factuality_weight + answer_similarity_weight)` --- ## Heuristic scorers Fast, deterministic scorers that don't use LLMs. ### Levenshtein Calculates Levenshtein (edit) distance between strings, normalized to 0-1. **Parameters:** - `output` (string, required): The generated text - `expected` (string, required): The expected text **Score Range:** 0-1 - `1.0` = Strings are identical - `0.0` = Strings are completely different **Example:** ```python from autoevals.string import Levenshtein scorer = Levenshtein() result = scorer.eval(output="hello", expected="helo") # Score: ~0.8 (1 character difference) ``` ### ExactMatch Binary scorer that checks for exact string equality. **Parameters:** - `output` (any, required): The generated value - `expected` (any, required): The expected value **Score Range:** 0 or 1 - `1` = Values are exactly equal - `0` = Values differ ### NumericDiff Evaluates numeric differences with configurable thresholds. **Parameters:** - `output` (number, required): The generated number - `expected` (number, required): The expected number - `max_diff` (number, optional): Maximum acceptable difference (default: 0) - `relative` (boolean, optional): Use relative difference (default: false) **Score Range:** 0-1 - `1.0` = Numbers are equal (within threshold) - `0.0` = Numbers differ significantly **Formula (absolute):** `score = max(0, 1 - |output - expected| / max_diff)` (when max_diff > 0) **Formula (relative):** `score = max(0, 1 - |output - expected| / |expected|)` **Example:** ```typescript import { NumericDiff } from "autoevals"; // Absolute difference const result1 = await NumericDiff({ output: 10.5, expected: 10.0, maxDiff: 1.0, }); // Score: 0.5 (difference of 0.5 out of max 1.0) // Relative difference const result2 = await NumericDiff({ output: 100, expected: 110, relative: true, }); // Score: ~0.91 (10% difference) ``` ### EmbeddingSimilarity Compares semantic similarity using text embeddings (cosine similarity). **Parameters:** - `output` (string, required): The generated text - `expected` (string, required): The expected text - `model` (string, optional): Embedding model (default: "text-embedding-3-small") - `client` (Client, optional): Custom OpenAI client **Score Range:** -1 to 1 (typically 0-1 for text) - `1.0` = Semantically identical - `0.0` = Unrelated - `-1.0` = Opposite meanings (rare) --- ## JSON scorers Scorers for evaluating JSON outputs. ### JSONDiff Recursively compares JSON objects with customizable string and number comparison. **Parameters:** - `output` (any, required): The generated JSON - `expected` (any, required): The expected JSON - `string_scorer` (Scorer, optional): Scorer for string values (default: Levenshtein) - `number_scorer` (Scorer, optional): Scorer for numeric values (default: NumericDiff) - `preserve_strings` (boolean, optional): Don't auto-parse JSON strings (default: false) **Score Range:** 0-1 - `1.0` = JSON structures are identical - `0.0` = JSON structures are completely different **Example:** ```python from autoevals.json import JSONDiff scorer = JSONDiff() result = scorer.eval( output={"name": "John", "age": 30}, expected={"name": "John", "age": 31} ) # Score: ~0.5 (name matches, age differs slightly) ``` ### ValidJSON Validates JSON syntax and optionally checks against a JSON Schema. **Parameters:** - `output` (any, required): The value to validate - `schema` (object, optional): JSON Schema to validate against **Score Range:** 0 or 1 - `1` = Valid JSON (and matches schema if provided) - `0` = Invalid JSON or doesn't match schema **Example:** ```typescript import { ValidJSON } from "autoevals/json"; const schema = { type: "object", properties: { name: { type: "string" }, age: { type: "number" }, }, required: ["name", "age"], }; const result = await ValidJSON({ output: '{"name": "John", "age": 30}', schema, }); // Score: 1 (valid JSON matching schema) ``` --- ## List scorers Scorers for evaluating lists and arrays. ### ListContains Checks if all expected items are present in the output list. **Parameters:** - `output` (any[], required): The generated list - `expected` (any[], required): Items that should be present - `scorer` (Scorer, optional): Scorer for comparing individual items **Score Range:** 0-1 - `1.0` = All expected items are present - `0.0` = None of the expected items are present **Example:** ```python from autoevals.list import ListContains scorer = ListContains() result = scorer.eval( output=["apple", "banana", "cherry"], expected=["apple", "banana"] ) # Score: 1.0 (both expected items present) ``` --- ## Custom scorers You can create custom scorers for domain-specific evaluation needs. See: - [JSON scorer examples](py/autoevals/json.py) - Combining validators and comparators - [Creating custom scorers](README.md#creating-custom-scorers) - Basic custom scorer pattern --- ## Score interpretation General guidelines for interpreting scores: - **1.0**: Perfect match or complete correctness - **0.8-0.99**: Very good, minor differences - **0.6-0.79**: Acceptable, some issues - **0.4-0.59**: Moderate quality, significant issues - **0.2-0.39**: Poor quality, major issues - **0.0-0.19**: Unacceptable or completely wrong Note: Interpretation varies by scorer type. Binary scorers (ExactMatch, ValidJSON) only return 0 or 1. --- ## Common parameters Many scorers share these common parameters: - `model` (string): LLM model to use for evaluation (default: configured via `init()` or "gpt-5-mini") - `client` (Client): Custom OpenAI-compatible client - `use_cot` (boolean): Enable chain-of-thought reasoning for LLM scorers (default: true) - `temperature` (number): LLM temperature setting - `max_tokens` (number): Maximum tokens for LLM response ## Configuring defaults Use `init()` to configure default settings for all scorers: ```typescript import { init } from "autoevals"; import OpenAI from "openai"; init({ client: new OpenAI({ apiKey: "..." }), defaultModel: "gpt-5-mini", }); ``` ```python from autoevals import init from openai import OpenAI init(OpenAI(api_key="..."), default_model="gpt-5-mini") ```