--- name: evaluation-quality description: Instrument evaluation metrics, quality scores, and feedback loops triggers: - "agent evaluation" - "quality metrics" - "response scoring" - "evals instrumentation" - "feedback tracking" priority: 2 --- # Evaluation and Quality Instrumentation Instrument evaluation systems to track agent quality and enable continuous improvement. ## Core Principle Quality observability answers: 1. **How good** are agent responses? 2. **What types** of errors occur (hallucination, refusal, etc.)? 3. **Is quality improving** over time? 4. **What correlates** with high/low quality? 5. **How does human feedback** compare to automated evals? ## Evaluation Types ### Automated Evals LLM-as-judge or heuristic scoring: ```python span.set_attribute("eval.type", "automated") span.set_attribute("eval.method", "llm_judge") span.set_attribute("eval.model", "claude-3-opus") span.set_attribute("eval.criteria", "helpfulness") span.set_attribute("eval.score", 0.85) span.set_attribute("eval.confidence", 0.92) ``` ### Human Feedback User ratings and corrections: ```python span.set_attribute("eval.type", "human") span.set_attribute("eval.feedback_type", "thumbs") span.set_attribute("eval.score", 1) # 1 = thumbs up, 0 = thumbs down span.set_attribute("eval.user_id", "user_hash") span.set_attribute("eval.latency_to_feedback_ms", 45000) ``` ### Ground Truth Comparison Compare to known correct answers: ```python span.set_attribute("eval.type", "ground_truth") span.set_attribute("eval.metric", "exact_match") span.set_attribute("eval.score", 1.0) span.set_attribute("eval.test_case_id", "test_123") ``` ## Quality Dimensions ### Correctness ```python span.set_attribute("quality.factual_accuracy", 0.95) span.set_attribute("quality.hallucination_detected", False) span.set_attribute("quality.source_grounded", True) span.set_attribute("quality.citation_count", 3) ``` ### Helpfulness ```python span.set_attribute("quality.task_completion", 1.0) span.set_attribute("quality.answered_question", True) span.set_attribute("quality.actionable", True) span.set_attribute("quality.conciseness", 0.8) ``` ### Safety ```python span.set_attribute("quality.safety_score", 1.0) span.set_attribute("quality.refused", False) span.set_attribute("quality.pii_detected", False) span.set_attribute("quality.harmful_content", False) ``` ### Relevance ```python span.set_attribute("quality.relevance", 0.9) span.set_attribute("quality.on_topic", True) span.set_attribute("quality.context_used", 0.85) ``` ## Eval Span Attributes ```python # Eval metadata (P0) span.set_attribute("eval.id", str(uuid4())) span.set_attribute("eval.name", "helpfulness_v2") span.set_attribute("eval.version", "2.1") span.set_attribute("eval.timestamp", datetime.utcnow().isoformat()) # Input reference (P0) span.set_attribute("eval.trace_id", original_trace_id) span.set_attribute("eval.span_id", original_span_id) span.set_attribute("eval.agent_name", "researcher") # Scores (P0) span.set_attribute("eval.score", 0.85) span.set_attribute("eval.pass", True) span.set_attribute("eval.threshold", 0.7) # Details (P1) span.set_attribute("eval.reasoning", "Response was accurate and helpful") span.set_attribute("eval.issues", ["slightly_verbose"]) span.set_attribute("eval.latency_ms", 1500) ``` ## Feedback Collection Pattern ```python from langfuse.decorators import observe @observe(name="feedback.collect") def collect_feedback( trace_id: str, score: int, feedback_type: str = "thumbs", comment: str = None, ): span = get_current_span() span.set_attribute("feedback.trace_id", trace_id) span.set_attribute("feedback.type", feedback_type) span.set_attribute("feedback.score", score) if comment: span.set_attribute("feedback.has_comment", True) span.set_attribute("feedback.comment_length", len(comment)) # Store feedback langfuse.score( trace_id=trace_id, name=feedback_type, value=score, comment=comment, ) ``` ## LLM-as-Judge Pattern ```python @observe(name="eval.llm_judge") def evaluate_response( question: str, response: str, criteria: str, ) -> float: span = get_current_span() span.set_attribute("eval.method", "llm_judge") span.set_attribute("eval.criteria", criteria) judge_prompt = f""" Evaluate this response on {criteria} (0-1 scale): Question: {question} Response: {response} Score: """ result = judge_llm.invoke(judge_prompt) score = parse_score(result) span.set_attribute("eval.score", score) span.set_attribute("eval.judge_tokens", result.usage.total_tokens) return score ``` ## Hallucination Detection ```python @observe(name="eval.hallucination_check") def check_hallucination( response: str, sources: list[str], ) -> dict: span = get_current_span() span.set_attribute("eval.type", "hallucination") # Check each claim against sources claims = extract_claims(response) span.set_attribute("eval.claims_count", len(claims)) grounded = 0 for claim in claims: if is_grounded(claim, sources): grounded += 1 score = grounded / len(claims) if claims else 1.0 span.set_attribute("eval.grounded_claims", grounded) span.set_attribute("eval.hallucination_score", 1 - score) span.set_attribute("eval.pass", score >= 0.9) return {"score": score, "grounded": grounded, "total": len(claims)} ``` ## Framework Integration ### Langfuse Scores ```python from langfuse import Langfuse langfuse = Langfuse() # Score a trace langfuse.score( trace_id=trace_id, name="helpfulness", value=0.85, comment="Accurate and complete response", ) # Score a specific generation langfuse.score( trace_id=trace_id, observation_id=generation_id, name="factual_accuracy", value=1.0, ) ``` ### Braintrust Evals ```python from braintrust import Eval Eval( "agent_quality", data=[{"input": q, "expected": a} for q, a in test_cases], task=run_agent, scores=[ Factuality(), Helpfulness(), SafetyCheck(), ], ) ``` ## Aggregation & Trends Track quality over time: ```python # Per-eval run span.set_attribute("eval.run_id", run_id) span.set_attribute("eval.test_count", 100) span.set_attribute("eval.pass_rate", 0.92) span.set_attribute("eval.avg_score", 0.87) span.set_attribute("eval.p50_score", 0.89) span.set_attribute("eval.p10_score", 0.65) # Comparison to baseline span.set_attribute("eval.baseline_score", 0.82) span.set_attribute("eval.improvement", 0.05) span.set_attribute("eval.regression", False) ``` ## Anti-Patterns - No eval on production data (lab-only testing) - Missing baseline comparison (can't detect regression) - Eval without trace linking (can't debug failures) - Only thumbs up/down (no granular insight) - No eval versioning (can't compare over time) ## Related Skills - `human-in-the-loop` - Feedback collection - `llm-call-tracing` - Generation tracking