--- name: eng-langfuse-eval-runner description: Use when setting up or operating automated quality evaluation of legal AI skill outputs using Langfuse. Covers the evaluation dataset structure for legal domains, judge-model configuration, scoring rubrics specific to legal drafting and analysis, how to run batch evals across skill versions, and how to connect eval results to feature-flag promotion decisions. Engineering skill for legal AI quality assurance. license: MIT metadata: id: eng.langfuse-eval-runner category: eng jurisdictions: [__multi__] priority: P2 intent: [eval, quality-assurance, langfuse, scoring, testing, LLM-evaluation] related: - eng-langfuse-trace-inspector - eng-feature-flag-rollout-skills - eng-latency-slo-by-skill - eng-cost-per-message-tracker source: Louis — HAQQ Legal AI (github.com/sboghossian/mini-claude-for-legal) version: "1.0" --- # Langfuse Eval Runner ## What it does The eval runner executes systematic quality evaluations of skill outputs, using Langfuse as the orchestration layer. For each skill under test, it: 1. Takes a dataset of reference input/output pairs (golden set). 2. Runs the current skill version against the inputs. 3. Scores the outputs against the golden outputs using a judge model or custom rubric. 4. Writes scores back to Langfuse traces. 5. Computes aggregated pass rates and quality metrics. 6. Feeds results into the feature-flag promotion decision ([[eng-feature-flag-rollout-skills]]). In a legal AI product, where the cost of a wrong output is potential malpractice exposure, systematic evaluation is non-optional for any skill that produces client-facing legal content. ## Evaluation dataset structure A Langfuse dataset item for a legal skill: ```json { "dataset_id": "efirm-conflict-check-eval-v1", "item_id": "ulid", "input": { "skill_id": "efirm-conflict-check", "context": { "new_client": "Al Baraka Holdings Ltd", "counterparties": ["Noor Capital SAOC", "Ali Hassan (individual)"], "matter_description": "Acquisition of minority stake in a KSA-incorporated company" }, "user_message": "Run a conflict check for this new matter." }, "expected_output": { "result": "CONCERN", "dimension_flagged": "former_client_check", "description": "Noor Capital SAOC was represented by the firm in matter 2023-UAE-0081 which may be substantially related" }, "metadata": { "skill_version": "1.0", "practice_area": "corporate", "jurisdiction": "KSA", "difficulty": "medium", "created_by": "legal-qa-team", "reviewed_by": "partner-id" } } ``` ## Rubrics by skill category ### Conflict check rubrics | Dimension | Scoring criterion | Weight | |---|---|---| | Correct result classification | CLEAN / CONCERN / CONFLICT matches expected | 40% | | Correct dimension identification | Right conflict dimension flagged | 25% | | No false negatives | Did not miss a flagged party | 25% | | Description quality | Actionable, precise description | 10% | ### Drafting skill rubrics (engagement letter, NDA, fee quote) | Dimension | Scoring criterion | Weight | |---|---|---| | Completeness | All required sections present | 25% | | Accuracy | No fabricated statute numbers, incorrect jurisdiction references | 25% | | Auto-population | All available fields correctly populated | 20% | | Plain language | Jargon explained; sentence length within standard | 15% | | Format compliance | Correct headings, numbering, version stamp | 15% | ### Advisory / analysis skill rubrics | Dimension | Scoring criterion | Weight | |---|---|---| | Issue identification | Key legal issues identified | 30% | | Legal accuracy | No invented legal rules; correct jurisdiction reference | 35% | | Actionability | Concrete recommendations, not hedged generalities | 20% | | Appropriate scope | Does not exceed what can be reliably stated | 15% | ## Judge model configuration For automated scoring (as opposed to human review), the judge model evaluates each output: ```yaml judge_config: model: claude-sonnet-4-6 system_prompt: | You are a senior legal quality reviewer. Evaluate the AI output against the reference output and the rubric. Score each dimension 0–1 (continuous). Return JSON: {dimension: score, overall: float, pass: bool, notes: string}. Legal accuracy is paramount. A single fabricated statute number or incorrect jurisdiction claim is an automatic fail regardless of other scores. pass_threshold: 0.75 # overall score ≥ 0.75 = pass legal_accuracy_hard_fail: true # any legal_accuracy < 0.9 = fail regardless ``` The judge model should never be the same model as the model being evaluated — use a different model or a more capable model as judge. ## Running an eval ### Single skill, current version ```python langfuse.run_eval( dataset_name="efirm-conflict-check-eval-v1", skill_id="efirm-conflict-check", skill_version="current", judge_config="legal-qa-rubric", run_name="conflict-check-v1-2025-05-14" ) ``` ### A/B comparison (two skill versions) ```python langfuse.run_eval( dataset_name="efirm-engagement-letter-eval-v2", experiments=[ {"skill_version": "1.0", "name": "control"}, {"skill_version": "1.1-test", "name": "treatment"} ], judge_config="drafting-rubric", significance_threshold=0.05 ) ``` ## Eval result schema (written back to Langfuse) ```json { "run_id": "run_xxx", "skill_id": "efirm-conflict-check", "skill_version": "1.0", "dataset_id": "efirm-conflict-check-eval-v1", "n_items": 50, "pass_rate": 0.88, "avg_score": 0.83, "scores_by_dimension": { "result_classification": 0.92, "dimension_identification": 0.86, "no_false_negatives": 0.84, "description_quality": 0.79 }, "failures": [ {"item_id": "xxx", "score": 0.61, "notes": "Missed corporate group adversity"} ], "promotion_recommendation": "PASS — meets 0.75 threshold", "run_timestamp": "ISO-8601" } ``` ## Promotion gate Integrate with [[eng-feature-flag-rollout-skills]]: - If `pass_rate ≥ 0.85` AND `avg_score ≥ 0.80`: auto-promote to next rollout stage. - If `pass_rate < 0.75` OR any legal-accuracy hard fail: block promotion; alert product team. - Between 0.75 and 0.85: manual review by legal QA partner before promotion. ## Golden set maintenance - Golden sets must be reviewed and updated by a legal practitioner (not just engineering) at least every 6 months. - When new legal developments occur (new legislation, regulatory guidance), update affected golden items. - Target: minimum 30 items per skill for statistically meaningful results; 50–100 items for P0 skills. ## Related skills - [[eng-langfuse-trace-inspector]] - [[eng-feature-flag-rollout-skills]] - [[eng-latency-slo-by-skill]] - [[eng-cost-per-message-tracker]]