--- name: enhance-evals description: Improve the eval system quality -- add better cases, fix broken cases, improve the LLM judge, add new metrics, or refine pass/fail criteria based on eval run data. Use when the user wants to make the eval system itself better. disable-model-invocation: true argument-hint: [area to improve] allowed-tools: Bash Read Write Edit Glob Grep Agent effort: high --- # Enhance the Eval System You are improving the boredgame.lol recommendation engine evaluation system. The user wants the eval system itself to be better -- more accurate tests, better failure detection, smarter judging, more useful output. ## Full Context You MUST Know ### System Architecture - `evals/runner.ts` (626 lines) - Core runner: parallel execution, LLM judge, constraint checker, regression tracking, persistent logging - `evals/types.ts` (218 lines) - Type definitions for cases, results, runs, metrics - `evals/metrics.ts` (109 lines) - NDCG@K, Precision@K, MRR, Hit Rate calculations - `evals/llm-judge.ts` (105 lines) - GPT-4o-mini rates overall result quality 0-10 - `evals/constraint-checker.ts` (110 lines) - Detects player count, time, complexity, game type violations - `evals/compare-runs.ts` (139 lines) - Side-by-side run comparison with regression detection - `evals/summary.ts` (109 lines) - Run summary viewer with history mode - `evals/analyze-failures.ts` (254 lines) - Failure pattern categorization + most-missing-game tracking - `evals/generate-cases.ts` - 130 hand-curated base cases - `evals/generate-expanded-cases.ts` - +177 systematic variations - `evals/generate-massive.ts` - LLM-generated thousands (GPT-4o-mini, batched) - `evals/cases.json` - Currently ~3,028 cases across 16 categories ### Known Weaknesses in the Eval System 1. **Pass/fail criteria may be too strict**: A case fails if ANY ideal game (relevance >= 2) is missing. But the results might be excellent games the eval case didn't anticipate. The LLM judge catches this (7.14/10 average vs 68.4% pass rate means many "failing" cases have good results). 2. **idealGames are sometimes wrong**: Some generated cases reference games that don't exist in the DB, or use slightly wrong names (e.g., "Castles of Burgundy" vs "The Castles of Burgundy"). 3. **LLM judge uses 0-10 scale**: Research (arxiv 2411.15594) shows 0-2 scales with per-dimension scoring are more reliable. Current judge gives a single holistic score. 4. **No serendipity metric**: Research shows users want accuracy + novelty. We don't measure whether recommendations include unexpected-but-delightful games. 5. **No familiarity balance metric**: Optimal is 20-30% recognizable games, 70-80% discovery. We don't track this. 6. **Catalog coverage is only 0.5%**: Only ~400-500 unique games appear across all recommendations. Could indicate test cases are too similar or engine has narrow candidate fetching. 7. **Generated cases may have quality issues**: LLM-generated cases (GPT-4o-mini at temp 0.9) may include unrealistic queries, wrong game names, or contradictory constraints. 8. **No confidence intervals**: We report point estimates without standard errors. With 3,000+ cases we should use statistical significance testing. ### Research-Backed Improvements to Consider From `docs/research/recommendation-eval-methodology.md` and `evals/RECOMMENDATIONS.md`: - **Pairwise LLM comparison** instead of absolute scoring (more reliable) - **Per-dimension scoring**: Rate mechanic match, theme match, constraint satisfaction separately - **Chain-of-thought reasoning** in LLM judge (explain BEFORE scoring) - **Trust buster detection**: Flag obviously-wrong individual results (Chess for party game query) - **Constraint violation breakdown** by type (time vs player count vs complexity) - **Serendipity@K**: relevant AND dissimilar to obvious matches - **Familiarity-discovery ratio**: % of results that are well-known vs obscure ### The 16 Categories and Their Current Performance | Category | Cases | Pass Rate | Notes | |----------|-------|-----------|-------| | mechanic-focused | 530 | ~32% | Weakest. BGG mechanic alias gap is the root cause. | | multi-constraint | 384 | ~36% | Combined constraints are hard to satisfy. | | theme-focused | 356 | ~86% | Strong. Theme matching works well. | | video-game | 262 | ~100% | Perfect. No cross-contamination. | | similar-to | 212 | ~83% | Good. "Like X" queries work. | | mood-vibe | 189 | ~29% | Weak. Missing Patchwork, Jaipur for chill. | | player-count | 177 | ~60% | Moderate. Some constraint violations. | | time-constraint | 164 | ~50% | Moderate. Time violations persist. | | free-text-intent | 159 | ~73% | Good. Natural language decent. | | edge-case | 153 | ~100% | Perfect. Handles garbage gracefully. | | negative-preference | 123 | ~73% | Good. Respects exclusions. | | designer-search | 116 | ~42% | Weak. Non-designer games mixed in. | | complexity | 112 | ~44% | Moderate. Misses gateway games. | | real-user-feedback | 78 | ~33% | Weak. BGG user issues persist. | | regression | 9 | ~89% | Good. Past bugs mostly fixed. | | party-game | 4 | ~50% | Too few cases. | ## What To Do Based on what the user asks (or $ARGUMENTS), pick the right enhancement: ### If they want to fix broken/inaccurate test cases: 1. Read `evals/cases.json` and look for cases with wrong game names, unrealistic queries, or contradictory constraints 2. Cross-reference idealGames against the actual database (query Supabase or use the validate script from `scripts/validate-eval-cases.ts`) 3. Fix the cases in the appropriate generator (`generate-cases.ts`, `generate-expanded-cases.ts`, or `generate-massive.ts`) and regenerate ### If they want better metrics: 1. Read `evals/metrics.ts` and `evals/types.ts` 2. Add new metric calculations (serendipity, familiarity balance, catalog coverage per run) 3. Update `computeCaseMetrics()` and `computeAggregateMetrics()` in the runner 4. Update the report formatter to display new metrics ### If they want a better LLM judge: 1. Read `evals/llm-judge.ts` 2. Implement per-dimension scoring (mechanic match, theme match, constraint satisfaction as separate 0-2 scores) 3. Add chain-of-thought reasoning requirement 4. Consider pairwise comparison mode for A/B testing system versions ### If they want more/better test cases: 1. Identify which categories are underrepresented (party-game has only 4 cases!) 2. Add hand-curated cases in `generate-cases.ts` for the weakest categories 3. Run `npm run eval:generate-massive` to fill gaps with LLM-generated cases 4. Validate new cases against the database ### If they want better reporting/visualization: 1. Read `evals/summary.ts`, `evals/compare-runs.ts`, `evals/analyze-failures.ts` 2. Add new views (e.g., trend over time, per-game analysis, category drill-down) 3. Consider adding an HTML report generator for richer visualization ## Important Rules - READ the relevant files thoroughly before making changes - ALWAYS regenerate cases.json after modifying generators (run the appropriate generate script) - ALWAYS run a quick eval after changes to verify nothing broke: `source .env.local && npx tsx evals/runner.ts --quick --no-judge` - Document what you changed in `evals/EVAL-WORKLOG.md` - Do NOT change engine code -- this skill is about the eval system only