# Benchmark And Evaluation Notes This document records the current demo-level measurements and explains what they mean. The numbers are intentionally reproducible from local scripts rather than presented as production claims. ## How To Reproduce Start the backend: ```bash cd backend mvn spring-boot:run ``` Seed and evaluate the demo: ```bash ./scripts/quick-demo.sh ``` Useful endpoints: ```bash GET /api/workflows/summary POST /api/evaluations/rag/run GET /api/document-index/600519/count GET /api/intelligence/600519/graph ``` ## Current Demo Snapshot | Signal | Example Result | | --- | --- | | Workflow tasks | `1/1 succeeded`, `0 failed/dead-letter` | | RAG evaluation | `85 / 100`, `2/3 cases passed` | | Evidence index | `6 documents`, `6 chunks` for `600519` | | Intelligence graph | `20 events`, `36 entities`, `47 relations` | | Deterministic test status | `mvn test` passes; Docker-only smoke test skips when Docker is unavailable | ## RAG Evaluation Metrics FinSight currently scores each evaluation case with: - `ragHitRate`: share of retrieved evidence chunks that match required evidence keywords. - `evidenceCoverage`: coverage of required evidence keywords across retrieved evidence. - `answerCoverage`: coverage of required answer keywords in the final answer. - `hallucinationRisk`: heuristic penalty for unsupported or overconfident claims. - `conclusionConsistency`: whether risk and positive conclusions are expressed coherently. - `confidenceCalibration`: whether confidence follows grounding quality. - `latencyMillis`: response latency captured in the RAG trace. The goal is not to claim perfect financial reasoning. The goal is to create a regression loop for evidence-grounded AI output. ## Workflow Reliability Checks The demo verifies that the workflow API exposes: - total task count; - status distribution; - stage distribution; - failed/dead-letter count; - latest created time. This gives the dashboard enough signal to show whether the research pipeline is progressing, stuck, or recoverable. ## Cache Trust Checks AI report reuse is tied to: - `dataSnapshotHash`; - `contextHash`; - `reportVersion`. When quote data, metrics, risks, or evidence changes, the snapshot hash changes and the report is regenerated instead of silently reusing stale conclusions. ## Next Benchmarks - Add a Redis single-flight concurrency test that fires parallel report-generation requests and proves one owner wins the lease. - Add a workflow timeout recovery test with a deliberately stale `RUNNING` task. - Add p50/p95 timing for retrieval, AI fallback generation, report cache hit, and report cache miss. - Track RAG evaluation trend history across commits.