# Goldset — Architecture
## 1. Component overview
```mermaid
graph TB
subgraph "User repo"
Eval[evals/*.eval.ts]
Cases[Test cases]
LLM[llm function]
end
subgraph "Goldset core"
Gold[goldenDataset runner
Levenshtein]
Judge[llmJudge runner
rubric scoring]
Struct[structural runner
schema + regex + contains]
Agg[Aggregator]
end
subgraph "GitHub Action"
Run[Run eval]
Compare[Compare to base branch]
Comment[Post PR comment]
Gate[Fail check on regression]
end
Eval --> Cases
Cases --> Gold
Cases --> Judge
Cases --> Struct
LLM --> Gold
LLM --> Judge
LLM --> Struct
Gold --> Agg
Judge --> Agg
Struct --> Agg
Agg --> Run
Run --> Compare
Compare --> Comment
Compare --> Gate
classDef core fill:#dcfce7,stroke:#16a34a
classDef action fill:#dbeafe,stroke:#2563eb
class Gold,Judge,Struct,Agg core
class Run,Compare,Comment,Gate action
```
---
## 2. Eval run sequence (in CI)
```mermaid
sequenceDiagram
autonumber
participant PR as Pull Request
participant GHA as GitHub Action
participant G as Goldset CLI
participant LLM as Your LLM
participant Judge as Judge LLM
participant Cmt as PR Comment
PR->>GHA: open / push
GHA->>G: npx goldset run evals/*.eval.ts
G->>LLM: invoke for each test case
LLM-->>G: outputs
G->>Judge: score outputs (llmJudge runner only)
Judge-->>G: rubric scores
G->>G: aggregate results
G-->>GHA: dist/eval-results.json
GHA->>GHA: download base branch results
GHA->>GHA: compute delta
alt regression detected
GHA->>Cmt: upsert PR comment with red row
GHA->>PR: fail check
else clean or improvement
GHA->>Cmt: upsert PR comment with green status
GHA->>PR: pass check
end
```
---
## 3. Runner internals
### goldenDataset
```mermaid
graph LR
Input[Test case
input + expected] --> LLMCall[Call your LLM]
LLMCall --> Output[Actual output]
Output --> Norm[Normalize
lowercase, trim, etc.]
Expected[Expected] --> Norm2[Normalize]
Norm --> Distance[Levenshtein distance]
Norm2 --> Distance
Distance --> Sim[Similarity = 1 - dist / max_len]
Sim --> Compare{≥ threshold?}
Compare -->|yes| Pass[pass]
Compare -->|no| Fail[fail + report similarity]
```
### llmJudge
```mermaid
graph LR
Input[Test case
input] --> LLMCall[Call your LLM]
LLMCall --> Output[Actual output]
Output --> Prompt[Compose judge prompt
rubric + input + output]
Prompt --> JudgeCall[Call judge LLM]
JudgeCall --> Parse[Parse score 0-5]
Parse --> Compare{≥ passThreshold?}
Compare -->|yes| Pass[pass + record score]
Compare -->|no| Fail[fail + record score]
```
### structural
```mermaid
graph LR
Input[Test case
input] --> LLMCall[Call your LLM]
LLMCall --> Output[Actual output]
Output --> Loop{For each assertion}
Loop --> JSON[json-schema?]
Loop --> Regex[regex match?]
Loop --> Contains[contains substring?]
Loop --> Tool[tool-call-shape?]
JSON --> Result[true/false]
Regex --> Result
Contains --> Result
Tool --> Result
Result --> All{All true?}
All -->|yes| Pass[pass]
All -->|no| Fail[fail + first failing assertion]
```
---
## 4. Result format
`dist/eval-results.json`:
```json
{
"version": 1,
"timestamp": "2026-05-26T17:00:00Z",
"commit": "abc123",
"branch": "feat/refund-flow",
"runners": {
"goldenDataset": {
"cases": [
{"id":"refund-q","passed":true,"similarity":0.93},
{"id":"shipping-q","passed":false,"similarity":0.61,"threshold":0.85}
],
"summary": {"passed":1,"failed":1,"passRate":0.5}
},
"llmJudge": {
"cases": [
{"id":"tone-helpful","passed":true,"score":4.2,"passThreshold":3}
],
"summary": {"passed":1,"failed":0,"avgScore":4.2}
},
"structural": {
"cases": [
{"id":"output-shape","passed":false,"failedAssertion":"json-schema","reason":"missing 'intent'"}
],
"summary": {"passed":0,"failed":1}
}
}
}
```
The GitHub Action diffs this file against the base branch's same-named file and renders the PR comment.
---
## 5. Design decisions
**Why three runners, not one?**
Because real AI apps fail in three orthogonal ways: drifted facts (golden), drifted tone (judge), drifted shape (structural). One runner that does all three would be a god-object. Three small runners compose.
**Why Levenshtein over embedding similarity for goldenDataset?**
Levenshtein is deterministic, dependency-free, and the failure mode it catches (canonical-answer drift) is character-level. Embedding similarity adds a model dependency and makes the threshold harder to reason about. We expose a hook to swap it in for cases where you need semantic match.
**Why call it `llmJudge` and not `aiScore`?**
Because "judge" makes the LLM-grading-LLM pattern explicit. Engineers reading this for the first time should immediately know what they're looking at.
**Why no UI?**
A UI is a different product. Goldset is for engineers who want evals next to their code. The PR comment IS the UI.
**Why provider-agnostic?**
Locking you to OpenAI was Vercel Eval's mistake. Your `llm: (input) => Promise` is the only interface. Goldset doesn't care if it's GPT-4o, Claude, Llama 3 on Ollama, or a stub function.