# Eval Guide

Internal tooling for measuring how reliably a model + backend combo navigates multi-step tool-calling workflows. Not a test suite — run manually against a live backend.

## Eval Harness

### Quick Start

```bash
# Ollama — all scenarios, 10 runs each
python -m tests.eval.eval_runner --backend ollama --model "ministral-3:8b-instruct-2512-q4_K_M" --runs 10 --stream --verbose

# llama-server — start server in one terminal, run eval in another
llama-server --jinja -m path/to/Ministral-3-14B-Instruct-2512-Q4_K_M.gguf -ngl 999 --port 8080
python -m tests.eval.eval_runner --backend llamafile --llamafile-mode native --gguf path/to/Ministral-3-14B-Instruct-2512-Q4_K_M.gguf --runs 10 --stream --verbose

# Anthropic API
python -m tests.eval.eval_runner --backend anthropic --model claude-haiku-4-5-20251001 --runs 5 --stream --verbose
```

### eval_runner Flags

| Flag | Values | Default | Description |
|------|--------|---------|-------------|
| `--backend` | `ollama`, `llamafile`, `anthropic` | `ollama` | Backend to target |
| `--model` | string | *(required for ollama/anthropic)* | Model name (Ollama-style or Anthropic model ID). Rejected for llamafile (use `--gguf`). |
| `--gguf` | path | *(required for llamafile)* | Path to GGUF / llamafile model file. Rejected for ollama/anthropic (use `--model`). |
| `--runs` | int | `10` | Runs per scenario |
| `--stream` | flag | off | Use streaming mode |
| `--verbose`, `-v` | flag | off | Print live per-message trace |
| `--tags` | `plumbing`, `model_quality`, `advanced_reasoning`, `compaction`, `stateful`, `reasoning`, `error_recovery` | all | Filter scenarios by tag |
| `--scenario` | name(s) | all | Run specific scenario(s) by name |
| `--llamafile-mode` | `native`, `prompt`, `auto` | `auto` | FC mode for llamafile/llama-server backend |
| `--think` | `true`, `false`, `auto` | `auto` | Thinking mode. Ollama: controls `think` param. Llamafile: captures `[THINK]` tags and `reasoning_content` |
| `--budget-mode` | `backend`, `manual`, `forge-full`, `forge-fast` | `forge-full` | Context budget strategy. Compaction scenarios always override with their own budget |
| `--num-ctx` | int | none | Exact token budget (requires `--budget-mode manual`) |
| `--no-history` | flag | off | Disable message history collection (lighter, fewer metrics) |
| `--probe` | flag | off | Print resolved budget from backend and exit (no eval run) |
| `--base-url` | URL | none | Override backend base URL |
| `--ablation` | `reforged`, `no_rescue`, `no_nudge`, `no_steps`, `no_recovery`, `no_compact`, `bare` | `reforged` | Ablation preset: selectively disable guardrails |
| `--tool-choice` | `auto`, `any` | none | Anthropic `tool_choice` type. `any` forces tool calls |
| `--no-cache-prompt` | flag | off | Disable llama-server prompt caching |
| `--compact-strategy` | `tiered`, `sliding`, `none` | auto | Override compaction strategy for all scenarios |

### Scenarios

30 scenarios across five categories. The 26 non-compaction scenarios split into two difficulty tiers — **OG-18** (baseline) and **advanced_reasoning** (hard) — with the dashboard's Suite scope filtering between them.

**Plumbing** (does forge's tool-calling loop work?):
- `basic_2step`, `sequential_3step`, `error_recovery`

**Model quality** (does the model reason correctly?):
- `tool_selection`, `argument_fidelity`, `sequential_reasoning`, `conditional_routing`, `data_gap_recovery`, `relevance_detection`

**Advanced reasoning** (top-tier separators — designed to weed out 8B-class winners after sampling-defaults closed the OG-18 gap):
- `data_gap_recovery_extended`, `argument_transformation`, `inconsistent_api_recovery`, `grounded_synthesis`

**Compaction chain** (multi-phase compaction retention):
- `compaction_chain_baseline`, `compaction_chain_p1`, `compaction_chain_p2`, `compaction_chain_p3`

**Stateful variants** (state carries between calls — wrong arguments cascade):
- All scenarios above (except compaction chain) ship a `_stateful` pair: `basic_2step_stateful`, `sequential_3step_stateful`, `error_recovery_stateful`, `tool_selection_stateful`, `argument_fidelity_stateful`, `sequential_reasoning_stateful`, `conditional_routing_stateful`, `data_gap_recovery_stateful`, `relevance_detection_stateful`, `data_gap_recovery_extended_stateful`, `argument_transformation_stateful`, `inconsistent_api_recovery_stateful`, `grounded_synthesis_stateful`.

**Lambda vs stateful:** Lambda scenarios use hardcoded echo tools — tool arguments don't affect the result. Stateful scenarios use backend classes where arguments matter and state carries between calls. The delta between lambda and stateful scores for the same model isolates model reasoning quality from forge correctness.

**OG-18 vs advanced_reasoning:** OG-18 is the 18-scenario baseline (plumbing + model_quality + their stateful pairs). advanced_reasoning is the 8 scenarios tagged for top-tier-only batching. Most published results split aggregates across the two; see [MODEL_GUIDE.md](MODEL_GUIDE.md#difficulty-tiers) for context.

### Examples

```bash
# Filter by tag
python -m tests.eval.eval_runner --backend ollama --model "qwen3:8b-q4_K_M" --runs 5 --tags plumbing

# Specific scenarios
python -m tests.eval.eval_runner --backend ollama --model "ministral-3:8b-instruct-2512-q4_K_M" --runs 10 --scenario basic_2step sequential_3step

# Qwen3 with thinking on llama-server
llama-server --jinja -m path/to/Qwen3-8B-Q4_K_M.gguf -ngl 999 --port 8080 --reasoning-format auto
python -m tests.eval.eval_runner --backend llamafile --llamafile-mode native --gguf path/to/Qwen3-8B-Q4_K_M.gguf --runs 10 --stream --think true

# Probe budget without running eval
python -m tests.eval.eval_runner --backend ollama --model "ministral-3:8b-instruct-2512-q4_K_M" --probe

# Ablation — bare (all guardrails off)
python -m tests.eval.eval_runner --backend anthropic --model claude-haiku-4-5-20251001 --runs 5 --stream --ablation bare
```

All OG-18 non-stateful scenarios (copy-paste friendly):

```
--scenario basic_2step sequential_3step error_recovery tool_selection argument_fidelity sequential_reasoning conditional_routing data_gap_recovery relevance_detection
```

All OG-18 stateful scenarios:

```
--scenario basic_2step_stateful sequential_3step_stateful error_recovery_stateful tool_selection_stateful argument_fidelity_stateful sequential_reasoning_stateful conditional_routing_stateful data_gap_recovery_stateful relevance_detection_stateful
```

All advanced_reasoning scenarios (lambda + stateful, 8 total):

```
--scenario data_gap_recovery_extended argument_transformation inconsistent_api_recovery grounded_synthesis data_gap_recovery_extended_stateful argument_transformation_stateful inconsistent_api_recovery_stateful grounded_synthesis_stateful
```

Or via tag (equivalent):

```
--tags advanced_reasoning
```

---

## Batch Eval

Run large-scale model comparisons across all backends. Results append to JSONL with automatic resume. Ollama auto-loads models, llama-server is auto-managed (start/stop/health check per GGUF), llamafile binaries require a manual server.

### batch_eval Flags

| Flag | Values | Default | Description |
|------|--------|---------|-------------|
| `--config` | `all`, `ollama`, `llamaserver`, `llamafile`, `llamaserver-native`, `llamaserver-prompt`, `anthropic`, `anthropic-any`, `haiku`, `sonnet`, `opus`, `haiku-any`, `sonnet-any`, `opus-any` | `all` | Config set to run |
| `--runs` | int | `50` | Runs per scenario |
| `--output` | path | `eval_results.jsonl` | JSONL output path |
| `--scenario` | name(s) | all | Run specific scenario(s) |
| `--tags` | tag(s) | all | Filter scenarios by tag |
| `--budget-mode` | `backend`, `manual`, `forge-full`, `forge-fast` | `forge-full` | Context budget strategy |
| `--num-ctx` | int | none | Exact token budget (requires `--budget-mode manual`) |
| `--ablation` | preset name | `reforged` | Ablation preset |
| `--model` | substring | none | Filter configs to models containing this substring |
| `--dry-run` | flag | off | Show what would run without executing |
| `--verbose`, `-v` | flag | off | Print per-run details |

### Examples

```bash
# Ollama (11 models, fully unattended)
python -m tests.eval.batch_eval --config ollama --runs 50

# llama-server (auto-managed, starts/stops per GGUF)
python -m tests.eval.batch_eval --config llamaserver --runs 50

# Anthropic (costs money)
python -m tests.eval.batch_eval --config anthropic --runs 50

# Dry run
python -m tests.eval.batch_eval --config all --runs 50 --dry-run

# Filter to specific model
python -m tests.eval.batch_eval --config llamaserver --model 8b-reasoning --runs 20

# Specific scenarios only
python -m tests.eval.batch_eval --config ollama --runs 50 --scenario basic_2step sequential_reasoning
```

Resume is automatic: re-run the same command and it skips completed scenarios.

---

## Reports

### Committed datasets

Released datasets are versioned in the repo: `eval_results_vX.Y.Z.jsonl` (LFS-tracked). The current shipped dashboard at `docs/results/dashboard.html` reflects the latest version. To regenerate the dashboard or markdown views against a specific release:

```bash
python -m tests.eval.report eval_results_v0.7.0.jsonl --html docs/results/dashboard.html --markdown docs/results/
```

Older datasets (e.g. `eval_results_v0.6.0.jsonl`) remain in the repo for comparison and reproducibility. `batch_eval` writes to `eval_results.jsonl` by default; rename to a versioned filename before committing to the repo.

### Forge eval report

```bash
# Full table + list
python -m tests.eval.report eval_results.jsonl

# Progress (for incomplete runs)
python -m tests.eval.report eval_results.jsonl --progress

# Compact list only (phone-friendly)
python -m tests.eval.report eval_results.jsonl --list-only

# Include partially-completed configs
python -m tests.eval.report eval_results.jsonl --include-partial

# Filter by ablation
python -m tests.eval.report eval_results.jsonl --ablation reforged bare

# Filter by scenario tag
python -m tests.eval.report eval_results.jsonl --tags stateful

# Exclude specific scenarios
python -m tests.eval.report eval_results.jsonl --exclude-scenario error_recovery

# HTML dashboard (requires Node.js)
python -m tests.eval.report eval_results.jsonl --html docs/results/dashboard.html

# Markdown views
python -m tests.eval.report eval_results.jsonl --markdown docs/results/

# Both
python -m tests.eval.report eval_results.jsonl --html docs/results/dashboard.html --markdown docs/results/
```

### report Flags

| Flag | Values | Default | Description |
|------|--------|---------|-------------|
| `jsonl` | path | `eval_results.jsonl` | JSONL input file (positional, optional) |
| `--list-only` | flag | off | Skip table, show list view only |
| `--progress` | flag | off | Show progress for all configs (including incomplete) |
| `--include-partial` | flag | off | Include configs that haven't finished all scenarios |
| `--ablation` | preset name(s) | all | Filter to specific ablation preset(s) |
| `--exclude-scenario` | name(s) | none | Exclude scenario(s) from aggregates and columns |
| `--tags` | `stateful`, `lambda`, `compaction` | all | Filter to scenarios matching tag(s) |
| `--html` | path | none | Write interactive HTML dashboard |
| `--markdown` | dir | none | Write pre-filtered markdown views |

---

## BFCL Benchmark (removed)

Forge previously included a [Berkeley Function Calling Leaderboard](https://github.com/ShishirPatil/gorilla/tree/main/berkeley-function-call-leaderboard) v4 integration (11 categories, ~2,183 entries). It was removed in favor of forge's own eval harness, which measures multi-step workflow completion rather than single-call argument matching. Last commit with BFCL code: [`a9b0257`](https://github.com/antoinezambelli/forge/commit/a9b0257).

---

## Ablation Presets

Ablation selectively disables forge guardrails to isolate their contribution to model performance.

| Preset | Rescue | Retry Nudge | Step Enforcement | Error Recovery | Compaction |
|--------|--------|-------------|------------------|----------------|------------|
| `reforged` | yes | yes (5 retries) | yes | yes (2 errors) | yes |
| `no_rescue` | **no** | yes | yes | yes | yes |
| `no_nudge` | **no** | **no** | yes | yes | yes |
| `no_steps` | yes | yes | **no** | yes | yes |
| `no_recovery` | yes | yes | yes | **no** | yes |
| `no_compact` | yes | yes | yes | yes | **no** |
| `bare` | **no** | **no** | **no** | **no** | **no** |

---

## Backend Notes

See [BACKEND_SETUP.md](BACKEND_SETUP.md) for installation, server launch, and verification instructions for each backend (Ollama, llama-server, llamafile).

**Key points for eval:**
- Ollama runs as a background service — no manual server launch needed
- llama-server needs `--jinja` for native function calling; use `--backend llamafile --llamafile-mode native`
- llamafile has no native FC — use `--llamafile-mode prompt`
- Anthropic needs `ANTHROPIC_API_KEY` env var; compaction scenarios are skipped (200K context)