# Add a New Benchmark Extend SkillOpt with your own benchmark in ~200 lines of code. We will use a tiny worked example, `docfaithful`, that scores a target model on how faithfully it answers questions grounded in a small reference doc. > **Working reference.** The easiest way to copy-cargo-cult a new env is > to read [`skillopt/envs/officeqa/`](https://github.com/microsoft/SkillOpt/tree/main/skillopt/envs/officeqa). > Everything below is the same shape, simplified. ## What you need to build To add a benchmark you implement four things: 1. **A `SplitDataLoader` subclass** — knows how to load train / val / test item dicts from disk. 2. **A rollout helper** — runs the target model on a batch of items under the current skill and scores each prediction. 3. **An `EnvAdapter` subclass** — wires the loader + rollout helper into SkillOpt's lifecycle (`build_*_env`, `rollout`, `reflect`, `get_task_types`). 4. **A YAML config** — references your env name plus the standard train / optimizer / gradient knobs. Then one line in `scripts/train.py`'s `_register_builtins()` makes it discoverable. --- ## Step 1 — Create the package ```bash mkdir -p skillopt/envs/docfaithful touch skillopt/envs/docfaithful/__init__.py ``` ## Step 2 — Implement the data loader `skillopt/envs/docfaithful/dataloader.py`: ```python from __future__ import annotations import json from pathlib import Path from skillopt.datasets.base import SplitDataLoader def _normalize(raw: dict) -> dict: """Make sure every item has an ``id``. Other keys are env-specific.""" return { "id": str(raw["uid"]), "question": raw["question"], "ground_truth": raw["answer"], "reference_text": raw.get("reference", ""), "task_type": raw.get("category", "docfaithful"), } class DocFaithfulDataLoader(SplitDataLoader): """Load DocFaithful items from JSON files inside each split dir.""" def load_split_items(self, split_path: str) -> list[dict]: # split_path is e.g. data/docfaithful_split/train/ json_files = sorted(Path(split_path).glob("*.json")) if not json_files: raise FileNotFoundError(f"No .json file found in {split_path}") with json_files[0].open(encoding="utf-8") as f: raw = json.load(f) return [_normalize(item) for item in raw] ``` Only `load_split_items()` is mandatory. If you also want to support `split_mode="ratio"` (auto-split a single raw file into train/val/test), override `load_raw_items(data_path)` as well — see `skillopt/datasets/base.py` docstrings. ## Step 3 — Write the rollout helper `skillopt/envs/docfaithful/rollout.py`: ```python from __future__ import annotations import json import os from pathlib import Path from skillopt.model import chat_target def _score(prediction: str, ground_truth: str) -> tuple[int, float]: """Trivial exact-match scorer. Replace with F1 / ROUGE / LLM-judge.""" p = (prediction or "").strip().lower() g = (ground_truth or "").strip().lower() hard = int(p == g and bool(g)) soft = 1.0 if hard else 0.0 return hard, soft def _rollout_one(item: dict, skill_content: str, *, max_completion_tokens: int) -> dict: system = skill_content user = ( f"Question: {item['question']}\n\n" f"Reference:\n{item.get('reference_text', '')}\n\n" "Answer:" ) prediction, _usage = chat_target( system=system, user=user, max_completion_tokens=max_completion_tokens, ) hard, soft = _score(prediction, item.get("ground_truth", "")) return { "id": str(item["id"]), "hard": hard, "soft": soft, "predicted_answer": prediction, "question": item.get("question", ""), "reference_text": item.get("reference_text", ""), "task_type": item.get("task_type", "docfaithful"), } def run_batch(*, items: list[dict], skill_content: str, out_root: str, workers: int = 4, max_completion_tokens: int = 4096) -> list[dict]: """Run a batch of episodes sequentially or with a thread pool.""" os.makedirs(out_root, exist_ok=True) # For brevity we go sequentially — swap in concurrent.futures.ThreadPoolExecutor # when network / model latency dominates. results = [ _rollout_one(item, skill_content, max_completion_tokens=max_completion_tokens) for item in items ] Path(out_root, "rollouts.json").write_text( json.dumps(results, ensure_ascii=False, indent=2) ) return results ``` Two design points worth flagging: - **Scoring lives here, not in `EnvAdapter`.** There is no `evaluate()` method on the ABC. Whatever signal you put in `hard` (0/1, or a float in [0, 1] for smoothed reward) and `soft` (float in [0, 1]) is what the optimizer reads. - **Use `skillopt.model.chat_target`**, not raw OpenAI/Claude calls. That routes through whichever **chat** target backend the user configured (`openai_chat` / `claude_chat` / `qwen_chat` / `minimax_chat`) without your adapter caring. Exec-style backends (`codex_exec`, `claude_code_exec`) need env-specific rollout code — see `skillopt/envs/swebench/` for an example. ## Step 4 — Implement the environment adapter `skillopt/envs/docfaithful/adapter.py`: ```python from __future__ import annotations from skillopt.datasets.base import BatchSpec from skillopt.envs.base import EnvAdapter from skillopt.envs.docfaithful.dataloader import DocFaithfulDataLoader from skillopt.envs.docfaithful.rollout import run_batch class DocFaithfulAdapter(EnvAdapter): """SkillOpt adapter for the DocFaithful benchmark.""" def __init__( self, split_dir: str = "", data_path: str = "", split_mode: str = "split_dir", split_ratio: str = "2:1:7", split_seed: int = 42, split_output_dir: str = "", workers: int = 4, analyst_workers: int = 4, failure_only: bool = False, minibatch_size: int = 8, edit_budget: int = 4, seed: int = 42, limit: int = 0, max_completion_tokens: int = 4096, ) -> None: self.workers = workers self.analyst_workers = analyst_workers self.failure_only = failure_only self.minibatch_size = minibatch_size self.edit_budget = edit_budget self.max_completion_tokens = int(max_completion_tokens) self.dataloader = DocFaithfulDataLoader( split_dir=split_dir, data_path=data_path, split_mode=split_mode, split_ratio=split_ratio, split_seed=split_seed, split_output_dir=split_output_dir, seed=seed, limit=limit, ) # ── Lifecycle ─────────────────────────────────────────────────────── def setup(self, cfg: dict) -> None: super().setup(cfg) self.dataloader.setup(cfg) def get_dataloader(self): return self.dataloader # ── Env construction ──────────────────────────────────────────────── def build_env_from_batch(self, batch: BatchSpec, **kwargs): # For dataset-backed envs the "manager" is just the items list. return list(batch.payload or []) def build_train_env(self, batch_size: int, seed: int, **kwargs): batch = self.dataloader.build_train_batch( batch_size=batch_size, seed=seed, **kwargs ) return self.build_env_from_batch(batch, **kwargs) def build_eval_env(self, env_num: int, split: str, seed: int, **kwargs): batch = self.dataloader.build_eval_batch( env_num=env_num, split=split, seed=seed, **kwargs ) return self.build_env_from_batch(batch, **kwargs) # ── The rollout method (reflect is inherited) ─────────────────────── def rollout(self, env_manager, skill_content: str, out_dir: str, **kwargs) -> list[dict]: items: list[dict] = env_manager return run_batch( items=items, skill_content=skill_content, out_root=out_dir, workers=self.workers, max_completion_tokens=self.max_completion_tokens, ) # reflect() is inherited from EnvAdapter — it delegates to # run_minibatch_reflect with your analyst_error_* / analyst_success_* # prompts. Override it only if you need custom reflection logic. def get_task_types(self) -> list[str]: seen: list[str] = [] for item in ( self.dataloader.train_items + self.dataloader.val_items + self.dataloader.test_items ): tt = str(item.get("task_type") or "docfaithful") if tt not in seen: seen.append(tt) return seen or ["docfaithful"] ``` ### What the rollout actually does Look back at `run_batch` from Step 3 — it sends each `item["question"]` to the target model with `skill_content` as the system prompt, scores the answer against `item["ground_truth"]`, and returns a list of dicts: ```python [ {"id": "ex_001", "hard": 1, "soft": 0.92, "predicted_answer": "...", "question": "...", "reference_text": item["reference_text"]}, {"id": "ex_002", "hard": 0, "soft": 0.13, "fail_reason": "...", ...}, ... ] ``` The trainer only requires `id`, `hard`, `soft`. The rest is preserved on `RolloutResult.extras` (see `skillopt/types.py`) and is what your `reflect()` consumes via `run_minibatch_reflect`. ## Step 5 — Register the adapter Edit [`scripts/train.py`](https://github.com/microsoft/SkillOpt/blob/main/scripts/train.py) and add to `_register_builtins()`: ```python try: from skillopt.envs.docfaithful.adapter import DocFaithfulAdapter _ENV_REGISTRY["docfaithful"] = DocFaithfulAdapter except ImportError: pass # docfaithful deps not installed — skip ``` There is **no `BENCHMARK_REGISTRY` dict in `skillopt/envs/__init__.py`** — the registry lives in `scripts/train.py` and is populated lazily so that optional deps don't break `--help`. ## Step 6 — Create the YAML config `configs/docfaithful/default.yaml`: ```yaml _base_: ../_base_/default.yaml # NOTE: string, not list model: reasoning_effort: medium train: batch_size: 16 accumulation: 1 num_epochs: 4 gradient: minibatch_size: 8 merge_batch_size: 8 optimizer: learning_rate: 4 env: name: docfaithful # Optional: a seed skill document. Create this file (or any markdown # file) yourself before the first run, or omit the key to let SkillOpt # start from an empty skill. skill_init: skillopt/envs/docfaithful/skills/initial.md split_mode: split_dir split_dir: data/docfaithful_split workers: 4 max_completion_tokens: 4096 limit: 0 ``` > ⚠️ `_base_` is currently parsed as a **string path**, not a list. Write > `_base_: ../_base_/default.yaml`, not `_base_: ['../_base_/default.yaml']`. > See [`skillopt/config.py`](https://github.com/microsoft/SkillOpt/blob/main/skillopt/config.py) > if you want to add list-form inheritance. ## Step 7 — Run ```bash # If you set skill_init above, create the seed skill first: # mkdir -p skillopt/envs/docfaithful/skills # echo "# DocFaithful initial skill" > skillopt/envs/docfaithful/skills/initial.md python scripts/train.py --config configs/docfaithful/default.yaml ``` If you get `ValueError: Unknown environment 'docfaithful'. Available: [...]`, you forgot Step 5. If you get `TypeError: Can't instantiate abstract class DocFaithfulAdapter`, you forgot to implement one of the four abstract methods on `EnvAdapter`: `build_train_env`, `build_eval_env`, `rollout`, `get_task_types`. ## Tips - Start with `train.batch_size: 4` and `limit: 10` while debugging. - The `evaluate` half lives **inside your `rollout`**, not as a separate method — there is no `evaluate()` in the `EnvAdapter` ABC. Score the prediction in `run_batch` and put the score on each result dict's `hard` / `soft`. - Noisy scoring kills the optimizer. Spend time on `run_batch`'s scoring before you spend time on prompts. - If your benchmark needs heavy optional deps (selenium, vllm, ...), wrap the registration block with `try / except ImportError` (Step 5) so people without those deps can still `--help`. - Copy `skillopt/envs/_template/` as a starting skeleton — it now implements the real abstract methods.