# bench_env task testing guide > Every suite's tasks must have tests. Tests are not optional — **a task without tests is a judge nobody verified; shipping it is gambling.** > > Companion docs: > - Authoring workflow: [`TASK_AUTHORING_GUIDE.md`](TASK_AUTHORING_GUIDE.md) > - Hard code spec: [`TASK_CODE_SPEC.md`](TASK_CODE_SPEC.md) ## 1. Test tiers Two tiers with clear responsibilities: | Tier | Depends on | Marker | What it covers | |---|---|---|---| | **Offline** | Only `defaults.json` | (default, no marker) | Task definition checks + Accessor tests + Judge positive/negative matrix | | **Live** | Simulator at `localhost:3000` | `@pytest.mark.live` | Tasks whose judge needs runtime simulator state (e.g., post-query verdicts) | **Most tasks should be offline tests**. Only when the judge needs runtime data produced by a simulator setup (e.g., `queryState.directTrains` is dynamically generated by App search and can't be statically constructed) do you need Live tests. ## 2. File layout ``` bench_env/tests/ ├── conftest.py # Shared fixtures and helpers ├── pytest.ini # pytest config ├── __init__.py ├── test_railway12306.py # Railway12306 suite tests ├── test_weather.py # Weather suite tests ├── test_wechat.py # WeChat suite tests └── ... # One file per suite ``` **Naming convention**: - File name: `test_.py` (matches `task//`) - Test classes: grouped by purpose (`TestTaskDefinitions`, `TestAccessor`, `TestTaskJudgeMatrixOffline`, `TestLiveQueryTasks`) ## 3. Shared infrastructure (`conftest.py`) `conftest.py` provides fixtures and helpers used across all suite tests: ```python # Session-scoped MobileGymEnv fixture (used by Live tests) @pytest_asyncio.fixture(scope="session", loop_scope="session") async def env(request) -> MobileGymEnv: ... # Helper: build JudgeInput from raw state dicts def make_judge_input(init_state, curr_state, *, route=None, init_route=None, answer=None) -> JudgeInput: ... ``` **Using `make_judge_input`**: - `route` — **current** route after the Agent's actions (assigned to `last_obs.route`) - `init_route` — **initial** route before the Agent acts (assigned to `init_obs.route`, default `{}`) - The two routes are set independently; they don't overwrite each other. ```python from bench_env.tests.conftest import make_judge_input # Basic: care only about the current route inp = make_judge_input( {"apps": {"weather": init_data}, "os": os_state}, {"apps": {"weather": curr_data}, "os": os_state}, route={"app": "weather", "path": "/settings"}, answer="25°C", ) # When you need to distinguish initial vs current route: inp = make_judge_input( {"apps": {"weather": init_data}, "os": os_state}, {"apps": {"weather": curr_data}, "os": os_state}, init_route={"app": "weather", "path": "/"}, route={"app": "weather", "path": "/settings"}, ) ``` ## 4. The four mandatory test categories ### 4.1 Task definition validation (`TestTaskDefinitions`) **Parametrize** over every task class in the suite. Collect classes through `TaskRegistry` so you don't miss the `defs/` layout by only importing `tasks.py`: ```python from bench_env.task.registry import TaskRegistry ALL_TASK_CLASSES = list(TaskRegistry()._load_suite_tasks("").values()) ``` | Test | Verifies | |---|---| | `test_instantiation` | Default params instantiable; has templates; `apps` includes this suite | | `test_description_renders` | Templates render with no unresolved `{placeholder}` | | `test_required_class_attrs` | scope/objective/composition/difficulty are valid | | `test_parameter_defaults_present` | Every non-`_`-prefixed parameter has a `default` | | `test_answer_task_has_answer_or_get_answer` | AnswerTask subclasses define `answer` or override `get_answer()` | These tests are **highly templated** — when adding a new suite, copy the structure and only update imports and app names. ### 4.2 Accessor tests (`TestAccessor`) Verify the properties and methods of the App class in `app.py`, using `defaults.json` as data: ```python class TestWeatherAccessor: @pytest.fixture def w(self) -> Weather: return Weather(copy.deepcopy(DEFAULTS)) def test_saved_cities(self, w: Weather): assert len(w.saved_cities) >= 1 def test_current_temp(self, w: Weather): temp = w.current_temp("北京") assert isinstance(temp, (int, float)) ``` **Rules**: - Every public property/method has at least one test - Methods requiring `init` (e.g., `new_orders()`) get their own `TestAccessorWithInit` - Raise behavior for missing data must also be verified (`pytest.raises`) ### 4.3 Judge positive/negative matrix (`TestTaskJudgeMatrixOffline`) **Core rule: every offline task must have one positive case and one negative case.** Cases are built by factory functions that return `(task, JudgeInput)`: ```python def _check_balance_positive_case(): task = _tasks_module.CheckBalance() return task, _make_task_input(DEFAULTS, DEFAULTS, answer="500.00") def _check_balance_negative_case(): task = _tasks_module.CheckBalance() return task, _make_task_input(DEFAULTS, DEFAULTS, answer="999") ``` **Collected into lists** and batched via `@pytest.mark.parametrize`: ```python OFFLINE_JUDGE_POSITIVE_CASES = [ ("CheckBalance", _check_balance_positive_case), ("SetTempUnit", _set_temp_unit_positive_case), # ... one row per offline task ] OFFLINE_JUDGE_NEGATIVE_CASES = [ ("CheckBalance", _check_balance_negative_case), ("SetTempUnit", _set_temp_unit_negative_case), # ... ] ``` **Completeness check** (prevents missing entries): ```python def test_offline_judge_matrix_complete(self): positive = {name for name, _ in OFFLINE_JUDGE_POSITIVE_CASES} negative = {name for name, _ in OFFLINE_JUDGE_NEGATIVE_CASES} assert positive == OFFLINE_JUDGE_TASK_NAMES assert negative == OFFLINE_JUDGE_TASK_NAMES ``` This guarantees **CI fails if a newly added task lacks a positive/negative case**. #### 4.3.1 Positive/negative case construction | | Positive | Negative | |---|---|---| | **operate task** | Build the correct state after the Agent's operation (added/modified data) | Keep the initial state, or build a wrong-operation outcome | | **query task** | `answer` contains the correct answer | `answer` contains a wrong answer | | **hybrid task** | State correct + answer correct (1 positive) | At least 2 negatives: state OK / answer wrong, and state wrong / answer OK (see below) | | **CriteriaTask** | Modify the field in `curr_state` to the expected value | Keep the field at the initial value | **Hybrid task negative matrix**: Hybrid tasks check both state changes and the Agent's answer, so they have more failure modes than operate/query. **At least 2 negatives** are needed to cover the independent failure paths: | Combination | Expected | Why | |---|---|---| | State correct + answer correct | PASS | The single positive | | State correct + answer wrong | FAIL | Agent did the right operation but answered wrong (verifies answer check fires independently) | | State wrong + answer correct | FAIL | Agent answered right but didn't operate (verifies state check fires independently) | | State wrong + answer wrong | FAIL | Optional third negative, covers the all-wrong case | ```python # ✅ Hybrid negative examples (ColdestDayIn15: must navigate to forecast page + answer the coldest day) # Negative 1: state correct (route on forecast page) but answer wrong ("ColdestDayIn15_wrong_answer", lambda: ( _tasks_module.ColdestDayIn15(city="成都"), _make_input(BASE_STATE, BASE_STATE, route=FORECAST_ROUTE, answer="错误答案"), )) # Negative 2: answer correct but state wrong (route not on forecast page) ("ColdestDayIn15_wrong_route", lambda: ( task := _tasks_module.ColdestDayIn15(city="成都"), _make_input(BASE_STATE, BASE_STATE, route=DEFAULT_ROUTE, answer=_realistic_answer(task, task.get_answer(...))), )) ``` **Forbidden**: - Using random data in positive cases that's unrelated to `defaults.json` — the state must be plausible - Negative cases that only change spelling of `answer` — test **semantic** errors instead (wrong person, wrong value) - Sharing a single builder for both positive and negative — each case must be independently constructed for clarity #### 4.3.2 AnswerTask positive `answer` must be natural language **Don't** use the bare ground truth as a positive `answer`. The Agent will never reply just `"多云"` or `"32"` — it says `"上海今天天气多云"` or `"现在32度"`. Bare ground truth bypasses `match_value`'s substring / numeric-extraction logic, which equates to not testing it. ```python # ❌ answer IS the ground truth; match_value substring trivially passes — nothing tested return task, _make_input(state, state, answer="多云") # ✅ answer mimics a real Agent reply; verifies match_value extracts correctly return task, _make_input(state, state, answer="上海今天天气多云") ``` **Principles for natural-language answers**: 1. **Include the key ground-truth content** — ensure `match_value` matches (numbers appear in full; keywords appear as substrings) 2. **Add reasonable context** — city name, time descriptor, units, tone words an Agent would naturally add 3. **Don't overcomplicate** — the goal is to verify matching logic, not to simulate every possible Agent style It's recommended to use a helper like `_realistic_answer(task, expected)` to generate these uniformly instead of hand-writing each case. **`match_value` behavior by type** (must know when writing cases): | Expected type | Matching | Positive answer example | An answer that fails | |---|---|---|---| | `int/float` | Extracts standalone numbers, compares one by one | `"现在32度"` → extracts `32` ✓ | `"三十二度"` ✓ (Chinese-numeral normalization) | | `str` | `expected in normalize_text(actual)` | `"天气多云转晴"` contains `"多云"` ✓ | `"阴天"` lacks `"多云"` ✗ | | `re.Pattern` | `expected.search(normalize_text(actual))` | `"温度差不多"` matches `r"一样\|相同\|差不多"` ✓ | `"温度接近"` ✗ | #### 4.3.3 Negative-case pattern catalog **A negative case should simulate a realistic Agent mistake**, not a clearly-impossible input. The Agent is a VLM — it sees the screen and decides; its errors follow patterns. The tables below enumerate common error modes per task type. **Every negative case must use one of these patterns**; don't just write `answer="错误答案"` for everything. ##### query task negative patterns | Error pattern | Description | Construction | |---|---|---| | **Wrong target** | Agent picked the wrong row/card/city | Use the correct value of a different entity (e.g., asked Beijing temp, fill Shanghai temp) | | **Close but wrong** | Agent saw the right spot but misread | Ground truth ±1 or similar (e.g., correct is 32, answer is `"北京现在33度"`) | | **Synonym but different meaning** | Agent used a near-synonym whose meaning differs | Replace with a near-synonym that doesn't match (e.g., gt=`"多云"`, answer=`"今天阴天"`) | | **Verbose answer with distractor numbers** | Agent reads every number on the page | Multiple numbers, with the ground truth **missing** (e.g., correct 40%, answer `"气温32度,风力3级,紫外线指数7"`) | | **Chinese numerals** | Agent uses Chinese numerals — positive/negative depends on correctness | Positive variant: `answer="北京现在二十度"`; negative: wrong Chinese numeral | | **Boolean flipped** | Agent says the opposite ("通过" ⊂ "未通过") | If gt is affirmative, fill a negation ("没有通过核验") | | **Empty answer** | Agent declared COMPLETE without answering | `answer=None` or `answer=""` | ```python # ✅ Wrong target: asked Beijing 20°C, Agent answered Shanghai 28°C ("CheckCurrentTemp_wrong_city", lambda: ( _tasks_module.CheckCurrentTemp(city="北京"), _make_input(BASE_STATE, BASE_STATE, answer="上海现在28度"), )) # ✅ Close but wrong: correct 20°C, Agent says 21°C ("CheckCurrentTemp_off_by_one", lambda: ( _tasks_module.CheckCurrentTemp(city="北京"), _make_input(BASE_STATE, BASE_STATE, answer="北京现在21度"), )) # ✅ Distractor numbers: correct is humidity 40, Agent rattles off other numbers but never 40 ("CheckDetailCard_noise", lambda: ( _tasks_module.CheckDetailCard(city="北京", metric="humidity"), _make_input(BASE_STATE, BASE_STATE, answer="北京气温20度,风力3级,紫外线指数7"), )) ``` ##### operate task negative patterns | Error pattern | Description | Construction | |---|---|---| | **Did nothing** | Agent didn't act | `curr_state` equals `init_state` | | **Reversed operation** | Agent interpreted "off" as "on" or vice versa | Set the target field to the opposite value | | **Wrong target** | Acted on the wrong object | Modify a different field of the same kind (e.g., changed wind unit instead of temperature unit) | | **Partial completion** | Sequential/deep-dive task only did the first step | Modify only the first criteria field | ```python # ✅ Reversed: should enable night DND, Agent disabled it instead ("EnableNightDnd_inverted", lambda: ( _tasks_module.EnableNightDnd(), _make_input(BASE_STATE, _with_settings(nightDnd=False)), )) # ✅ Wrong target: should switch temp unit, Agent switched wind unit ("SwitchTempUnit_wrong_field", lambda: ( _tasks_module.SwitchTempUnit(unit="fahrenheit"), _make_input(BASE_STATE, _with_settings(windUnit="ms")), # wrong field )) # ✅ Partial: SwitchUnitAndReport changes a unit and answers; only changed the unit ("SwitchUnitAndReport_partial", lambda: ( _tasks_module.SwitchUnitAndReport(city="上海"), _make_input(BASE_STATE, _with_settings(tempUnit="celsius")), # only changed temp unit )) ``` ##### crossapp task negative patterns | Error pattern | Description | Construction | |---|---|---| | **Source app done, target app untouched** | Agent acted only in the source app, forgot to switch | Source app state correct; target app at initial | | **Wrong info passed** | Agent read source app correctly but typed something wrong into target | Target app has new data, but content doesn't match source | | **Neither app acted** | Agent got lost in navigation | All app states at initial | ```python # ✅ Source done but target untouched: weather share to WeChat — only checked weather, didn't send message ("WeatherShareForecast_no_send", lambda: ( _tasks_module.WeatherShareForecast(), _make_input( {"weather": init_weather, "wechat": init_wechat}, {"weather": init_weather, "wechat": init_wechat}, # WeChat unchanged ), )) ``` **Rule**: each task's negative case **must use one of the patterns matching the task type**. When the judging logic is complex (multi-field, cross-app), cover **multiple** patterns. Don't fall back to `answer="错误答案"` or `curr_state=init_state` for every case. #### 4.3.4 `match_value` edge-case coverage `match_value` is the core function that matches Agent replies. Each suite must cover at least one of the following edge cases (via an extra positive or negative case): | Edge case | Risk | Test requirement | |---|---|---| | **Distractor numbers** | Agent says "今天32度,明天28度" — when gt=28, 32 is also in the text | Positive: answer has multiple numbers including gt — verify match passes; negative: answer has multiple numbers **without** gt | | **Chinese numerals** | Agent uses "二十三" instead of "23" | At least 1 positive uses a Chinese-numeral answer (e.g., `answer="北京现在二十度"`) | | **Empty answer** | Agent gave no answer | At least 1 AnswerTask negative uses `answer=None`, confirming FAIL rather than error | | **Substring trap** | str match: `"通过" in "未通过"` is True | For yes/no queries, negatives must test the negation-contains-affirmation case | | **Trailing zero formatting** | gt=278.2, Agent says "278.20元" | For AnswerTasks with decimal amounts, the positive should include a trailing-zero variant (e.g., `"总共278.20元"`) | ```python # ✅ Chinese-numeral positive ("CheckCurrentTemp_chinese_num", lambda: ( _tasks_module.CheckCurrentTemp(city="北京"), _make_input(BASE_STATE, BASE_STATE, answer="北京现在二十度"), )) # ✅ Empty-answer negative ("CheckBalance_empty_answer", lambda: ( _tasks_module.CheckBalance(), _make_input(BASE_STATE, BASE_STATE, answer=None), )) # ✅ Distractor positive (gt=40, answer has 20 and 40) ("CheckDetailCard_multi_number", lambda: ( _tasks_module.CheckDetailCard(city="北京", metric="humidity"), _make_input(BASE_STATE, BASE_STATE, answer="北京气温20度,湿度40%"), )) ``` **Rule**: these edge cases can be added as extra positives/negatives in `OFFLINE_JUDGE_POSITIVE_CASES` / `OFFLINE_JUDGE_NEGATIVE_CASES` (named `"TaskName_suffix"` to distinguish from the main case). They don't need to apply to every task — covering them on a representative task in the suite is sufficient. The completeness check (`test_offline_judge_matrix_complete`) still only requires one main positive and one main negative per task. #### 4.3.5 Multi-format tests for structured values (time, duration) The Agent is a pure-vision model — after reading the screen, it phrases the answer in **natural language**. A single structured value (time, duration) may be expressed in multiple **semantically equivalent but format-different** ways. `match_value`'s substring containment can't match these variants — you must use the framework's semantic matchers and cover multiple formats in tests. **Common equivalent expressions from the Agent**: | Internal format | Agent variants | `match_value` matches? | |---|---|---| | `"09:54"` | "9点54分", "上午9点54分", "上午9:54" | ✗ (all fail) | | `"13:10"` | "下午1点10分", "1点10分", "13:10" | only exact ✓ | | `"0小时59分"` | "59分钟", "59分", "不到1小时" | ✗ (all fail) | | `"1小时10分"` | "70分钟", "1小时10分钟", "1:10" | only exact ✓ | **Semantic matchers provided by the framework**: | Matcher | Use | Principle | |---|---|---| | `match_duration(expected, actual)` | Duration | Normalize both sides to total minutes | | `match_time(expected, actual)` | Time-of-day | Normalize to (h, m); supports 12/24-hour and 上午/下午 prefixes | **Test requirement**: when a task uses `match_duration` / `match_time` (or a similar semantic matcher), **add multi-format positive tests** to verify the matcher actually covers the Agent's variants. **Mental model for constructing multi-format answers** (think like the Agent): 1. **What did the Agent see** — was the screen showing "09:54", "0小时59分", or some other format? 2. **How would the Agent transcribe it** — a human seeing "09:54" naturally says "上午9点54分" or "9:54", not literally "09:54" 3. **List the equivalent expressions** — how many natural Chinese / numeric forms exist for the same value? At least one positive per form 4. **Negative must be a semantic error** — a truly wrong value (e.g., "10:30" ≠ "09:54"), **not** another format of the same value **Recommended pattern**: in the Live/Offline test class, use a standalone `@pytest.mark.parametrize` to test multi-format positives: ```python @pytest.mark.parametrize( "answer", [ "G7010,1小时10分,上海虹桥,13:10", # exact "最快的车是G7010, 70分钟, 始发站上海虹桥, 下午1点10分到达", # natural Chinese "G7010,70分钟,上海虹桥,下午1:10", # mixed "G7010,1小时10分钟,上海虹桥,13:10到", # suffix variant ], ids=["exact", "chinese_natural", "mixed_format", "suffix_variant"], ) async def test_fastest_train_flexible_answer_formats(self, env, answer): """Agent answers in any natural format should pass.""" task = _tasks_module.QueryFastestTrainDetails( from_station="上海", to_station="南京", date="2026-03-20", ) inp = await self._setup_query_task(env, task) result = task.evaluate( JudgeInput(init_obs=inp.init_obs, last_obs=inp.last_obs, answer=answer) ) assert result.success, f"Flexible format failed: {result.issues}" ``` **Rules**: - AnswerTasks involving time/duration answers **must** include at least 2 format variants in positives - Multi-format positives live outside the main positive/negative matrix (they don't affect `test_*_judge_matrix_complete`) - When new structured value types appear (distance with units, temperature with units), add the matching semantic matcher in `common_tasks.py` and the multi-format tests at the same time ### 4.4 Live tests (`TestLiveQueryTasks`) Only for tasks whose judge depends on runtime simulator state: ```python @pytest.mark.live @pytest.mark.asyncio(loop_scope="session") class TestLiveQueryTasks: async def _setup_query_task(self, env, task: BaseTask) -> JudgeInput: task._suite = "" init_obs = await task.setup(env) await self._inject_data(env) # inject test data last_obs = await env.get_observation() return JudgeInput(init_obs=init_obs, last_obs=last_obs) @pytest.mark.parametrize("task_name,task_factory,answer", LIVE_POSITIVE_CASES) async def test_positive_case(self, env, task_name, task_factory, answer): task = task_factory() inp = await self._setup_query_task(env, task) result = task.evaluate(JudgeInput( init_obs=inp.init_obs, last_obs=inp.last_obs, answer=answer, )) assert result.success ``` **Live tests also need the completeness check** to ensure `LIVE_JUDGE_TASK_NAMES` covers everything. ## 5. State-builder helper conventions Each suite's test file typically needs local helpers to build test state: ```python # Module-level constants DEFAULT_ROUTE = {"app": "", "path": "/"} TEST_OS_STATE = {"time": {"timestamp": 1742025600000}} # Wrap make_judge_input to avoid repeating the apps/os wrapping def _make_task_input(init_state, curr_state, *, route=None, answer=None) -> JudgeInput: return make_judge_input( {"apps": {"": init_state}, "os": TEST_OS_STATE}, {"apps": {"": curr_state}, "os": TEST_OS_STATE}, route=route or DEFAULT_ROUTE, answer=answer, ) ``` **Rules**: - Helpers are prefixed with `_` to mark them private - State-building helpers (e.g., `_booking_order()`) are for complex operate tasks, to avoid repeating large dict literals in cases - **No judging logic inside helpers** — helpers build data; judging stays in `task.evaluate()` ## 6. Run commands ```bash # Offline only (no simulator needed) pytest bench_env/tests/ -m "not live" -v # Offline for a single suite pytest bench_env/tests/test_weather.py -m "not live" -v # Full suite (simulator must run at localhost:3000) pytest bench_env/tests/ -v # Custom simulator URL pytest bench_env/tests/ --sim-url http://localhost:3001 # Live only pytest bench_env/tests/ -m live -v ``` ## 7. New-suite test setup 1. Create `bench_env/tests/test_.py` 2. Copy the task-discovery scaffolding (`TaskRegistry()._load_suite_tasks("")` + `ALL_TASK_CLASSES`) 3. Load `defaults.json` 4. Implement `TestTaskDefinitions` (reuse the template; change imports and app name) 5. Implement `TestAccessor` (cover every public property/method of `app.py`) 6. Write `_xxx_positive_case()` / `_xxx_negative_case()` for every offline task 7. Collect them into `OFFLINE_JUDGE_POSITIVE_CASES` / `OFFLINE_JUDGE_NEGATIVE_CASES` 8. Implement `TestTaskJudgeMatrixOffline` (with completeness check) 9. If you have Live tasks, implement `TestLiveQueryTasks` (with completeness check) 10. Run `pytest bench_env/tests/test_.py -m "not live" -v` to verify ## 8. Configuration `bench_env/tests/pytest.ini`: ```ini [pytest] asyncio_mode = auto addopts = -n 3 required_plugins = pytest-xdist ``` Dependencies (`pip install`): - `pytest` - `pytest-asyncio` - `pytest-xdist` (parallel runs via `-n 3`)