# AgentBench Harness This folder contains the AgentBench-style harness used in the paper. It is kept as a separate subproject because it uses a Docker-based task environment and a Python 3.9 dependency stack. The paper experiments in this folder use the following AgentBench tasks: - ALFWorld - DBBench - OS - WebShop ## Installation ```bash cd AgentBench conda create -n agent-bench python=3.9 conda activate agent-bench pip install -r requirements.txt ``` Docker is required for the task workers: ```bash docker ps ``` ## Configure the Agent Configure the OpenAI-compatible agent endpoint in: ```text configs/agents/api_agents.yaml ``` The public repository uses local OpenAI-compatible defaults such as `Bearer EMPTY` for self-hosted model servers. If you use a commercial provider, replace the endpoint and authorization value locally before running experiments. Do not commit private API keys or private service URLs. For the Qwen agent profile used by our harness runs, edit: ```yaml qwen3-4b-instruct: parameters: url: "http://localhost:30001/v1/chat/completions" headers: Authorization: "Bearer EMPTY" body: model: "openai/Qwen/Qwen3-4B-Instruct" max_tokens: 4096 temperature: 0.0 ``` You can test whether the agent configuration is valid with: ```bash python -m src.client.agent_test --config configs/agents/api_agents.yaml --agent qwen3-4b-instruct ``` ## Build Docker Images The OS and DBBench environments require base images. Build them before starting the Docker Compose services: ```bash docker pull mysql:8 docker pull ubuntu docker build -f data/os_interaction/res/dockerfiles/default data/os_interaction/res/dockerfiles --tag local-os/default docker build -f data/os_interaction/res/dockerfiles/packages data/os_interaction/res/dockerfiles --tag local-os/packages docker build -f data/os_interaction/res/dockerfiles/ubuntu data/os_interaction/res/dockerfiles --tag local-os/ubuntu ``` ## Start Task Services Task workers are launched through Docker Compose. Start Redis, the controller, and the task worker needed for the benchmark you are running: ```bash # ALFWorld docker compose -f extra/docker-compose.yml up -d --force-recreate redis controller alfworld-std # DBBench docker compose -f extra/docker-compose.yml up -d --force-recreate redis controller dbbench-std # OS docker compose -f extra/docker-compose.yml up -d --force-recreate redis controller os-std # WebShop docker compose -f extra/docker-compose.yml up -d --force-recreate redis controller webshop-std ``` If you modify task configuration files, restart the corresponding Docker Compose services so workers reload the new configuration. WebShop can take several minutes to become ready after Docker reports that the container has started. If the first run cannot find the WebShop worker, wait and retry. ## Run Evaluations The assignment files specify the task split, concurrency, output directory, and number of trials. ```bash # ALFWorld python -m src.assigner --config configs/assignments/alfworld.yaml # DBBench python -m src.assigner --config configs/assignments/dbbench.yaml # OS python -m src.assigner --config configs/assignments/os.yaml # WebShop python -m src.assigner --config configs/assignments/webshop.yaml ``` Useful configuration locations: - `configs/agents/api_agents.yaml`: model endpoint, model name, sampling, and max tokens. - `configs/tasks/*.yaml`: task split, max steps/rounds, and harness switches. - `configs/assignments/*.yaml`: selected task profile, concurrency, trials, and output directory. ## Harness Switches The harness switches correspond to the four layers described in the paper: - `h2`: **Action Realization Layer**. Repairs malformed or recoverable actions, validates tool calls before execution, and blocks invalid actions when needed. - `h3`: **Environment Contract Layer**. Embeds environment-specific tool-use constraints into tool descriptions so the agent sees the correct contract when deciding how to call tools. - `h4`: **Trajectory Regulation Layer**. Monitors post-execution state, detects repeated failures or stagnation, and manages the remaining step/round budget. - `h5`: **Procedural Skill Layer**. Retrieves and injects task-relevant procedural skills or hints, controlled by settings such as `h5_top_k`. `enabled` is the master switch. If `enabled=false`, the H2/H3/H4/H5 settings do not take effect even if their individual flags are set to `true`. In AgentBench task configs this master switch may appear either as top-level `enabled` or as nested `harness.enabled`, depending on the task. For harness ablations, edit the task config and enable or disable one harness level at a time (`h2`, `h3`, `h4`, `h5`) while keeping the remaining levels fixed. ## Evolving the Harness Harness evolution is an iterative code-editing loop. After each evaluation run, give a CLI coding agent, such as Codex CLI, the current harness implementation, the previous iteration's trajectories, and the harness design guide. The agent should inspect recurring deterministic interface failures and directly modify the harness code; it should not stop at producing an analysis report. Run the CLI agent from this suite directory and point it to the local guide file instead of pasting the full design rules into the prompt. ```bash HARNESS_DIR=src/server/harness TRAJECTORY_DIR= DESIGN_GUIDE=Harenss.md codex " You are a coding agent responsible for improving a runtime harness for a deterministic LLM-agent environment. Your goal is to improve task performance by adapting the runtime interface between the frozen model and the environment, without changing model weights, benchmark tasks, or environment evaluation logic. Inputs: - current harness implementation: ${HARNESS_DIR} - trajectory directory from the previous iteration, including summary metrics: ${TRAJECTORY_DIR} - harness design guide: ${DESIGN_GUIDE} Inspect the previous iteration's trajectories and identify recurring failure patterns. For each pattern, determine the earliest lifecycle point where it can be reliably detected or prevented: before interaction, during task conditioning, before environment execution, or after execution. Focus on mechanically identifiable deterministic failures such as invalid action formats, wrong tool conventions, missing required fields, repeated no-op actions, loops, premature submissions, budget exhaustion, or recurring procedural mistakes. Directly implement targeted, minimal updates in the appropriate harness layer. Do not only return an analysis report. Do not use hidden oracle information, test labels, task modifications, environment transition changes, or evaluation-criteria changes. After editing, run or recommend the narrowest regression checks available. Inspect cases where the harness may over-trigger, block a valid action, inject misleading guidance, or reduce performance on previously successful trajectories. When finished, summarize: 1. dominant failure patterns found; 2. harness layer responsible for each update; 3. implemented code changes; 4. why each update is safe under the deterministic environment contract; 5. remaining failure modes to monitor next. " ``` Then rerun the corresponding evaluation command and use the new trajectory directory as `TRAJECTORY_DIR` for the next iteration. Keep each update local to the harness layer that can detect the failure earliest. ## Default Evaluation Settings | Benchmark | Agent sampling | Agent max tokens | Max step / rounds | | --- | --- | ---: | ---: | | ALFWorld | temperature = 0.0 | 4096 | 50 | | DBBench | temperature = 0.0 | 4096 | 15 | | OS | temperature = 0.0 | 4096 | 8 | | WebShop | temperature = 0.0 | 4096 | 20 |