# Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents [![arXiv](https://img.shields.io/badge/arXiv-2605.30621-b31b1b.svg)](https://arxiv.org/abs/2605.30621) Official implementation for **"Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents."** If you find this work helpful, please consider to cite our paper: ```bibtex @article{lin2026harness, title={Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents}, author={Lin, Minhua and Wu, Juncheng and Wang, Zijun and Shi, Zhan and Sang, Yisi and He, Bing and Liu, Zewen and Wei, Tianxin and Wu, Zongyu and Zhang, Zhiwei and Wang, Dakuo and Zhang, Xiang and Dumoulin, Benoit and Xie, Cihang and Zhou, Yuyin and Wang, Suhang and Lu, Hanqing}, journal={arXiv preprint arXiv:2605.30621}, year={2026} } ``` ## Table of Contents - [Abstract](#abstract) - [Project Structure](#project-structure) - [Installation](#installation) - [Dataset Preparation](#dataset-preparation) - [Artifact Entry Points](#artifact-entry-points) - [Reproducing the Paper Experiments](#reproducing-the-paper-experiments) - [Exp0: Evolver-side Analysis (harness-updating)](#exp0-evolver-side-analysis-harness-updating) - [Exp1: Agent-side Analysis (harness-benefit)](#exp1-agent-side-analysis-harness-benefit) - [Models](#models) - [Citation](#citation) ## Overview LLM agents are increasingly deployed as systems built around an editable external **harness**: prompts, skills, memories, and tools that shape task execution without changing model parameters. *Harness self-evolution* adapts such an agent by updating this harness from execution evidence.

Overview of harness-updating and harness-benefit findings

We separate two capabilities in this loop: 1. **Harness-updating**, exercised by the evolver, is the capability to produce useful persistent harness updates from evidence. 2. **Harness-benefit**, exercised by the agent, is the capability to use updated harnesses during task solving. Pairing seven LLMs as agents and evolvers across three agentic benchmarks (SWE-bench Verified, MCP-Atlas, SkillsBench), the analysis reveals two findings: - **Harness-updating is flat in base capability.** Models from different capability tiers produce harness updates that lead to surprisingly similar gains; even Qwen3.5-9B's updates yield gains comparable to those of Claude Opus 4.6. - **Harness-benefit is non-monotonic in base capability.** Weak-tier models benefit little from updated harnesses, mid-tier models benefit most, and strong-tier models benefit less than mid-tier. We trace the low gains at the weak tier to two failure modes: weak-tier models may fail to activate relevant harness artifacts, or activate them but fail to follow them faithfully (measured by the Harness-Following Rate). The results suggest that scaling the task-solving agent matters more than scaling the evolver, and that future agent training should directly target harness invocation and long-horizon instruction following. ## Project Structure ``` . ├── agent_evolve/ # Core library │ ├── algorithms/unified/ # UnifiedEngine: readers, operators, verifiers (the evolution engine) │ ├── agents/{swe,mcp,skillbench}/ # Task-solving agents for the three benchmarks │ ├── benchmarks/{swe_verified_mini,mcp_atlas,skillbench}/ # Benchmark adapters │ ├── engine/ # Evolution loop, history, observer, trial runner │ ├── llm/ # Bedrock / OpenAI-compatible LLM backends │ └── contract/ # Harness (workspace) contract: manifest, schema │ ├── examples/ │ ├── swe_examples/ # SWE-bench Verified runners (*_unified.py) + baseline (solve_all.py) │ ├── mcp_examples/ # MCP-Atlas runner (run_adaptive_evolve_all_unified.py) + baseline │ ├── skillbench_examples/ # SkillsBench runners (skillbench_evolve_*_unified.py) │ ├── configs/ # Per-benchmark evolution configs (swe / mcp / skillbench) │ └── harness-evolution/ # Paper experiment orchestration │ ├── run_exp0_unified_insitu.py # Exp0: evolver-side (harness-updating) │ ├── run_exp1_unified_insitu.py # Exp1: agent-side (harness-benefit) │ ├── run_exp1.py # Exp1 train/test split variant │ ├── _region_picker.py + model_region_availability.json # model nickname -> provider id / region │ ├── scripts/ # Single-seed sweep launchers + status dashboard │ └── hfr_analysis/ # Agent-side HFR diagnostic pipeline │ ├── seed_workspaces/{swe,mcp,skillbench-upstream-parity}/ # Initial harnesses (H_0) ├── docs/ # Unified engine + benchmark setup docs ├── tests/ # Unit tests (unified engine, scaffolding, isolation) ├── pyproject.toml ├── Makefile └── .env.example ``` ## Installation Requires Python 3.11+. ```bash git clone -b release/harness-evolution https://github.com/A-EVO-Lab/a-evolve.git cd a-evolve # Create environment conda create -n aevolve python=3.11 -y conda activate aevolve # Editable install with the three paper benchmarks + dev tools pip install -e ".[swe,mcp,skillbench,dev]" # or, all extras: pip install -e ".[all,dev]" ``` Optional-dependency extras: `swe` (SWE-bench Verified), `mcp` (MCP-Atlas), `skillbench` (SkillsBench), `all`, `dev` (pytest/ruff). `make install` runs `pip install -e ".[all,dev]"`. ### Runtime Configuration Copy the example file and fill in the credentials required by the model providers and benchmark backends you use: ```bash cp .env.example .env ``` The paper scripts use short model nicknames such as `opus46`, `sonnet46`, and `qwen235b`. These are resolved by `examples/harness-evolution/model_region_availability.json` and `_region_picker.py`. Benchmark-specific requirements are documented in `docs/mcp-atlas-demo.md` and `docs/skillbench-setup.md`. ## Dataset Preparation - **SWE-bench Verified** (`swe`): the HuggingFace dataset `MariusHobbhahn/swe-bench-verified-mini` (or `princeton-nlp/SWE-bench_Verified`) downloads on first use. Each task runs in a SWE-bench Docker image pulled on demand, so a running Docker daemon is required. Seed harness: `seed_workspaces/swe/`. - **MCP-Atlas** (`mcp`): the HuggingFace dataset `ScaleAI/MCP-Atlas` downloads on first use. Tasks run against MCP servers in a container (`--docker-image`, or `--external-container-url` for a pre-running container) and require per-server API keys in a `.env`. Evaluation uses an LLM judge (Gemini 2.5 Pro via LiteLLM by default). Seed harness: `seed_workspaces/mcp/`. See `docs/mcp-atlas-demo.md`. - **SkillsBench** (`skillbench`): tasks are auto-cloned from `https://github.com/benchflow-ai/skillsbench` (pinned commit `828bb921...`) into `~/.cache/agent-evolve/skillbench/` on first use; set `SKILLBENCH_REPO_DIR` to use a local clone. Seed harness: `seed_workspaces/skillbench-upstream-parity/`. See `docs/skillbench-setup.md`. ## Artifact Entry Points Each benchmark runner copies an initial harness, solves a task stream, updates the harness between batches, and writes metrics plus the evolved `workspace/` to the output directory. Pass `--help` to any runner for the full argument list. **SWE-bench Verified:** ```bash python examples/swe_examples/evolve_sequential_unified.py \ --cycles 3 --limit 50 --batch-size 5 --parallel 5 --feedback minimal \ --model-id us.anthropic.claude-opus-4-6-v1 --region us-west-2 \ --dataset MariusHobbhahn/swe-bench-verified-mini \ --seed-workspace seed_workspaces/swe --output-dir logs/swe_evolve ``` No-evolve baseline: ```bash python examples/swe_examples/solve_all.py \ --dataset MariusHobbhahn/swe-bench-verified-mini \ --model-id us.anthropic.claude-opus-4-6-v1 \ --workers 5 --limit 50 --output-dir logs/swe_baseline ``` **MCP-Atlas:** ```bash python examples/mcp_examples/run_adaptive_evolve_all_unified.py \ --cycles 3 --batch-size 30 --limit 500 \ --solver-model us.anthropic.claude-opus-4-6-v1 \ --judge-model us.anthropic.claude-sonnet-4-6 --region us-west-2 \ --docker-image --env-file .env \ --dataset ScaleAI/MCP-Atlas --seed-workspace seed_workspaces/mcp \ --output-dir logs/mcp_evolve python examples/mcp_examples/adaptive_evolve_baseline.py \ --limit 500 --batch-size 30 --workers 5 \ --solver-model us.anthropic.claude-opus-4-6-v1 \ --judge-model us.anthropic.claude-sonnet-4-6 \ --docker-image --env-file .env \ --seed-workspace seed_workspaces/mcp --output-dir logs/mcp_baseline ``` **SkillsBench:** ```bash python examples/skillbench_examples/skillbench_evolve_split_unified.py \ --evolve-limit 20 --batch-size 1 --max-cycles 1 \ --train-parallel 1 --test-parallel 5 \ --model-id us.anthropic.claude-opus-4-6-v1 --region us-west-2 \ --feedback-level tests \ --seed-workspace seed_workspaces/skillbench-upstream-parity \ --run-dir logs/skillbench_split ``` ## Reproducing the Paper Experiments The main experiment drivers live in `examples/harness-evolution/`. They expose the two comparisons studied in the paper: Exp0 varies the evolver to measure harness-updating, while Exp1 varies the task-solving agent to measure harness-benefit. Run the commands below from the repository root. Model nicknames such as `opus46`, `sonnet46`, and `qwen235b` are listed in [Models](#models); `--evolver none` runs the no-evolution baseline. ### Exp0: Evolver-side Analysis (harness-updating) Fix the agent over the anchor set and vary the evolver, isolating each evolver's harness-updating capability. ```bash # One cell: python examples/harness-evolution/run_exp0_unified_insitu.py \ --solver opus46 --evolver qwen35_9b --benchmark mcp --seed 42 \ --region-strategy hash --output-root results/exp0_unified_insitu # Full single-seed sweep: bash examples/harness-evolution/scripts/phase0_unified_insitu_single_seed.sh ``` ### Exp1: Agent-side Analysis (harness-benefit) Fix the evolver over the anchor set and vary the agent, isolating each agent's harness-benefit. ```bash # One cell (in-situ route, comparable cell-by-cell with Exp0): python examples/harness-evolution/run_exp1_unified_insitu.py \ --solver qwen32b --evolver opus46 --benchmark sb --seed 42 \ --region-strategy hash --output-root results/exp1_unified_insitu # Full single-seed sweep + progress dashboard: bash examples/harness-evolution/scripts/phase1_unified_insitu_single_seed.sh bash examples/harness-evolution/scripts/check_status.sh ``` `run_exp1.py` provides a train/test split route (evolve on a subset, evaluate on a held-out slice) for unbiased harness-benefit estimation. #### Harness-Following Rate (HFR) Diagnostic As an agent-side diagnostic, `hfr_analysis/pipeline.py` measures how faithfully an agent follows the procedural instructions in an evolved SkillsBench skill. The pipeline extracts per-skill rubrics, judges trajectory adherence, and computes mechanical proxies such as retrieval-to-use gap, early termination, and answer-before-validation. ```bash python examples/harness-evolution/hfr_analysis/pipeline.py --max-workers 4 --stages 1,2,4 ``` ## Models We pair the following six LLMs as task-solving agents and evolvers: Claude Opus 4.6, Claude Sonnet 4.6, Claude Haiku 4.5, Qwen3-235B-A22B, Qwen3-32B, GPT-OSS-120B. In addition, we also use Qwen 3.5-9B as the evolver to test whether a substantially smaller open model can still produce useful harness updates. | Nickname | Model | |----------|-------| | `opus46` | Claude Opus 4.6 | | `sonnet46` | Claude Sonnet 4.6 | | `haiku45` | Claude Haiku 4.5 | | `qwen235b` | Qwen3-235B-A22B | | `qwen32b` | Qwen3-32B | | `qwen35_9b` | Qwen3.5-9B | | `gptoss120b` | GPT-OSS-120B |