# SR-Agents
[](https://huggingface.co/datasets/WeihangSu/SRA-Bench)
[](./LICENSE)
[](https://www.python.org/)
A benchmark and research toolkit for **skill-retrieval-augmented LLM
agents**.
Modern LLM agents increasingly rely on reusable external *skills* โ
modular capability packages that pair natural-language guidance with
executable resources โ to solve tasks that exceed their native
parametric ability. As skill libraries grow into the tens or hundreds
of thousands, the prevailing practice of enumerating every candidate
skill in the prompt stops scaling: context budgets fill up, and
selection accuracy degrades as the skill list grows.
**Skill Retrieval Augmentation (SRA)** is an alternative paradigm in
which the agent dynamically retrieves, incorporates, and applies
relevant skills from a large external skill library on demand. This
repository provides:
* **SRA-Bench** โ 5,400 capability-intensive test instances across six
task families, each paired with manually curated gold skill(s),
embedded in a realistic skill library of 26,262 skills (636 gold +
25,626 web-collected distractors).
* **SR-Agents** โ a baseline family of skill-retrieval-augmented
agents, spanning five skill-use methods (LLM Direct, Oracle Skill,
Full-Skill Injection, LLM Selection, Progressive Disclosure) and six
retrievers (BM25, TF-IDF, BGE, Contriever, Hybrid, BM25 + LLM Rerank).

*The Skill Retrieval Augmentation paradigm: the agent retrieves candidate
skills from a large external skill library, selectively incorporates
useful ones into context, and applies them for downstream reasoning
and acting.*
## SRA-Bench at a glance
**6 source datasets ยท 5,400 test instances ยท 636 gold skills** embedded
in a skill library of **26,262 skills** (2.4% gold, 25,626 web-collected
distractors).
Gold-skill annotations associate each instance with either a single
gold skill (**Single**) or multiple gold skills (**Multi**).
| Dataset | Capability Type | #Inst. | #Skills | Skill Mapping | Evaluation |
|---|---|---:|---:|---|---|
| TheoremQA | Theorem Application | 747 | 320 | Single | Rule-Based |
| LogicBench | Logical Reasoning Patterns | 760 | 19 | Single | Rule-Based |
| ToolQA | Tool-Use Workflows | 1,430 | 14 | Single | Rule-Based |
| MedCalc-Bench | Medical Calculators | 1,100 | 55 | Single | Rule-Based |
| CHAMP | Mathematical Concepts | 223 | 89 | Multi | Rule-Based |
| BigCodeBench | Software Libraries | 1,140 | 139 | Multi | Execution |
## Install
Requires Python 3.10 โ 3.12.
```bash
pip install -e . # or: uv sync
```
Download SRA-Bench (skill corpus + test instances) from the
[HuggingFace dataset page](https://huggingface.co/datasets/WeihangSu/SRA-Bench):
```bash
huggingface-cli download WeihangSu/SRA-Bench --repo-type dataset \
--local-dir data/bench
```
ToolQA external corpus (only needed to run ToolQA)
Download from the
[ToolQA Google Drive link](https://drive.google.com/file/d/1zRbHzPW2x4dDcfmphBWlan8cxUCRNmqk/view?usp=drive_link),
unzip, and place the result under `data/external/toolqa/`.
Inference requires an OpenAI-compatible chat endpoint. Point
`--api-base` at any compatible server (OpenAI, vLLM, SGLang, Ollama, โฆ).
`--model` is the served model identifier โ a model ID like
`gpt-4o-mini` for hosted APIs, or whatever string the local server is
serving (often a path like `/models/Qwen3-32B`) for vLLM/SGLang. For
endpoints that require auth, set `OPENAI_API_KEY`; local unauthenticated
servers accept any value.
## Quickstart
An end-to-end run of the three stages on TheoremQA using Full-Skill
Injection with the BM25 top-1 skill:
```bash
# Pick one (examples; any OpenAI-compatible endpoint works).
# Hosted: MODEL=gpt-4o-mini API_BASE=https://api.openai.com/v1
# Local : MODEL=/models/Qwen3-32B API_BASE=http://localhost:8000/v1
MODEL=
API_BASE=
# 1. Retrieve โ BM25 top-50 per query.
sragents retrieve --retriever bm25 \
--corpus data/bench/corpus/corpus.json \
--instances data/bench/instances/theoremqa.json \
--output results/retrieval/theoremqa-bm25.json
# 2. Infer โ prepend the top-1 BM25 skill and generate an answer.
sragents infer \
--instances data/bench/instances/theoremqa.json \
--output results/inference/theoremqa-bm25_top1.jsonl \
--model $MODEL --api-base $API_BASE \
--provider topk \
--provider-arg source=results/retrieval/theoremqa-bm25.json \
--provider-arg k=1 \
--engine direct --label bm25_top1
# 3. Evaluate โ extract + score against ground truth.
sragents evaluate \
--input results/inference/theoremqa-bm25_top1.jsonl \
--instances data/bench/instances/theoremqa.json \
--output results/eval/theoremqa-bm25_top1.json
```
The evaluator prints overall accuracy and saves a JSON of the form
```json
{
"dataset": "theoremqa", "method": "bm25_top1", "model": "...",
"metrics": {"accuracy": 0.XX, "correct": N, "total": 747},
"details": [{"instance_id": "...", "extracted_answer": "...",
"correct": true, "ground_truth": "...", ...}]
}
```
## Pipeline
The toolkit is organised as three CLI stages that pass JSON files
between them:
```
sragents retrieve โโถ retrieval/*.json ranked skills + Recall@K, nDCG@K
sragents infer โโถ inference/*.jsonl raw model output per instance
sragents evaluate โโถ eval/*.json end-task accuracy
```
Each stage's output is consumed by the next via an explicit path
argument, so any stage can be swapped or rerun in isolation, and
every stage supports per-instance resume.
`infer` is parameterised by two orthogonal axes:
* **SkillProvider** โ which candidate skills the instance receives:
none, oracle (gold), top-K retrieval, LLM-selected, or oracle plus
hard-negative distractors. The pool may already be narrowed to a
single skill, or left wide for the engine to narrow further.
* **InferenceEngine** โ how those candidates are turned into an
answer. `direct` statically prepends everything it receives;
agentic engines (`progressive_disclosure`, `react`,
`react_progressive_disclosure`) interleave further skill selection
with solving inside their reasoning loop.
The five skill-use methods studied in our experiments are specific
(Provider, Engine) combinations:
| Method | Provider | Engine | Description |
|---|---|---|---|
| **LLM Direct** | `none` | `direct` | No external skill โ parametric-only baseline |
| **Oracle Skill** | `oracle` | `direct` | Annotated gold skill prepended (upper bound) |
| **Full-Skill Injection** | `topk(k=1)` | `direct` | Full content of the BM25 rank-1 skill prepended to the prompt |
| **LLM Selection** | `llm_select(pool=50)` | `direct` | Model picks one skill from BM25 top-50, then answers |
| **Progressive Disclosure** | `topk(k=50)` | `progressive_disclosure` | Model sees a compact skill catalog and loads skills on demand |
For ToolQA the `direct` engine is replaced by `react` and
`progressive_disclosure` by `react_progressive_disclosure`; the
experiment runner selects the right engine per dataset.
## CLI
A single `sragents` command is installed (also invokable as
`python -m sragents.cli.main`). Every subcommand has `--help`.
```bash
sragents list retrievers # bm25 tfidf bge contriever
sragents list providers # none oracle topk llm_select oracle_distractor
sragents list engines # direct progressive_disclosure react react_progressive_disclosure
sragents list datasets # theoremqa logicbench toolqa champ medcalcbench bigcodebench
sragents list experiments # main, retrieval_comparison, topk_sweep, distractor, ...
```
### 1. Retrieve
```bash
sragents retrieve \
--retriever bm25 \
--corpus data/bench/corpus/corpus.json \
--instances data/bench/instances/theoremqa.json \
--output results/retrieval/theoremqa-bm25.json \
--top-k 50
```
Dense retrievers default to the same checkpoints used in our
experiments โ `BAAI/bge-base-en-v1.5` for `bge` and
`facebook/contriever-msmarco` for `contriever`. Override with
`--retriever-arg model_path=` to swap in a different model.
Retrievers can also be cascaded. `sragents rerank` is a second-stage
retriever: it reads an existing retrieval file, uses an LLM to reorder
the top-K candidates per query, and writes a new retrieval file in the
same format. A typical cascade is BM25 โ LLM rerank:
```bash
sragents rerank \
--input results/retrieval/theoremqa-bm25.json \
--output results/retrieval/theoremqa-rerank_bm25.json \
--instances data/bench/instances/theoremqa.json \
--model --api-base \
--top-k 50
```
`sragents hybrid` is the round-robin fuser used for our Hybrid
(BM25 + BGE) retriever โ see the reproduction recipe below.
### 2. Infer
```bash
sragents infer \
--instances data/bench/instances/theoremqa.json \
--output results/inference/theoremqa-Qwen3-32B-bm25_top1.jsonl \
--model --api-base \
--provider topk \
--provider-arg source=results/retrieval/theoremqa-bm25.json \
--provider-arg k=1 \
--engine direct \
--label bm25_top1
```
Common recipes:
| Method | CLI fragment |
|---|---|
| LLM Direct | `--provider none --engine direct` |
| Oracle Skill | `--provider oracle --engine direct` |
| Full-Skill Injection | `--provider topk --provider-arg source=โฆ --provider-arg k=1 --engine direct` |
| LLM Selection | `--provider llm_select --provider-arg source=โฆ --provider-arg pool=50 --engine direct` |
| Progressive Disclosure | `--provider topk --provider-arg source=โฆ --provider-arg k=50 --engine progressive_disclosure` |
| ToolQA | For any of the above, use `--engine react` in place of `direct`, or `--engine react_progressive_disclosure` in place of `progressive_disclosure` |
### 3. Evaluate
```bash
sragents evaluate \
--input results/inference/theoremqa-Qwen3-32B-bm25_top1.jsonl \
--instances data/bench/instances/theoremqa.json \
--output results/eval/theoremqa-Qwen3-32B-bm25_top1.json
```
## Reproducing our experiments
The retrievers used across all experiments are BM25, TF-IDF, BGE,
Contriever, a round-robin hybrid of BM25 and BGE, and an LLM rerank
of BM25's top-50. Pre-compute the first-stage retrievers, then fuse
BM25 + BGE into the hybrid file:
```bash
DATASETS="theoremqa logicbench toolqa champ medcalcbench bigcodebench"
# First-stage retrievers.
for ds in $DATASETS; do
for r in bm25 tfidf bge contriever; do
sragents retrieve --retriever $r \
--corpus data/bench/corpus/corpus.json \
--instances data/bench/instances/$ds.json \
--output results/retrieval/$ds-$r.json
done
done
# Round-robin fusion of BM25 + BGE.
for ds in $DATASETS; do
sragents hybrid \
--input results/retrieval/$ds-bm25.json \
results/retrieval/$ds-bge.json \
--output results/retrieval/$ds-hybrid_bm25_bge.json
done
```
The LLM rerank output is model-dependent (file name:
`*-rerank_bm25-.json`), so it has to be produced once per
evaluated model:
```bash
MODEL= # same identifier you pass to `sragents infer --model`
API_BASE=
for ds in $DATASETS; do
sragents rerank \
--input results/retrieval/$ds-bm25.json \
--output results/retrieval/$ds-rerank_bm25-$(basename $MODEL).json \
--instances data/bench/instances/$ds.json \
--model $MODEL --api-base $API_BASE --top-k 50
done
```
`sragents experiment --exp retrieval_comparison` triggers
`sragents rerank` on demand if the rerank file is missing.
Skill-retrieval metrics (Recall@K, nDCG@K) are computed by
`sragents retrieve` and `sragents rerank` directly into their
output JSONs, so no separate end-to-end run is needed to obtain
them. The end-to-end experiment runners produce inference traces
and end-task accuracy:
```bash
# Main results: 5 skill-use methods ร 6 datasets.
sragents experiment --exp main \
--model --api-base
# Retriever comparison: rank-1 end-to-end under BM25, TF-IDF, BGE,
# Contriever, and BM25 + Rerank. The rank-1 Hybrid is omitted โ its
# round-robin fusion always returns BM25's top-1 at rank 1, so its
# end-to-end numbers are identical to BM25.
sragents experiment --exp retrieval_comparison \
--model --api-base
# Noise robustness: gold skill plus N โ {0, 2, 4, 8} hard-negative
# distractors, drawn by alternately picking non-gold candidates from
# BM25 and BGE rank lists (both retrieval files must already exist).
# Run under Full Skill Injection and Progressive Disclosure exposure.
sragents experiment --exp distractor \
--model --api-base
```
The skill-loading-rate analyses in our paper are derived from the
inference traces left behind by `--exp main` rather than from a
separate run.
The catalog also includes `--exp topk_sweep`, an additional
ablation that varies BM25 top-K (K โ {1, 2, 4, 8}) under both
exposure modes. Use `--exp topk_sweep_injection` or
`--exp topk_sweep_progressive_disclosure` to restrict to a single
mode.
Narrow the scope of any experiment with `--dataset theoremqa
logicbench` or `--methods bm25_top1 progressive_disclosure` (use
the technical method labels, not the human-readable display names).
The runner invokes `sragents infer` and `sragents evaluate` for
each (dataset, method) cell and skips cells whose output files
already exist.
## Project layout
```
src/sragents/
โโโ config.py # path constants, ALL_DATASETS
โโโ corpus.py # skill corpus loading
โโโ llm.py # OpenAI-compatible client
โโโ prompts.py # per-dataset prompt builders
โ
โโโ retrieve/ # Stage 1
โ โโโ base.py # Retriever protocol + registry
โ โโโ bm25.py ยท tfidf.py ยท dense.py ยท hybrid.py ยท llm_rerank.py
โ โโโ metrics.py ยท schema.py
โ
โโโ infer/ # Stage 2
โ โโโ base.py # SkillProvider + InferenceEngine protocols
โ โโโ runner.py # per-instance parallel execution + resume
โ โโโ providers/ # none ยท oracle ยท topk ยท llm_select ยท oracle_distractor
โ โโโ engines/ # direct ยท progressive_disclosure ยท react ยท react_progressive_disclosure
โ
โโโ evaluate/ # Stage 3
โ โโโ base.py # Evaluator protocol + registry
โ โโโ datasets/ # 6 per-dataset evaluators
โ โโโ metrics.py
โ
โโโ toolqa/ # ToolQA tools (used by ReAct engines)
โโโ experiments/ # experiment catalog + runner
โโโ cli/ # sragents entry points
data/bench/
โโโ corpus/corpus.json # 26,262 skills (636 gold + 25,626 web)
โโโ instances/*.json # 6 datasets
```
## License
See [LICENSE](LICENSE).