# Agentic Abstention

This repository is a research artifact release for the agentic abstention benchmark. It
contains the code and lightweight artifacts needed to reproduce the benchmark
protocol across three agent environments:

- `web/`: WebShop-based web navigation tasks.
- `qa/`: Q&A tasks with Wikimedia multi-turn search.
- `terminal/`: TerminalBench immediate and delayed abstention tasks.

Raw datasets, search indexes, model outputs, debug traces, and cluster job
artifacts are intentionally not tracked. Each environment includes download or
materialization instructions for external assets.

## Repository Layout

```text
web/        WebShop instruction rewriting, missing-target construction, and evaluation
qa/         AbstentionBench-style Q&A datasets with Wikimedia SEARCH episodes
terminal/   TerminalBench task construction, Harbor configs, and analysis tools
docs/       Benchmark protocol, metric definitions, and data-source notes
```

## Quick Start

Clone the repository and enter the environment you want to reproduce:

```bash
git clone https://github.com/lhannnn/agentic-abstention.git
cd agentic-abstention
```

For WebShop:

```bash
cd web
python -m venv .venv
. .venv/bin/activate
pip install -r requirements.txt
```

For Q&A:

```bash
cd qa
conda env create -f environment.yml
conda activate abstention-bench
pip install -e .
```

For TerminalBench:

```bash
cd terminal
python -m venv .venv
. .venv/bin/activate
pip install -r requirements.txt
```

See the environment-specific README files for data downloads and run commands.

## Benchmark Protocol

Agentic abstention evaluates whether an agent knows when to stop and abstain
instead of continuing with unsupported actions. The concrete action interface is
environment-specific: Q&A uses Wikimedia search episodes, Web uses browser
navigation, and Terminal uses command-line interaction.

The main metrics are Timely Recall, Overall Recall, SPL, and pass@k. See
`docs/metrics.md` for the metric definitions and `docs/benchmark_protocol.md`
for the environment-level protocol summary.

## Data Policy

This repository tracks only small code, prompt, config, manifest, and test
files. It does not include:

- raw WebShop product/search assets,
- raw or materialized TerminalBench task directories,
- HuggingFace dataset caches,
- Wikimedia dumps or retrieval indexes,
- API keys or `.env` files,
- model outputs, logs, plots, and debug traces.

See `docs/data_sources.md` and each environment's `download/README.md`.