# SpatialBench

**Can AI agents extract biological insight from real-world spatial data?**

SpatialBench is a benchmark of 159 verifiable problems derived from practical spatial transcriptomics workflows. Each problem snapshots an analysis state immediately before a target step and pairs it with a deterministic grader that evaluates recovery of a key biological result.

This revised version of the benchmark includes 159 problems across 5 platforms and 7 task categories.  We share results for the full benchmark and publicly release a representative sample covering all platform types and task categories along with the associated agent trajectories. We withhold releasing the full benchmark set publicly to avoid contamination.  


## Key Findings
| model_name                    | harness        |   Accuracy (%) |   Cost ($) |
|:------------------------------|:---------------|---------------:|-----------:|
| gpt-5.5                       | mini-swe-agent |          57.65 |     1.1207 |
| gpt-5.4                       | mini-swe-agent |          57.44 |     0.577  |
| gpt-5.5                       | openai-codex   |          53.67 |     3.1616 |
| claude-opus-4-6               | mini-swe-agent |          52.83 |     0.8456 |
| claude-opus-4-7               | mini-swe-agent |          52.41 |     0.9817 |
| gemini-3.1-pro-preview        | mini-swe-agent |          51.57 |     0.9362 |
| claude-opus-4-7               | claude-code    |          51.36 |     0.8023 |
| gpt-5.2                       | mini-swe-agent |          50.1  |     0.6024 |
| grok-4.20-beta-0309-reasoning | mini-swe-agent |          45.91 |     0.1679 |
| claude-sonnet-4-6             | mini-swe-agent |          44.23 |     0.273  |
| claude-opus-4-5               | mini-swe-agent |          42.77 |     0.4624 |
| claude-sonnet-4-5             | mini-swe-agent |          41.51 |     0.2247 |
| gpt-5.1                       | mini-swe-agent |          39.83 |     0.1574 |
| grok-4-1-fast-reasoning       | mini-swe-agent |          33.96 |     0.0164 |
| grok-4                        | mini-swe-agent |          31.87 |     0.4529 |
| gemini-2.5-pro                | mini-swe-agent |          28.93 |     0.1086 |

Full results with 95% confidence intervals are in [`results/`](results/). Details on implementation methodology can be found in [Methods](METHODS.md)

Supplemental human verification notes for reviewed evaluations are available in [`supplemental/verified_human_solutions/`](supplemental/verified_human_solutions/). These notes include sanitized task definitions, independent human answers, and stripped analysis notebooks for evaluations that had all three artifacts available.


## Benchmark Structure

**159 evaluations** across:
- **5 platforms**: Curio,Vizgen,Xenium,AtlasXOmics,Visium
- **7 task categories**: Dimensionality Reduction,Cell Typing,Normalization,Differential Expression,Clustering,QC,Spatial Analysis

Tasks require empirical interaction with the data—agents that rely on prior knowledge without performing the requisite analysis fail to complete many tasks correctly.

## Quick Start

```bash
pip install -e .

# Validate evaluation format
spatialbench validate example_evals/spatial_analysis/xenium_kidney_spatial_cn7_composition_day14.json

# Run with mini-swe-agent
export ANTHROPIC_API_KEY=your_key
spatialbench run example_evals/spatial_analysis/xenium_kidney_spatial_cn7_composition_day14.json --agent minisweagent --model anthropic/claude-opus-4-5

export OPENAI_API_KEY=your_key
spatialbench run example_evals/spatial_analysis/xenium_kidney_spatial_cn7_composition_day14.json --agent minisweagent --model openai/gpt-5.5
```

## Graders

Five grader families handle different answer types:

| Grader | Use Case |
|--------|----------|
| NumericTolerance | QC metrics, counts, expression values |
| MultipleChoice | Discrete interpretation questions |
| MarkerGenePrecisionRecall | Gene lists (P@K, R@K) |
| LabelSetJaccard | Cell type sets |
| DistributionComparison | Cell type proportions |

See [latch-eval-tools](https://github.com/latchbio/latch-eval-tools) for implementations and harness setups. 

## Citation

```bibtex
@article{spatialbench2025,
  title = {SpatialBench: Can Agents Analyze Real-World Spatial Biology Data?},
  author = {Workman, Kenny and Yang, Zhen and Muralidharan, Harihara and Le, Hannah},
  year = {2025},
  url = {https://github.com/latchbio/spatialbench}
}
```

## License

Apache 2.0