# scBench

**Evaluating AI Agents on Single-Cell RNA-seq Analysis**

scBench is a benchmark of 195 verifiable problems derived from practical single-cell RNA-seq workflows. Each problem pairs a data snapshot (AnnData `.h5ad`) with a natural-language task prompt and a deterministic grader that maps the agent's structured output to pass/fail.

## Key Findings

| model_name | harness | Accuracy (%) | Cost ($) |
|---|---|---:|---:|
| gpt-5.5 | mini-swe-agent | 57.95 | 1.1136 |
| gpt-5.5 | openai-codex | 57.78 | 2.4685 |
| gpt-5.4 | mini-swe-agent | 57.44 | 0.8240 |
| claude-opus-4-7 | mini-swe-agent | 55.21 | 1.5378 |
| claude-opus-4-7 | claude-code | 54.02 | 1.1465 |
| gemini-3.1-pro-preview | mini-swe-agent | 53.85 | 0.8948 |
| claude-opus-4-6 | mini-swe-agent | 52.65 | 1.1917 |
| gpt-5.2 | mini-swe-agent | 52.31 | 0.8874 |
| claude-sonnet-4-6 | mini-swe-agent | 50.26 | 0.9872 |
| claude-opus-4-5 | mini-swe-agent | 47.18 | 0.6553 |
| grok-4.20-beta-0309-reasoning | mini-swe-agent | 44.44 | 0.2957 |
| grok-4.3 | mini-swe-agent | 44.27 | 0.2147 |
| gpt-5.1 | mini-swe-agent | 38.80 | 0.2177 |
| claude-sonnet-4-5 | mini-swe-agent | 33.16 | 0.2682 |
| grok-4-1-fast-reasoning | mini-swe-agent | 30.26 | 0.0282 |
| gemini-2.5-pro | mini-swe-agent | 23.59 | 0.1368 |

Full results with per-task and per-platform breakdowns are in [`results/`](results/).

## Benchmark Structure

195 evaluations across:
- **6 platforms**: BD Rhapsody, Chromium, CSGenetics, Illumina, MissionBio, ParseBio
- **6 task categories**: QC, Normalization, Dimensionality Reduction, Clustering, Cell Typing, Differential Expression

Tasks require empirical interaction with the data—agents that rely on prior knowledge without performing the requisite analysis fail.

## Canonical Examples

Six canonical examples are in [`evals/`](evals/). The sample covers all current platforms and task categories. The full 195-evaluation benchmark is withheld to prevent training contamination.

| Task | Platform | Eval |
|---|---|---|
| QC | BD Rhapsody | `bd_rhapsody_tnbc_panel_aware_qc` |
| Dimensionality Reduction | Chromium | `dr_05_pca_preprocessing_sentinels` |
| Normalization | CS Genetics | `NRM01_sparse_normalization` |
| Cell Typing | Illumina snRNA | `T04a_endothelin_niche_sources` |
| Clustering | Mission Bio Tapestri | `tapestri_ccus_clustering_12_largest_mutant_clone` |
| Differential Expression | Parse Bio | `DE01_pseudobulk_de` |

## Quick Start

```bash
pip install -e .

# Validate an evaluation
scbench validate evals/qc/bd_rhapsody_tnbc_panel_aware_qc.json

# Run with mini-swe-agent
export ANTHROPIC_API_KEY=your_key
scbench run evals/qc/bd_rhapsody_tnbc_panel_aware_qc.json --agent minisweagent --model anthropic/claude-opus-4-5
```

### Custom Agent

```python
from scbench import EvalRunner

def my_agent(task_prompt, work_dir):
    import json
    answer = {
        "n_pbmcs_retained": 14346,
        "median_genes_per_pbmc": 68,
        "n_monocytes_pbmc": 2592,
    }
    (work_dir / "eval_answer.json").write_text(json.dumps(answer))
    return answer

runner = EvalRunner("evals/qc/bd_rhapsody_tnbc_panel_aware_qc.json")
result = runner.run(agent_function=my_agent)
print(f"Passed: {result['passed']}")
```

## Graders

Five grader families handle different answer types:

| Grader | Use Case |
|--------|----------|
| NumericTolerance | QC metrics, counts, expression values |
| MultipleChoice | Discrete interpretation questions |
| MarkerGenePrecisionRecall | Gene lists (P@K, R@K) |
| LabelSetJaccard | Cell type sets |
| DistributionComparison | Cell type proportions |

See [latch-eval-tools](https://github.com/latchbio/latch-eval-tools) for implementations.

## Citation

```bibtex
@article{scbench2026,
  title={scBench: Evaluating AI Agents on Single-Cell RNA-seq Analysis},
  author={Workman, Kenny and Yang, Zhen and Muralidharan, Harihara and Abdulali, Aidan and Le, Hannah},
  year={2026},
  note={LatchBio}
}
```

## License

Apache 2.0