# Automated reproducibility assessments in the social and behavioral sciences using large language models

This repository contains the code and data accompanying our study on statistical reproducibility with LLM agents. 

## Protocol

The agent receives a verbal claim, the original dataset, and — depending on the experimental condition — some or all of the paper. It conducts an independent analysis and submits a structured result (test statistic + degrees of freedom + sample size + conclusion), which the scorer converts to Cohen's d.

Three experimental conditions vary the analyst's inputs:

- **Reproducibility baseline:** full paper, neutral analyst stance.
- **Information context:** vary the paper material — Variant A `full` (full parsed paper text), Variant B `no_methods` (full text with the methods / research-design section redacted), Variant C `abstract` (title + abstract only).
- **Analyst perspective:** vary the analytical stance — `neutral` / `confirmatory` / `critical`.

The same prompt and the same scorer are used across all conditions; only the paper material attached and instruction differ.

## Setup

```bash
uv venv .venv
source .venv/bin/activate       # macOS/Linux
uv pip install -r requirements.txt
```

Docker is required for the agentic analysis pipeline (the Inspect AI sandbox).

All model calls are routed through OpenRouter. Set your API key:

```bash
export LLM100_OPEN_ROUTER_API_KEY=sk-or-...        # macOS/Linux
```

## Data

The analysis inputs are not bundled with this repository; two sources reconstruct them.

**Per-paper datasets** are fetched from OSF in two steps. They download data for the papers in `data/labeling/`:

```bash
# Step 01 — download the small metadata tables that drive step 02 (e.g. information from multi100 github.com/marton-balazs-kovacs/multi100)
python pipeline/data_collection/01_download_metadata.py

# Step 02 — download per-paper data for the labeling corpus (Multi100 + SCORE)
python pipeline/data_collection/02_download_data.py
```

**Paper text** was retrieved manually rather than programmatically and parsed into one folder per paper for the agentic pipeline, split into three sibling corpora — one per information-context input variant:

- `data/paper_ocr_abstract_only/` — abstract only (`abstract.md`). Abstracts are redistributable, so **this is the one parsed-text corpus committed to the repo**, and the `abstract` variant runs out of the box.
- `data/paper_ocr_full/` — full article text (`article.md`) + tables/figures, used by the `full` variant.
- `data/paper_ocr_no_methods/` — full text with the methods/research-design section stripped inline, used by the `no_methods` variant.

The full and method-removed corpora hold license-restricted paper text and are not redistributed.

The agentic corpora are driven by the labelling sheets in `data/labeling/`, which holds the evalaution corpus (84 Multi100 papers, 96 Score papers):

- `multi100_labeling.csv`
- `score_labeling.csv`

## Experiments

Ready-to-run wrappers live in `pipeline/agentic_analysis/{multi100,score}/` — one folder per corpus, one script per prereg condition, with all flags pinned (5 epochs, `reasoning_effort=medium`, `--max-sandboxes 10`).

```bash
# baseline: full × neutral (primary model)
bash pipeline/agentic_analysis/multi100/baseline_claude47.sh
bash pipeline/agentic_analysis/score/baseline_claude47.sh
# information context: abstract + no_methods × neutral
bash pipeline/agentic_analysis/multi100/information.sh
bash pipeline/agentic_analysis/score/information.sh
# analyst perspective: full × confirmatory + critical       
bash pipeline/agentic_analysis/multi100/perspective.sh       
bash pipeline/agentic_analysis/score/perspective.sh
# robustness: GPT-5.5 (secondary)       
bash pipeline/agentic_analysis/multi100/baseline_gpt55.sh    
bash pipeline/agentic_analysis/score/baseline_gpt55.sh
# robustness: GLM-5.1 (secondary, open-weight)
bash pipeline/agentic_analysis/multi100/baseline_glm51.sh    
bash pipeline/agentic_analysis/score/baseline_glm51.sh
```

`cost_limit` defaults to `$5.00` per sample across all models — override with `-T cost_limit=<dollars>`. `temperature` defaults to `1.0` for stochastic sampling (variance across runs).

## Evaluation

The `eval_export.csv` is of the paper is provided. Figures and statistics can be produced with:

```bash
python pipeline/evaluation/statistics_and_values.py        # summary statistics
python pipeline/evaluation/figure_plot.py                  # figures → results/figures/
python pipeline/evaluation/cochran_ttest.py                # RQ3 Cochran's Q / Friedman + paired t-test
```

`figure_plot.py` defaults to Claude (strict tolerance); pass `--model gpt|glm` and/or `--tolerance broad` for the other panels. The summary scripts read the Multi100 human-benchmark CSVs in `data/multi100/human_results/`, so run `01_download_metadata.py` first if that folder is empty.

To generate an `eval_export.csv` after running new experiments:
```bash
python pipeline/evaluation/export_eval_csv.py         # export statistics to a flat per-run CSV
```

## Repository structure

```
├── data/
│   ├── labeling/            — Labelling exports driving the agentic corpora
│   │   ├── score_labeling.csv    — SCORE labeler sheet, final run set (96 papers)
│   │   └── multi100_labeling.csv — Multi100 reference sheet, final run set (84 papers)
│   ├── paper_ocr_abstract_only/ — Variant C (abstract): one folder per paper with abstract.md
│   ├── paper_ocr_full/      — Variant A (full): article.md + tbl-N.md / img-N.jpeg (only placeholder)
│   ├── paper_ocr_no_methods/ — Variant B (no_methods): article.md with methods stripped inline (only placeholder)
│   ├── multi100/
│   │   ├── human_results/   — Multi100 benchmark CSVs (downloaded by 01_download_metadata.py from github.com/marton-balazs-kovacs/multi100)
│   │   └── datasets/        — Per-paper OSF downloads, one folder per labeling paper_id
│   └── score/
│       ├── datasets/        — SCORE per-paper data, one <paper_id>/<rr_id>/ folder per labeling pair
│       └── companion_repos/
│           └── score_master/ — Cached SCORE rr-project OSF-ID index (OSF dtzx4)
├── pipeline/
│   ├── data_loading.py              — Shared paper-metadata + study-file loaders
│   ├── data_collection/             — One-time data preparation
│   │   ├── osf_common.py            — Shared OSF helpers + paths for the steps below
│   │   ├── 01_download_metadata.py  — Download metadata for step 02 (Multi100 human-benchmark CSVs + SCORE rr-project index)
│   │   └── 02_download_data.py      — OSF scraper: per-paper data for the labeling corpus (Multi100 + SCORE)
│   ├── agentic_analysis/            — Agentic LLM pipeline (Inspect AI + tool use)
│   │   ├── task.py                  — Single `reanalysis` task (all conditions via parameters)
│   │   ├── dataset.py               — Dataset loader facade (re-exports the corpus loaders)
│   │   ├── dataset_common.py        — Shared prompt building + sandbox file mounting
│   │   ├── dataset_multi100.py      — Multi100 corpus loader
│   │   ├── dataset_score.py         — SCORE corpus loader
│   │   ├── scorer.py                — Unified scorer: Cohen's d tolerance + conclusion match
│   │   ├── multi100/, score/        — Per-corpus prereg runner scripts (one per condition)
│   │   ├── Dockerfile               — Sandbox image (Python + analysis packages)
│   │   └── compose.yaml             — Docker Compose for sandbox
│   ├── evaluation/                  — Post-hoc evaluation
│   │   ├── export_eval_csv.py       — Flatten Inspect AI .eval logs into one CSV
│   │   ├── effect_size.py           — Multi100 test statistic → Cohen's d conversion
│   │   ├── processing_functions.py  — Helper for evaluations scripts
│   │   ├── statistics_and_values.py — Summary statistics reported in the paper
│   │   ├── figure_plot.py           — Generate the paper's figures (→ results/figures/)
│   │   └── cochran_ttest.py         — RQ3 Cochran's Q / Friedman tests + paired t-test (LLM vs original d)
│   └── misc/                        — Auxiliary scripts
│       └── contamination_probe.py   — Training data contamination check
├── results/                 — LLM analysis outputs
│   ├── eval_export.csv      — Flat per-run CSV of the eval logs; the published study results and the input to the evaluation scripts
│   ├── agentic_logs/        — Inspect AI eval logs (generated by running the experiments)
│   └── figures/             — Generated figures
└── supplementary_materials/ — GUIDE-LLM checklist
```

## License

The code in this repository is released under the [MIT License](LICENSE).

## GUIDE-LLM reporting checklist

This repository includes the completed [GUIDE-LLM reporting checklist](https://sfeuerriegel.github.io/llm-checklist/) for the study: [supplementary_materials/GUIDE-LLM_checklist.pdf](supplementary_materials/GUIDE-LLM_checklist.pdf).