# Automated reproducibility assessments in the social and behavioral sciences using large language models This repository contains the code and data accompanying our study on statistical reproducibility with LLM agents. ## Protocol The agent receives a verbal claim, the original dataset, and — depending on the experimental condition — some or all of the paper. It conducts an independent analysis and submits a structured result (test statistic + degrees of freedom + sample size + conclusion), which the scorer converts to Cohen's d. Three experimental conditions vary the analyst's inputs: - **Reproducibility baseline:** full paper, neutral analyst stance. - **Information context:** vary the paper material — Variant A `full` (full parsed paper text), Variant B `no_methods` (full text with the methods / research-design section redacted), Variant C `abstract` (title + abstract only). - **Analyst perspective:** vary the analytical stance — `neutral` / `confirmatory` / `critical`. The same prompt and the same scorer are used across all conditions; only the paper material attached and instruction differ. ## Setup ```bash uv venv .venv source .venv/bin/activate # macOS/Linux uv pip install -r requirements.txt ``` Docker is required for the agentic analysis pipeline (the Inspect AI sandbox). All model calls are routed through OpenRouter. Set your API key: ```bash export LLM100_OPEN_ROUTER_API_KEY=sk-or-... # macOS/Linux ``` ## Data The analysis inputs are not bundled with this repository; two sources reconstruct them. **Per-paper datasets** are fetched from OSF in two steps. They download data for the papers in `data/labeling/`: ```bash # Step 01 — download the small metadata tables that drive step 02 (e.g. information from multi100 github.com/marton-balazs-kovacs/multi100) python pipeline/data_collection/01_download_metadata.py # Step 02 — download per-paper data for the labeling corpus (Multi100 + SCORE) python pipeline/data_collection/02_download_data.py ``` **Paper text** was retrieved manually rather than programmatically and parsed into one folder per paper for the agentic pipeline, split into three sibling corpora — one per information-context input variant: - `data/paper_ocr_abstract_only/` — abstract only (`abstract.md`). Abstracts are redistributable, so **this is the one parsed-text corpus committed to the repo**, and the `abstract` variant runs out of the box. - `data/paper_ocr_full/` — full article text (`article.md`) + tables/figures, used by the `full` variant. - `data/paper_ocr_no_methods/` — full text with the methods/research-design section stripped inline, used by the `no_methods` variant. The full and method-removed corpora hold license-restricted paper text and are not redistributed. The agentic corpora are driven by the labelling sheets in `data/labeling/`, which holds the evalaution corpus (84 Multi100 papers, 96 Score papers): - `multi100_labeling.csv` - `score_labeling.csv` ## Experiments Ready-to-run wrappers live in `pipeline/agentic_analysis/{multi100,score}/` — one folder per corpus, one script per prereg condition, with all flags pinned (5 epochs, `reasoning_effort=medium`, `--max-sandboxes 10`). ```bash # baseline: full × neutral (primary model) bash pipeline/agentic_analysis/multi100/baseline_claude47.sh bash pipeline/agentic_analysis/score/baseline_claude47.sh # information context: abstract + no_methods × neutral bash pipeline/agentic_analysis/multi100/information.sh bash pipeline/agentic_analysis/score/information.sh # analyst perspective: full × confirmatory + critical bash pipeline/agentic_analysis/multi100/perspective.sh bash pipeline/agentic_analysis/score/perspective.sh # robustness: GPT-5.5 (secondary) bash pipeline/agentic_analysis/multi100/baseline_gpt55.sh bash pipeline/agentic_analysis/score/baseline_gpt55.sh # robustness: GLM-5.1 (secondary, open-weight) bash pipeline/agentic_analysis/multi100/baseline_glm51.sh bash pipeline/agentic_analysis/score/baseline_glm51.sh ``` `cost_limit` defaults to `$5.00` per sample across all models — override with `-T cost_limit=`. `temperature` defaults to `1.0` for stochastic sampling (variance across runs). ## Evaluation The `eval_export.csv` is of the paper is provided. Figures and statistics can be produced with: ```bash python pipeline/evaluation/statistics_and_values.py # summary statistics python pipeline/evaluation/figure_plot.py # figures → results/figures/ python pipeline/evaluation/cochran_ttest.py # RQ3 Cochran's Q / Friedman + paired t-test ``` `figure_plot.py` defaults to Claude (strict tolerance); pass `--model gpt|glm` and/or `--tolerance broad` for the other panels. The summary scripts read the Multi100 human-benchmark CSVs in `data/multi100/human_results/`, so run `01_download_metadata.py` first if that folder is empty. To generate an `eval_export.csv` after running new experiments: ```bash python pipeline/evaluation/export_eval_csv.py # export statistics to a flat per-run CSV ``` ## Repository structure ``` ├── data/ │ ├── labeling/ — Labelling exports driving the agentic corpora │ │ ├── score_labeling.csv — SCORE labeler sheet, final run set (96 papers) │ │ └── multi100_labeling.csv — Multi100 reference sheet, final run set (84 papers) │ ├── paper_ocr_abstract_only/ — Variant C (abstract): one folder per paper with abstract.md │ ├── paper_ocr_full/ — Variant A (full): article.md + tbl-N.md / img-N.jpeg (only placeholder) │ ├── paper_ocr_no_methods/ — Variant B (no_methods): article.md with methods stripped inline (only placeholder) │ ├── multi100/ │ │ ├── human_results/ — Multi100 benchmark CSVs (downloaded by 01_download_metadata.py from github.com/marton-balazs-kovacs/multi100) │ │ └── datasets/ — Per-paper OSF downloads, one folder per labeling paper_id │ └── score/ │ ├── datasets/ — SCORE per-paper data, one // folder per labeling pair │ └── companion_repos/ │ └── score_master/ — Cached SCORE rr-project OSF-ID index (OSF dtzx4) ├── pipeline/ │ ├── data_loading.py — Shared paper-metadata + study-file loaders │ ├── data_collection/ — One-time data preparation │ │ ├── osf_common.py — Shared OSF helpers + paths for the steps below │ │ ├── 01_download_metadata.py — Download metadata for step 02 (Multi100 human-benchmark CSVs + SCORE rr-project index) │ │ └── 02_download_data.py — OSF scraper: per-paper data for the labeling corpus (Multi100 + SCORE) │ ├── agentic_analysis/ — Agentic LLM pipeline (Inspect AI + tool use) │ │ ├── task.py — Single `reanalysis` task (all conditions via parameters) │ │ ├── dataset.py — Dataset loader facade (re-exports the corpus loaders) │ │ ├── dataset_common.py — Shared prompt building + sandbox file mounting │ │ ├── dataset_multi100.py — Multi100 corpus loader │ │ ├── dataset_score.py — SCORE corpus loader │ │ ├── scorer.py — Unified scorer: Cohen's d tolerance + conclusion match │ │ ├── multi100/, score/ — Per-corpus prereg runner scripts (one per condition) │ │ ├── Dockerfile — Sandbox image (Python + analysis packages) │ │ └── compose.yaml — Docker Compose for sandbox │ ├── evaluation/ — Post-hoc evaluation │ │ ├── export_eval_csv.py — Flatten Inspect AI .eval logs into one CSV │ │ ├── effect_size.py — Multi100 test statistic → Cohen's d conversion │ │ ├── processing_functions.py — Helper for evaluations scripts │ │ ├── statistics_and_values.py — Summary statistics reported in the paper │ │ ├── figure_plot.py — Generate the paper's figures (→ results/figures/) │ │ └── cochran_ttest.py — RQ3 Cochran's Q / Friedman tests + paired t-test (LLM vs original d) │ └── misc/ — Auxiliary scripts │ └── contamination_probe.py — Training data contamination check ├── results/ — LLM analysis outputs │ ├── eval_export.csv — Flat per-run CSV of the eval logs; the published study results and the input to the evaluation scripts │ ├── agentic_logs/ — Inspect AI eval logs (generated by running the experiments) │ └── figures/ — Generated figures └── supplementary_materials/ — GUIDE-LLM checklist ``` ## License The code in this repository is released under the [MIT License](LICENSE). ## GUIDE-LLM reporting checklist This repository includes the completed [GUIDE-LLM reporting checklist](https://sfeuerriegel.github.io/llm-checklist/) for the study: [supplementary_materials/GUIDE-LLM_checklist.pdf](supplementary_materials/GUIDE-LLM_checklist.pdf).