---
title: "Autodata: An agentic data scientist to create high quality synthetic data"
source_url: "https://arxiv.org/abs/2606.25996"
type: article
created: 2026-06-26
updated: 2026-06-26
sha256: 926477076b31e43b4d6e6a636cf8e87bbf842c3ccfe2c348372384d2534916b9
---

# Autodata: An agentic data scientist to create high quality synthetic data


Published Time: Fri, 26 Jun 2026 00:49:13 GMT

Markdown Content:
[Skip to main content](http://arxiv.org/abs/2606.25996#content)

[![Image 1: Cornell University](http://arxiv.org/static/browse/0.3.4/images/icons/cu/cornell-reduced-white-SMALL.svg)](https://www.cornell.edu/)

[Learn about arXiv becoming an independent nonprofit.](https://tech.cornell.edu/arxiv/)

We gratefully acknowledge support from the Simons Foundation, [member institutions](https://info.arxiv.org/about/ourmembers.html), and all contributors.[Donate](https://info.arxiv.org/about/donate.html)

[](http://arxiv.org/IgnoreMe)

[![Image 2: arxiv logo](http://arxiv.org/static/browse/0.3.4/images/arxiv-logo-one-color-white.svg)](http://arxiv.org/)>[cs](http://arxiv.org/list/cs/recent)> arXiv:2606.25996 

[Help](https://info.arxiv.org/help) | [Advanced Search](https://arxiv.org/search/advanced)

Search

[![Image 3: arXiv logo](http://arxiv.org/static/browse/0.3.4/images/arxiv-logomark-small-white.svg)](https://arxiv.org/)

[![Image 4: Cornell University Logo](http://arxiv.org/static/browse/0.3.4/images/icons/cu/cornell-reduced-white-SMALL.svg)](https://www.cornell.edu/)

GO

## quick links

*   [Login](https://arxiv.org/login)
*   [Help Pages](https://info.arxiv.org/help)
*   [About](https://info.arxiv.org/about)

# Computer Science > Artificial Intelligence

**arXiv:2606.25996** (cs) 

 [Submitted on 24 Jun 2026 ([v1](https://arxiv.org/abs/2606.25996v1)), last revised 25 Jun 2026 (this version, v2)]

# Title:Autodata: An agentic data scientist to create high quality synthetic data

Authors:[Ilia Kulikov](https://arxiv.org/search/cs?searchtype=author&query=Kulikov,+I), [Chenxi Whitehouse](https://arxiv.org/search/cs?searchtype=author&query=Whitehouse,+C), [Tianhao Wu](https://arxiv.org/search/cs?searchtype=author&query=Wu,+T), [Yixin Nie](https://arxiv.org/search/cs?searchtype=author&query=Nie,+Y), [Swarnadeep Saha](https://arxiv.org/search/cs?searchtype=author&query=Saha,+S), [Eryk Helenowski](https://arxiv.org/search/cs?searchtype=author&query=Helenowski,+E), [Weizhe Yuan](https://arxiv.org/search/cs?searchtype=author&query=Yuan,+W), [Olga Golovneva](https://arxiv.org/search/cs?searchtype=author&query=Golovneva,+O), [Jack Lanchantin](https://arxiv.org/search/cs?searchtype=author&query=Lanchantin,+J), [Yoram Bachrach](https://arxiv.org/search/cs?searchtype=author&query=Bachrach,+Y), [Jakob Foerster](https://arxiv.org/search/cs?searchtype=author&query=Foerster,+J), [Xian Li](https://arxiv.org/search/cs?searchtype=author&query=Li,+X), [Han Fang](https://arxiv.org/search/cs?searchtype=author&query=Fang,+H), [Sainbayar Sukhbaatar](https://arxiv.org/search/cs?searchtype=author&query=Sukhbaatar,+S), [Jason Weston](https://arxiv.org/search/cs?searchtype=author&query=Weston,+J)

View a PDF of the paper titled Autodata: An agentic data scientist to create high quality synthetic data, by Ilia Kulikov and 14 other authors

[View PDF](http://arxiv.org/pdf/2606.25996)[HTML (experimental)](https://arxiv.org/html/2606.25996v2)
> Abstract:We introduce Autodata, a general method that enables AI agents to act as data scientists who build high quality training and evaluation data. We show how to train (meta-optimize) such a data scientist agent, so that it learns to create even stronger data. We describe the overall formulation, and a specific practical implementation, Agentic Self-Instruct. We conduct experiments on computer science research tasks, legal reasoning tasks and reasoning with mathematical objects, where we obtain improved results compared to classical synthetic dataset creation methods. Further, meta-optimizing the data scientist agent itself delivers an even larger performance uplift. Agentic data creation provides a way to convert increased inference compute into higher quality model training. Overall, we believe this direction has the potential to change the way we build AI data.

Subjects:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:[arXiv:2606.25996](https://arxiv.org/abs/2606.25996) [cs.AI]
(or [arXiv:2606.25996v2](https://arxiv.org/abs/2606.25996v2) [cs.AI] for this version)
[https://doi.org/10.48550/arXiv.2606.25996](https://doi.org/10.48550/arXiv.2606.25996)

Focus to learn more

 arXiv-issued DOI via DataCite

## Submission history

 From: Jason Weston [[view email](http://arxiv.org/show-email/92b02bb2/2606.25996)] 

**[[v1]](http://arxiv.org/abs/2606.25996v1)** Wed, 24 Jun 2026 16:08:31 UTC (19,889 KB)

**[v2]** Thu, 25 Jun 2026 13:26:50 UTC (19,879 KB)

[](http://arxiv.org/abs/2606.25996)Full-text links:
## Access Paper:

 View a PDF of the paper titled Autodata: An agentic data scientist to create high quality synthetic data, by Ilia Kulikov and 14 other authors

*   [View PDF](http://arxiv.org/pdf/2606.25996)
*   [HTML (experimental)](https://arxiv.org/html/2606.25996v2)
*   [TeX Source](http://arxiv.org/src/2606.25996)

[view license](http://arxiv.org/licenses/nonexclusive-distrib/1.0/ "Rights to this article")

### Current browse context:

cs.AI

[<prev](http://arxiv.org/prevnext?id=2606.25996&function=prev&context=cs.AI "previous in cs.AI (accesskey p)") | [next>](http://arxiv.org/prevnext?id=2606.25996&function=next&context=cs.AI "next in cs.AI (accesskey n)")

[new](http://arxiv.org/list/cs.AI/new) | [recent](http://arxiv.org/list/cs.AI/recent) | [2026-06](http://arxiv.org/list/cs.AI/2026-06)

 Change to browse by: 

[cs](http://arxiv.org/abs/2606.25996?context=cs)

[cs.CL](http://arxiv.org/abs/2606.25996?context=cs.CL)

[cs.LG](http://arxiv.org/abs/2606.25996?context=cs.LG)

### References & Citations

*   [NASA ADS](https://ui.adsabs.harvard.edu/abs/arXiv:2606.25996)
*   [Google Scholar](https://scholar.google.com/scholar_lookup?arxiv_id=2606.25996)
*   [Semantic Scholar](https://api.semanticscholar.org/arXiv:2606.25996)

export BibTeX citation Loading...

## BibTeX formatted citation

×

Data provided by: [](http://arxiv.org/abs/2606.25996)

### Bookmark

[![Image 5: BibSonomy](http://arxiv.org/static/browse/0.3.4/images/icons/social/bibsonomy.png)](http://www.bibsonomy.org/BibtexHandler?requTask=upload&url=https://arxiv.org/abs/2606.25996&description=Autodata:%20An%20agentic%20data%20scientist%20to%20create%20high%20quality%20synthetic%20data "Bookmark on BibSonomy")[![Image 6: Reddit](http://arxiv.org/static/browse/0.3.4/images/icons/social/reddit.png)](https://reddit.com/submit?url=https://arxiv.org/abs/2606.25996&title=Autodata:%20An%20agentic%20data%20scientist%20to%20create%20high%20quality%20synthetic%20data "Bookmark on Reddit")

Bibliographic Tools 

# Bibliographic and Citation Tools

- [x] Bibliographic Explorer Toggle 

Bibliographic Explorer _([What is the Explorer?](https://info.arxiv.org/labs/showcase.html#arxiv-bibliographic-explorer))_

- [x] Connected Papers Toggle 

Connected Papers _([What is Connected Papers?](https://www.connectedpapers.com/about))_

- [x] Litmaps Toggle 

Litmaps _([What is Litmaps?](https://www.litmaps.co/))_

- [x] scite.ai Toggle 

scite Smart Citations _([What are Smart Citations?](https://www.scite.ai/))_

Code, Data, Media 

# Code, Data and Media Associated with this Article

- [x] alphaXiv Toggle 

alphaXiv _([What is alphaXiv?](https://alphaxiv.org/))_

- [x] Links to Code Toggle 

CatalyzeX Code Finder for Papers _([What is CatalyzeX?](https://www.catalyzex.com/))_

- [x] DagsHub Toggle 

DagsHub _([What is DagsHub?](https://dagshub.com/))_

- [x] GotitPub Toggle 

Gotit.pub _([What is GotitPub?](http://gotit.pub/faq))_

- [x] Huggingface Toggle 

Hugging Face _([What is Huggingface?](https://huggingface.co/huggingface))_

- [x] ScienceCast Toggle 

ScienceCast _([What is ScienceCast?](https://sciencecast.org/welcome))_

Demos 

# Demos

- [x] Replicate Toggle 

Replicate _([What is Replicate?](https://replicate.com/docs/arxiv/about))_

- [x] Spaces Toggle 

Hugging Face Spaces _([What is Spaces?](https://huggingface.co/docs/hub/spaces))_

- [x] Spaces Toggle 

TXY