--- title: "Autodata: An agentic data scientist to create high quality synthetic data" source_url: "https://arxiv.org/abs/2606.25996" type: article created: 2026-06-26 updated: 2026-06-26 sha256: 926477076b31e43b4d6e6a636cf8e87bbf842c3ccfe2c348372384d2534916b9 --- # Autodata: An agentic data scientist to create high quality synthetic data Published Time: Fri, 26 Jun 2026 00:49:13 GMT Markdown Content: [Skip to main content](http://arxiv.org/abs/2606.25996#content) [![Image 1: Cornell University](http://arxiv.org/static/browse/0.3.4/images/icons/cu/cornell-reduced-white-SMALL.svg)](https://www.cornell.edu/) [Learn about arXiv becoming an independent nonprofit.](https://tech.cornell.edu/arxiv/) We gratefully acknowledge support from the Simons Foundation, [member institutions](https://info.arxiv.org/about/ourmembers.html), and all contributors.[Donate](https://info.arxiv.org/about/donate.html) [](http://arxiv.org/IgnoreMe) [![Image 2: arxiv logo](http://arxiv.org/static/browse/0.3.4/images/arxiv-logo-one-color-white.svg)](http://arxiv.org/)>[cs](http://arxiv.org/list/cs/recent)> arXiv:2606.25996 [Help](https://info.arxiv.org/help) | [Advanced Search](https://arxiv.org/search/advanced) Search [![Image 3: arXiv logo](http://arxiv.org/static/browse/0.3.4/images/arxiv-logomark-small-white.svg)](https://arxiv.org/) [![Image 4: Cornell University Logo](http://arxiv.org/static/browse/0.3.4/images/icons/cu/cornell-reduced-white-SMALL.svg)](https://www.cornell.edu/) GO ## quick links * [Login](https://arxiv.org/login) * [Help Pages](https://info.arxiv.org/help) * [About](https://info.arxiv.org/about) # Computer Science > Artificial Intelligence **arXiv:2606.25996** (cs) [Submitted on 24 Jun 2026 ([v1](https://arxiv.org/abs/2606.25996v1)), last revised 25 Jun 2026 (this version, v2)] # Title:Autodata: An agentic data scientist to create high quality synthetic data Authors:[Ilia Kulikov](https://arxiv.org/search/cs?searchtype=author&query=Kulikov,+I), [Chenxi Whitehouse](https://arxiv.org/search/cs?searchtype=author&query=Whitehouse,+C), [Tianhao Wu](https://arxiv.org/search/cs?searchtype=author&query=Wu,+T), [Yixin Nie](https://arxiv.org/search/cs?searchtype=author&query=Nie,+Y), [Swarnadeep Saha](https://arxiv.org/search/cs?searchtype=author&query=Saha,+S), [Eryk Helenowski](https://arxiv.org/search/cs?searchtype=author&query=Helenowski,+E), [Weizhe Yuan](https://arxiv.org/search/cs?searchtype=author&query=Yuan,+W), [Olga Golovneva](https://arxiv.org/search/cs?searchtype=author&query=Golovneva,+O), [Jack Lanchantin](https://arxiv.org/search/cs?searchtype=author&query=Lanchantin,+J), [Yoram Bachrach](https://arxiv.org/search/cs?searchtype=author&query=Bachrach,+Y), [Jakob Foerster](https://arxiv.org/search/cs?searchtype=author&query=Foerster,+J), [Xian Li](https://arxiv.org/search/cs?searchtype=author&query=Li,+X), [Han Fang](https://arxiv.org/search/cs?searchtype=author&query=Fang,+H), [Sainbayar Sukhbaatar](https://arxiv.org/search/cs?searchtype=author&query=Sukhbaatar,+S), [Jason Weston](https://arxiv.org/search/cs?searchtype=author&query=Weston,+J) View a PDF of the paper titled Autodata: An agentic data scientist to create high quality synthetic data, by Ilia Kulikov and 14 other authors [View PDF](http://arxiv.org/pdf/2606.25996)[HTML (experimental)](https://arxiv.org/html/2606.25996v2) > Abstract:We introduce Autodata, a general method that enables AI agents to act as data scientists who build high quality training and evaluation data. We show how to train (meta-optimize) such a data scientist agent, so that it learns to create even stronger data. We describe the overall formulation, and a specific practical implementation, Agentic Self-Instruct. We conduct experiments on computer science research tasks, legal reasoning tasks and reasoning with mathematical objects, where we obtain improved results compared to classical synthetic dataset creation methods. Further, meta-optimizing the data scientist agent itself delivers an even larger performance uplift. Agentic data creation provides a way to convert increased inference compute into higher quality model training. Overall, we believe this direction has the potential to change the way we build AI data. Subjects:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as:[arXiv:2606.25996](https://arxiv.org/abs/2606.25996) [cs.AI] (or [arXiv:2606.25996v2](https://arxiv.org/abs/2606.25996v2) [cs.AI] for this version) [https://doi.org/10.48550/arXiv.2606.25996](https://doi.org/10.48550/arXiv.2606.25996) Focus to learn more arXiv-issued DOI via DataCite ## Submission history From: Jason Weston [[view email](http://arxiv.org/show-email/92b02bb2/2606.25996)] **[[v1]](http://arxiv.org/abs/2606.25996v1)** Wed, 24 Jun 2026 16:08:31 UTC (19,889 KB) **[v2]** Thu, 25 Jun 2026 13:26:50 UTC (19,879 KB) [](http://arxiv.org/abs/2606.25996)Full-text links: ## Access Paper: View a PDF of the paper titled Autodata: An agentic data scientist to create high quality synthetic data, by Ilia Kulikov and 14 other authors * [View PDF](http://arxiv.org/pdf/2606.25996) * [HTML (experimental)](https://arxiv.org/html/2606.25996v2) * [TeX Source](http://arxiv.org/src/2606.25996) [view license](http://arxiv.org/licenses/nonexclusive-distrib/1.0/ "Rights to this article") ### Current browse context: cs.AI [](http://arxiv.org/prevnext?id=2606.25996&function=next&context=cs.AI "next in cs.AI (accesskey n)") [new](http://arxiv.org/list/cs.AI/new) | [recent](http://arxiv.org/list/cs.AI/recent) | [2026-06](http://arxiv.org/list/cs.AI/2026-06) Change to browse by: [cs](http://arxiv.org/abs/2606.25996?context=cs) [cs.CL](http://arxiv.org/abs/2606.25996?context=cs.CL) [cs.LG](http://arxiv.org/abs/2606.25996?context=cs.LG) ### References & Citations * [NASA ADS](https://ui.adsabs.harvard.edu/abs/arXiv:2606.25996) * [Google Scholar](https://scholar.google.com/scholar_lookup?arxiv_id=2606.25996) * [Semantic Scholar](https://api.semanticscholar.org/arXiv:2606.25996) export BibTeX citation Loading... ## BibTeX formatted citation × Data provided by: [](http://arxiv.org/abs/2606.25996) ### Bookmark [![Image 5: BibSonomy](http://arxiv.org/static/browse/0.3.4/images/icons/social/bibsonomy.png)](http://www.bibsonomy.org/BibtexHandler?requTask=upload&url=https://arxiv.org/abs/2606.25996&description=Autodata:%20An%20agentic%20data%20scientist%20to%20create%20high%20quality%20synthetic%20data "Bookmark on BibSonomy")[![Image 6: Reddit](http://arxiv.org/static/browse/0.3.4/images/icons/social/reddit.png)](https://reddit.com/submit?url=https://arxiv.org/abs/2606.25996&title=Autodata:%20An%20agentic%20data%20scientist%20to%20create%20high%20quality%20synthetic%20data "Bookmark on Reddit") Bibliographic Tools # Bibliographic and Citation Tools - [x] Bibliographic Explorer Toggle Bibliographic Explorer _([What is the Explorer?](https://info.arxiv.org/labs/showcase.html#arxiv-bibliographic-explorer))_ - [x] Connected Papers Toggle Connected Papers _([What is Connected Papers?](https://www.connectedpapers.com/about))_ - [x] Litmaps Toggle Litmaps _([What is Litmaps?](https://www.litmaps.co/))_ - [x] scite.ai Toggle scite Smart Citations _([What are Smart Citations?](https://www.scite.ai/))_ Code, Data, Media # Code, Data and Media Associated with this Article - [x] alphaXiv Toggle alphaXiv _([What is alphaXiv?](https://alphaxiv.org/))_ - [x] Links to Code Toggle CatalyzeX Code Finder for Papers _([What is CatalyzeX?](https://www.catalyzex.com/))_ - [x] DagsHub Toggle DagsHub _([What is DagsHub?](https://dagshub.com/))_ - [x] GotitPub Toggle Gotit.pub _([What is GotitPub?](http://gotit.pub/faq))_ - [x] Huggingface Toggle Hugging Face _([What is Huggingface?](https://huggingface.co/huggingface))_ - [x] ScienceCast Toggle ScienceCast _([What is ScienceCast?](https://sciencecast.org/welcome))_ Demos # Demos - [x] Replicate Toggle Replicate _([What is Replicate?](https://replicate.com/docs/arxiv/about))_ - [x] Spaces Toggle Hugging Face Spaces _([What is Spaces?](https://huggingface.co/docs/hub/spaces))_ - [x] Spaces Toggle TXY