---
source_url: "https://datajuicer.github.io/data-juicer/en/main/""
ingested: 2026-06-26
sha256: 707695496fd9b44d
---
sha256: 314c3194a2a54db9
---
title: "The Data Operating System for the Foundation Model Era — Data Juicer"
source_url: "https://datajuicer.github.io/data-juicer/en/main/"
ingested: 2026-06-22
type: article
---

# The Data Operating System for the Foundation Model Era — Data Juicer


Published Time: Mon, 22 Jun 2026 10:20:26 GMT

Markdown Content:
[![Image 1: PyPI](https://img.shields.io/pypi/v/py-data-juicer?logo=pypi&color=026cad)](https://pypi.org/project/py-data-juicer)[![Image 2: Downloads](https://static.pepy.tech/personalized-badge/py-data-juicer?period=total&units=INTERNATIONAL_SYSTEM&left_color=grey&right_color=green&left_text=downloads)](https://pepy.tech/projects/py-data-juicer)[![Image 3: Docker](https://img.shields.io/docker/v/datajuicer/data-juicer?logo=docker&label=Docker&color=498bdf)](https://hub.docker.com/r/datajuicer/data-juicer)

[![Image 4: Docs](https://img.shields.io/badge/%F0%9F%93%96_Docs-Website-026cad)](https://datajuicer.github.io/data-juicer/)[![Image 5: Operators](https://img.shields.io/badge/%F0%9F%A7%A9_Operators-200+-blue)](https://datajuicer.github.io/data-juicer/en/main/docs/Operators.html)[![Image 6: Recipes](https://img.shields.io/badge/%F0%9F%8D%B3_Recipes-50+-brightgreen)](https://github.com/datajuicer/data-juicer-hub)

[![Image 7: Chinese](https://img.shields.io/badge/%F0%9F%87%A8%F0%9F%87%B3_%E6%96%87%E6%A1%A3-%E4%B8%BB%E9%A1%B5-red)](https://datajuicer.github.io/data-juicer/zh_CN/main/index_ZH.html)[![Image 8: Paper](https://img.shields.io/badge/NeurIPS'25_Spotlight-2.0-B31B1B?logo=arxiv)](https://arxiv.org/abs/2501.14755)[![Image 9: Coverage](https://img.shields.io/endpoint?style=flat&url=https%3A%2F%2Fgist.githubusercontent.com%2FHYLcool%2Ff856b14416f08f73d05d32fd992a9c29%2Fraw%2Ftotal_cov.json&label=coverage&logo=codecov&color=4c1)](https://github.com/datajuicer/data-juicer)

**Multimodal | Cloud-Native | AI-Ready | Large-Scale**

Data-Juicer (DJ) transforms raw data chaos into AI-ready intelligence. It treats data processing as _composable infrastructure_—providing modular building blocks to clean, synthesize, and analyze data across the entire AI lifecycle, unlocking latent value in every byte.

Whether you’re deduplicating web-scale pre-training corpora, curating agent interaction traces, or preparing domain-specific RAG indices, DJ scales seamlessly from your laptop to thousand-node clusters—no glue code required.

* * *

## 🚀 Quick Start[#](http://datajuicer.github.io/data-juicer/en/main/#quick-start "Link to this heading")

**Zero-install exploration**:

*   [JupyterLab Playground with Tutorials](http://8.138.149.181/)

*   [Ask DJ Copilot](https://datajuicer.github.io/data-juicer/en/main/docs_index.html)

**Install & run**:

uv pip install py-data-juicer
dj-process --config demos/process_simple/process.yaml

**Or compose in Python**:

from data_juicer.core.data import NestedDataset
from data_juicer.ops.filter import TextLengthFilter
from data_juicer.ops.mapper import WhitespaceNormalizationMapper

ds = NestedDataset.from_dict({
    "text": ["Short", "This passes the filter.", "Text with spaces"]
})
res_ds = ds.process([
    TextLengthFilter(min_len=10),
    WhitespaceNormalizationMapper()
])

for s in res_ds:
    print(s)

* * *

## ✨ Why Data-Juicer?[#](http://datajuicer.github.io/data-juicer/en/main/#why-data-juicer "Link to this heading")

### 1. Modular & Extensible Architecture[#](http://datajuicer.github.io/data-juicer/en/main/#modular-extensible-architecture "Link to this heading")

*   **200+ operators** spanning text, image, audio, video, and multimodal data

*   **Recipe-first**: Reproducible YAML pipelines you can version, share, and fork like code

*   **Composable**: Drop in a single operator, chain complex workflows, or orchestrate full pipelines

*   **Hot-reload**: Iterate on operators without pipeline restarts

### 2. Full-Spectrum Data Intelligence[#](http://datajuicer.github.io/data-juicer/en/main/#full-spectrum-data-intelligence "Link to this heading")

*   **Foundation Models**: Pre-training, fine-tuning, RL, and evaluation-grade curation

*   **Agent Systems**: Clean tool traces, structure context, de-identification, and quality gating

*   **RAG & Analytics**: Extraction, normalization, semantic chunking, deduplication, and data profiling

### 3. Production-Ready Performance[#](http://datajuicer.github.io/data-juicer/en/main/#production-ready-performance "Link to this heading")

*   **Scale**: Process 70B samples in 2h on 50 Ray nodes (6400 cores)

*   **Efficiency**: Deduplicate 5TB in 2.8h using 1280 cores

*   **Optimization**: Automatic OP fusion (2-10x speedup), adaptive parallelism, CUDA acceleration, robustness

*   **Observability**: Built-in tracing for debugging, auditing, and iterative improvement

> _⭐ If Data-Juicer saved you time or improved your data work, please consider starring the repo._ It helps more people discover the project and keeps you notified of new releases and features.

* * *

## 📰 News[#](http://datajuicer.github.io/data-juicer/en/main/#news "Link to this heading")

[2026-05-29] Release v1.5.2: **Semantic LLM OPs, Cross-doc Line Dedup & Leaner Dependencies**
*   🧹 _New Deduplicator_ — Added `DocumentLineDeduplicator` for cross-document line-level dedup, removing boilerplate lines (templates, copyright notices, navigation bars) by global document frequency.

*   🤖 _Agent Data Quality Toolkit_ — Shipped interaction-quality OPs & recipe, a bad-case HTML report, and more robust JSONL / HuggingFace meta loading.

*   📦 _Leaner & Faster Install_ — Slimmed the default dependency set (Ray, audio, spaCy, av, etc. moved to on-demand extras) to speed up installation.

*   🐳 _Stability & Robustness Fixes_ — Library-safe error handling (raise over `exit(1)`), Ray init/temp-dir fixes, valid API params (drop invalid `max_new_tokens`), PyArrow 20+ batch JSON reading, local-path aesthetics model support, and more performance/bug fixes.

*   🧠 _Semantic LLM Operators_ — Introduced `llm_extract_mapper`, `llm_condition_filter`, and `llm_structured_ops` with unified `llm_*` naming and configurable inference strategies (join/agg/top-k planned).

[2026-03-17] Release v1.5.1: **LaTeX OPs; Compressed Format Support; Operator Robustness Fixes**
*   📄 Two new LaTeX-focused mapper OPs shipped, extending data-juicer’s document processing capabilities to handle `.tex` archives and figure contexts.

*   🗜️ Compressed dataset format support: `json[l].gz` files can now be loaded directly, and Ray datasets gain proper support for reading compressed JSON files.

*   📚 New documentation added covering cache, export, and tracing workflows to help users better understand and debug data processing pipelines.

*   🤖 Major refactor and upgrade of data-juicer-agents completed: The project architecture and CLI/session capabilities were comprehensively redesigned for better maintainability and extensibility. See [date-juicer-agents](https://github.com/datajuicer/data-juicer-agents) for more details.

[2026-02-12] Release v1.5.0: **Partitioned Ray Executor, OP-level Env Management, and More Embodied-AI OPs**
*   🚀 _Enhanced Distributed Execution Framework_ – Introduced partitioned Ray executor and OP-level isolated environments to improve fault tolerance, scalability, and dependency conflict resolution.

*   🤖 _Expanded Embodied AI Video Processing_ – Added specialized operators for camera calibration, video undistortion, hand reconstruction, and pose estimation to strengthen multi-view video handling.

*   💪🏻 _System Performance & Developer Experience Optimizations_ – Enabled batch inference, memory/log reduction, core logic refactoring, and updated documentation/templates.

*   🐳 _Critical Bug Fixes & Stability Improvements_ – Resolved duplicate tracking, parameter conflicts, homepage rendering issues, and outdated docs for higher reliability.

[2026-02-02] Release v1.4.6: **Copilot, Video Bytes I/O & Ray Tracing**
*   🤖 _Q&A Copilot_ — Now live on our [Doc Site](https://datajuicer.github.io/data-juicer/en/main/index.html) | [DingTalk](https://qr.dingtalk.com/action/joingroup?code=v1,k1,N78tgW54U447gJP5aMC95B6qgQhlkVQS4+dp7qQq6MpuRVJIwrSsXmL8oFqU5ajJ&_dt_no_comment=1&origin=11?) | [Discord](https://discord.gg/ngQbB9hEVK). Feel free to ask anything related to Data-Juicer ecosystem!

    *   Check 🤖 [Data-Juicer Agents](https://github.com/datajuicer/data-juicer-agents/blob/main) | 📃 [Deploy-ready codes](https://github.com/datajuicer/data-juicer-agents/blob/main/qa-copilot) | 🎬[More demos](https://github.com/datajuicer/data-juicer-agents/blob/main/qa-copilot/DEMO.md) for more details.

*   🎬 _Video Bytes I/O_ — Direct bytes processing for video pipelines

*   🫆 _Ray Mode Tracer_ — Track changed samples in distributed processing

*   🐳 _Enhancements & fixes_ — refreshed Docker image, small perf boosts, GitHub Insights traffic workflow, Ray compatibility updates, and bug/doc fixes.

[2026-01-15] Release v1.4.5: **20+ New OPs, Ray vLLM Pipelines & Sphinx Docs Upgrade**
*   _Embodied-AI OPs_: added/enhanced mappers for video captioning (VLM), video object segmentation (YOLOE+SAM2), video depth estimation (viz + point cloud), human pose (MMPose), image tagging (VLM), single-image 3D body mesh recovery (SAM 3D Body), plus _S3 upload/download_.

*   _New Pipeline OP_: compose multiple OPs into one pipeline; introduced _Ray + vLLM_ pipelines for LLM/VLM inference.

*   _Docs upgrade_: moved to a unified _Sphinx-based_ documentation build/deploy workflow with isolated theme/architecture repo.

*   _Enhancements & fixes_: dependency updates, improved Ray deduplication and S3 loading, OpenAI Responses API support, tracer consistency, Docker base updated to CUDA 12.6.3 + Ubuntu 24.04 + Py3.11, and multiple bug fixes.

[2025-12-01] Release v1.4.4: **NeurIPS’25 Spotlight, 6 New Video/MM OPs & S3 I/O**
*   NeurIPS’25 **Spotlight** for Data-Juicer 2.0

*   _Repo split_: sandbox/recipes/agents moved to standalone repos

*   _S3 I/O_ added to loader/exporter

*   _6 new video & multimodal OPs_ (character detection, VGGT, whole-body pose, hand reconstruction) + docs/Ray/video I/O improvements and bug fixes

View [All Release](https://github.com/datajuicer/data-juicer/releases) and [News Archive](http://datajuicer.github.io/data-juicer/en/main/docs/news.html)

* * *

## 🔌 Users & Ecosystems[#](http://datajuicer.github.io/data-juicer/en/main/#users-ecosystems "Link to this heading")

> The below list focuses on _developer-facing integration and usages_ in _alphabetical order_.
> 
>  Missing your project / name? Feel free to [open a PR](https://github.com/datajuicer/data-juicer/pulls) or [reach out](http://datajuicer.github.io/data-juicer/en/main/#contributing-community).

Data-Juicer plugs into your existing stack and evolves with community contributions:

### Extensions[#](http://datajuicer.github.io/data-juicer/en/main/#extensions "Link to this heading")

*   **[data-juicer-agents](https://github.com/datajuicer/data-juicer-agents)** — DJ Copilot and agentic workflows

*   **[data-juicer-hub](https://github.com/datajuicer/data-juicer-hub)** — Community recipes and best practices

*   **[data-juicer-sandbox](https://github.com/datajuicer/data-juicer-sandbox)** — Data-model co-development with feedback loops

### Frameworks & Platforms[#](http://datajuicer.github.io/data-juicer/en/main/#frameworks-platforms "Link to this heading")

[AgentScope](https://github.com/agentscope-ai/agentscope) · [Apache Arrow](https://github.com/apache/arrow) · [Apache HDFS](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html) · [Apache Hudi](https://hudi.apache.org/) · [Apache Iceberg](https://iceberg.apache.org/) · [Apache Paimon](https://paimon.apache.org/) · [Alibaba PAI](https://www.alibabacloud.com/en/product/machine-learning?_p_lc=1) · [Delta Lake](https://delta.io/) · [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio) · [EasyAnimate](https://github.com/aigc-apps/EasyAnimate) · [Eval-Scope](https://github.com/modelscope/evalscope) · [Huawei Ascend](https://www.huawei.com/en/products/cloud-computing-dc/atlas/ascend) · [Hugging Face](https://huggingface.co/) · [LanceDB](https://lancedb.github.io/lance/) · [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) · [ModelSco