---
description: "Generate and augment training data using LLMs with NeMo Curator's synthetic data generation pipeline"
categories: ["workflows"]
tags: ["synthetic-data", "llm", "generation", "augmentation", "multilingual"]
personas: ["data-scientist-focused", "mle-focused"]
difficulty: "intermediate"
content_type: "workflow"
modality: "text-only"
---
# Synthetic Data Generation
NeMo Curator provides synthetic data generation (SDG) capabilities for creating and augmenting training data using Large Language Models (LLMs). These pipelines integrate with OpenAI-compatible APIs, enabling you to use NVIDIA NIM endpoints, NeMo Curator's built-in [Inference Server](/curate-text/synthetic/inference-server) (Ray Serve + vLLM), or other inference providers.
## Use Cases
- **Data Augmentation**: Expand limited datasets by generating diverse variations
- **Multilingual Generation**: Create Q&A pairs and text in multiple languages
- **Knowledge Extraction**: Convert raw text into structured knowledge formats
- **Quality Improvement**: Paraphrase low-quality text into higher-quality Wikipedia-style prose
- **Training Data Creation**: Generate instruction-following data for model fine-tuning
## Core Concepts
Synthetic data generation in NeMo Curator operates in two primary modes:
### Generation Mode
Create new data from scratch without requiring input documents. The `QAMultilingualSyntheticStage` demonstrates this pattern—it generates Q&A pairs based on a prompt template without needing seed documents.
### Transformation Mode
Improve or restructure existing data using LLM capabilities. The Nemotron-CC stages exemplify this approach, taking input documents and producing:
- Paraphrased text in Wikipedia style
- Diverse Q&A pairs derived from document content
- Condensed knowledge distillations
- Extracted factual content
### Declarative Mode (NeMo Data Designer)
Define data generation pipelines declaratively using [NeMo Data Designer](/curate-text/synthetic/nemo-data-designer) (NDD). Instead of writing imperative LLM call logic, you configure structured column generation (samplers, expressions, LLM text columns) through a builder API or YAML file. NDD handles execution, batching, and token metric collection. This mode supports both standalone generation and NDD-backed versions of Nemotron-CC stages.
## Architecture
The following diagram shows how SDG pipelines process data through preprocessing, LLM generation, and postprocessing stages:
```mermaid
flowchart LR
A["Input Documents
(Parquet/JSONL)"] --> B["Preprocessing
(Tokenization,
Segmentation)"]
B --> C["LLM Generation
(OpenAI-compatible)"]
C --> D["Postprocessing
(Cleanup, Filtering)"]
D --> E["Output Dataset
(Parquet/JSONL)"]
F["LLM Client
(NVIDIA API,
InferenceServer,
vLLM, TGI)"] -.->|"API Calls"| C
classDef stage fill:#e3f2fd,stroke:#1976d2,stroke-width:2px,color:#000
classDef infra fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#000
classDef output fill:#e8f5e9,stroke:#2e7d32,stroke-width:3px,color:#000
class A,B,C,D stage
class E output
class F infra
```
## Prerequisites
Before using synthetic data generation, ensure you have:
1. **NVIDIA API Key** (for cloud endpoints)
- Obtain from [NVIDIA Build](https://build.nvidia.com/settings/api-keys)
- Set as environment variable: `export NVIDIA_API_KEY="your-key"`
2. **NeMo Curator with text extras**
```bash
uv pip install --extra-index-url https://pypi.nvidia.com nemo-curator[text_cuda12]
```
3. **Local inference** (optional) — to serve models alongside your pipeline:
```bash
uv pip install nemo-curator[inference_server]
```
Refer to the [Inference Server](/curate-text/synthetic/inference-server) guide for setup details.
Nemotron-CC pipelines use the `transformers` library for tokenization, which is included in NeMo Curator core dependencies.
## Available SDG Stages
| Stage | Purpose | Input Type |
| --- | --- | --- |
| `QAMultilingualSyntheticStage` | Generate multilingual Q&A pairs | Empty (generates from scratch) |
| `WikipediaParaphrasingStage` | Rewrite text as Wikipedia-style prose | Document text |
| `DiverseQAStage` | Generate diverse Q&A pairs from documents | Document text |
| `DistillStage` | Create condensed, information-dense paraphrases | Document text |
| `ExtractKnowledgeStage` | Extract knowledge as textbook-style passages | Document text |
| `KnowledgeListStage` | Extract structured fact lists | Document text |
| `DataDesignerStage` | Declarative generation via NeMo Data Designer | Seed data (any schema) |
---
## Topics
Configure OpenAI-compatible clients for NVIDIA APIs and custom endpoints
configuration
performance
Serve LLMs locally via Ray Serve and vLLM alongside curation pipelines
ray-serve
local-inference
Generate synthetic Q&A pairs across multiple languages
quickstart
tutorial
Declarative data generation with structured columns and NDD-backed Nemotron-CC stages
ndd
declarative
Advanced text transformation and knowledge extraction workflows
advanced
paraphrasing