---
description: "Scale synthetic data generation using NeMo Data Designer (NDD) with declarative configuration and structured column generation"
categories: ["how-to-guides"]
tags: ["nemo-data-designer", "ndd", "synthetic-data", "data-designer", "declarative-config", "structured-generation"]
personas: ["data-scientist-focused", "mle-focused"]
difficulty: "intermediate"
content_type: "how-to"
modality: "text-only"
---
# NeMo Data Designer Integration
[NeMo Data Designer (NDD)](https://nvidia-nemo.github.io/DataDesigner/latest/) is a declarative data generation framework that integrates with NeMo Curator to scale synthetic data pipelines. Instead of writing imperative LLM call logic, you define a configuration that describes what columns to generate, how to sample structured fields, and which LLM to use. NDD handles execution, batching, and token metric collection automatically.
## How It Works
NeMo Curator wraps NDD through the `DataDesignerStage`, which accepts a `DataDesignerConfigBuilder` or a YAML config file. The stage:
1. Takes input records from a `DocumentBatch`
2. Passes them to NDD as a seed dataset
3. Calls `DataDesigner.preview()` to generate new columns (samplers, expressions, LLM text)
4. Returns the enriched dataset as a new `DocumentBatch` with token usage metrics
```mermaid
flowchart LR
A["Seed Data
(JSONL/Parquet)"] --> B["DataDesignerStage"]
B --> C["NDD Engine
(Samplers, Expressions,
LLM Generation)"]
C --> D["Enriched Output
(JSONL/Parquet)"]
E["LLM Endpoint
(Local InferenceServer
or NVIDIA NIM)"] -.->|"API Calls"| C
classDef stage fill:#e3f2fd,stroke:#1976d2,stroke-width:2px,color:#000
classDef infra fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#000
classDef output fill:#e8f5e9,stroke:#2e7d32,stroke-width:3px,color:#000
class A,B stage
class D output
class C,E infra
```
## Prerequisites
Install the NDD dependency:
```bash
uv pip install --extra-index-url https://pypi.nvidia.com nemo-curator[text_cuda12]
```
The `data-designer` package is included in the text extras. For local model serving, also install:
```bash
uv pip install nemo-curator[inference_server]
```
## DataDesignerStage
The `DataDesignerStage` is the core integration point between NeMo Curator and NDD.
### Parameters
| Parameter | Type | Default | Description |
| --- | --- | --- | --- |
| `config_builder` | `DataDesignerConfigBuilder` | None | NDD configuration builder. Mutually exclusive with `data_designer_config_file`. |
| `data_designer_config_file` | str | None | Path to a YAML config file. Mutually exclusive with `config_builder`. |
| `model_providers` | list | None | Custom `ModelProvider` instances for local or test endpoints. If None, NDD uses its default providers. |
| `verbose` | bool | False | When True, show full NDD log output. |
### Metrics
`DataDesignerStage` automatically collects and reports:
- `ndd_running_time`: Wall-clock time for the NDD `preview()` call
- `num_input_records` / `num_output_records`: Record counts before and after generation
- `input_tokens_median_per_record` / `output_tokens_median_per_record`: Median token counts across all LLM columns
## Building a Configuration
NDD configurations use a builder pattern. You add columns of three types:
For full documentation for building NDD configuration, see the [NDD config builder reference](https://nvidia-nemo.github.io/DataDesigner/latest/code_reference/config_builder/).
### Sampler Columns
Generate structured data using built-in samplers (Faker names, UUIDs, dates):
```python
import data_designer.config as dd
config_builder = dd.DataDesignerConfigBuilder(model_configs=[model_config])
config_builder.add_column(
dd.SamplerColumnConfig(
name="patient_name",
sampler_type=dd.SamplerType.PERSON_FROM_FAKER,
params=dd.PersonFromFakerSamplerParams(),
)
)
config_builder.add_column(
dd.SamplerColumnConfig(
name="patient_id",
sampler_type=dd.SamplerType.UUID,
params=dd.UUIDSamplerParams(prefix="PT-", short_form=True, uppercase=True),
)
)
```
### Expression Columns
Derive values from other columns using Jinja templates:
```python
config_builder.add_column(
dd.ExpressionColumnConfig(
name="first_name",
expr="{{ patient_name.first_name }}",
)
)
```
### LLM Text Columns
Generate text using an LLM with prompts that reference other columns:
```python
config_builder.add_column(
dd.LLMTextColumnConfig(
name="physician_notes",
prompt="""\
You are a primary-care physician who just had an appointment with {{ first_name }}.
{{ patient_summary }}
Write careful notes about your visit. Respond with only the notes.
""",
model_alias="local-llm",
)
)
```
## End-to-End Example
This example generates synthetic medical notes from seed symptom data using a local `InferenceServer`:
```python
import data_designer.config as dd
from nemo_curator.backends.ray_data import RayDataExecutor
from nemo_curator.core.client import RayClient
from nemo_curator.core.serve import InferenceModelConfig, InferenceServer
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.synthetic.nemo_data_designer.data_designer import DataDesignerStage
from nemo_curator.stages.text.io.reader.jsonl import JsonlReader
from nemo_curator.stages.text.io.writer.jsonl import JsonlWriter
# Start Ray cluster
client = RayClient(num_cpus=16, num_gpus=4)
client.start()
# Start local inference server
server_config = InferenceModelConfig(
model_identifier="google/gemma-3-27b-it",
deployment_config={"autoscaling_config": {"min_replicas": 1, "max_replicas": 1}},
engine_kwargs={"tensor_parallel_size": 4},
)
inference_server = InferenceServer(models=[server_config])
inference_server.start()
# Configure NDD model
model_config = dd.ModelConfig(
alias="local-llm",
model="google/gemma-3-27b-it",
provider="local",
skip_health_check=True,
inference_parameters=dd.ChatCompletionInferenceParams(
temperature=1.0, top_p=1.0, max_tokens=2048,
),
)
model_provider = dd.ModelProvider(
name="local",
endpoint=inference_server.endpoint,
api_key="unused",
)
# Build config with sampler and LLM columns
config_builder = dd.DataDesignerConfigBuilder(model_configs=[model_config])
config_builder.add_column(
dd.SamplerColumnConfig(
name="patient_name",
sampler_type=dd.SamplerType.PERSON_FROM_FAKER,
params=dd.PersonFromFakerSamplerParams(),
)
)
config_builder.add_column(
dd.LLMTextColumnConfig(
name="physician_notes",
prompt="You are a physician. Write notes for {{ patient_name.first_name }} "
"who has {{ diagnosis }}. {{ patient_summary }}",
model_alias="local-llm",
)
)
# Build and run pipeline
pipeline = Pipeline(name="ndd_medical_notes")
pipeline.add_stage(JsonlReader(file_paths="seed_data/*.jsonl", fields=["diagnosis", "patient_summary"]))
pipeline.add_stage(DataDesignerStage(config_builder=config_builder, model_providers=[model_provider]))
pipeline.add_stage(JsonlWriter(path="./synthetic_output"))
pipeline.run(executor=RayDataExecutor())
inference_server.stop()
client.stop()
```
## Using a Remote Provider
To use NVIDIA NIM or another hosted endpoint instead of a local server, configure the `ModelProvider` with the remote URL and API key:
```python
import os
import data_designer.config as dd
from nemo_curator.stages.synthetic.nemo_data_designer.data_designer import DataDesignerStage
model_config = dd.ModelConfig(
alias="nim-llm",
model="meta/llama-3.3-70b-instruct",
provider="nvidia",
inference_parameters=dd.ChatCompletionInferenceParams(
temperature=0.5, top_p=0.9, max_tokens=1600,
),
)
model_provider = dd.ModelProvider(
name="nvidia",
endpoint="https://integrate.api.nvidia.com/v1",
provider_type="openai",
api_key=os.environ["NVIDIA_API_KEY"],
)
config_builder = dd.DataDesignerConfigBuilder(model_configs=[model_config])
# Add columns as needed...
stage = DataDesignerStage(
config_builder=config_builder,
model_providers=[model_provider],
)
```
## NDD-Backed Nemotron-CC Stages
The Nemotron-CC synthetic data stages have NDD-backed equivalents that replace the `AsyncOpenAIClient` with NDD execution. These stages accept the same `input_field`, `output_field`, and prompt parameters, but route generation through `DataDesignerStage` internally.
| Stage | Import Path | Output Field |
| --- | --- | --- |
| `WikipediaParaphrasingStage` | `nemo_curator.stages.synthetic.nemotron_cc.nemo_data_designer.nemotron_cc` | `rephrased` |
| `DiverseQAStage` | `nemo_curator.stages.synthetic.nemotron_cc.nemo_data_designer.nemotron_cc` | `diverse_qa` |
| `DistillStage` | `nemo_curator.stages.synthetic.nemotron_cc.nemo_data_designer.nemotron_cc` | `distill` |
| `ExtractKnowledgeStage` | `nemo_curator.stages.synthetic.nemotron_cc.nemo_data_designer.nemotron_cc` | `extract_knowledge` |
| `KnowledgeListStage` | `nemo_curator.stages.synthetic.nemotron_cc.nemo_data_designer.nemotron_cc` | `knowledge_list` |
These stages inherit from `NDDBaseSyntheticStage`, which auto-builds an NDD config from the prompt fields. You configure the LLM through `model_configs` and `model_providers` instead of an `AsyncOpenAIClient`:
```python
import os
import data_designer.config as dd
from nemo_curator.stages.synthetic.nemotron_cc.nemo_data_designer.nemotron_cc import DiverseQAStage
model_config = dd.ModelConfig(
alias="meta/llama-3.3-70b-instruct",
model="meta/llama-3.3-70b-instruct",
provider="nvidia",
inference_parameters=dd.ChatCompletionInferenceParams(
temperature=0.5, top_p=0.9, max_tokens=1600,
),
)
model_provider = dd.ModelProvider(
name="nvidia",
endpoint="https://integrate.api.nvidia.com/v1",
provider_type="openai",
api_key=os.environ["NVIDIA_API_KEY"],
)
stage = DiverseQAStage(
input_field="text",
output_field="diverse_qa",
model_alias="meta/llama-3.3-70b-instruct",
model_configs=[model_config],
model_providers=[model_provider],
)
```
## YAML Configuration
Instead of building configs in Python, you can define the entire NDD configuration in a YAML file and pass it to `DataDesignerStage`:
```python
stage = DataDesignerStage(data_designer_config_file="config.yaml")
```
This is useful for reproducible pipelines where the generation config is versioned alongside data artifacts.
---
## Next Steps
- [Inference Server](/curate-text/synthetic/inference-server): Co-locate model serving with your pipeline
- [Nemotron-CC Pipelines](/curate-text/synthetic/nemotron-cc): Advanced text transformation tasks
- [Synthetic Data Generation](/curate-text/synthetic): Overview of all SDG capabilities