---
description: "Add unique identifiers to documents in your text dataset for tracking and deduplication workflows"
categories: ["text-curation"]
tags: ["preprocessing", "identifiers", "document-tracking", "pipeline"]
personas: ["data-scientist-focused", "mle-focused"]
difficulty: "beginner"
content_type: "how-to"
modality: "text-only"
---

# Adding Document IDs

Add unique identifiers to each document in your text dataset.

## How It Works

Document IDs are useful for:
- **Pipeline tracking** - Monitor documents through processing stages
- **Dataset versioning** - Identify documents across different versions

---

## Usage

### Basic Usage

```python
from nemo_curator.stages.text.modules import AddId

# Initialize pipeline, read stage, etc.

# Add to your pipeline
pipeline.add_stage(AddId(id_field="doc_id"))
```

### Configuration Options

```python
# Customize ID generation
pipeline.add_stage(AddId(
    id_field="document_id",        # Field name for IDs
    id_prefix="corpus_v2",         # Optional prefix
    overwrite=True                 # Overwrite existing IDs
))
```

#### Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `id_field` | `str` | Required | Field name where IDs will be stored |
| `id_prefix` | `str` | `None` | Optional prefix for IDs |
| `overwrite` | `bool` | `False` | Whether to overwrite existing ID fields |

#### ID Format

Generated IDs follow this pattern:
- Without prefix: `{task_uuid}_{index}`
- With prefix: `{prefix}_{task_uuid}_{index}`

### Complete Example

```python
from nemo_curator.core.client import RayClient
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.text.io.reader import JsonlReader
from nemo_curator.stages.text.modules import AddId
from nemo_curator.stages.text.io.writer import JsonlWriter

# Initialize Ray client
ray_client = RayClient()
ray_client.start()

# Create pipeline
pipeline = Pipeline(name="add_ids")

# Add stages
pipeline.add_stage(JsonlReader(file_paths="input/"))
pipeline.add_stage(AddId(id_field="doc_id", id_prefix="v1"))
pipeline.add_stage(JsonlWriter("output/"))

# Run pipeline
result = pipeline.run()

# Stop Ray client
ray_client.stop()
```

### Alternative: Reader-Based ID Generation

For deduplication workflows, unique IDs are generated during data loading:

```python
from nemo_curator.core.client import RayClient
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.deduplication.id_generator import create_id_generator_actor
from nemo_curator.stages.text.io.reader import JsonlReader

# Initialize Ray client
ray_client = RayClient()
ray_client.start()

pipeline = Pipeline(name="id_generator_example")

# Create ID generator
create_id_generator_actor()

# Reader generates IDs automatically
reader = JsonlReader(
    file_paths="data/",
    _generate_ids=True  # Adds '_curator_dedup_id' field
)
pipeline.add_stage(reader)

# Run pipeline
results = pipeline.run()

# Stop Ray client
ray_client.stop()

# Examine the first 5 rows of the first DocumentBatch
print(results[0].data.head())
```

This approach:
- Generates monotonically increasing integer IDs
- Required for some deduplication workflows
- Persists ID state across pipeline runs

---

## Error Handling

**Existing ID field:**
```python
# This raises ValueError if 'doc_id' already exists
AddId(id_field="doc_id", overwrite=False)

# This overwrites existing field with warning
AddId(id_field="doc_id", overwrite=True)
```

---

## Best Practices

- **Place early in pipeline** - Add IDs after loading, before filtering
- **Use descriptive field names** - `doc_id`, `document_id`, `unique_id`
- **Choose appropriate method**:
  - Use `AddId` for general document tracking
  - Use ID generator for deduplication workflows

---

For deduplication workflows, see [Deduplication](/curate-text/process-data/deduplication).