--- description: "Add unique identifiers to documents in your text dataset for tracking and deduplication workflows" categories: ["text-curation"] tags: ["preprocessing", "identifiers", "document-tracking", "pipeline"] personas: ["data-scientist-focused", "mle-focused"] difficulty: "beginner" content_type: "how-to" modality: "text-only" --- # Adding Document IDs Add unique identifiers to each document in your text dataset. ## How It Works Document IDs are useful for: - **Pipeline tracking** - Monitor documents through processing stages - **Dataset versioning** - Identify documents across different versions --- ## Usage ### Basic Usage ```python from nemo_curator.stages.text.modules import AddId # Initialize pipeline, read stage, etc. # Add to your pipeline pipeline.add_stage(AddId(id_field="doc_id")) ``` ### Configuration Options ```python # Customize ID generation pipeline.add_stage(AddId( id_field="document_id", # Field name for IDs id_prefix="corpus_v2", # Optional prefix overwrite=True # Overwrite existing IDs )) ``` #### Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `id_field` | `str` | Required | Field name where IDs will be stored | | `id_prefix` | `str` | `None` | Optional prefix for IDs | | `overwrite` | `bool` | `False` | Whether to overwrite existing ID fields | #### ID Format Generated IDs follow this pattern: - Without prefix: `{task_uuid}_{index}` - With prefix: `{prefix}_{task_uuid}_{index}` ### Complete Example ```python from nemo_curator.core.client import RayClient from nemo_curator.pipeline import Pipeline from nemo_curator.stages.text.io.reader import JsonlReader from nemo_curator.stages.text.modules import AddId from nemo_curator.stages.text.io.writer import JsonlWriter # Initialize Ray client ray_client = RayClient() ray_client.start() # Create pipeline pipeline = Pipeline(name="add_ids") # Add stages pipeline.add_stage(JsonlReader(file_paths="input/")) pipeline.add_stage(AddId(id_field="doc_id", id_prefix="v1")) pipeline.add_stage(JsonlWriter("output/")) # Run pipeline result = pipeline.run() # Stop Ray client ray_client.stop() ``` ### Alternative: Reader-Based ID Generation For deduplication workflows, unique IDs are generated during data loading: ```python from nemo_curator.core.client import RayClient from nemo_curator.pipeline import Pipeline from nemo_curator.stages.deduplication.id_generator import create_id_generator_actor from nemo_curator.stages.text.io.reader import JsonlReader # Initialize Ray client ray_client = RayClient() ray_client.start() pipeline = Pipeline(name="id_generator_example") # Create ID generator create_id_generator_actor() # Reader generates IDs automatically reader = JsonlReader( file_paths="data/", _generate_ids=True # Adds '_curator_dedup_id' field ) pipeline.add_stage(reader) # Run pipeline results = pipeline.run() # Stop Ray client ray_client.stop() # Examine the first 5 rows of the first DocumentBatch print(results[0].data.head()) ``` This approach: - Generates monotonically increasing integer IDs - Required for some deduplication workflows - Persists ID state across pipeline runs --- ## Error Handling **Existing ID field:** ```python # This raises ValueError if 'doc_id' already exists AddId(id_field="doc_id", overwrite=False) # This overwrites existing field with warning AddId(id_field="doc_id", overwrite=True) ``` --- ## Best Practices - **Place early in pipeline** - Add IDs after loading, before filtering - **Use descriptive field names** - `doc_id`, `document_id`, `unique_id` - **Choose appropriate method**: - Use `AddId` for general document tracking - Use ID generator for deduplication workflows --- For deduplication workflows, see [Deduplication](/curate-text/process-data/deduplication).