--- description: "Understanding the AudioTask data structure for audio file management and validation in NeMo Curator" categories: ["concepts-architecture"] tags: ["data-structures", "audiotask", "audio-validation", "task-processing", "file-management"] personas: ["data-scientist-focused", "mle-focused"] difficulty: "beginner" content_type: "concept" modality: "audio-only" --- # AudioTask Data Structure This guide covers the `AudioTask` data structure, which serves as the core container for audio data throughout NeMo Curator's audio processing pipeline. ## Overview `AudioTask` is a specialized data structure that extends NeMo Curator's base `Task` class to handle audio-specific processing requirements. Each `AudioTask` holds a single manifest entry, matching the convention used by `VideoTask` and `FileGroupTask`: - **Single-Entry Model**: One manifest entry per task (`Task[dict]`), enabling straightforward per-sample processing - **File Path Management**: Automatically validates audio file existence and accessibility - **Metadata Handling**: Preserves audio characteristics and processing results throughout pipeline stages ## Structure and Components ### Basic Structure ```python from nemo_curator.tasks import AudioTask # Create AudioTask with a single audio file audio_task = AudioTask( data={ "audio_filepath": "/path/to/audio.wav", "text": "ground truth transcription", "duration": 3.2, "language": "en" }, filepath_key="audio_filepath", task_id="audio_task_001", dataset_name="my_speech_dataset" ) ``` ### Key Attributes | Attribute | Type | Description | |-----------|------|-------------| | `data` | `dict` | Audio manifest entry (single dict, exposed as `_AttrDict` for attribute-style access) | | `filepath_key` | `str \| None` | Key name for audio file paths in data (optional) | | `task_id` | `str` | Unique identifier for the task | | `dataset_name` | `str` | Name of the source dataset | | `num_items` | `int` | Always returns `1` (read-only property) | ### Attribute-Style Access `AudioTask.data` is an `_AttrDict` subclass, so you can access fields as attributes: ```python audio_task = AudioTask(data={"audio_filepath": "/path/to/audio.wav", "duration": 3.2}) # Both access styles work audio_task.data["audio_filepath"] # dict-style audio_task.data.audio_filepath # attribute-style ``` ## Data Validation ### Automatic Validation `AudioTask` provides built-in validation for audio data integrity. The `_AttrDict` data type enables `hasattr`-based validation, matching the pattern used by all other modalities. ## Metadata Management ### Standard Metadata Fields Common fields stored in AudioTask data: ```python audio_sample = { # Core fields (user-provided) "audio_filepath": "/path/to/audio.wav", "text": "transcription text", # Fields added by processing stages "pred_text": "asr prediction", # Added by ASR inference stages "wer": 12.5, # Added by GetPairwiseWerStage "duration": 3.2, # Added by GetAudioDurationStage # Optional user-provided metadata "language": "en_us", "speaker_id": "speaker_001", # Custom fields (examples) "domain": "conversational", "noise_level": "low" } ``` Character error rate (CER) is available as a utility function and typically requires a custom stage to compute and store it. ## Error Handling ### Graceful Failure Modes AudioTask handles various error conditions: ```python # Missing files audio_task = AudioTask(data={ "audio_filepath": "/missing/file.wav", "text": "sample" }) # Validation fails, but processing continues with warnings # Corrupted audio files corrupted_sample = { "audio_filepath": "/corrupted/audio.wav", "text": "sample text" } # Duration calculation returns -1.0 for corrupted files # Invalid metadata invalid_sample = { "audio_filepath": "/valid/audio.wav", # Missing "text" field - needed for WER calculation but not enforced by AudioTask } # AudioTask does not enforce metadata field requirements. Add a validation stage if required. ``` ## Performance Characteristics ### Memory Usage AudioTask memory footprint is minimal since each task holds a single manifest entry. Memory scales with the number of metadata fields per entry and the total number of tasks processed in the pipeline. ### Processing Patterns Audio stages follow two processing patterns: | Pattern | Stages | Method | |---------|--------|--------| | **Per-task** | CPU stages (`GetAudioDurationStage`, `GetPairwiseWerStage`) | `process(task) → AudioTask` — mutates `task.data` in-place | | **Batched** | GPU stages (`InferenceAsrNemoStage`), IO stages (`AudioToDocumentStage`), filtering (`PreserveByValueStage`) | `process_batch(tasks) → list[AudioTask]` | ## Integration with Processing Stages ### Stage Input/Output AudioTask serves as input and output for audio processing stages. All audio stages subclass `ProcessingStage[AudioTask, AudioTask]` directly: ```python # CPU stage: mutates task in-place and returns it def process(self, task: AudioTask) -> AudioTask: duration = get_duration(task.data["audio_filepath"]) task.data["duration"] = duration return task ``` ### Chaining Stages AudioTask flows through multiple processing stages, with each stage adding new metadata fields: ```mermaid flowchart TD A["AudioTask (raw)
• audio_filepath
• text"] --> B[ASR Inference Stage] B --> C["AudioTask (with predictions)
• audio_filepath
• text
• pred_text"] C --> D[Quality Assessment Stage] D --> E["AudioTask (with metrics)
• audio_filepath
• text
• pred_text
• wer
• duration"] E --> F[Filter Stage] F --> G["AudioTask (filtered)
• audio_filepath
• text
• pred_text
• wer
• duration"] G --> H[Export Stage] H --> I[Output Files] style A fill:#e1f5fe style C fill:#f3e5f5 style E fill:#e8f5e8 style G fill:#fff3e0 style I fill:#fce4ec ```