---
description: "Concepts for integrating audio processing with text curation workflows in multimodal applications"
categories: ["concepts-architecture"]
tags: ["text-integration", "multimodal", "workflow-integration", "format-conversion", "cross-modal"]
personas: ["data-scientist-focused", "mle-focused"]
difficulty: "intermediate"
content_type: "concept"
modality: "audio-text"
---
# Audio-Text Integration Concepts
This guide covers how audio processing integrates with text curation workflows in NeMo Curator, enabling seamless multi-modal data preparation and cross-modal quality assessment.
## Integration Architecture
Audio-text integration in NeMo Curator operates on several levels:
### Data Structure Integration
**Format Conversion**: `AudioTask` to `DocumentBatch`
- All fields remain intact without remapping (`text` and `pred_text` remain unchanged)
- Audio metadata remains available as extra fields for downstream processing
**Metadata Preservation**: Audio characteristics remain intact during conversion
- File paths remain available for traceability and debugging
- Quality metrics (WER, duration) remain available for filtering operations
- Audio-specific metadata remains available for downstream processing stages
### Pipeline Integration
**Sequential Processing**: Audio to Text to Multi-Modal
```mermaid
flowchart LR
A[Audio Files] --> B[InferenceAsrNemoStage]
B --> C[AudioToDocumentStage]
C --> D[ScoreFilter
Text Processing]
D --> E[Integrated Output]
style A fill:#e1f5fe
style C fill:#ffcc02
style E fill:#fff3e0
```
The `AudioToDocumentStage` provides the conversion bridge between audio and text processing workflows.
**Parallel Processing**: Simultaneous audio and text analysis
```mermaid
flowchart LR
A[Audio Files] --> B[InferenceAsrNemoStage]
C[Text Data] --> D[ScoreFilter
Text Processing]
B --> E[Cross-Modal
Quality Assessment]
D --> E
E --> F[Filtered Output]
style A fill:#e1f5fe
style C fill:#e8f5e8
style F fill:#fff3e0
```
## Cross-Modal Quality Assessment
### Audio-Informed Text Quality
Use audio characteristics to enhance text quality assessment:
**Speech Rate Analysis**: Detect unnaturally fast or slow speech patterns using the `get_wordrate()` function
**Duration-Text Consistency**: Ensure transcription length matches audio duration
- Short audio with long text: Potential transcription errors
- Long audio with short text: Potential missing content
- Optimal ratio: ~3-5 characters per second of audio
### Text-Informed Audio Quality
Use text characteristics to assess audio quality:
**Transcription Completeness**: Detect incomplete or truncated speech
- Sentence fragments without proper endings
- Unusual punctuation patterns
- Incomplete words or phrases
**Content Coherence**: Assess semantic consistency
- Logical flow and coherence in transcriptions
- Domain-appropriate vocabulary usage
- Language consistency throughout sample
## Workflow Patterns
### Audio-First Workflows
Start with audio processing, then apply text curation:
```mermaid
flowchart TD
A[Audio Files] --> B[InferenceAsrNemoStage
ASR Transcription]
B --> C[GetPairwiseWerStage
Calculate WER Metrics]
C --> D[ScoreFilter
WER-based Filtering]
D --> E[AudioToDocumentStage
Convert to DocumentBatch]
E --> F[ScoreFilter
Text Quality Assessment]
F --> G[Filter
Metadata-based Filtering]
G --> H[Text Enhancement Stages]
H --> I[Processed Dataset]
style A fill:#e1f5fe
style E fill:#fff3e0
style I fill:#e8f5e8
classDef audioStage fill:#bbdefb
classDef conversionStage fill:#ffcc02
classDef textStage fill:#c8e6c9
class B,C,D audioStage
class E conversionStage
class F,G,H textStage
```
**Use Cases**:
- Speech dataset curation for ASR training
- Podcast transcription and processing
- Lecture and educational content preparation
### Text-First Workflows
Start with text processing, then verify with audio:
```mermaid
flowchart TD
A[Text Corpus] --> B[ScoreFilter
Text Quality Assessment]
B --> C[Filter
Initial Text Filtering]
C --> D[Audio File Matching
Custom Stage]
D --> E[InferenceAsrNemoStage
ASR Validation]
E --> F[GetPairwiseWerStage
Cross-Modal Metrics]
F --> G[ScoreFilter
Consistency Filtering]
G --> H[Validated Dataset]
style A fill:#e8f5e8
style D fill:#fff3e0
style H fill:#e1f5fe
classDef textStage fill:#c8e6c9
classDef matchingStage fill:#ffcc02
classDef audioStage fill:#bbdefb
class B,C textStage
class D matchingStage
class E,F,G audioStage
```
**Use Cases**:
- Validating existing transcriptions with audio
- Creating audio-text pairs from separate sources
- Quality control for crowd-sourced transcriptions
## Data Flow Concepts
### Conversion Mechanisms
**AudioTask to DocumentBatch**:
NeMo Curator provides the `AudioToDocumentStage` for converting audio processing results to text processing format:
```python
from nemo_curator.stages.audio.io.convert import AudioToDocumentStage
# Create the conversion stage
converter = AudioToDocumentStage()
# Example input AudioTask data
audio_data = {
"audio_filepath": "/audio.wav",
"text": "ground truth",
"pred_text": "asr prediction",
"wer": 15.2,
"duration": 3.4
}
# The stage returns a list containing one DocumentBatch
# with the same fields preserved as pandas DataFrame
document_batches = converter.process(audio_task)
document_batch = document_batches[0] # Extract the single DocumentBatch
# All fields are preserved in the DocumentBatch:
# - audio_filepath, text, pred_text, wer, duration
```
**Note**: A built-in `DocumentBatch` to `AudioTask` conversion stage is not provided. Create a custom stage if you need reverse conversion.
For practical usage examples and step-by-step implementation, refer to [Text integration](/curate-audio/process-data/text-integration).
### Metadata Flow
**Additive Processing**: Processing stages typically add metadata without removing existing fields
```python
# Stage 1: Initial loading
stage1_output = {"audio_filepath": "/audio.wav", "text": "transcription"}
# Stage 2: ASR inference
stage2_output = {**stage1_output, "pred_text": "asr result"}
# Stage 3: Quality assessment
stage3_output = {**stage2_output, "wer": 15.2, "duration": 3.4}
# Stage 4: Text processing (after conversion)
stage4_output = {**stage3_output, "word_count": 6, "language": "en"}
```
## Quality Assessment Integration
### Available Quality Metrics
NeMo Curator provides these audio quality assessment capabilities:
**Word Error Rate (WER) Analysis**:
- WER correlates with transcription accuracy
- Available through `GetPairwiseWerStage`
- Measures percentage of incorrect words between ground truth and ASR predictions
**Duration and Speech Rate Analysis**:
- Duration validation using `GetAudioDurationStage`
- Speech rate calculation using `get_wordrate()` function
- Character rate calculation using `get_charrate()` function
**Individual Quality Dimensions**:
- **Technical Quality**: File integrity, format compliance, duration validation
- **Content Quality**: Transcription accuracy via WER/CER metrics
- **Speech Rate Quality**: Words/characters per second analysis
## Performance and Scaling
### Memory Considerations
**AudioTask Memory Usage**:
- Metadata storage scales linearly with batch size
- Audio files loaded on-demand, not cached in memory
- Large batches increase processing efficiency but consume more RAM
**Conversion Overhead**:
- AudioTask → DocumentBatch conversion is lightweight
- Metadata copying has minimal performance impact
- Batch size affects conversion performance
### Processing Efficiency
**Sequential vs. Parallel Integration**:
**Sequential Processing**: Audio to Text (lower memory, slower)
```python
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.audio.inference.asr_nemo import InferenceAsrNemoStage
from nemo_curator.stages.audio.io.convert import AudioToDocumentStage
from nemo_curator.stages.text.filters import ScoreFilter
from nemo_curator.stages.text.filters.heuristic import WordCountFilter # Example filter
# Define a text quality filter
text_quality_filter = WordCountFilter(min_words=10)
# Process audio completely first
audio_pipeline = Pipeline(
name="audio_processing",
stages=[
InferenceAsrNemoStage(model_name="stt_en_fastconformer_transducer_large"),
AudioToDocumentStage()
]
)
audio_results = audio_pipeline.run(executor)
# Then process text
text_pipeline = Pipeline(
name="text_processing",
stages=[
ScoreFilter(filter_obj=text_quality_filter)
]
)
final_results = text_pipeline.run(executor, initial_tasks=audio_results)
```
**Parallel Processing**: Audio and Text (higher memory, faster)
```python
# Run a single pipeline that includes both audio and text stages using an executor
from nemo_curator.pipeline import Pipeline
pipeline = Pipeline(name="audio_text", stages=[
InferenceAsrNemoStage(model_name="stt_en_fastconformer_transducer_large"),
GetPairwiseWerStage(),
AudioToDocumentStage(),
ScoreFilter(filter_obj=text_quality_filter)
])
results = pipeline.run()
```
### Scaling Strategies
**Horizontal Scaling**: Distribute across several workers
- Partition audio files across workers
- Independent processing with final aggregation
- Load balancing based on audio duration
**Vertical Scaling**: Optimize single-machine performance
- GPU acceleration for ASR inference
- Batch size optimization for hardware
- Memory management for large datasets
## Design Principles
### Modularity
**Separation of Concerns**: Audio and text processing remain independent
- Audio stages focus on speech-specific operations
- Text stages handle language processing
- Integration stages manage cross-modal operations
**Modular Architecture**: Mix and match audio and text processing stages
- Flexible pipeline construction
- Reusable stage components
- Configurable integration points
### Extensibility
**Custom Integration Patterns**: Support for domain-specific workflows
- Custom conversion logic
- Specialized quality metrics
- Domain-specific filtering rules
**Plugin Architecture**: Easy addition of new integration methods
- Custom stage implementations
- External tool integration
- Specialized format support