---
description: "Process audio data using ASR inference, quality assessment, audio analysis, and text integration for high-quality speech datasets"
categories: ["workflows"]
tags: ["audio-processing", "asr-inference", "quality-assessment", "audio-analysis", "text-integration", "gpu-accelerated"]
personas: ["data-scientist-focused", "mle-focused"]
difficulty: "intermediate"
content_type: "workflow"
modality: "audio-only"
---
# Process Data for Audio Curation
Process audio data you've loaded into `AudioTask` objects using NeMo Curator's comprehensive audio processing capabilities.
NeMo Curator provides a specialized suite of tools for processing speech and audio data as part of the AI training pipeline. These tools help you transcribe, analyze, filter, and integrate audio datasets to ensure high-quality input for ASR model training and multimodal applications.
## How it Works
NeMo Curator's audio processing capabilities are organized into five main categories:
1. **ASR Inference**: Transcribe audio using NeMo Framework's pretrained ASR models
2. **Quality Assessment**: Calculate and filter based on transcription accuracy metrics
3. **Quality Filtering**: Segment, filter, and diarize raw audio into clean single-speaker training segments
4. **Audio Analysis**: Extract audio characteristics like duration and validate formats
5. **Text Integration**: Convert processed audio data to text processing workflows
Each category provides GPU-accelerated implementations optimized for different speech curation needs. The result is a cleaned and filtered audio dataset with high-quality transcriptions ready for model training.
---
## ASR Inference
Transcribe audio files using NeMo Framework's state-of-the-art ASR models with GPU acceleration.
Use pretrained NeMo ASR models for accurate speech recognition
pretrained
multilingual
gpu-accelerated
Efficiently process large audio datasets with configurable batch sizes
batch-inference
memory-optimization
scalable
## Quality Assessment
Evaluate and filter audio quality using transcription accuracy and audio characteristics.
Filter audio samples based on Word Error Rate thresholds
accuracy
quality-metrics
filtering
Filter audio samples by duration ranges and speech rate metrics
duration
speech-rate
range-filtering
## Quality Filtering
Compose VAD, band, UTMOS, SIGMOS, and speaker-separation stages to extract clean single-speaker training segments from raw audio.
End-to-end pipeline of preprocessing, segmentation, and filtering stages
vad
mos-scoring
diarization
Single composite stage that decomposes into the full filtering pipeline from a YAML config
composite
yaml-config
end-to-end
## Audio Analysis
Extract and analyze audio file characteristics for quality control and metadata generation.
Calculate precise audio duration using soundfile library
soundfile
precision
metadata
Validate audio file formats and detect corrupted files
validation
error-handling
format-support
## ALM Data Curation
Curate training data for audio language models by extracting fixed-duration windows from diarized audio segments.
Construct candidate training windows from consecutive segments with quality filtering
windowing
speaker-count
bandwidth
Remove redundant overlapping windows based on configurable thresholds
deduplication
overlap-ratio
target-duration
## Text Integration
Convert processed audio data to text processing workflows for multimodal applications.
Convert AudioTask objects to DocumentBatch for text processing
format-conversion
pipeline-integration
multimodal