---
description: "Comprehensive overview of the automatic speech recognition pipeline architecture and workflow in NeMo Curator"
categories: ["concepts-architecture"]
tags: ["asr-pipeline", "speech-recognition", "architecture", "workflow", "nemo-toolkit"]
personas: ["data-scientist-focused", "mle-focused"]
difficulty: "beginner"
content_type: "concept"
modality: "audio-only"
---

# ASR Pipeline Architecture

This guide provides a comprehensive overview of NeMo Curator's Automatic Speech Recognition (ASR) pipeline architecture, covering audio input processing through transcription generation and quality assessment.

## Pipeline Overview

The ASR pipeline in NeMo Curator follows a systematic approach to speech processing:

```mermaid
graph TD
    A[Audio Files] --> B[AudioTask Creation]
    B --> C[ASR Model Loading]
    C --> D[Batch Inference]
    D --> E[Transcription Output]
    E --> F[Quality Assessment]
    F --> G[Filtering & Export]

    subgraph "Input Stage"
        A
        B
    end

    subgraph "Processing Stage"
        C
        D
        E
    end

    subgraph "Assessment Stage"
        F
        G
    end
```

## Core Components

### 1. Audio Input Management

**AudioTask Structure**: The foundation for audio processing

- Contains audio file paths and associated metadata
- Validates file existence and accessibility automatically
- Supports efficient batch processing for scalability

**Input Validation**: Ensures data integrity before processing

- File path existence checks using `AudioTask.validate()` and `validate_item()`
- Optional metadata validation added by downstream stages (such as duration and format checks)

### 2. ASR Model Integration

**NeMo Framework Integration**: Leverages state-of-the-art ASR models

- Automatic model downloading and caching for convenience
- GPU-accelerated inference when hardware is available
- Support for multilingual and domain-specific model variants

**Model Management**: Efficient resource usage

- Lazy loading of models to conserve system memory
- Automatic GPU or CPU device selection based on available resources
- Model-level batching handled within NeMo framework

### 3. Inference Processing

**Batch Processing**: Supports processing audio files together

- Audio files are processed together in a single call to the NeMo ASR model
- Batch size configuration controls task grouping for processing using `.with_(batch_size=..., resources=Resources(...))`
- Internal batching and optimization handled by the NeMo framework

**Output Generation**: Structured transcription results

- Clean predicted text extraction from NeMo model outputs
- Complete metadata preservation throughout the processing pipeline

## Processing Stages

### Stage 1: Data Loading

```python
from nemo_curator.stages.audio.datasets.fleurs.create_initial_manifest import CreateInitialManifestFleursStage
from nemo_curator.stages.text.io.reader import JsonlReader

# Data loading from datasets (e.g., FLEURS)
fleurs_stage = CreateInitialManifestFleursStage(
    lang="en_us",              # Language code
    split="dev",               # Data split
    raw_data_dir="/path/to/data"
)

# Or load from custom manifest files
manifest_reader = JsonlReader(
    input_file_path="/path/to/manifest.jsonl"
)

# Stages automatically create AudioTask objects from loaded data
```

### Stage 2: ASR Model Setup

```python
# Model initialization
asr_stage = InferenceAsrNemoStage(
    model_name="nvidia/stt_en_fastconformer_hybrid_large_pc"
)

# GPU/CPU device selection (based on configured resources)
device = asr_stage.check_cuda()

# Model loading
asr_stage.setup()  # Downloads and loads model
```

### Stage 3: Transcription Generation

```python
# Don't call process() directly — the Pipeline/Executor handles dispatch:
pipeline.add_stage(asr_stage)
results = pipeline.run(executor)

# Output: AudioTask objects with added "pred_text" field
# Each task now contains both original data and predictions
```

### Stage 4: Quality Assessment Integration

```python
# WER calculation
wer_stage = GetPairwiseWerStage(
    text_key="text",
    pred_text_key="pred_text",
    wer_key="wer"
)

# Duration analysis
duration_stage = GetAudioDurationStage(
    audio_filepath_key="audio_filepath",
    duration_key="duration"
)
```

## Data Flow Architecture

### Input Data Flow

1. **Audio Files** → File system
2. **Manifest Files** → JSONL format with metadata
3. **AudioTask Objects** → Validated, structured data containers

### Processing Data Flow

1. **Model Loading** → NeMo ASR model initialization
2. **Batch Creation** → Group audio files for efficient processing
3. **GPU Processing** → Transcription generation
4. **Result Aggregation** → Combine transcriptions with metadata

### Output Data Flow

1. **Transcription Results** → Predicted text for each audio file
2. **Quality Metrics** → WER, CER, duration, and custom scores
3. **Filtered Datasets** → High-quality audio-text pairs
4. **Export Formats** → JSONL manifests for training workflows

## Performance Characteristics

### Scalability Factors

**Model Selection Impact**:

- Larger models provide better accuracy but require more processing time
- NeMo models support streaming capabilities, though this stage performs offline transcription
- Language-specific models improve accuracy for target languages

**Hardware Usage**:

- GPU acceleration typically outperforms CPU processing for larger workloads
- Memory requirements scale proportionally with model size and audio input lengths

### Optimization Strategies

**Memory Management**:

```python
# Optimize for memory-constrained environments
asr_stage = InferenceAsrNemoStage(
    model_name="nvidia/stt_en_fastconformer_hybrid_small"  # Smaller model
).with_(
    resources=Resources(gpus=0.5)  # Request fractional GPU using executor/backends
)
```

**Resource Configuration**:

```python
# Configure resources for processing
asr_stage = InferenceAsrNemoStage(
    model_name="nvidia/stt_en_fastconformer_hybrid_large_pc"
).with_(
    resources=Resources(gpus=1.0)  # Dedicated GPU
)
```

## Error Handling and Recovery

### Audio Processing Errors

```python
# Validate and filter invalid file paths
audio_task = AudioTask(data=audio_data, filepath_key="audio_filepath")

# Validate the audio file exists on disk
is_valid = audio_task.validate()
```

### Pipeline Recovery

For guidance on resumable processing and recovery at the executor and backend level, refer to [Resumable Processing](/reference/infra/resumable-processing).

## Integration Points

### Text Processing Integration

The ASR pipeline seamlessly integrates with text processing workflows:

```python
# Audio → Text pipeline
audio_to_text = [
    InferenceAsrNemoStage(),  # Audio → Transcriptions
    AudioToDocumentStage(),   # AudioTask → DocumentBatch
    # Continue with text processing stages...
]
```