---
name: speech-to-text
description: Transcribe audio to text using Sarvam AI's Saarika model. Use when the user needs to convert speech to text, transcribe audio files, build voice interfaces, or process Indian language audio. Supports 11 Indian languages plus English with automatic language detection, code-mixing, speaker diarization, and word-level timestamps.
license: Apache-2.0
metadata:
  author: sarvam-ai
  version: "1.0"
  model: saarika:v2.5
---

# Speech-to-Text with Saarika

Saarika is Sarvam AI's speech recognition model optimized for Indian languages with support for code-mixing (Hindi-English etc.) and multi-speaker scenarios.

## Installation

```bash
pip install sarvamai
```

## Quick Start

```python
from sarvamai import SarvamAI

client = SarvamAI()

response = client.speech_to_text.transcribe(
    file=open("audio.wav",
"rb"),
    model="saarika:v2.5",
    language_code="hi-IN"
)

print(response.transcript)
```

## Supported Languages

| Code | Language | Code | Language |
|------|----------|------|----------|
| `hi-IN` | Hindi | `ta-IN` | Tamil |
| `bn-IN` | Bengali | `te-IN` | Telugu |
| `kn-IN` | Kannada | `ml-IN` | Malayalam |
| `mr-IN` | Marathi | `gu-IN` | Gujarati |
| `pa-IN` | Punjabi | `or-IN` | Odia |
| `en-IN` | English (Indian) | `auto` | Auto-detect |

## API Options

### REST API (≤30 seconds)

For short audio clips:

```python
response = client.speech_to_text.transcribe(
    file=open("short_clip.wav",
"rb"),
    model="saarika:v2.5",
    language_code="auto",           # Auto-detect language
    with_timestamps=True,           # Word-level timestamps
    with_diarisation=True           # Speaker identification
)

print(response.transcript)
print(response.language_code)       # Detected language
print(response.words)               # Timestamped words
print(response.speaker_segments)    # Speaker turns
```

### Batch API (≤1 hour)

For long recordings:

```python
response = client.speech_to_text.transcribe_batch(
    file=open("long_recording.mp3",
"rb"),
    model="saarika:v2.5",
    language_code="hi-IN"
)
```

### WebSocket Streaming (Real-time)

For live transcription. Audio must be sent as **base64-encoded strings**.

```python
import asyncio
import base64
from sarvamai import AsyncSarvamAI

async def stream_audio():
    client = AsyncSarvamAI()

    async with client.speech_to_text_streaming.connect(
        language_code="hi-IN",
        model="saarika:v2.5",
        high_vad_sensitivity=True
    ) as ws:
        # Read and encode audio to base64
        with open("audio.wav",
"rb") as f:
            audio_base64 = base64.b64encode(f.read()).decode("utf-8")

        # Send base64 encoded audio
        await ws.transcribe(
            audio=audio_base64,
            encoding="audio/wav",
            sample_rate=16000
        )

        # Receive transcription
        response = await ws.recv()
        print(response)

asyncio.run(stream_audio())
```

**WebSocket supported formats:** `wav`, `pcm_s16le`, `pcm_l16`, `pcm_raw` only. MP3/AAC/OGG not supported for streaming.

## JavaScript

```javascript
import { SarvamAI
} from "sarvamai";
import fs from "fs";

const client = new SarvamAI();

const response = await client.speechToText.transcribe({
  file: fs.createReadStream("audio.wav"),
  model: "saarika:v2.5",
  languageCode: "hi-IN",
  withTimestamps: true
});

console.log(response.transcript);
```

## cURL

```bash
curl -X POST "https://api.sarvam.ai/speech-to-text" \
  -H "api-subscription-key: $SARVAM_API_KEY" \
  -F "file=@audio.wav" \
  -F "model=saarika:v2.5" \
  -F "language_code=hi-IN"
```

## Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `file` | File | Yes | Audio file (wav, mp3, flac, ogg, webm) |
| `model` | string | Yes | `saarika:v2.5` or `saarika:v2` |
| `language_code` | string | Yes | BCP-47 code or `auto` |
| `with_timestamps` | bool | No | Return word timestamps |
| `with_diarisation` | bool | No | Enable speaker identification |

## Response

```json
{
    "request_id": "abc123",
    "transcript": "नमस्ते, आप कैसे हैं?",
    "language_code": "hi-IN",
    "words": [
        {
            "word": "नमस्ते",
            "start": 0.0,
            "end": 0.5
        },
        {
            "word": "आप",
            "start": 0.6,
            "end": 0.8
        }
    ],
    "speaker_segments": [
        {
            "speaker": "SPEAKER_00",
            "start": 0.0,
            "end": 2.5
        }
    ]
}
```

See [references/streaming.md
](references/streaming.md) for detailed WebSocket documentation.