---
name: qwen3-tts-mlx
description: Local Qwen3-TTS speech synthesis on Apple Silicon via MLX. Use for offline narration, audiobooks, video voiceovers, and multilingual TTS.
metadata:
  author: agiseek
  version: "1.2.0"
---

# Qwen3-TTS MLX

Run Qwen3-TTS locally on Apple Silicon (M1/M2/M3/M4) using MLX. Supports 11 languages, 9 built-in voices, voice cloning, and voice design from text descriptions.

## When to Use

- Generate speech fully offline on a Mac
- Produce narration, audiobooks, podcasts, or video voiceovers
- Create multilingual TTS with controllable style and emotion
- Clone any voice from a short audio sample
- Design custom voices from text descriptions

## Quick Start

### Install

```bash
pip install mlx-audio
brew install ffmpeg
```

### Basic Usage

```bash
python scripts/run_tts.py custom-voice \
  --text "Hello, welcome to local text to speech." \
  --voice Ryan \
  --output output.wav
```

### With Style Control

```bash
python scripts/run_tts.py custom-voice \
  --text "Breaking news: local AI model achieves human-level speech." \
  --voice Uncle_Fu \
  --instruct "news anchor tone, calm and authoritative" \
  --output news.wav
```

## Model Variants

| Variant | Model | Size | Memory | Use Case |
|---------|-------|------|--------|----------|
| CustomVoice | `mlx-community/Qwen3-TTS-12Hz-0.6B-CustomVoice-4bit` | ~1GB | ~4GB | Built-in voices + style control (recommended) |
| VoiceDesign | `mlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-5bit` | ~2GB | ~5GB | Create voices from text descriptions |
| Base | `mlx-community/Qwen3-TTS-12Hz-0.6B-Base-4bit` | ~1GB | ~4GB | Voice cloning from reference audio |

## Supported Languages

| Language | Code | Notes |
|----------|------|-------|
| Auto-detect | `auto` | Default, detects from text |
| Chinese | `Chinese` | Mandarin |
| English | `English` | |
| Japanese | `Japanese` | |
| Korean | `Korean` | |
| French | `French` | |
| German | `German` | |
| Spanish | `Spanish` | |
| Portuguese | `Portuguese` | |
| Italian | `Italian` | |
| Russian | `Russian` | |

## Built-in Voices

| Voice | Language | Character |
|-------|----------|-----------|
| Vivian | Chinese | Female, bright, young |
| Serena | Chinese | Female, gentle, soft |
| Uncle_Fu | Chinese | Male, authoritative, news anchor |
| Dylan | Chinese | Male, Beijing dialect |
| Eric | Chinese | Male, Sichuan dialect |
| Ryan | English | Male, energetic |
| Aiden | English | Male, clear, neutral |
| Ono_Anna | Japanese | Female |
| Sohee | Korean | Female |

**Voice Selection Guide:**

| Scenario | Recommended Voice |
|----------|-------------------|
| Chinese news/narration | Uncle_Fu |
| Chinese casual/lively | Eric |
| Chinese female, professional | Vivian |
| Chinese female, storytelling | Serena |
| English energetic content | Ryan |
| English neutral/educational | Aiden |
| Japanese content | Ono_Anna |
| Korean content | Sohee |

## Modes

### 1) CustomVoice

Use built-in voices with optional emotion/style control via `--instruct`.

```bash
python scripts/run_tts.py custom-voice \
  --text "This is amazing news!" \
  --voice Vivian \
  --instruct "excited and happy" \
  --output excited.wav
```

**Style instruction examples:**
- `"calm and warm"` - Soft, friendly delivery
- `"news anchor, authoritative"` - Professional broadcast style
- `"excited and energetic"` - High energy, enthusiastic
- `"sad and melancholic"` - Emotional, somber tone
- `"whispering, intimate"` - Quiet, close-mic feel

### 2) VoiceDesign

Create a completely new voice by describing it in natural language.

```bash
python scripts/run_tts.py voice-design \
  --text "Welcome to our podcast." \
  --instruct "warm, mature male narrator with low pitch and gentle tone" \
  --output podcast_intro.wav
```

**Voice description examples:**
- `"young cheerful female with high pitch"`
- `"elderly wise male with deep resonant voice"`
- `"professional female news anchor, clear articulation"`
- `"friendly young male, casual and relaxed"`

### 3) VoiceClone

Clone any voice from a reference audio sample (5-10 seconds recommended).

```bash
python scripts/run_tts.py voice-clone \
  --text "This is my cloned voice speaking new content." \
  --ref_audio reference.wav \
  --ref_text "The exact transcript of the reference audio" \
  --output cloned.wav
```

**Tips for voice cloning:**
- Use clean audio without background noise
- 5-10 seconds of speech works best
- Provide accurate transcript of the reference
- Reference and output language should match

## CLI Parameters

| Parameter | Required | Default | Description |
|-----------|----------|---------|-------------|
| `--text` | Yes | - | Text to synthesize |
| `--voice` | No | Vivian | Built-in voice (CustomVoice only) |
| `--lang_code` | No | auto | Language code |
| `--instruct` | No | - | Style control or voice description |
| `--speed` | No | 1.0 | Speech speed multiplier |
| `--temperature` | No | 0.7 | Sampling temperature (higher = more variation) |
| `--model` | No | (per mode) | Override default model |
| `--output` | No | - | Output file path |
| `--out-dir` | No | ./outputs | Output directory when --output not set |
| `--ref_audio` | VoiceClone | - | Reference audio file |
| `--ref_text` | VoiceClone | - | Reference audio transcript |

## Python API

### Using generate_audio (recommended)

```python
from mlx_audio.tts.generate import generate_audio

# CustomVoice with style control
generate_audio(
    text="Hello from Qwen3-TTS!",
    model="mlx-community/Qwen3-TTS-12Hz-0.6B-CustomVoice-4bit",
    voice="Ryan",
    lang_code="english",
    instruct="friendly and warm",
    output_path=".",
    file_prefix="hello",
    audio_format="wav",
    join_audio=True,
    verbose=True,
)
```

### Using Model directly

```python
from mlx_audio.tts.utils import load
import soundfile as sf
import numpy as np

# Load model
model = load("mlx-community/Qwen3-TTS-12Hz-0.6B-CustomVoice-4bit")

# Generate audio (returns a generator)
audio_chunks = []
for chunk in model.generate_custom_voice(
    text="Hello from Qwen3-TTS.",
    speaker="Ryan",
    language="english",
    instruct="clear, steady delivery"
):
    if hasattr(chunk, 'audio') and chunk.audio is not None:
        audio_chunks.append(chunk.audio)

# Combine and save
audio = np.concatenate(audio_chunks)
sf.write("output.wav", audio, 24000)
```

### VoiceDesign

```python
from mlx_audio.tts.generate import generate_audio

generate_audio(
    text="Welcome to the show.",
    model="mlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-5bit",
    instruct="warm, friendly female narrator with medium pitch",
    lang_code="english",
    output_path=".",
    file_prefix="voice_design",
    join_audio=True,
)
```

### VoiceClone

```python
from mlx_audio.tts.generate import generate_audio

generate_audio(
    text="New content in the cloned voice.",
    model="mlx-community/Qwen3-TTS-12Hz-0.6B-Base-4bit",
    ref_audio="reference.wav",
    ref_text="Transcript of the reference audio",
    output_path=".",
    file_prefix="cloned",
    join_audio=True,
)
```

## Batch Processing

Use `scripts/batch_dubbing.py` for processing multiple lines:

```bash
python scripts/batch_dubbing.py \
  --input dubbing.json \
  --out-dir outputs
```

See `references/dubbing_format.md` for the JSON format.

## Performance

| Metric | Value |
|--------|-------|
| Sample rate | 24,000 Hz |
| Real-time factor | ~0.7x (faster than real-time) |
| Peak memory | ~4-6 GB |
| First run | Downloads model (~1-2GB) |

## Troubleshooting

| Issue | Solution |
|-------|----------|
| Slow generation | Use 4-bit CustomVoice model |
| Unnatural pauses | Add punctuation, keep sentences short |
| Wrong language detected | Specify `--lang_code` explicitly |
| Voice cloning quality | Use cleaner reference audio, accurate transcript |
| Tokenizer warnings | Harmless, can be ignored |
| Out of memory | Close other apps, use 4-bit model |