---
name: voice-stt-tts
description: "Full voice message setup (STT + TTS) for OpenClaw using faster-whisper and Edge TTS"
homepage: https://docs.openclaw.ai/nodes/audio
---

# Voice Messages (STT + TTS) for OpenClaw 🎙️

Complete voice message setup using **faster-whisper** for transcription and **Edge TTS** for voice replies.

## What we configure

- ✅ **STT** (Speech-to-Text) — transcribe voice messages via faster-whisper
- ✅ **TTS** (Text-to-Speech) — voice replies via Edge TTS
- 🎯 **Result:** voice → text → reply with voice

---

## Installation

### 1. Create virtual environment (venv)

For Ubuntu create an isolated venv:

```bash
python3 -m venv ~/.openclaw/workspace/voice-messages
```

### 2. Install faster-whisper

Install packages in venv:

```bash
~/.openclaw/workspace/voice-messages/bin/pip install faster-whisper
```

**What gets installed:**
- `faster-whisper` — Python library for transcription
- Dependencies: `ctranslate2`, `onnxruntime`, `huggingface-hub`, `av`, `numpy`, and others.
- Size: ~250 MB

---

## Transcription Script

### Path and content

**File:** `~/.openclaw/workspace/voice-messages/transcribe.py`

```python
#!/usr/bin/env python3
import argparse
from faster_whisper import WhisperModel


def transcribe(audio_path: str, model_name: str = "small", lang: str = "en", device: str = "cpu") -> str:
    model = WhisperModel(
        model_name,
        device=device,
        compute_type="int8" if device == "cpu" else "float16",
    )
    segments, _ = model.transcribe(audio_path, language=lang, vad_filter=True)
    text = " ".join(seg.text.strip() for seg in segments if seg.text and seg.text.strip()).strip()
    return text


def main():
    p = argparse.ArgumentParser()
    p.add_argument("--audio", required=True)
    p.add_argument("--model", default="small")
    p.add_argument("--lang", default="en")
    p.add_argument("--device", default="cpu", choices=["cpu", "cuda"])
    args = p.parse_args()

    text = transcribe(args.audio, args.model, args.lang, args.device)
    print(text if text else "")


if __name__ == "__main__":
    main()
```

**What the script does:**
1. Accepts audio file path (`--audio`)
2. Loads Whisper model (`--model`): `small` by default
3. Sets language (`--lang`): `en` for English
4. Transcribes with VAD filter (Voice Activity Detection)
5. Outputs clean text to stdout

### Make file executable:

```bash
chmod +x ~/.openclaw/workspace/voice-messages/transcribe.py
```

---

## OpenClaw Configuration

### 1. Configure STT (`tools.media.audio`)

Add to `~/.openclaw/openclaw.json`:

```json5
{
  "tools": {
    "media": {
      "audio": {
        "enabled": true,
        "maxBytes": 20971520,
        "models": [
          {
            "type": "cli",
            "command": "~/.openclaw/workspace/voice-messages/bin/python",
            "args": [
              "~/.openclaw/workspace/voice-messages/transcribe.py",
              "--audio",
              "{{MediaPath}}",
              "--lang",
              "en",
              "--model",
              "small"
            ],
            "timeoutSeconds": 120
          }
        ]
      }
    }
  }
}
```

**Parameters:**

| Parameter | Value | Description |
|-----------|----------|-----------|
| `enabled` | `true` | Enable audio transcription |
| `maxBytes` | `20971520` | Max file size (20 MB) |
| `type` | `"cli"` | Model type: CLI command |
| `command` | Python path | Path to python in venv |
| `args` | argument array | Arguments for script |
| `{{MediaPath}}` | placeholder | Replaced with audio file path |
| `timeoutSeconds` | `120` | Transcription timeout (2 minutes) |

### 2. Configure TTS (`messages.tts`)

Add to `~/.openclaw/openclaw.json`:

```json5
{
  "messages": {
    "tts": {
      "auto": "inbound",
      "provider": "edge",
      "edge": {
        "voice": "en-US-JennyNeural",
        "lang": "en-US"
      }
    }
  }
}
```

**Parameters:**

| Parameter | Value | Description |
|-----------|----------|-----------|
| `auto` | `"inbound"` | **Key mode!** — reply with voice only on incoming voice messages |
| `provider` | `"edge"` | TTS provider (free, no API key) |
| `voice` | `"en-US-JennyNeural"` | Voice (see available below) |
| `lang` | `"en-US"` | Locale (en-US for US english) |

### 3. Full configuration example

```json5
{
  "tools": {
    "media": {
      "audio": {
        "enabled": true,
        "maxBytes": 20971520,
        "models": [
          {
            "type": "cli",
            "command": "~/.openclaw/workspace/voice-messages/bin/python",
            "args": [
              "~/.openclaw/workspace/voice-messages/transcribe.py",
              "--audio",
              "{{MediaPath}}",
              "--lang",
              "en",
              "--model",
              "small"
            ],
            "timeoutSeconds": 120
          }
        ]
      }
    },
  },
  "messages": {
    "tts": {
      "auto": "inbound",
      "provider": "edge",
      "edge": {
        "voice": "en-US-JennyNeural",
        "lang": "en-US"
      }
    },
    "ackReactionScope": "group-mentions"
  }
}
```

---

## Apply Changes

### Restart Gateway

```bash
# Method 1: via openclaw CLI
openclaw gateway restart

# Method 2: via systemd
systemctl --user restart openclaw-gateway

# Check status
systemctl --user status openclaw-gateway
# Should show: active (running)
```

---

## Testing

### Test STT (transcription)

**Action:** Send a voice message to your Telegram bot

**Expected result:**
```
[Audio] User text: [Telegram ...] <media:audio> Transcript: <transcribed text>
```

**Example response:**
```
[Audio] User text: [Telegram kd (@someuser) id:12345678 +5s ...] <media:audio> Transcript: Hello. How are you?
```

### Test TTS (voice replies)

**Action:** After successful transcription, bot should send a voice reply

**Expected result:**
- Voice file arrives in Telegram
- Voice note (round bubble)

**Expected behavior:**
- Incoming voice → bot replies with voice
- Text messages → bot replies with text (this is normal!)

---

## Available Edge TTS Voices

### Female voices

| Voice | ID | Usage example |
|--------|-----|------------------|
| Jenny | `en-US-JennyNeural` | ← current |
| Ana | `en-US-AnaNeural` | Softer |

### Male voices

| Voice | ID | Usage example |
|--------|-----|------------------|
| Dmitry | `en-US-RogerNeural` | More bass |

**How to change voice:**
```bash
cat ~/.openclaw/openclaw.json | \
  jq '.messages.tts.edge.voice = "en-US-MichelleNeural"' > ~/.openclaw/openclaw.json.tmp
mv ~/.openclaw/openclaw.json.tmp ~/.openclaw/openclaw.json
systemctl --user restart openclaw-gateway
```

---

## Additional Edge TTS Parameters

### Adjusting speed, pitch, volume

```json5
{
  "messages": {
    "tts": {
      "edge": {
        "voice": "en-US-JennyNeural",
        "lang": "en-US",
        "rate": "+10%",      // Speed: -50% to +100%
        "pitch": "-5%",     // Pitch: -50% to +50%
        "volume": "+5%"     // Volume: -100% to +100%
      }
    }
  }
}
```

---

## Troubleshooting

### Problem: Voice not transcribed

**Logs show:**
```
[ERROR] Transcription failed
```

**Possible causes:**
1. **File too large** — > 20 MB
   ```bash
   # Solution: Increase maxBytes in config
   maxBytes: 52428800  # 50 MB
   ```

2. **Timeout** — transcription took > 2 minutes
   ```bash
   # Solution: Increase timeoutSeconds
   timeoutSeconds: 180  # 3 minutes
   ```

3. **Model not downloaded** — first run
   ```bash
   # Solution: Wait while it downloads (1-2 minutes)
   # Models are cached in ~/.cache/huggingface/
   ```

### Problem: No voice reply

**Possible causes:**
1. **Reply too short** (< 10 characters)
   - TTS skips very short replies
   - Solution: this is expected behavior

2. **auto: "inbound"** but text message
   - TTS in `inbound` mode replies with voice only on **voice messages**
   - Text messages get text replies — this is correct!

3. **Edge TTS unavailable**
   ```bash
   # Check
   curl -s "https://speech.platform.bing.com/consumer/api/v1/tts" | head -c 100
   # If error — temporarily unavailable
   ```

---

## Performance

### Transcription time (Raspberry Pi 4/ARM)

| Whisper Model | Est. time | Quality |
|---------------|--------------|---------|
| `tiny` | ~5-10 sec | Low |
| `base` | ~10-20 sec | Medium |
| `small` | ~20-40 sec | High ← current |
| `medium` | ~40-80 sec | Very high |
| `large` | ~80-160 sec | Maximum |

**Recommendation:** For Raspberry Pi use `small` or `base`. `medium`/`large` will be very slow.

### Where Whisper models are stored

```bash
~/.cache/huggingface/
```

Models download automatically on first run.

## Done! 🎉

After completing these steps:

1. ✅ faster-whisper installed in venv
2. ✅ `transcribe.py` script created
3. ✅ OpenClaw configured (STT + TTS)
4. ✅ Gateway restarted
5. ✅ Voice messages working

Now your Telegram bot:
- 🎙️ **Accepts voice** → transcribes via faster-whisper
- 🎤 **Replies with voice** → generates via Edge TTS
- 💬 **Accepts text** → replies with text (as usual)

---

**Useful links:**
- OpenClaw docs: https://docs.openclaw.ai
- TTS docs: https://docs.openclaw.ai/tts
- Audio docs: https://docs.openclaw.ai/nodes/audio
- Install skills: `npx clawhub search voice`

---

*Created: 2026-03-01 for OpenClaw 2026.2.26*