# Speech-to-Text (Whisper)

The CLI can turn speech into text in two places:

- **Chat mode** - the [`/voice`](shortcuts-guide.md#voice-shortcut) shortcut records your microphone,
  transcribes it, and drops the text into the input field.
- **Channels mode** - inbound **Telegram voice messages** are transcribed to text before the agent
  sees them, so you can talk to the agent from your phone.

Both use a local [whisper.cpp](https://github.com/ggml-org/whisper.cpp) binary and run fully offline
once the model is downloaded. The feature is **disabled by default** because the model download can
take time and the runtime depends on external tools.

## Prerequisites

Speech-to-text shells out to two external programs (no CGO is added to the `infer` binary):

| Tool | Used for | Install |
| --- | --- | --- |
| `whisper-cli` (or `whisper-cpp`) | Transcription | macOS: `brew install whisper-cpp` · Nix: `nix profile install nixpkgs#openai-whisper-cpp` · or build from [whisper.cpp](https://github.com/ggml-org/whisper.cpp) |
| `ffmpeg` | Microphone capture & decoding voice messages (OGG/Opus → WAV) | macOS: `brew install ffmpeg` · Debian/Ubuntu: `apt install ffmpeg` |

On Linux, `arecord` (ALSA) or `sox` can substitute for `ffmpeg` for microphone capture.

If a required tool is missing, the CLI reports an actionable error naming what to install - it never
fails silently.

## Enabling

Add a `speech_to_text` section to `.infer/config.yaml` (or `~/.infer/config.yaml`):

```yaml
speech_to_text:
  enabled: true          # feature flag (default: false)
  engine: whisper.cpp    # transcription engine
  model: tiny            # tiny | base | small | medium | large-v3-turbo | *.en (default: tiny)
  language: ""           # ISO code (e.g. "en"); empty = auto-detect
  auto_download: true    # download the model on first use if missing
  max_recording_seconds: 30  # /voice hard recording cap
  silence_timeout: 2     # stop /voice this many seconds after you go quiet (0 = record full cap)
  retain_recordings: 0   # keep the last N inbound Telegram voice/audio files (0 = keep none)
  # Optional overrides:
  binary_path: ""        # explicit whisper-cli/whisper-cpp path; empty = resolve on PATH
  ffmpeg_path: ""        # explicit ffmpeg path; empty = resolve on PATH
  models_dir: ""         # where models are cached; empty = ~/.infer/models/whisper
  recordings_dir: ""     # where retained recordings live; empty = ~/.infer/voice
  input_device: ""       # microphone device; empty = platform default
  timeout: 120           # transcription timeout (seconds)
```

Every field can also be set via environment variables, e.g.
`INFER_SPEECH_TO_TEXT_ENABLED=true`, `INFER_SPEECH_TO_TEXT_MODEL=base`.

## Models

The GGML model is downloaded on first use from
`https://huggingface.co/ggerganov/whisper.cpp` and cached under `~/.infer/models/whisper/`
(e.g. `ggml-tiny.bin`). Pick a model with the `model` setting; larger models are more accurate but
slower and heavier:

| Model | Size | Notes |
| --- | --- | --- |
| `tiny` | ~75 MB | Fastest, lowest accuracy (default) |
| `base` | ~142 MB | Good balance |
| `small` | ~466 MB | More accurate |
| `medium` | ~1.5 GB | High accuracy |
| `large-v3-turbo` | ~1.5 GB | Best accuracy, optimized speed |

Append `.en` (e.g. `base.en`) for English-only variants. You can also pass a full filename
(`ggml-small.bin`) or place a model in `models_dir` manually and set `auto_download: false`.

## Using `/voice` in chat

1. Type `/voice` and press Enter - recording starts immediately.
2. Speak. Recording stops automatically about `silence_timeout` seconds after you
   go quiet (or at the `max_recording_seconds` cap, or `/voice 8` per-call), and the
   transcription lands in the input field. Set `silence_timeout: 0` to always record
   the full cap instead.
3. Review/edit the text and press Enter to send.

`/voice` only appears when `speech_to_text.enabled` is `true`.

## Telegram voice messages

When `speech_to_text.enabled` is set and you run `infer channels-manager`, voice notes sent to your
Telegram bot are downloaded, decoded with `ffmpeg`, transcribed, and forwarded to the agent as text.
When speech-to-text is disabled, voice messages are ignored (as before). See
[Channels](channels.md) for channel setup.

By default the downloaded audio is transcribed and then deleted. To keep the original files, set
`retain_recordings` to the number of recent recordings to keep (e.g. `10`). Retained files are written
to `recordings_dir` (default `~/.infer/voice/`) with their original extension, and the oldest are
pruned automatically once the cap is exceeded. `retain_recordings: 0` (the default) keeps nothing.

## Troubleshooting

- **"whisper binary not found"** - install whisper.cpp or set `speech_to_text.binary_path`.
- **"ffmpeg not found"** - install ffmpeg or set `speech_to_text.ffmpeg_path`.
- **No audio captured on macOS** - grant microphone permission to your terminal, and list devices
  with `ffmpeg -f avfoundation -list_devices true -i ""`, then set `input_device` to the index.
- **Wrong language** - set `language` to the ISO code instead of relying on auto-detect.
- **First `/voice` is slow** - the model downloads once; subsequent runs use the cache.