--- name: impressions description: Add and use custom voices for VoiceMode TTS via local mlx-audio. Use when the user wants to clone a voice, do an impression, add a reference clip, or use voice="" in converse. --- # Impressions Make VoiceMode speak in any voice. The model takes a short reference clip and synthesises fresh speech in that voice via local Qwen3-TTS on top of mlx-audio. > **Status:** Preview / experimental. Apple Silicon only. Opt-in. ## When to use this skill - User asks for "voice cloning", "do an impression", "speak as X", "add my voice" - A `voice=` argument in `voicemode:converse` doesn't match a known Kokoro voice - User wants to install or troubleshoot the `mlx-audio` service - User asks how to configure a remote mlx-audio server ## Quick start ```bash # 1. Install the local TTS service (one-time, Apple Silicon only) voicemode service install mlx-audio # 2. Add a voice from a reference clip voicemode clone add fleabag ~/Downloads/fleabag-clip.wav # 3. Use it voicemode converse --voice fleabag ``` In the MCP `converse` tool, pass `voice="fleabag"` -- VoiceMode auto-routes any voice that matches a profile in `VOICEMODE_VOICES_DIR` to mlx-audio instead of Kokoro / OpenAI. ## Reference clip requirements `voicemode clone add` validates the input before doing any expensive work: - **Duration: 3-9 seconds** (5-9s sweet spot). Clips outside this window are rejected with an actionable error. - **Mono speech, no music or cross-talk.** The model copies what it hears -- including hum, music beds, laugh tracks, and overlapping speakers. - **Any input format accepted.** WAV, MP3, M4A, etc. -- ffmpeg normalises whatever you hand it. - **Output is always mono 24 kHz 16-bit PCM with loudnorm I=-16 TP=-1.5 LRA=11.** This is the canonical voice-lab format; the original input is replaced by this normalised render at `default.wav`. - **ALWAYS pair the clip with its transcript.** The model conditions on the reference text; without one it ASRs the clip itself, and any mis-hearing (noisy or vintage audio especially) corrupts the conditioning -- the symptom is **stammering / stuttered synthesis**. `voicemode clone add` auto-transcribes into `voice.md` (verify it -- correct mis-hearings by hand); voice-lab's `sayas` reads `.txt` next to each wav; the MCP `converse` tool takes `ref_text` alongside a clip-path `voice`. (Root-caused on VL-50, 2026-06-11: 1977 Doctor Who clips stammered until transcripts were supplied -- then "much better!!!".) ### Trimming a too-long clip If your source is longer than 9 seconds, trim with the same one-liner the runtime error suggests: ```bash ffmpeg -i in.wav -ss 0 -t 8 out.wav ``` ## On-disk layout Voices live as **directories** under `~/.voicemode/voices//`: ``` ~/.voicemode/voices/fleabag/ ├── default.wav # required: 3-9s of clean reference audio, mono 24kHz 16-bit PCM └── voice.md # auto-generated by `voicemode clone add` -- name, source, duration, format, transcript ``` `voice.md` carries YAML front matter with `name`, `source` (original input path), `duration_seconds`, `format` (literal `mono 24kHz 16-bit PCM, loudnorm I=-16 TP=-1.5 LRA=11`), and `transcript`. It documents what the clip is and where it came from. `voices.json` at the voices root is retained as a **legacy index** -- `voicemode clone add` writes an entry pointing at `/default.wav` so older consumers keep working. Prefer the directory layout above for new work. Multiple WAVs are allowed alongside `default.wav`; symlink whichever one is "active" to `default.wav`. A directory with multiple WAVs and no `default.wav` is treated as a sample bin and skipped. ## Picking a clip 5-9 seconds of clean conversational speech beats 30 seconds of noisy podcast audio. The model copies what it hears -- including hum, music beds, and laugh tracks. See [docs/finding-samples.md](docs/finding-samples.md) for ranking heuristics, an mlx-whisper word-timestamp ranker concept, and `ffmpeg loudnorm` recipes. ## Configuration | Variable | Default | Purpose | | -------------------------------- | --------------------------------------------- | ---------------------------------------------------- | | `VOICEMODE_VOICES_DIR` | `~/.voicemode/voices` | Where voice profiles live | | `VOICEMODE_REMOTE_VOICES_DIR` | *(unset)* | Path on remote mlx-audio host (path translation) | | `VOICEMODE_MLX_AUDIO_BASE_URL` | `http://127.0.0.1:8890/v1` | OpenAI-compatible mlx-audio endpoint | | `VOICEMODE_IMPRESSIONS_MODEL` | `mlx-community/Qwen3-TTS-12Hz-1.7B-Base-bf16` | Hugging Face model ID | ### Deprecated aliases (one release only) The unreleased 8.7.0 candidate used `VOICEMODE_CLONE_*` names. They're honoured in 8.7.x with a one-shot deprecation warning and **removed in 8.8.0**: | Deprecated | Use instead | | -------------------------- | ------------------------------ | | `VOICEMODE_CLONE_BASE_URL` | `VOICEMODE_MLX_AUDIO_BASE_URL` | | `VOICEMODE_CLONE_MODEL` | `VOICEMODE_IMPRESSIONS_MODEL` | | `VOICEMODE_CLONE_PORT` | `VOICEMODE_MLX_AUDIO_PORT` | If you see those in a user's `voicemode.env`, suggest updating them. ## Footguns - **Missing reference transcript = stammering.** A clip without its transcript forces the model to ASR the reference itself; on anything but clean modern audio that mis-hears, and the synthesis stutters. Fix: `.txt` beside the wav (sayas), `ref_text` in converse, corrected `transcript:` in `voice.md`. See "Reference clip requirements". - **Kokoro name collisions** -- naming a voice `af_sky` (or any other Kokoro voice name) shadows the Kokoro voice. Pick distinctive names like `fleabag`, `mike-2026`, `bryan_morning`. - **Apple Silicon only** -- no fallback for Intel Macs / Linux / Windows. Don't suggest installing mlx-audio on those platforms. - **First synthesis is slow** -- ~3.4 GB model download on first call. Warn the user. ## Deep dives - [docs/setup.md](docs/setup.md) -- install path, model quants table, remote mlx-audio config, troubleshooting. - [docs/finding-samples.md](docs/finding-samples.md) -- clip ranking heuristic, ffmpeg loudnorm recipe, link to voice-lab. ## Related - [Impressions guide](../../../docs/guides/impressions.md) -- user-facing prose version of this skill. - [VoiceMode skill](../voicemode/SKILL.md) -- primary voice interaction skill. - [voice-lab](https://github.com/mbailey/voice-lab) -- companion repo for curating reference clips and personas.