--- name: local-tts version: 1.0.0 description: Local text-to-speech using MLX and Kokoro model triggers: - local-tts - local tts - generate audio locally - kokoro tts dependencies: - mlx-audio (via uv --with) - pydub (via uv --with) --- # Local TTS Skill Generate high-quality speech audio locally using Apple Silicon MLX acceleration and the Kokoro-82M model. No API keys or recurring costs. ## Quick Start ```bash # Generate MP3 from text uv run --with mlx-audio --with pydub skills/local-tts/scripts/generate_audio.py \ --text "Hello, this is a test." \ --output ~/Desktop/test.mp3 # Generate from file uv run --with mlx-audio --with pydub skills/local-tts/scripts/generate_audio.py \ --file /tmp/script.txt \ --voice af_heart \ --output ~/Desktop/podcast.mp3 # List available voices uv run --with mlx-audio skills/local-tts/scripts/list_voices.py ``` ## Parameters | Parameter | Required | Default | Description | |-----------|----------|---------|-------------| | `--text` | One of text/file | - | Text to convert | | `--file` | One of text/file | - | Path to text file | | `--voice` | No | `af_heart` | Voice preset | | `--output` | Yes | - | Output file path (.mp3, .wav) | | `--model` | No | `Kokoro-82M-bf16` | Model to use | | `--list-voices` | No | - | Show available voices | ## Voice Presets ### American English Female (prefix: af_) - `af_heart` - Warm, friendly **(default)** - `af_bella` - Soft, calm - `af_nova` - Clear, professional - `af_river` - Clear, confident - `af_sarah` - Soft, expressive ### American English Male (prefix: am_) - `am_adam` - Clear, professional - `am_echo` - Deep, smooth - `am_liam` - Articulate, conversational - `am_michael` - Soft, measured ### British English (prefix: bf_, bm_) - `bf_emma` - Clear, refined female - `bm_daniel` - Clear, professional male - `bm_george` - Distinguished male See `references/voices.md` for full list. ## Output Format ```json { "success": true, "file": "/Users/hagelk/Desktop/podcast.mp3", "voice": "af_heart", "model": "Kokoro-82M-bf16", "characters": 9824, "chunks": 20, "duration_seconds": 612.5, "generation_time": 45.2 } ``` ## Performance | Hardware | Speed | Notes | |----------|-------|-------| | M3 Pro 36GB | ~3-4x realtime | First run slower (model loading) | | M1/M2 Mac Mini 8GB | ~1.5x realtime | Works well for briefings | | M1/M2 Mac Mini 16GB | ~2x realtime | Comfortable headroom | ## Technical Details - **Model**: Kokoro-82M-bf16 (~200MB download on first run) - **Sample rate**: 24kHz mono - **Chunking**: Text split at ~400 chars per chunk for quality - **Concatenation**: Chunks joined seamlessly via pydub - **Formats**: MP3, WAV, M4A, OGG ## Important Notes 1. **MUST use `--with` flags** - Do not use PEP 723 inline deps. mlx-audio requires uv's cached environment. 2. **First run is slower** - Model downloads ~200MB and espeak dependencies initialize. 3. **Model cached at**: `~/.cache/huggingface/hub/models--mlx-community--Kokoro-82M-bf16/` ## Integration with Morning Briefing The morning-briefing skill uses this for podcast generation: ```bash uv run --with mlx-audio --with pydub skills/local-tts/scripts/generate_audio.py \ --file /tmp/morning_briefing_podcast.txt \ --voice af_heart \ --output ~/Desktop/morning_briefing.mp3 ```