# Whisper Speech-to-Text Setup Whisper is a local speech recognition service that converts audio to text for VoiceMode using OpenAI's Whisper model. It provides offline STT capabilities with various model sizes to balance speed and accuracy. ## Quick Start ```bash # Install whisper service with default base model (includes Core ML on Apple Silicon!) voicemode whisper service install # Install with a different model voicemode whisper service install --model large-v3 # List available models and their status voicemode whisper model --all # Switch to a different model (auto-installs if needed) voicemode whisper model large-v2 # Start the service voicemode whisper service start ``` **Apple Silicon Bonus:** On M1/M2/M3/M4 Macs, VoiceMode automatically downloads pre-built Core ML models for 2-3x faster performance. No Xcode or Python dependencies required! Default endpoint: `http://127.0.0.1:2022/v1` ## Installation Methods ### Automatic Installation (Recommended) VoiceMode includes an installation tool that sets up Whisper.cpp automatically: ```bash # Install with default base model (142MB) - good balance of speed and accuracy voicemode whisper service install # Install with a specific model voicemode whisper service install --model small ``` This will: - Clone and build Whisper.cpp with GPU support (if available) - Download the specified model (default: base) - **On Apple Silicon:** Automatically download pre-built Core ML models for 2-3x faster performance - Create a start script with environment variable support - Set up automatic startup (launchd on macOS, systemd on Linux) ### Manual Installation #### macOS ```bash # Install via Homebrew brew install whisper.cpp # Download model mkdir -p ~/.voicemode/models/whisper cd ~/.voicemode/models/whisper curl -LO https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v2.bin ``` #### Linux ```bash # Clone and build whisper.cpp git clone https://github.com/ggerganov/whisper.cpp cd whisper.cpp make # Download model mkdir -p ~/.voicemode/models/whisper cd ~/.voicemode/models/whisper wget https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v2.bin ``` ### Prerequisites **macOS**: - Xcode Command Line Tools (`xcode-select --install`) - Only for building whisper.cpp - Homebrew (https://brew.sh) - cmake (`brew install cmake`) **Note for Apple Silicon users:** Core ML models are pre-built and downloaded automatically. No Xcode, PyTorch, or coremltools required! **Linux**: - Build essentials (`sudo apt install build-essential` on Ubuntu/Debian) ## Core ML Acceleration (Apple Silicon) On Apple Silicon Macs (M1/M2/M3/M4), VoiceMode automatically downloads pre-built Core ML models from Hugging Face for 2-3x faster transcription: - **Automatic:** Core ML models download alongside regular models - **No Dependencies:** No PyTorch, Xcode, or coremltools needed - **Pre-built:** Models are pre-compiled and ready to use - **Performance:** 2-3x faster than Metal acceleration alone Core ML models are included automatically when available. The installation process handles this transparently. ## Model Management ### Available Models | Model | Size | RAM Usage | Accuracy | Speed | Language Support | |-------|------|-----------|----------|-------|-----------------| | **tiny** | 39 MB | ~390 MB | Low | Fastest | Multilingual | | **tiny.en** | 39 MB | ~390 MB | Low | Fastest | English only | | **base** | 142 MB | ~500 MB | Fair | Fast | Multilingual | | **base.en** | 142 MB | ~500 MB | Fair | Fast | English only | | **small** | 466 MB | ~1 GB | Good | Moderate | Multilingual | | **small.en** | 466 MB | ~1 GB | Good | Moderate | English only | | **medium** | 1.5 GB | ~2.6 GB | Very Good | Slow | Multilingual | | **medium.en** | 1.5 GB | ~2.6 GB | Very Good | Slow | English only | | **large-v1** | 2.9 GB | ~3.9 GB | Excellent | Slower | Multilingual | | **large-v2** | 2.9 GB | ~3.9 GB | Excellent | Slower | Multilingual (recommended) | | **large-v3** | 3.1 GB | ~3.9 GB | Best | Slowest | Multilingual | | **large-v3-turbo** | 1.6 GB | ~2.5 GB | Very Good | Moderate | Multilingual | ### Model Commands ```bash # List all models with installation status voicemode whisper model --all # Show current active model voicemode whisper model # Switch to a model (auto-installs if not present) voicemode whisper model small.en # Switch model without auto-installing (fails if model not installed) voicemode whisper model medium --no-install # Switch model without restarting service voicemode whisper model large-v2 --no-restart ``` Note: After changing the active model with `--no-restart`, restart the whisper service manually for changes to take effect. ## Service Configuration ### Environment Variables Configure in `~/.voicemode/voicemode.env`: ```bash VOICEMODE_WHISPER_MODEL=large-v2 VOICEMODE_WHISPER_PORT=2022 VOICEMODE_WHISPER_THREADS= # Auto-detected if not set VOICEMODE_WHISPER_LANGUAGE=auto VOICEMODE_WHISPER_MODEL_PATH=~/.voicemode/models/whisper ``` **Thread Configuration**: By default, VoiceMode auto-detects the number of CPU cores and configures threads accordingly. You can override this by setting `VOICEMODE_WHISPER_THREADS` to a specific number. ### Running the Server #### OpenAI-Compatible Server Mode ```bash whisper-server \ --model models/ggml-large-v2.bin \ --host 127.0.0.1 \ --port 2022 \ --inference-path "/v1/audio/transcriptions" \ --threads 4 \ --processors 1 \ --convert \ --print-progress ``` Key options: - `--model`: Path to model file - `--host`: Server host (default: 127.0.0.1) - `--port`: Server port (VoiceMode expects 2022) - `--inference-path`: OpenAI-compatible endpoint path - `--threads`: Number of threads for processing (auto-detected by VoiceMode) - `--processors`: Number of parallel processors - `--convert`: Convert audio to required format automatically (required for VoiceMode) - `--print-progress`: Show transcription progress **Note**: When using VoiceMode's managed service, threads are auto-detected based on your CPU cores. The `--convert` flag is required for VoiceMode to work correctly with various audio formats. ### Service Management #### macOS (LaunchAgent) ```bash # Start/stop service launchctl load ~/Library/LaunchAgents/com.voicemode.whisper.plist launchctl unload ~/Library/LaunchAgents/com.voicemode.whisper.plist # Enable/disable at startup launchctl load -w ~/Library/LaunchAgents/com.voicemode.whisper.plist launchctl unload -w ~/Library/LaunchAgents/com.voicemode.whisper.plist # Check status launchctl list | grep whisper ``` #### Linux (Systemd) ```bash # Start/stop service systemctl --user start whisper systemctl --user stop whisper # Enable/disable at startup systemctl --user enable whisper systemctl --user disable whisper # Check status and logs systemctl --user status whisper journalctl --user -u whisper -f ``` ## Hardware Acceleration ### Apple Silicon (CoreML) CoreML provides 2-3x faster transcription on Apple Silicon Macs: ``` # Performance comparison # CPU Only: ~1x baseline # Metal: ~3-4x faster # CoreML + Metal: ~8-12x faster ``` Core ML models are downloaded automatically when installing Whisper on Apple Silicon. No additional configuration needed. ### GPU Acceleration The installation tool automatically detects and enables: - **Mac (Apple Silicon)**: Metal acceleration - **NVIDIA GPU**: CUDA acceleration - **CPU**: Optimized CPU builds ## Integration with VoiceMode VoiceMode automatically detects Whisper when available: 1. **First**: Checks for Whisper.cpp on `http://127.0.0.1:2022/v1` 2. **Fallback**: Uses OpenAI API (requires `OPENAI_API_KEY`) ### Custom Configuration To use a different endpoint or force Whisper use: ```bash export STT_BASE_URL=http://127.0.0.1:2022/v1 ``` Or in MCP configuration: ```json "voicemode": { ... "env": { "STT_BASE_URL": "http://127.0.0.1:2022/v1" } } ``` ## Fully Local Setup For completely offline voice processing, combine Whisper with Kokoro: ```bash export STT_BASE_URL=http://127.0.0.1:2022/v1 # Whisper for STT export TTS_BASE_URL=http://127.0.0.1:8880/v1 # Kokoro for TTS export TTS_VOICE=af_sky # Kokoro voice ``` ## Troubleshooting ### Service Won't Start - Check if port 2022 is already in use: `lsof -i :2022` - Verify model file exists at configured path - Check service logs for error messages ### Poor Transcription Quality - Try a larger model (base → small → medium → large) - Ensure audio input quality is good - Set specific language instead of 'auto' if known ### High CPU Usage - Use a smaller model for better performance - Consider English-only models (.en) for English content - Enable GPU acceleration if available ### Model Installation Issues - Verify adequate disk space (models range from 39MB to 3GB) - Check network connectivity to Hugging Face - Delete corrupted model files from `~/.voicemode/models/whisper/` and re-run the model command ## Performance Monitoring ```bash # Check service status voicemode whisper service status # Monitor real-time processing tail -f ~/.voicemode/services/whisper/logs/performance.log # List available models voicemode whisper model --all ``` ## File Locations - **Models**: `~/.voicemode/models/whisper/` or `~/.voicemode/services/whisper/models/` - **Service Config**: `~/.voicemode/services/whisper/config.json` - **Model Preferences**: `~/.voicemode/whisper-models.txt` - **Logs**: `~/.voicemode/services/whisper/logs/` - **LaunchAgent** (macOS): `~/Library/LaunchAgents/com.voicemode.whisper.plist` - **Systemd Service** (Linux): `~/.config/systemd/user/whisper.service`