# VibeVoice vLLM ASR Deployment

Deploy VibeVoice ASR model as a high-performance API service using [vLLM](https://github.com/vllm-project/vllm). This plugin provides OpenAI-compatible API endpoints for speech-to-text transcription with streaming support. ## 🔥 Key Features - **🚀 High-Performance Serving**: Optimized for high-throughput ASR inference with vLLM's continuous batching - **📡 OpenAI-Compatible API**: Standard `/v1/chat/completions` endpoint with streaming support - **🎵 Long Audio Support**: Process up to 60+ minutes of audio in a single request - **🔌 Plugin Architecture**: No vLLM source code modification required - just install and run - **⚡ Data Parallel (DP)**: Run independent model replicas across multiple GPUs with automatic load balancing behind a single port ## 🛠️ Installation Using Official vLLM Docker Image (Recommended) 1. Clone the repository ```bash git clone https://github.com/microsoft/VibeVoice.git cd VibeVoice ``` 2. Launch the server (background mode) ```bash docker run -d --gpus all --name vibevoice-vllm \ --ipc=host \ -p 8000:8000 \ -e VIBEVOICE_FFMPEG_MAX_CONCURRENCY=64 \ -e PYTORCH_ALLOC_CONF=expandable_segments:True \ -v $(pwd):/app \ -w /app \ --entrypoint bash \ vllm/vllm-openai:v0.14.1 \ -c "python3 /app/vllm_plugin/scripts/start_server.py" ``` ## ⚡ Multi-GPU Deployment The launcher supports two types of GPU parallelism via `--tp` and `--dp` flags: | Flag | Name | What it does | |------|------|-------------| | `--tp N` | Tensor Parallel | Splits **one model** across N GPUs (for models too large for a single GPU) | | `--dp N` | Data Parallel | Runs **N independent replicas**, one per GPU, with automatic load balancing behind a single port | ### Data Parallel (Recommended for scaling throughput) Run N independent replicas on N GPUs with automatic load balancing behind a single port. When `--dp N` is specified (N > 1), the launcher automatically starts N independent vLLM processes behind an nginx reverse proxy (2×N workers) for optimal throughput: ```bash docker run -d --gpus '"device=0,1,2,3"' --name vibevoice-vllm \ --ipc=host \ -p 8000:8000 \ -e VIBEVOICE_FFMPEG_MAX_CONCURRENCY=64 \ -e PYTORCH_ALLOC_CONF=expandable_segments:True \ -v $(pwd):/app \ -w /app \ --entrypoint bash \ vllm/vllm-openai:v0.14.1 \ -c "python3 /app/vllm_plugin/scripts/start_server.py --dp 4" ``` Run on all 8 GPUs: ```bash docker run -d --gpus all --name vibevoice-vllm \ --ipc=host \ -p 8000:8000 \ -e VIBEVOICE_FFMPEG_MAX_CONCURRENCY=64 \ -e PYTORCH_ALLOC_CONF=expandable_segments:True \ -v $(pwd):/app \ -w /app \ --entrypoint bash \ vllm/vllm-openai:v0.14.1 \ -c "python3 /app/vllm_plugin/scripts/start_server.py --dp 8" ``` ### Tensor Parallel Split a single model across 2 GPUs (useful if GPU memory is limited): ```bash docker run -d --gpus '"device=0,1"' --name vibevoice-vllm \ --ipc=host \ -p 8000:8000 \ -e VIBEVOICE_FFMPEG_MAX_CONCURRENCY=64 \ -e PYTORCH_ALLOC_CONF=expandable_segments:True \ -v $(pwd):/app \ -w /app \ --entrypoint bash \ vllm/vllm-openai:v0.14.1 \ -c "python3 /app/vllm_plugin/scripts/start_server.py --tp 2" ``` ### Hybrid (DP × TP) Combine both — e.g., 2 replicas, each split across 2 GPUs (4 GPUs total): ```bash docker run -d --gpus '"device=0,1,2,3"' --name vibevoice-vllm \ --ipc=host \ -p 8000:8000 \ -v $(pwd):/app \ -w /app \ --entrypoint bash \ vllm/vllm-openai:v0.14.1 \ -c "python3 /app/vllm_plugin/scripts/start_server.py --dp 2 --tp 2" ``` > **Note**: Total GPUs required = `dp × tp`. Make sure to expose enough GPU devices in the Docker `--gpus` flag. 3. View logs ```bash docker logs -f vibevoice-vllm ``` > **Note**: > - The `-d` flag runs the container in background (detached mode) > - Use `docker stop vibevoice-vllm` to stop the service > - The model will be downloaded to HuggingFace cache (`~/.cache/huggingface`) inside the container ## 🚀 Usages ### Test the API Once the vLLM server is running, test it with the provided script: ```bash # Basic transcription docker exec -it vibevoice-vllm python3 vllm_plugin/tests/test_api.py /app/audio.wav # With hotwords for better recognition of specific terms docker exec -it vibevoice-vllm python3 vllm_plugin/tests/test_api.py /app/audio.wav --hotwords "Microsoft,VibeVoice" ``` ```bash # With auto-recovery from repetition loops (for long audio) docker exec -it vibevoice-vllm python3 vllm_plugin/tests/test_api_auto_recover.py /app/audio.wav # Auto-recover with hotwords docker exec -it vibevoice-vllm python3 vllm_plugin/tests/test_api_auto_recover.py /app/audio.wav --hotwords "Microsoft,VibeVoice" ``` > **Note**: > - The audio/video file must be inside the mounted directory (`/app` in the container). Copy your files to the VibeVoice folder before testing. > - Hotwords help improve recognition of domain-specific terms like proper nouns, technical terms, and speaker names. ### Environment Variables | Variable | Description | Default | |----------|-------------|---------| | `VIBEVOICE_FFMPEG_MAX_CONCURRENCY` | Maximum FFmpeg processes for audio decoding | `64` | | `PYTORCH_ALLOC_CONF` | PyTorch memory allocator config | `expandable_segments:True` | ## 📊 Performance Tips 1. **GPU Memory**: Use `--gpu-memory-utilization 0.9` for maximum throughput if you have dedicated GPU 2. **Batch Size**: Increase `--max-num-seqs` for higher concurrency (requires more GPU memory) 3. **FFmpeg Concurrency**: Tune `VIBEVOICE_FFMPEG_MAX_CONCURRENCY` based on CPU cores ## 🚨 Troubleshooting ### Common Issues 1. **"CUDA out of memory"** - Reduce `--gpu-memory-utilization` - Reduce `--max-num-seqs` - Use smaller `--max-model-len` 2. **"Audio decoding failed"** - Ensure FFmpeg is installed: `ffmpeg -version` - Check audio file format is supported 3. **"Model not found"** - Ensure model path contains `config.json` and model weights - Generate tokenizer files if missing 4. **"Plugin not loaded"** - Verify installation: `pip show vibevoice` - Check entry point: `pip show -f vibevoice | grep entry`