# MOSS-TTS-Delay: llama.cpp Inference Backend [English](README.md) | [简体中文](README_zh.md) This package provides a **torch-free** (or torch-optional) end-to-end TTS inference pipeline for MOSS-TTS-Delay using: - **llama.cpp** for the Qwen3 backbone (GGUF format, GPU/CPU) - **NumPy** for embeddings, LM heads, delay state machine, and sampling - **ONNX Runtime** or **TensorRT** for the audio tokenizer When PyTorch is available, LM heads can optionally be GPU-accelerated (~30x faster). ## Prerequisites 1. **llama.cpp** — compiled from source with shared library support 2. **Python >= 3.10** ## Installation ### Minimal (torch-free, ONNX audio) ```bash pip install -e ".[llama-cpp-onnx]" ``` ### With TensorRT audio (max performance) ```bash pip install -e ".[llama-cpp-trt]" ``` ### With PyTorch LM heads acceleration ```bash pip install -e ".[llama-cpp-trt,llama-cpp-torch]" ``` ## Weight Preparation > To convert weights from the original MOSS-TTS model yourself (instead of downloading pre-quantized ones), see the [conversion guide](conversion/README.md). ### Step 1: Download pre-quantized TTS backbone & weights We provide pre-quantized GGUF backbone, embedding tables, and LM head matrices on HuggingFace: ```bash # Download pre-built GGUF + embeddings + lm_heads huggingface-cli download OpenMOSS-Team/MOSS-TTS-GGUF --local-dir weights/MOSS-TTS-GGUF ``` This gives you: - `weights/MOSS-TTS-GGUF/MOSS_TTS_Q4_K_M.gguf` — Q4_K_M quantized backbone - `weights/MOSS-TTS-GGUF/embeddings/` — 33 embedding `.npy` files - `weights/MOSS-TTS-GGUF/lm_heads/` — 33 LM head `.npy` files - `weights/MOSS-TTS-GGUF/tokenizer/` — BPE tokenizer files ### Step 2: Download ONNX audio tokenizer We provide ONNX models for the audio tokenizer. **TensorRT engines are not provided** because they are tied to specific GPU architectures and TensorRT versions. ```bash # Download ONNX encoder & decoder huggingface-cli download OpenMOSS-Team/MOSS-Audio-Tokenizer-ONNX --local-dir weights/MOSS-Audio-Tokenizer-ONNX ``` ### Step 3: Build the C bridge ```bash # Clone and build llama.cpp (if not already done) git clone https://github.com/ggerganov/llama.cpp cd llama.cpp && cmake -B build && cmake --build build --config Release -j cd .. # Build the C bridge shared library cd moss_tts_delay/llama_cpp bash build_bridge.sh /path/to/llama.cpp ``` ### Step 4 (Optional): Build TensorRT engines > **Note:** Only needed if you want to use `audio_backend: trt` for maximum audio tokenizer performance. Most users should use the ONNX backend. ```bash bash moss_audio_tokenizer/trt/build_engine.sh \ weights/MOSS-Audio-Tokenizer-ONNX/encoder.onnx \ weights/MOSS-Audio-Tokenizer-ONNX/decoder.onnx \ weights/MOSS-Audio-Tokenizer-TRT ``` > **⚠️ maxShapes determines the maximum audio length your engine can handle.** > The default builds support up to **40 seconds** of audio. If you need longer audio, > edit `MAX_AUDIO_SECONDS` in `build_engine.sh` before building. > See the detailed shape ↔ duration table in the script's comments. ## Usage ### CLI ```bash # Basic generation python -m moss_tts_delay.llama_cpp \ --config configs/llama_cpp/default.yaml \ --text "Hello, world!" \ --output output.wav # With reference audio (voice cloning) python -m moss_tts_delay.llama_cpp \ --config configs/llama_cpp/default.yaml \ --text "Hello!" \ --reference ref.wav \ --output output.wav # Force numpy LM heads (torch-free) python -m moss_tts_delay.llama_cpp \ --config configs/llama_cpp/default.yaml \ --text "Hello!" \ --heads-backend numpy # With profiling python -m moss_tts_delay.llama_cpp \ --config configs/llama_cpp/default.yaml \ --text "Hello!" \ --profile ``` ### Python API ```python from moss_tts_delay.llama_cpp import LlamaCppPipeline, PipelineConfig config = PipelineConfig.from_yaml("configs/llama_cpp/default.yaml") with LlamaCppPipeline(config) as pipeline: waveform = pipeline.generate( text="Hello, world!", reference_audio="ref.wav", # optional language="en", ) import soundfile as sf sf.write("output.wav", waveform, 24000) ``` ### Batch Evaluation ```bash python scripts/batch_eval_llama_cpp.py \ --config configs/llama_cpp/default.yaml \ --benchmark-dir /path/to/eval/tts \ --result-dir results/llama_cpp_run \ --suite seed-tts ``` ## Benchmark Quantization quality evaluated on [Seed-TTS-eval](https://github.com/BytedanceSpeech/seed-tts-eval) zero-shot benchmark. Baseline is the original HuggingFace model; GGUF variants use the llama.cpp backend with TensorRT audio tokenizer. | Quantization | EN WER (%) ↓ | EN SIM (%) ↑ | ZH CER (%) ↓ | ZH SIM (%) ↑ | |---|---:|---:|---:|---:| | Baseline (HuggingFace) | 1.79 | 71.46 | 1.32 | 77.05 | | Q8_0 | 3.21 | 68.61 | 1.56 | 76.03 | | Q6_K | 3.11 | 68.77 | 1.44 | 76.06 | | Q5_K_M | 2.95 | 68.55 | 1.50 | 75.96 | | Q4_K_M | 2.83 | 68.15 | 1.58 | 75.71 | ## Configuration ### Config Files | Config | Audio Backend | Use Case | |--------|--------------|----------| | `configs/llama_cpp/default.yaml` | ONNX | Recommended starting point | | `configs/llama_cpp/trt.yaml` | TensorRT | Maximum throughput | | `configs/llama_cpp/cpu-only.yaml` | ONNX (CPU) | No GPU required | ### Key Options | Option | Values | Description | |--------|--------|-------------| | `heads_backend` | `auto` / `numpy` / `torch` | LM heads computation backend. `auto` uses torch if available | | `audio_backend` | `onnx` / `trt` / `torch` | Audio tokenizer backend | | `n_gpu_layers` | `-1` / `0` / `N` | GPU offload layers. -1 = all, 0 = CPU only | | `n_ctx` | int | Context window size (prompt + generation) | | `max_new_tokens` | int | Maximum generation steps | ## Architecture ``` Input text │ ▼ Tokenizer (Rust BPE, tokenizers library) │ ▼ build_generation_prompt() → input_ids (S, 33) │ ▼ EmbeddingLookup (NumPy .npy) → embeddings (S, H) │ ▼ LlamaCppBackbone (GGUF, C bridge) → hidden_state (H,) │ ├─ [heads_backend=torch] TorchLMHeads (nn.Linear, GPU) │ └─ audio_logits (32, 1025) │ └─ [heads_backend=numpy] NumpyLMHeads (CPU matmul) └─ audio_logits (32, 1025) │ ▼ delay_step() + sampling (NumPy) → next_ids (33,) │ ▼ (loop until EOS) │ Audio codes → AudioTokenizer (ONNX/TRT/Torch) → waveform ``` ## File Structure ``` moss_tts_delay/llama_cpp/ ├── __init__.py # Package entry, exports LlamaCppPipeline ├── __main__.py # python -m moss_tts_delay.llama_cpp ├── _constants.py # Token IDs (from config.json, torch-free) ├── pipeline.py # LlamaCppPipeline (main entry) ├── backbone.py # LlamaCppBackbone (C bridge wrapper) ├── backbone_bridge.c # C bridge source ├── build_bridge.sh # Build script ├── embedding.py # EmbeddingLookup (NumPy) ├── lm_heads.py # NumpyLMHeads + TorchLMHeads ├── delay_state.py # Delay state machine (NumPy) ├── sampling.py # top-k/p sampling (NumPy) ├── processor.py # Tokenizer + prompt builder ├── README.md # This file ├── README_zh.md # Chinese documentation └── conversion/ ├── extract_weights.py # Weight extraction script ├── README.md # Conversion guide (English) └── README_zh.md # Conversion guide (Chinese) ```