# Converting MOSS-TTS Weights to GGUF

[English](README.md) | [简体中文](README_zh.md)

This guide walks through converting the original MOSS-TTS (HuggingFace) weights into the GGUF format used by the llama.cpp inference backend. If you just want to **use** the pre-converted weights, skip this guide and download them directly:

```bash
huggingface-cli download OpenMOSS-Team/MOSS-TTS-GGUF --local-dir weights/MOSS-TTS-GGUF
```

## Overview

The conversion pipeline has three steps:

1. **Extract weights** — split the MOSS-TTS model into a standalone Qwen3 backbone (safetensors), embedding tables (`.npy`), and LM head matrices (`.npy`).
2. **Convert to GGUF** — convert the Qwen3 backbone safetensors to a full-precision (f16) GGUF file using llama.cpp's `convert_hf_to_gguf.py`.
3. **Quantize** — quantize the f16 GGUF to a smaller format (e.g. Q4_K_M) using `llama-quantize`.

```
OpenMOSS-Team/MOSS-TTS (HuggingFace)
  │
  ▼  Step 1: extract_weights.py
  ├── qwen3_backbone/     (safetensors + config.json)
  ├── embeddings/          (33 × .npy)
  └── lm_heads/            (33 × .npy)
        │
        ▼  Step 2: convert_hf_to_gguf.py
        backbone_f16.gguf
        │
        ▼  Step 3: llama-quantize
        backbone_q4km.gguf
```

## Prerequisites

- Python >= 3.10
- `safetensors`, `numpy`, `torch`, `huggingface_hub` (`pip install safetensors numpy torch huggingface_hub`)
- A compiled [llama.cpp](https://github.com/ggerganov/llama.cpp) tree (for `convert_hf_to_gguf.py` and `llama-quantize`)

### Building llama.cpp

```bash
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release -j
cd ..
```

After building, you will have:
- `llama.cpp/convert_hf_to_gguf.py` — HF-to-GGUF conversion script
- `llama.cpp/build/bin/llama-quantize` — quantization tool

## Step 1: Extract Weights

This splits the full MOSS-TTS model into three component groups. The script downloads the model from HuggingFace automatically if a local path is not provided.

```bash
python moss_tts_delay/llama_cpp/conversion/extract_weights.py \
    --model OpenMOSS-Team/MOSS-TTS \
    --output weights/extracted
```

To use a **local** model directory instead of downloading:

```bash
python moss_tts_delay/llama_cpp/conversion/extract_weights.py \
    --model /path/to/MOSS-TTS \
    --output weights/extracted
```

### Output structure

```
weights/extracted/
├── qwen3_backbone/
│   ├── config.json                          # Qwen3ForCausalLM config
│   ├── model-00001-of-00004.safetensors     # backbone shards
│   ├── model-00002-of-00004.safetensors
│   ├── model-00003-of-00004.safetensors
│   ├── model-00004-of-00004.safetensors
│   ├── model.safetensors.index.json
│   ├── tokenizer.json
│   ├── tokenizer_config.json
│   └── ...
├── embeddings/
│   ├── embed_tokens.npy      # shared text embedding table
│   ├── emb_ext_00.npy        # audio embedding codebook 0
│   ├── emb_ext_01.npy
│   └── ...                   # (32 audio codebooks total)
├── lm_heads/
│   ├── lm_head_text.npy      # text LM head
│   ├── lm_head_audio_00.npy  # audio LM head 0
│   ├── lm_head_audio_01.npy
│   └── ...                   # (32 audio heads total)
└── extraction_meta.json       # metadata (vocab sizes, paths, etc.)
```

## Step 2: Convert Backbone to GGUF

Use llama.cpp's conversion script to turn the extracted Qwen3 backbone into a GGUF file:

```bash
python llama.cpp/convert_hf_to_gguf.py \
    weights/extracted/qwen3_backbone \
    --outfile weights/backbone_f16.gguf \
    --outtype f16
```

This produces a ~16 GB f16 GGUF file.

## Step 3: Quantize

Quantize the f16 GGUF to a smaller format. `Q4_K_M` is a good balance of quality and size:

```bash
llama.cpp/build/bin/llama-quantize \
    weights/backbone_f16.gguf \
    weights/backbone_q4km.gguf \
    Q4_K_M
```

This reduces the file from ~16 GB to ~4.8 GB.

### Other quantization options

| Type | Approx. Size | BPW | Notes |
|------|-------------|-----|-------|
| `Q4_K_M` | ~4.8 GB | 4.91 | Recommended default |
| `Q5_K_M` | ~5.7 GB | 5.69 | Slightly better quality |
| `Q6_K` | ~6.6 GB | 6.56 | Near-lossless for most uses |
| `Q8_0` | ~8.7 GB | 8.50 | Highest quality quantization |

## All-in-One Example

```bash
# 0. Build llama.cpp (one-time)
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && cmake -B build && cmake --build build --config Release -j && cd ..

# 1. Extract weights
python moss_tts_delay/llama_cpp/conversion/extract_weights.py \
    --model OpenMOSS-Team/MOSS-TTS \
    --output weights/extracted

# 2. Convert to f16 GGUF
python llama.cpp/convert_hf_to_gguf.py \
    weights/extracted/qwen3_backbone \
    --outfile weights/backbone_f16.gguf \
    --outtype f16

# 3. Quantize to Q4_K_M
llama.cpp/build/bin/llama-quantize \
    weights/backbone_f16.gguf \
    weights/backbone_q4km.gguf \
    Q4_K_M

# Done! Use the quantized backbone + embeddings + lm_heads for inference.
# See the llama.cpp backend README for usage instructions.
```

## Using the Converted Weights

After conversion, arrange the weights for the llama.cpp backend:

```
weights/
├── backbone_q4km.gguf          # from Step 3
├── embeddings/                  # from Step 1 (weights/extracted/embeddings/)
│   ├── embed_tokens.npy
│   └── emb_ext_*.npy
├── lm_heads/                    # from Step 1 (weights/extracted/lm_heads/)
│   ├── lm_head_text.npy
│   └── lm_head_audio_*.npy
└── tokenizer/                   # from Step 1 (weights/extracted/qwen3_backbone/)
    ├── tokenizer.json
    └── tokenizer_config.json
```

Then update your config YAML (e.g. `configs/llama_cpp/default.yaml`) to point to these paths and run inference:

```bash
python -m moss_tts_delay.llama_cpp \
    --config configs/llama_cpp/default.yaml \
    --text "Hello, world!" \
    --output output.wav
```

## Troubleshooting

- **`convert_hf_to_gguf.py` fails with "unknown model architecture"**: Make sure you are converting the `qwen3_backbone/` directory (not the original MOSS-TTS directory). The `config.json` must declare `"architectures": ["Qwen3ForCausalLM"]`.
- **Out of memory during extraction**: The extraction script uses lazy loading, so peak memory should be roughly one safetensors shard (~5 GB). If memory is still tight, close other applications.
- **Quantization produces unexpected size**: Verify you are quantizing the f16 GGUF (not an already-quantized file). Double-check the quantization type argument.