# MOSS-TTS-Nano Finetuning Guide

This directory provides a complete finetuning workflow for `MOSS-TTS-Nano`:

- `prepare_data.py`: precomputes `audio_codes` for target audio and, when needed, `ref_audio_codes`
- `dataset.py`: packs fields such as `text / instruction / ambient_sound / ref_audio` into teacher-forcing samples
- `sft.py`: supports single-GPU, data parallel, and multi-node training
- `verify.py`: provides basic non-streaming inference checks
- `run_train.sh`: one-click wrapper that chains preprocessing and training

Default model weight locations:

- TTS model: `./models/MOSS-TTS-Nano`
- Audio codec: `./models/MOSS-Audio-Tokenizer-Nano`

## 1. Install Dependencies

From the repository root:

```bash
cd /path/to/MOSS-TTS-Nano
pip install -r requirements.txt
```

`requirements.txt` already includes:

- `accelerate>=1.0.0`
- `tqdm>=4.66.0`

## 2. Raw JSONL Format

The Nano finetuning pipeline mainly supports the following two formats.

### 2.1 Plain `text, speech` pairs

```jsonl
{"audio":"./data/utt0001.wav","text":"I realized that I am actually very good at noticing other people's emotions.","language":"en"}
{"audio":"./data/utt0002.wav","text":"She said she would be here by noon.","language":"en"}
```

### 2.2 Voice Cloning / Reference-Conditioned Training

Only one reference field is supported:

- `ref_audio`: a single reference audio clip

Example:

```jsonl
{"audio":"./data/utt0001.wav","text":"I realized that I am actually very good at noticing other people's emotions.","ref_audio":"./data/ref.wav","language":"en"}
{"audio":"./data/utt0002.wav","text":"She said she would be here by noon.","ref_audio":"./data/ref.wav","language":"en"}
```

### 2.3 Optional Fields

If needed, you can also provide the following fields. They will be appended to the user prompt:

- `instruction`
- `tokens`
- `quality`
- `sound_event`
- `ambient_sound`
- `language`

### 2.4 Path Rules

- Relative paths in the raw JSONL are resolved relative to the JSONL file location.
- Training expects preprocessed JSONL input, which means each record must already contain `audio_codes`.
- If reference-conditioned training is used, the training JSONL must also already contain `ref_audio_codes`.
- Nano finetuning currently supports only a single reference audio per sample.

## 3. Data Preprocessing

`prepare_data.py` does two things:

1. Encodes `audio` into `audio_codes`
2. Encodes `ref_audio` into `ref_audio_codes` by default

### 3.1 Single Process

```bash
python finetuning/prepare_data.py \
  --codec-path ./models/MOSS-Audio-Tokenizer-Nano \
  --input-jsonl train_raw.jsonl \
  --output-jsonl train_with_codes.jsonl \
  --batch-size 8
```

If you only want to encode target audio and skip reference audio:

```bash
python finetuning/prepare_data.py \
  --codec-path ./models/MOSS-Audio-Tokenizer-Nano \
  --input-jsonl train_raw.jsonl \
  --output-jsonl train_with_codes.jsonl \
  --skip-reference-audio-codes
```

### 3.2 Multi-Node / Multi-GPU Parallel Encoding

`prepare_data.py` follows the standard `accelerate launch` multi-process semantics.  
For example, with 2 nodes and 16 GPUs in total, the input is split into 16 shards and each rank writes its own output shard:

```bash
accelerate launch --num_processes 16 finetuning/prepare_data.py \
  --codec-path ./models/MOSS-Audio-Tokenizer-Nano \
  --input-jsonl train_raw.jsonl \
  --output-jsonl prepared/train_with_codes.jsonl
```

The outputs look like:

- `prepared/train_with_codes.rank00000-of-00016.jsonl`
- `prepared/train_with_codes.rank00001-of-00016.jsonl`
- ...
- `prepared/train_with_codes.rank00015-of-00016.jsonl`

During training, `sft.py` can directly read:

- a single JSONL file
- a directory
- a glob such as `prepared/train_with_codes.rank*.jsonl`
- or a comma-separated list of files

If your platform already injects multi-node communication environment variables, `accelerate launch` can usually reuse them directly.

## 4. Training

### 4.1 Single-GPU Baseline

```bash
accelerate launch finetuning/sft.py \
  --model-path ./models/MOSS-TTS-Nano \
  --codec-path ./models/MOSS-Audio-Tokenizer-Nano \
  --train-jsonl train_with_codes.jsonl \
  --output-dir output/moss_tts_nano_sft \
  --per-device-batch-size 1 \
  --gradient-accumulation-steps 8 \
  --learning-rate 1e-5 \
  --warmup-ratio 0.03 \
  --num-epochs 3 \
  --mixed-precision bf16 \
  --max-length 1024 \
  --channelwise-loss-weight 1,32
```

### 4.2 Single-Machine 8-GPU DDP

```bash
accelerate launch \
  --config_file finetuning/configs/accelerate_ddp_8gpu.yaml \
  finetuning/sft.py \
  --model-path ./models/MOSS-TTS-Nano \
  --codec-path ./models/MOSS-Audio-Tokenizer-Nano \
  --train-jsonl 'prepared/train_with_codes.rank*.jsonl' \
  --output-dir output/moss_tts_nano_sft_ddp \
  --per-device-batch-size 1 \
  --gradient-accumulation-steps 4 \
  --learning-rate 1e-5 \
  --num-epochs 3 \
  --mixed-precision bf16 \
  --max-length 1024 \
  --channelwise-loss-weight 1,32
```

### 4.3 Multi-Node Training

Update the following fields in your config file to match your cluster:

- `num_machines`
- `num_processes`
- `machine_rank`
- `main_process_ip`
- `main_process_port`

Keep the rest of the training command unchanged.

### 4.4 Important Arguments

- `--max-length`: fixed full sequence length. Samples are truncated to this length and then padded.
- `--channelwise-loss-weight`: supports two formats
  - `text_head,vq0,...,vqN`
  - `text_weight,total_audio_weight`
- `--save-every-epochs`: save one checkpoint every N epochs.

Single-GPU memory reference:

- With `accelerate launch --num_processes 1` and `--per-device-batch-size 1 --gradient-accumulation-steps 1 --max-length 1024 --mixed-precision bf16`, the measured training-process peak memory usage is about `3.23 GiB`.

### 4.5 Checkpoint Contents

Each checkpoint directory can be loaded directly by the inference code in this repository. It contains:

- model weights
- `config.json`
- tokenizer files
- the Nano model Python source files needed for loading
- `finetune_config.json`

## 5. One-Click Script

If you want a simple wrapper that chains preprocessing and training:

```bash
bash finetuning/run_train.sh
```

Common environment variables:

- `RAW_JSONL`: raw training JSONL
- `PREPARED_JSONL`: preprocessed JSONL
- `TRAIN_JSONL`: training input; if unset, it is inferred from `PREPARED_JSONL`
- `OUTPUT_DIR`: output directory
- `SKIP_PREPARE=1`: skip preprocessing and train directly
- `PREP_ACCELERATE_ARGS_STR`: extra `accelerate` args for `prepare_data.py`
- `TRAIN_ACCELERATE_ARGS_STR`: extra `accelerate launch` args for training, mainly for overriding `num_machines / num_processes / machine_rank`
- `PREP_EXTRA_ARGS_STR`: extra args passed to `prepare_data.py`
- `TRAIN_EXTRA_ARGS_STR`: extra args passed to `sft.py`
- `ACCELERATE_CONFIG_FILE`: training-time accelerate config file; if `TRAIN_ACCELERATE_ARGS_STR` is also provided, command-line values override the config defaults

Example:

```bash
RAW_JSONL=train_raw.jsonl \
PREPARED_JSONL=prepared/train_with_codes.jsonl \
OUTPUT_DIR=output/moss_tts_nano_sft \
PREP_ACCELERATE_ARGS_STR='--num_processes 8' \
ACCELERATE_CONFIG_FILE=finetuning/configs/accelerate_ddp_8gpu.yaml \
TRAIN_EXTRA_ARGS_STR='--per-device-batch-size 1 --gradient-accumulation-steps 4 --learning-rate 1e-5 --num-epochs 3 --mixed-precision bf16 --max-length 1024 --channelwise-loss-weight 1,32' \
bash finetuning/run_train.sh
```

For multi-node runs, the same idea applies: prepare shared encoded data first, then adjust `ACCELERATE_CONFIG_FILE` or `TRAIN_ACCELERATE_ARGS_STR` for your cluster.

## 6. Quick Verification

`verify.py` keeps the inference path intentionally simple. It supports:

- `voice_clone`: reference audio + target text
- `continuation`: continuation mode, with two input patterns
  - `prompt_text + prompt_audio_path + text`
  - or only `text`, which degrades to plain TTS

### 6.1 Voice Clone Verification

```bash
python finetuning/verify.py \
  --checkpoint output/moss_tts_nano_sft/checkpoint-last \
  --mode voice_clone \
  --text "This is a quick validation example for a finetuned model." \
  --prompt-audio-path ./assets/audio/zh_1.wav \
  --output-audio-path output/verify_voice_clone.wav
```

### 6.2 Continuation Verification

If `continuation` is used with `prompt-audio-path`, you must also provide the corresponding `prompt-text`:

```bash
python finetuning/verify.py \
  --checkpoint output/moss_tts_nano_sft/checkpoint-last \
  --mode continuation \
  --prompt-text "This sentence has already been spoken in the prompt audio." \
  --prompt-audio-path ./assets/audio/zh_1.wav \
  --text "This next sentence continues from that prompt for a quick continuation check." \
  --output-audio-path output/verify_continuation.wav
```

### 6.3 Plain TTS Verification

If you only want plain text-to-speech without reference audio, still use `continuation`, but do not pass `prompt-text` or `prompt-audio-path`:

```bash
python finetuning/verify.py \
  --checkpoint output/moss_tts_nano_sft/checkpoint-last \
  --mode continuation \
  --text "This is a quick non-streaming validation example." \
  --output-audio-path output/verify_tts.wav
```

You can also continue using the repository-level `infer.py`. Checkpoints saved by finetuning are already packaged in a format that `infer.py` can load directly.