# Fine-Tuning MossTTSDelay This directory provides a complete finetuning workflow built on the `MossTTSDelay` architecture: - `prepare_data.py`: pre-extract target audio `audio_codes`, with rank-sharded output support - `dataset.py`: pack `text / instruction / ambient_sound / reference` and related fields into teacher-forcing samples - `sft.py`: supports single-GPU, data parallel training, and 8B-scale FSDP / DeepSpeed ZeRO-3 sharded training - `convert_seed_tts_eval_to_jsonl.py`: convert `seed-tts-eval` folders into training JSONL - `run_train.sh`: one-click launcher ## 1. Install Install training dependencies first: ```bash git clone https://github.com/OpenMOSS/MOSS-TTS.git cd MOSS-TTS pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-runtime,finetune]" ``` If your environment supports FlashAttention 2, you can also follow the installation notes in the root README. If you plan to use **DeepSpeed ZeRO-3**, install the extra dependency group as well: ```bash pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-runtime,finetune-deepspeed]" ``` ## 2. Input JSONL format All tasks share the same basic idea: - `audio`: target training audio path; `prepare_data.py` will encode it into `audio_codes` - all other fields are mapped directly into `processor.build_user_message(...)` ### 2.1 MOSS-TTS-v1.5 #### Plain `text, speech` pairs This format does not require reference audio and is supported directly: ```jsonl {"audio":"./data/utt0001.wav","text":"Actually, I noticed that I am very sensitive to other people's emotions.","language":"English"} {"audio":"./data/utt0002.wav","text":"She said she would be here by noon.","language":"English"} ``` #### Voice cloning / reference-conditioned training For MOSS-TTS-v1.5, set `language` whenever the language is known. Use language names such as `Chinese`, `English`, or `French`. ```jsonl {"audio":"./data/utt0001.wav","text":"Actually, I noticed that I am very sensitive to other people's emotions.","ref_audio":"./data/ref.wav","language":"English"} {"audio":"./data/utt0002.wav","text":"She said she would be here by noon.","ref_audio":"./data/ref.wav","language":"English"} ``` ### 2.2 MOSS-TTSD MOSS-TTSD shares the same `prepare_data.py / sft.py` pipeline as MOSS-TTS-v1.5, and the format can stay the same. The only difference is that `reference` may be a multi-speaker list, and list elements may be `null`, meaning that speaker has no cloning reference: ```jsonl { "audio":"./data/dialog_target.wav", "text":"[S1] This is the prefix from speaker one. [S2] This is the prefix from speaker two. [S1] Now continue the next turn.", "reference":["./data/s1_ref.wav", null], "language":"English" } ``` Notes: - `prepare_data.py` always encodes `audio` - by default it also encodes any reference audio found in `reference` / `ref_audio` / `reference_audio` - `null` entries inside `reference` are preserved as `None` during training and will not be encoded incorrectly - no extra `prompt_audio` field is required; autoregressive continuation is already learned through standard teacher-forcing training If you use `OpenMOSS-Team/MOSS-TTSD-v1.0` as the base model, the safest and simplest workflow is: **replace the support `.py` files under this repo's `moss_tts_delay` folder with the corresponding versions from the `OpenMOSS-Team/MOSS-TTSD-v1.0` Hugging Face repository before preprocessing, training, and inference.** To be specific, replacing: - `processing_moss_tts.py` - `modeling_moss_tts.py` - `configuration_moss_tts.py` - `inference_utils.py` The reason is that the prompt template and some implementation details in `OpenMOSS-Team/MOSS-TTSD-v1.0` are not exactly the same as the default `moss_tts_delay` version in this repo. If you mix them, a common failure mode is that the training loss looks normal but inference collapses into gibberish. Also note that `OpenMOSS-Team/MOSS-TTSD-v1.0` uses `n_vq = 16`, so we recommend explicitly passing `--n-vq 16` in both preprocessing and training to keep data preparation, training, and inference consistent. ### 2.3 MOSS-SoundEffect MOSS-SoundEffect uses the same pipeline, with `ambient_sound` as the user-side field: ```jsonl {"audio":"./data/rain.wav","ambient_sound":"Rolling thunder with steady rainfall."} {"audio":"./data/footsteps.wav","ambient_sound":"Clear footsteps echoing on concrete at a steady rhythm.","tokens":160} ``` ### 2.4 MOSS-VoiceGenerator MOSS-VoiceGenerator also shares the same training flow, using `text + instruction`: ```jsonl {"audio":"./data/old_man.wav","text":"My old back is really giving me trouble these days.","instruction":"A tired, hoarse elderly voice complaining slowly with a faint groan."} {"audio":"./data/tavern.wav","text":"Hey there, stranger!","instruction":"Hearty, jovial tavern owner's voice, loud and welcoming with a slightly gruff tone."} ``` ## 3. Prepare data ### 3.1 Single process ```bash python moss_tts_delay/finetuning/prepare_data.py \ --model-path OpenMOSS-Team/MOSS-TTS-v1.5 \ --codec-path OpenMOSS-Team/MOSS-Audio-Tokenizer \ --device auto \ --input-jsonl train_raw.jsonl \ --output-jsonl train_with_codes.jsonl ``` By default, `prepare_data.py` pre-encodes reference audio as well. If you only want target audio codes, disable it explicitly: ```bash python moss_tts_delay/finetuning/prepare_data.py \ --model-path OpenMOSS-Team/MOSS-TTS-v1.5 \ --codec-path OpenMOSS-Team/MOSS-Audio-Tokenizer \ --device auto \ --input-jsonl train_raw.jsonl \ --output-jsonl train_with_codes.jsonl \ --skip-reference-audio-codes ``` ### 3.2 Multi-node / multi-GPU parallel preprocessing `prepare_data.py` now follows the `accelerate launch` multi-process model directly. For example, with 2 nodes and 16 GPUs in total, the dataset is split into 16 shards and each rank writes one shard: ```bash accelerate launch --num_processes 16 moss_tts_delay/finetuning/prepare_data.py \ --model-path OpenMOSS-Team/MOSS-TTS-v1.5 \ --codec-path OpenMOSS-Team/MOSS-Audio-Tokenizer \ --device auto \ --input-jsonl train_raw.jsonl \ --output-jsonl prepared/train_with_codes.jsonl ``` The output will look like: - `prepared/train_with_codes.rank00000-of-00016.jsonl` - `prepared/train_with_codes.rank00001-of-00016.jsonl` - ... - `prepared/train_with_codes.rank00015-of-00016.jsonl` During training, `sft.py` can read: - a single JSONL - a directory - a glob such as `prepared/train_with_codes.rank*.jsonl` - or a comma-separated list of files If your platform already injects distributed communication environment variables, `accelerate launch` will reuse them directly, so you usually do not need to write `torchrun`-style communication arguments yourself. ## 4. Train ### 4.1 Single-GPU baseline ```bash accelerate launch moss_tts_delay/finetuning/sft.py \ --model-path OpenMOSS-Team/MOSS-TTS-v1.5 \ --train-jsonl train_with_codes.jsonl \ --output-dir output/moss_tts_sft \ --per-device-batch-size 1 \ --gradient-accumulation-steps 8 \ --learning-rate 1e-5 \ --warmup-ratio 0.03 \ --num-epochs 3 \ --mixed-precision bf16 \ --channelwise-loss-weight 1,32 \ --gradient-checkpointing ``` ### 4.2 Data parallel For single-node 8-GPU data parallel training, you can use: ```bash accelerate launch \ --config_file moss_tts_delay/finetuning/configs/accelerate_ddp_8gpu.yaml \ moss_tts_delay/finetuning/sft.py \ --model-path OpenMOSS-Team/MOSS-TTS-v1.5 \ --train-jsonl 'prepared/train_with_codes.rank*.jsonl' \ --output-dir output/moss_tts_sft_ddp \ --per-device-batch-size 1 \ --gradient-accumulation-steps 4 \ --mixed-precision bf16 \ --channelwise-loss-weight 1,32 \ --gradient-checkpointing ``` ### 4.3 "Model parallel" / parameter-sharded training for the 8B model For the 8B `MOSS-TTS-v1.5` model, the following approaches are recommended over naive single-card training: - **FSDP**: shard parameters, gradients, and optimizer states across ranks - **DeepSpeed ZeRO-3**: fully shard parameters, gradients, and optimizer states; better suited for larger models and multi-node setups #### FSDP ```bash accelerate launch \ --config_file moss_tts_delay/finetuning/configs/accelerate_fsdp_8b.yaml \ moss_tts_delay/finetuning/sft.py \ --model-path OpenMOSS-Team/MOSS-TTS-v1.5 \ --train-jsonl 'prepared/train_with_codes.rank*.jsonl' \ --output-dir output/moss_tts_sft_fsdp \ --per-device-batch-size 1 \ --gradient-accumulation-steps 4 \ --mixed-precision bf16 \ --channelwise-loss-weight 1,32 \ --gradient-checkpointing ``` #### DeepSpeed ZeRO-3 ```bash accelerate launch \ --config_file moss_tts_delay/finetuning/configs/accelerate_zero3_8b.yaml \ moss_tts_delay/finetuning/sft.py \ --model-path OpenMOSS-Team/MOSS-TTS-v1.5 \ --train-jsonl 'prepared/train_with_codes.rank*.jsonl' \ --output-dir output/moss_tts_sft_zero3 \ --per-device-batch-size 1 \ --gradient-accumulation-steps 4 \ --mixed-precision bf16 \ --channelwise-loss-weight 1,32 \ --gradient-checkpointing ``` ZeRO-3 requires the `deepspeed` package. If you only use DDP or FSDP, you do not need it. ### 4.4 Common tunable hyperparameters `sft.py` now exposes the common training hyperparameters directly: - optimizer: `--learning-rate`, `--weight-decay`, `--adam-beta1`, `--adam-beta2`, `--adam-eps` - LR schedule: `--lr-scheduler-type`, `--warmup-steps`, `--warmup-ratio` - stability: `--max-grad-norm`, `--gradient-checkpointing`, `--mixed-precision` - RVQ multi-head loss weighting: `--channelwise-loss-weight` `--channelwise-loss-weight` supports two forms: - `n_vq + 1` values: `text_head,vq0,...,vqN` - two values: `text_weight,total_audio_weight` The default is `1,32`, which means the text head and each individual audio head have equal weight. Training logs now print: - timestamped log prefixes - `global_batch_size` and its formula - `step_time` - `steps_per_sec` - `samples_per_sec` - `eta` ### 4.5 Multi-node training Update the following fields in the config file for your cluster: - `num_machines` - `num_processes` - `machine_rank` - `main_process_ip` - `main_process_port` For example, for 2 nodes and 16 GPUs: - node 0: `machine_rank: 0` - node 1: `machine_rank: 1` - `num_machines: 2` - `num_processes: 16` The training command itself can stay unchanged. ## 5. Quick inference test Each checkpoint saved by `sft.py` now contains model config, runtime Python files, tokenizer files, and processor metadata, so you can call `from_pretrained` directly on that checkpoint directory: ```python from pathlib import Path import importlib.util import torch import torchaudio from transformers import AutoProcessor from moss_tts_delay.modeling_moss_tts import MossTTSDelayModel torch.backends.cuda.enable_cudnn_sdp(False) torch.backends.cuda.enable_flash_sdp(True) torch.backends.cuda.enable_mem_efficient_sdp(True) torch.backends.cuda.enable_math_sdp(True) def resolve_attn_implementation(device: str, dtype: torch.dtype) -> str: if ( device == "cuda" and importlib.util.find_spec("flash_attn") is not None and dtype in {torch.float16, torch.bfloat16} ): major, _ = torch.cuda.get_device_capability() if major >= 8: return "flash_attention_2" if device == "cuda": return "sdpa" return "eager" model_path = "output/moss_tts_sft/checkpoint-epoch-2" reference_audio = "./assets/audio/reference_en_0.mp3" text = "This is a quick finetuning smoke test for MOSS-TTS-v1.5." device = "cuda" if torch.cuda.is_available() else "cpu" dtype = torch.bfloat16 if device == "cuda" else torch.float32 attn_implementation = resolve_attn_implementation(device, dtype) processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True) processor.audio_tokenizer = processor.audio_tokenizer.to(device) model = MossTTSDelayModel.from_pretrained( model_path, torch_dtype=dtype, attn_implementation=attn_implementation, ).to(device) model.eval() conversation = [[ processor.build_user_message( text=text, reference=[reference_audio], ) ]] batch = processor(conversation, mode="generation") outputs = model.generate( input_ids=batch["input_ids"].to(device), attention_mask=batch["attention_mask"].to(device), max_new_tokens=4096, ) message = processor.decode(outputs)[0] audio = message.audio_codes_list[0] Path("demo_outputs").mkdir(parents=True, exist_ok=True) torchaudio.save("demo_outputs/finetuned_sample.wav", audio.unsqueeze(0), processor.model_config.sampling_rate) ``` ## 6. One-click launcher Run directly: ```bash bash moss_tts_delay/finetuning/run_train.sh ``` Common environment variables: - `RAW_JSONL`: raw training JSONL - `PREPARED_JSONL`: output file from `prepare_data.py` - `TRAIN_JSONL`: optional; training input, which can be a single file, directory, or glob. If unset, it is inferred automatically from `PREPARED_JSONL` - `OUTPUT_DIR`: training output directory - `ACCELERATE_CONFIG_FILE`: optional; DDP / FSDP / ZeRO-3 config file - `SKIP_PREPARE`: set to `1` to skip preprocessing and train directly from existing `TRAIN_JSONL` / `PREPARED_JSONL` - `PREP_EXTRA_ARGS_STR`: extra arguments passed to `prepare_data.py` - `PREP_ACCELERATE_ARGS_STR`: if you want preprocessing to also launch through `accelerate`, set this, for example `--num_processes 16` or `--config_file moss_tts_delay/finetuning/configs/accelerate_ddp_8gpu.yaml` - `TRAIN_EXTRA_ARGS_STR`: extra arguments passed to `sft.py` For example, to launch with ZeRO-3: ```bash RAW_JSONL=train_raw.jsonl \ PREPARED_JSONL=prepared/train_with_codes.jsonl \ OUTPUT_DIR=output/moss_tts_sft_zero3 \ ACCELERATE_CONFIG_FILE=moss_tts_delay/finetuning/configs/accelerate_zero3_8b.yaml \ PREP_ACCELERATE_ARGS_STR='--config_file moss_tts_delay/finetuning/configs/accelerate_ddp_8gpu.yaml' \ PREP_EXTRA_ARGS_STR='' \ TRAIN_EXTRA_ARGS_STR='--per-device-batch-size 1 --gradient-accumulation-steps 4 --num-epochs 3 --warmup-ratio 0.03 --mixed-precision bf16 --channelwise-loss-weight 1,32 --gradient-checkpointing' \ bash moss_tts_delay/finetuning/run_train.sh ``` ## 7. Additional task format notes The remaining tasks do not require a separate trainer. You only need to switch the JSONL fields: - **MOSS-TTS-v1.5**: use `text`, optionally `ref_audio` - **MOSS-TTSD**: use `text + reference`, where `reference` supports `null` - **MOSS-SoundEffect**: use `ambient_sound` - **MOSS-VoiceGenerator**: use `text + instruction` Shared fields: - `audio`: required target audio - `language`, `tokens`, `quality`, `sound_event`, `ambient_sound`, `instruction`: fill them as needed by the task. For MOSS-TTS-v1.5, prefer language names such as `Chinese`, `English`, or `French`. Shared scripts: - use `prepare_data.py` for data preparation - use `sft.py` for training - `train-jsonl` supports a single file, directory, glob, or multi-file list