# Fine-Tuning MOSS-TTS-Realtime This directory provides a complete finetuning workflow built on the `MOSS-TTS-Realtime` architecture: - `prepare_data.py`: pre-extract target audio `audio_codes`, with rank-sharded output support - `dataset.py`: assemble preprocessed data into a trainable format - `sft.py`: supports single-GPU, data parallel training, and optional FSDP / DeepSpeed ZeRO-3 sharded training - `run_train.sh`: one-click launcher ## 1. Install Install training dependencies first: ```bash git clone https://github.com/OpenMOSS/MOSS-TTS.git cd MOSS-TTS pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-runtime,finetune]" ``` If your environment supports FlashAttention 2, you can also follow the installation notes in the root README. If you plan to use **DeepSpeed ZeRO-3**, install the extra dependency group as well: ```bash pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-runtime,finetune-deepspeed]" ``` ## 2. Input JSONL format All data uses a unified `conversations` multi-turn dialogue format. Each record contains a `conversations` list, where each element represents one dialogue turn with `role` (`user` or `assistant`), `text` (text content), and `wav` (audio file path). An optional `ref_wav` field specifies the reference audio for voice cloning. ### 2.1 Single-turn data Single-turn data contains only one assistant turn, functioning the same as standard TTS and voice cloning. If no reference audio is available, the `ref_wav` field can be omitted. ```jsonl {"id": "000001", "ref_wav": "./data/ref0.wav", "conversations": [{"role": "assistant", "text": "Actually, I noticed that I am very sensitive to other people's emotions.", "wav": "./data/utt0001.wav"}]} {"id": "000002", "ref_wav": "./data/ref1.wav", "conversations": [{"role": "assistant", "text": "She said she would be here by noon.", "wav": "./data/utt0002.wav"}]} ``` ### 2.2 Multi-turn data In multi-turn data, `user` turns represent the user's voice interaction with the VoiceAgent, and `assistant` turns represent the speech synthesized by MOSS-TTS-Realtime, which should share the same speaker as `ref_wav`. All assistant turns must be from the same speaker, while user turns can be from different speakers. Single-turn and multi-turn data can be mixed together for training to maintain both single-turn and multi-turn capabilities. ```jsonl {"id": "000003", "ref_wav": "./data/ref0.wav", "conversations": [{"role": "user", "text": "Hey, I just landed in Paris. I have about six hours before my next flight. Any ideas?", "wav": "./data/user_utt0001.wav"}, {"role": "assistant", "text": "Nice, welcome to Paris! Six hours is actually perfect for a short city walk. Are you traveling light, or do you have luggage with you?", "wav": "./data/assistant_utt0001.wav"}, {"role": "user", "text": "Just a backpack. I don't want anything too rushed.", "wav": "./data/user_utt0002.wav"}, {"role": "assistant", "text": "Got it. In that case, I'd suggest starting near the Seine. You could walk from Notre-Dame to the Louvre, grab a coffee.", "wav": "./data/assistant_utt0002.wav"}]} ``` ## 3. Prepare data ### 3.1 Single process ```bash python moss_tts_realtime/finetuning/prepare_data.py \ --codec-path OpenMOSS-Team/MOSS-Audio-Tokenizer \ --device auto \ --input-jsonl train_raw.jsonl \ --output-jsonl train_with_codes.jsonl ``` By default, `prepare_data.py` pre-encodes reference audio as well. If you only want target audio codes, disable it explicitly: ```bash python moss_tts_realtime/finetuning/prepare_data.py \ --codec-path OpenMOSS-Team/MOSS-Audio-Tokenizer \ --device auto \ --input-jsonl train_raw.jsonl \ --output-jsonl train_with_codes.jsonl \ --skip-reference-audio-codes ``` ### 3.2 Multi-node / multi-GPU parallel preprocessing `prepare_data.py` follows the `accelerate launch` multi-process model directly. For example, with 2 nodes and 16 GPUs in total, the dataset is split into 16 shards and each rank writes one shard: ```bash accelerate launch --num_processes 16 moss_tts_realtime/finetuning/prepare_data.py \ --codec-path OpenMOSS-Team/MOSS-Audio-Tokenizer \ --device auto \ --input-jsonl train_raw.jsonl \ --output-jsonl prepared/train_with_codes.jsonl ``` The output will look like: - `prepared/train_with_codes.rank00000-of-00016.jsonl` - `prepared/train_with_codes.rank00001-of-00016.jsonl` - ... - `prepared/train_with_codes.rank00015-of-00016.jsonl` During training, `sft.py` can read: - a single JSONL - a directory - a glob such as `prepared/train_with_codes.rank*.jsonl` - or a comma-separated list of files If your platform already injects distributed communication environment variables, `accelerate launch` will reuse them directly, so you usually do not need to write `torchrun`-style communication arguments yourself. ## 4. Train ### 4.1 Single-GPU baseline ```bash accelerate launch moss_tts_realtime/finetuning/sft.py \ --model-path OpenMOSS-Team/MOSS-TTS-Realtime \ --codec-path OpenMOSS-Team/MOSS-Audio-Tokenizer \ --train-jsonl train_with_codes.jsonl \ --output-dir output/moss_tts_realtime_sft \ --per-device-batch-size 1 \ --gradient-accumulation-steps 8 \ --learning-rate 1e-5 \ --warmup-ratio 0.03 \ --num-epochs 3 \ --mixed-precision bf16 ``` ### 4.2 Data parallel For single-node 8-GPU data parallel training, you can use: ```bash accelerate launch \ --config_file moss_tts_realtime/finetuning/configs/accelerate_ddp_8gpu.yaml \ moss_tts_realtime/finetuning/sft.py \ --model-path OpenMOSS-Team/MOSS-TTS-Realtime \ --codec-path OpenMOSS-Team/MOSS-Audio-Tokenizer \ --train-jsonl 'prepared/train_with_codes.rank*.jsonl' \ --output-dir output/moss_tts_realtime_sft_ddp \ --per-device-batch-size 1 \ --gradient-accumulation-steps 4 \ --mixed-precision bf16 ``` ### 4.3 Optional parameter-sharded training For the 1.7B `OpenMOSS-Team/MOSS-TTS-Realtime` model, single-node DDP is usually enough. If you still want parameter sharding, the following approaches are supported: - **FSDP**: shard parameters, gradients, and optimizer states across ranks - **DeepSpeed ZeRO-3**: fully shard parameters, gradients, and optimizer states; better suited for larger models and multi-node setups #### FSDP ```bash accelerate launch \ --config_file moss_tts_realtime/finetuning/configs/accelerate_fsdp_1.7b.yaml \ moss_tts_realtime/finetuning/sft.py \ --model-path OpenMOSS-Team/MOSS-TTS-Realtime \ --codec-path OpenMOSS-Team/MOSS-Audio-Tokenizer \ --train-jsonl 'prepared/train_with_codes.rank*.jsonl' \ --output-dir output/moss_tts_realtime_sft_fsdp \ --per-device-batch-size 1 \ --gradient-accumulation-steps 4 \ --mixed-precision bf16 ``` #### DeepSpeed ZeRO-3 ```bash accelerate launch \ --config_file moss_tts_realtime/finetuning/configs/accelerate_zero3_1.7b.yaml \ moss_tts_realtime/finetuning/sft.py \ --model-path OpenMOSS-Team/MOSS-TTS-Realtime \ --codec-path OpenMOSS-Team/MOSS-Audio-Tokenizer \ --train-jsonl 'prepared/train_with_codes.rank*.jsonl' \ --output-dir output/moss_tts_realtime_sft_zero3 \ --per-device-batch-size 1 \ --gradient-accumulation-steps 4 \ --mixed-precision bf16 ``` ZeRO-3 requires the `deepspeed` package. If you only use DDP or FSDP, you do not need it. ### 4.4 Common tunable hyperparameters `sft.py` now exposes the common training hyperparameters directly: - Optimizer: `--learning-rate`, `--weight-decay`, `--adam-beta1`, `--adam-beta2`, `--adam-eps` - LR schedule: `--lr-scheduler-type`, `--warmup-steps`, `--warmup-ratio` - Stability: `--max-grad-norm`, `--mixed-precision` Training logs now print: - timestamped log prefixes - `global_batch_size` and its formula - `step_time` - `steps_per_sec` - `samples_per_sec` - `eta` ### 4.5 Multi-node training Update the following fields in the config file for your cluster: - `num_machines` - `num_processes` - `machine_rank` - `main_process_ip` - `main_process_port` For example, for 2 nodes and 16 GPUs: - node 0: `machine_rank: 0` - node 1: `machine_rank: 1` - `num_machines: 2` - `num_processes: 16` The training command itself can stay unchanged. ## 5. One-click launcher Run directly: ```bash bash moss_tts_realtime/finetuning/run_train.sh ``` Common environment variables: - `RAW_JSONL`: raw training JSONL - `PREPARED_JSONL`: output file from `prepare_data.py` - `TRAIN_JSONL`: optional; training input, which can be a single file, directory, or glob. If unset, it is inferred automatically from `PREPARED_JSONL` - `OUTPUT_DIR`: training output directory - `ACCELERATE_CONFIG_FILE`: optional; DDP / FSDP / ZeRO-3 config file - `SKIP_PREPARE`: set to `1` to skip preprocessing and train directly from existing `TRAIN_JSONL` / `PREPARED_JSONL` - `PREP_EXTRA_ARGS_STR`: extra arguments passed to `prepare_data.py` - `PREP_ACCELERATE_ARGS_STR`: if you want preprocessing to also launch through `accelerate`, set this, for example `--num_processes 16` or `--config_file moss_tts_realtime/finetuning/configs/accelerate_ddp_8gpu.yaml` - `TRAIN_EXTRA_ARGS_STR`: extra arguments passed to `sft.py` For example, to launch with ZeRO-3: ```bash RAW_JSONL=train_raw.jsonl \ PREPARED_JSONL=prepared/train_with_codes.jsonl \ OUTPUT_DIR=output/moss_tts_realtime_sft_zero3 \ ACCELERATE_CONFIG_FILE=moss_tts_realtime/finetuning/configs/accelerate_zero3_1.7b.yaml \ PREP_ACCELERATE_ARGS_STR='--config_file moss_tts_realtime/finetuning/configs/accelerate_ddp_8gpu.yaml' \ PREP_EXTRA_ARGS_STR='' \ TRAIN_EXTRA_ARGS_STR='--per-device-batch-size 1 --gradient-accumulation-steps 4 --num-epochs 3 --warmup-ratio 0.03 --mixed-precision bf16' \ bash moss_tts_realtime/finetuning/run_train.sh ```