# MOSS-SoundEffect v2 **MOSS-SoundEffect v2.0** is a text-to-audio model with a Diffusion Transformer (DiT) backbone trained with the Flow Matching objective, paired with a DAC VAE and a Qwen3 text encoder. ## News * **MOSS-SoundEffect v2.0** *(this directory)* — A new attempt at text-to-audio generation using a **DiT** backbone trained with the **Flow Matching** objective, replacing the discrete-token autoregressive backbone used in v1. Targets higher audio fidelity and more natural long-form environmental sound. Released on HuggingFace at [`OpenMOSS-Team/MOSS-SoundEffect-v2.0`](https://huggingface.co/OpenMOSS-Team/MOSS-SoundEffect-v2.0). * **MOSS-SoundEffect v1.0** — The first release, built on the `MossTTSDelay` discrete-token autoregressive architecture shared with the rest of the MOSS-TTS family. See the [v1 model card](https://github.com/OpenMOSS/MOSS-TTS/blob/main/docs/moss_sound_effect_model_card.md) for architecture and usage details. > **Note.** This subdirectory uses its **own** Python environment (Python > 3.12, pinned `numpy==1.26`, `transformers==4.57`, `torch==2.9`) and is > **not compatible** with the top-level MOSS-TTS environment. Install it in > a clean, isolated environment as shown in > [Environment Setup](#environment-setup) below. ## Environment Setup We recommend a clean, isolated Python 3.12 environment to avoid dependency conflicts with the top-level MOSS-TTS environment. ### Using Conda ```bash conda create -n moss-soundeffect-v2 python=3.12 -y conda activate moss-soundeffect-v2 ``` Clone the repository and install all required dependencies: ```bash git clone https://github.com/OpenMOSS/MOSS-TTS.git cd MOSS-TTS/moss_soundeffect_v2 pip install --extra-index-url https://download.pytorch.org/whl/cu128 \ -e ".[torch-cu128,finetune]" ``` For a minimal **inference-only** install (still ships the Gradio demo; skips the fine-tuning extras `accelerate` / `peft` / `pandas` / `torchcodec`): ```bash pip install --extra-index-url https://download.pytorch.org/whl/cu128 \ -e ".[torch-cu128]" ``` ## Inference ```python import torch from moss_soundeffect_v2 import MossSoundEffectPipeline pipe = MossSoundEffectPipeline.from_pretrained( "OpenMOSS-Team/MOSS-SoundEffect-v2.0", # HF hub repo id, or a local dir torch_dtype=torch.bfloat16, device="cuda", ) audio = pipe( prompt="The crisp, rhythmic click-clack of fast typing on a mechanical keyboard.", seconds=10, num_inference_steps=100, cfg_scale=4.0, ) # (B, C, T) waveform tensor pipe.save_audio(audio, "out.wav") ``` Command-line: `bash infer_from_pipeline.sh` > The bundled shell scripts accept either a HF hub repo id or a > local directory; weights are auto-downloaded into the HuggingFace cache > on first use. > The underlying DiT is wrapped with `torch.compile` + Triton CUDA Graph for > acceleration. The first call may take a few minutes to compile. If you hit > `TorchDynamo` / Triton compile errors, set `TORCHDYNAMO_DISABLE=1` before > launching Python — the bundled shell scripts already do this. ## Gradio demo ```bash SOUNDEFFECT_MODEL_DIR=OpenMOSS-Team/MOSS-SoundEffect-v2.0 \ python ../clis/moss_sound_effect_app.py ``` ## Fine-tuning Full-parameter DiT fine-tune from an existing HF directory: ```bash HF_MODEL_DIR=OpenMOSS-Team/MOSS-SoundEffect-v2.0 \ METADATA_PATH=/path/to/captions.jsonl \ OUTPUT_PATH=./output/my_finetune \ bash finetuning/finetuning.sh ``` ### Metadata format `METADATA_PATH` is a JSON Lines file with two required fields per line: `audio` (path to the audio file, relative to `--dataset_base_path`) and `prompt` (caption text in English or Chinese). ```jsonl {"audio": "wavs/birdsong.wav", "prompt": "清晨小鸟叽叽喳喳地叫着,叫声清脆悦耳。"} {"audio": "wavs/brushing_teeth.wav", "prompt": "刷牙的声音,牙刷毛摩擦牙齿的那种沙沙声。"} {"audio": "wavs/pooring_water.wav", "prompt": "Pouring water into a glass, clear liquid flowing sound, pitch rising as the glass fills up, refreshing."} ``` ## Export a fine-tuned checkpoint Training auto-exports the latest checkpoint to `/hf_format/`. To convert any other fine-tuned DiT `.safetensors` checkpoint (e.g. an intermediate `epoch-0.safetensors`) into a HF directory without re-running training: ```bash CKPT_PATH=/path/to/output/finetune/epoch-0.safetensors \ SOURCE_HF_DIR=OpenMOSS-Team/MOSS-SoundEffect-v2.0 \ OUTPUT_DIR=./output/finetune/hf_format_epoch0 \ bash finetuning/export_to_hf.sh ``` `SOURCE_HF_DIR` is the HF directory (or hub repo id) you fine-tuned from. Its frozen sub-modules (VAE / text encoder / tokenizer / scheduler) are copied unchanged into the output, so you do **not** need to re-download Qwen3 or the DAC VAE. The resulting directory can be loaded by `MossSoundEffectPipeline.from_pretrained(OUTPUT_DIR)`.