# VibeVoice ASR LoRA Fine-tuning This directory contains scripts for LoRA (Low-Rank Adaptation) fine-tuning of the VibeVoice ASR model. ## Requirements ```bash # Install vibevoice first pip install -e . pip install peft ``` ## Toy Dataset > **Note**: The `toy_dataset/` included in this directory contains **synthetic audio generated by VibeVoice TTS** for demonstration purposes only. It is NOT a full finetuning dataset. > > When using your own data, you should: > - Prepare real audio recordings with accurate transcriptions > - Adjust hyperparameters (learning rate, epochs, LoRA rank) based on your dataset size and domain > - Consider the audio quality and speaker diversity in your data ## Data Format Training data should be organized as pairs of audio files and JSON labels in the same directory: ``` toy_dataset/ ├── 0.mp3 ├── 0.json ├── 1.mp3 ├── 1.json └── ... ``` ### JSON Label Format Each JSON file should have the following structure: ```json { "audio_duration": 351.73, "audio_path": "0.mp3", "segments": [ { "speaker": 0, "text": "Hey everyone, welcome back...", "start": 0.0, "end": 38.68 }, { "speaker": 1, "text": "Thanks for having me...", "start": 38.75, "end": 77.88 } ], "customized_context": ["Tea Brew", "Aiden Host", "The property is near Meter Street."] // optional, domain-specific terms or context sentences } ``` ## Training ### Basic ```bash # 1 GPU torchrun --nproc_per_node=1 lora_finetune.py \ --model_path microsoft/VibeVoice-ASR \ --data_dir ./toy_dataset \ --output_dir ./output \ --num_train_epochs 3 \ --per_device_train_batch_size 1 \ --learning_rate 1e-4 \ --bf16 \ --report_to none # Specific GPUs (e.g., GPU 0,1,2,3) CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 lora_finetune.py \ --model_path microsoft/VibeVoice-ASR \ --data_dir ./toy_dataset \ --output_dir ./output \ --num_train_epochs 3 \ --per_device_train_batch_size 1 \ --learning_rate 1e-4 \ --bf16 \ --report_to none ``` ### Full Options The script uses HuggingFace's `TrainingArguments`, so all standard options are available: ```bash torchrun --nproc_per_node=4 lora_finetune.py \ --model_path microsoft/VibeVoice-ASR \ --data_dir ./toy_dataset \ --output_dir ./output \ --lora_r 16 \ --lora_alpha 32 \ --lora_dropout 0.05 \ --num_train_epochs 3 \ --per_device_train_batch_size 1 \ --gradient_accumulation_steps 4 \ --learning_rate 1e-4 \ --warmup_ratio 0.1 \ --weight_decay 0.01 \ --max_grad_norm 1.0 \ --logging_steps 10 \ --save_steps 100 \ --gradient_checkpointing \ --bf16 \ --report_to none ``` ### Key Parameters | Parameter | Default | Description | |-----------|---------|-------------| | `--lora_r` | 16 | LoRA rank (lower = fewer params, higher = more expressive) | | `--lora_alpha` | 32 | LoRA scaling factor (typically 2x rank) | | `--lora_dropout` | 0.05 | Dropout for LoRA layers | | `--per_device_train_batch_size` | 8 | Batch size per device | | `--gradient_accumulation_steps` | 1 | Effective batch size = batch_size × grad_accum | | `--learning_rate` | 5e-5 | Learning rate (1e-4 to 2e-4 typical for LoRA) | | `--gradient_checkpointing` | False | Enable to reduce memory usage | | `--use_customized_context` | True | Include customized_context from JSON as additional context | | `--max_audio_length` | None | Skip audio longer than this (seconds) | ## Inference with Fine-tuned Model ```bash python inference_lora.py \ --base_model microsoft/VibeVoice-ASR \ --lora_path ./output \ --audio_file ./toy_dataset/0.mp3 \ --context_info "Tea Brew, Aiden Host" ``` ## Merging LoRA Weights (Optional) To merge LoRA weights into the base model for faster inference: ```python from peft import PeftModel # Load base model + LoRA model = VibeVoiceASRForConditionalGeneration.from_pretrained("microsoft/VibeVoice-ASR", ...) model = PeftModel.from_pretrained(model, "./output") # Merge and save model = model.merge_and_unload() model.save_pretrained("./merged_model") ```