YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
VibeVoice 1.5B Single-Speaker Fine-tuning Guide
This folder contains all the files needed to fine-tune VibeVoice 1.5B for a single speaker (Elise voice).
Key Improvements
- Fixed EOS Token Issue: The modified
data_vibevoice.py
adds proper<|endoftext|>
token after speech generation to prevent repetition/looping - Single-Speaker Training: Uses
voice_prompt_drop_rate=1.0
to train without voice prompts - Audio Quality Filter: Removes training samples with abrupt cutoffs
Files Included
data_vibevoice.py
- CRITICAL: Modified data collator that adds EOS token (replaces src/data_vibevoice.py)prepare_jinsaryko_elise_dataset.py
- Downloads and prepares the Elise datasetdetect_audio_cutoffs.py
- Detects audio files with abrupt endingsfinetune_elise_single_speaker.sh
- Training script for single-speaker modeltest_fixed_eos_dummy_voice.py
- Test script for inference
Quick Start
Prepare the dataset:
python prepare_jinsaryko_elise_dataset.py
Detect and remove bad audio (optional but recommended):
python detect_audio_cutoffs.py # This will create elise_cleaned/ folder with good samples only
IMPORTANT: Replace the data collator:
cp data_vibevoice.py ../src/data_vibevoice.py
Train the model:
./finetune_elise_single_speaker.sh
Test the model:
python test_fixed_eos_dummy_voice.py
Training Configuration
Key settings in finetune_elise_single_speaker.sh
:
voice_prompt_drop_rate 1.0
- Always drops voice prompts (single-speaker mode)learning_rate 2.5e-5
- Conservative learning rateddpm_batch_mul 2
- Diffusion batch multiplierdiffusion_loss_weight 1.4
- Diffusion loss weightce_loss_weight 0.04
- Cross-entropy loss weight
How It Works
- The model learns to associate "Speaker 0:" with Elise's voice
- No voice samples needed during inference
- Proper EOS token ensures clean endings without repetition
Dataset Format
The training data should be JSONL with this format:
{"text": "Speaker 0: Hello, this is a test.", "audio": "/path/to/audio.wav"}
Note: The "Speaker 0:" prefix is REQUIRED for all text entries.
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support