YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

VibeVoice 1.5B Single-Speaker Fine-tuning Guide

This folder contains all the files needed to fine-tune VibeVoice 1.5B for a single speaker (Elise voice).

Key Improvements

  1. Fixed EOS Token Issue: The modified data_vibevoice.py adds proper <|endoftext|> token after speech generation to prevent repetition/looping
  2. Single-Speaker Training: Uses voice_prompt_drop_rate=1.0 to train without voice prompts
  3. Audio Quality Filter: Removes training samples with abrupt cutoffs

Files Included

  • data_vibevoice.py - CRITICAL: Modified data collator that adds EOS token (replaces src/data_vibevoice.py)
  • prepare_jinsaryko_elise_dataset.py - Downloads and prepares the Elise dataset
  • detect_audio_cutoffs.py - Detects audio files with abrupt endings
  • finetune_elise_single_speaker.sh - Training script for single-speaker model
  • test_fixed_eos_dummy_voice.py - Test script for inference

Quick Start

  1. Prepare the dataset:

    python prepare_jinsaryko_elise_dataset.py
    
  2. Detect and remove bad audio (optional but recommended):

    python detect_audio_cutoffs.py
    # This will create elise_cleaned/ folder with good samples only
    
  3. IMPORTANT: Replace the data collator:

    cp data_vibevoice.py ../src/data_vibevoice.py
    
  4. Train the model:

    ./finetune_elise_single_speaker.sh
    
  5. Test the model:

    python test_fixed_eos_dummy_voice.py
    

Training Configuration

Key settings in finetune_elise_single_speaker.sh:

  • voice_prompt_drop_rate 1.0 - Always drops voice prompts (single-speaker mode)
  • learning_rate 2.5e-5 - Conservative learning rate
  • ddpm_batch_mul 2 - Diffusion batch multiplier
  • diffusion_loss_weight 1.4 - Diffusion loss weight
  • ce_loss_weight 0.04 - Cross-entropy loss weight

How It Works

  1. The model learns to associate "Speaker 0:" with Elise's voice
  2. No voice samples needed during inference
  3. Proper EOS token ensures clean endings without repetition

Dataset Format

The training data should be JSONL with this format:

{"text": "Speaker 0: Hello, this is a test.", "audio": "/path/to/audio.wav"}

Note: The "Speaker 0:" prefix is REQUIRED for all text entries.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support