BUD-E-Whisper / README.md
ChristophSchuhmann's picture
Update README.md
9400e67 verified
metadata
license: cc-by-4.0

BUD-E Whisper: Emotional Speech Captioning Model

BUD-E Whisper is a suite of Whisper models fine-tuned for direct emotional speech captioning. The core models are built upon OpenAI's Whisper architecture, with the current primary variant being a fine-tune of OpenAI Whisper Small. These models are designed to generate text captions that not only transcribe speech but also inherently reflect its emotional content.

The embeddings generated by BUD-E Whisper can also serve as input for Empathic Insight - Voice, a downstream ensemble of Multi-Layer Perceptrons (MLPs) designed to predict dimensional emotion scores.

License

This model is released under the CC-by-4.0 license. Please give attribution to Maurice Kraus & Christoph Schuhmann, who made this model.

Colab

Open In Colab

Training Data

BUD-E Whisper was trained on a combination of:

Training Procedure & Caption Generation

A key aspect of BUD-E Whisper's development was a multi-step caption refinement process to create rich training targets:

  1. Initial Score Generation: An iterative process using Gemini Flash 2.0 generated initial 40-dimensional emotion scores (0-4 scale) and 15 additional dimensions like age, arousal, valence, dominance, harshness, vocalbursts,... for all audio snippets.
  2. Templated Captions: These scores were converted into templated string captions.
  3. Paraphrasing for Richness: Gemini Flash 2.0 was then used to paraphrase these templated captions, creating diverse and semantically rich training targets.
  4. Fine-tuning: Various Whisper model sizes (including the aforementioned fine-tune of OpenAI Whisper Small) were fine-tuned on these refined, emotionally-aware captions.

This multi-step caption refinement was crucial for performance. Direct score regression or simple templated captions were found to lead to suboptimal performance for emotional speech captioning with Whisper models.

Intended Use

  • Generating emotionally nuanced captions for audio content.
  • Providing rich embeddings for downstream emotion recognition tasks (e.g., with Empathic Insight - Voice).