---
license: cc-by-4.0
---
# BUD-E Whisper: Emotional Speech Captioning Model


**BUD-E Whisper** is a suite of Whisper models fine-tuned for **direct emotional speech captioning**. The core models are built upon OpenAI's Whisper architecture, with the current primary variant being a fine-tune of **OpenAI Whisper Small**. These models are designed to generate text captions that not only transcribe speech but also inherently reflect its emotional content.

The embeddings generated by BUD-E Whisper can also serve as input for **Empathic Insight - Voice**, a downstream ensemble of Multi-Layer Perceptrons (MLPs) designed to predict dimensional emotion scores.

## License

This model is released under the CC-by-4.0 license. Please give attribution to Maurice Kraus & Christoph Schuhmann, who made this model.

## Colab 
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1VoAtmNhY1hI5Yzv1_dppHTcYky82OCDK?usp=sharing)

## Training Data

BUD-E Whisper was trained on a combination of:
*   The **[Laion's Got Talent (Enhanced Flash Annotations and Long Captions) dataset](https://huggingface.co/datasets/laion/laions_got_talent_enhanced_flash_annotations_and_long_captions)**.
*   An **internal dataset** comprising approximately **5,000 hours of public Vlogs** and similar audio content.

## Training Procedure & Caption Generation

A key aspect of BUD-E Whisper's development was a multi-step caption refinement process to create rich training targets:

1.  **Initial Score Generation:** An iterative process using Gemini Flash 2.0 generated initial 40-dimensional emotion scores (0-4 scale) and 15 additional dimensions like age, arousal, valence, dominance, harshness, vocalbursts,... for all audio snippets.
2.  **Templated Captions:** These scores were converted into templated string captions.
3.  **Paraphrasing for Richness:** Gemini Flash 2.0 was then used to paraphrase these templated captions, creating diverse and semantically rich training targets.
4.  **Fine-tuning:** Various Whisper model sizes (including the aforementioned fine-tune of OpenAI Whisper Small) were fine-tuned on these refined, emotionally-aware captions.

This multi-step caption refinement was crucial for performance. Direct score regression or simple templated captions were found to lead to suboptimal performance for emotional speech captioning with Whisper models.

## Intended Use

*   Generating emotionally nuanced captions for audio content.
*   Providing rich embeddings for downstream emotion recognition tasks (e.g., with Empathic Insight - Voice).