Improve model card: set pipeline tag, add library name, link project page

This PR improves the model card by:
- Setting `pipeline_tag` to `automatic-speech-recognition` to reflect the model's audio processing nature.
- Adding `library_name: transformers` to specify the relevant library.
- Linking the project page in the model description.

Files changed (1) hide show

README.md +60 -68

README.md CHANGED Viewed

@@ -1,35 +1,46 @@
 ---
-tags:
-- model_hub_mixin
-- pytorch_model_hub_mixin
-license: apache-2.0
 language:
 - en
 metrics:
 - accuracy
-base_model:
-- openai/whisper-large-v3
-pipeline_tag: audio-classification
 ---
-# Whisper Large v3 for Speech Flow (Fluency) Classification
 # Model Description
-This model includes the implementation of speech fluency classification described in Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits (https://arxiv.org/pdf/2505.14648)
-The model first predicts the speech with 3-second window size and 1-second step size in
-```
-["fluent", "disfluent"]
-```
-If the disfluent speech is detected, we predict the disfluent types in:
-```
 [
-  "Block",
-  "Prolongation",
-  "Sound Repetition",
-  "Word Repetition",
-  "Interjection"
 ]
-```
 # How to use this model
@@ -49,60 +60,41 @@ pip install -e .
 # Load libraries
 import torch
 import torch.nn.functional as F
-from src.model.fluency.whisper_fluency import WhisperWrapper
 # Find device
 device = torch.device("cuda") if torch.cuda.is_available() else "cpu"
 # Load model from Huggingface
-model = WhisperWrapper.from_pretrained("tiantiaf/whisper-large-v3-speech-flow").to(device)
 model.eval()
 ```
 ## Prediction
 ```python
-audio_data = torch.zeros([1, 16000*10]).float().to(device)
-audio_segment = (audio_data.shape[1] - 3*16000) // 16000 + 1
-if audio_segment < 1: audio_segment = 1
-input_audio = list()
-input_audio_length = list()
-for idx in range(audio_segment):
-    input_audio.append(audio_data[0, 16000*idx:16000*idx+3*16000])
-    input_audio_length.append(torch.tensor(len(audio_data[0, 16000*idx:16000*idx+3*16000])))
-input_audio = torch.stack(input_audio, dim=0)
-input_audio_length = torch.stack(input_audio_length, dim=0)
-```
-## Prediction
-```python
-fluency_outputs, disfluency_type_outputs = model(input_audio, length=input_audio_length)
-fluency_prob   = F.softmax(fluency_outputs, dim=1).detach().cpu().numpy().astype(float).tolist()
-disfluency_type_prob = nn.Sigmoid()(disfluency_type_outputs)
-# we can set a higher threshold in practice
-disfluency_type_predictions = (disfluency_type_prob > 0.7).int().detach().cpu().numpy().tolist()
-disfluency_type_prob = disfluency_type_prob.cpu().numpy().astype(float).tolist()
-```
-## Now let's gather the predictions for the utterance
-```python
-utterance_fluency_list = list()
-utterance_disfluency_list = list()
-for audio_idx in range(audio_segment):
-  disfluency_type = list()
-  if fluency_prob[audio_idx][0] > 0.5:
-      utterance_fluency_list.append("fluent")
-  else:
-      # If the prediction is disfluent, then which disfluency type
-      utterance_fluency_list.append("disfluent")
-      predictions = disfluency_type_predictions[audio_idx]
-      for label_idx in range(len(predictions)):
-          if predictions[label_idx] == 1:
-            disfluency_type.append(disfluency_type_labels[label_idx])
-  utterance_disfluency_list.append(disfluency_type)
-# Now print how fluent is the utterance
-print(utterance_fluency_list)
-print(utterance_disfluency_list)
 ```
 ## If you have any questions, please contact: Tiantian Feng ([email protected])
@@ -115,4 +107,4 @@ print(utterance_disfluency_list)
   journal={arXiv preprint arXiv:2505.14648},
   year={2025}
 }
-```

 ---
+base_model:
+- openai/whisper-large-v3
 language:
 - en
+license: bsd-2-clause
 metrics:
 - accuracy
+pipeline_tag: automatic-speech-recognition
+tags:
+- model_hub_mixin
+- pytorch_model_hub_mixin
+- speech_emotion_recognition
+library_name: transformers
 ---
+# Whisper-Large V3 for Categorical Emotion Classification
 # Model Description
+This model includes the implementation of categorical emotion classification described in Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits (https://arxiv.org/pdf/2505.14648)
+The training pipeline used is also the top-performing solution (SAILER) in INTERSPEECH 2025—Speech Emotion Challenge (https://lab-msp.com/MSP-Podcast_Competition/IS2025/).
+Note that we did not use all the augmentation and did not use the transcript compared to our official challenge submission system, but we created a speech-only system to make the model simple but still effective.
+We use the MSP-Podcast data to train this model, noting that the model might be sensitive to content information when making emotion predictions. However, this could be a good feature for classifying emotions from online content.
+The included emotions are:
+<pre>
 [
+    'Anger',
+    'Contempt',
+    'Disgust',
+    'Fear',
+    'Happiness',
+    'Neutral',
+    'Sadness',
+    'Surprise',
+    'Other'
 ]
+</pre>
+- Library: https://github.com/tiantiaf0627/vox-profile-release
 # How to use this model
 # Load libraries
 import torch
 import torch.nn.functional as F
+from src.model.emotion.whisper_emotion import WhisperWrapper
 # Find device
 device = torch.device("cuda") if torch.cuda.is_available() else "cpu"
 # Load model from Huggingface
+model = WhisperWrapper.from_pretrained("tiantiaf/whisper-large-v3-msp-podcast-emotion").to(device)
 model.eval()
 ```
 ## Prediction
 ```python
+# Label List
+emotion_label_list = [
+    'Anger',
+    'Contempt',
+    'Disgust',
+    'Fear',
+    'Happiness',
+    'Neutral',
+    'Sadness',
+    'Surprise',
+    'Other'
+]
+# Load data, here just zeros as the example
+# Our training data filters output audio shorter than 3 seconds (unreliable predictions) and longer than 15 seconds (computation limitation)
+# So you need to prepare your audio to a maximum of 15 seconds, 16kHz and mono channel
+max_audio_length = 15 * 16000
+data = torch.zeros([1, 16000]).float().to(device)[:, :max_audio_length]
+logits, embedding, _, _, _, _ = model(
+    data, return_feature=True
+)
+# Probability and output
+emotion_prob = F.softmax(logits, dim=1)
+print(emotion_label_list[torch.argmax(emotion_prob).detach().cpu().item()])
 ```
 ## If you have any questions, please contact: Tiantian Feng ([email protected])
   journal={arXiv preprint arXiv:2505.14648},
   year={2025}
 }
+```