Improve model card: set pipeline tag, add library name, link project page
Browse filesThis PR improves the model card by:
- Setting `pipeline_tag` to `automatic-speech-recognition` to reflect the model's audio processing nature.
- Adding `library_name: transformers` to specify the relevant library.
- Linking the project page in the model description.
README.md
CHANGED
@@ -1,35 +1,46 @@
|
|
1 |
---
|
2 |
-
|
3 |
-
-
|
4 |
-
- pytorch_model_hub_mixin
|
5 |
-
license: apache-2.0
|
6 |
language:
|
7 |
- en
|
|
|
8 |
metrics:
|
9 |
- accuracy
|
10 |
-
|
11 |
-
|
12 |
-
|
|
|
|
|
|
|
13 |
---
|
14 |
-
|
|
|
15 |
|
16 |
# Model Description
|
17 |
-
This model includes the implementation of
|
18 |
|
19 |
-
The
|
20 |
-
|
21 |
-
|
22 |
-
|
23 |
-
|
24 |
-
|
|
|
|
|
25 |
[
|
26 |
-
|
27 |
-
|
28 |
-
|
29 |
-
|
30 |
-
|
|
|
|
|
|
|
|
|
31 |
]
|
32 |
-
|
|
|
|
|
33 |
|
34 |
# How to use this model
|
35 |
|
@@ -49,60 +60,41 @@ pip install -e .
|
|
49 |
# Load libraries
|
50 |
import torch
|
51 |
import torch.nn.functional as F
|
52 |
-
from src.model.
|
53 |
-
|
54 |
# Find device
|
55 |
device = torch.device("cuda") if torch.cuda.is_available() else "cpu"
|
56 |
-
|
57 |
# Load model from Huggingface
|
58 |
-
model = WhisperWrapper.from_pretrained("tiantiaf/whisper-large-v3-
|
59 |
model.eval()
|
60 |
```
|
61 |
|
62 |
## Prediction
|
63 |
```python
|
64 |
-
|
65 |
-
|
66 |
-
|
67 |
-
|
68 |
-
|
69 |
-
|
70 |
-
|
71 |
-
|
72 |
-
|
73 |
-
|
74 |
-
|
75 |
-
|
76 |
-
|
77 |
-
|
78 |
-
|
79 |
-
|
80 |
-
|
81 |
-
|
82 |
-
|
83 |
-
|
84 |
-
|
85 |
-
|
86 |
-
|
87 |
-
|
88 |
-
|
89 |
-
utterance_disfluency_list = list()
|
90 |
-
for audio_idx in range(audio_segment):
|
91 |
-
disfluency_type = list()
|
92 |
-
if fluency_prob[audio_idx][0] > 0.5:
|
93 |
-
utterance_fluency_list.append("fluent")
|
94 |
-
else:
|
95 |
-
# If the prediction is disfluent, then which disfluency type
|
96 |
-
utterance_fluency_list.append("disfluent")
|
97 |
-
predictions = disfluency_type_predictions[audio_idx]
|
98 |
-
for label_idx in range(len(predictions)):
|
99 |
-
if predictions[label_idx] == 1:
|
100 |
-
disfluency_type.append(disfluency_type_labels[label_idx])
|
101 |
-
utterance_disfluency_list.append(disfluency_type)
|
102 |
-
|
103 |
-
# Now print how fluent is the utterance
|
104 |
-
print(utterance_fluency_list)
|
105 |
-
print(utterance_disfluency_list)
|
106 |
```
|
107 |
|
108 |
## If you have any questions, please contact: Tiantian Feng ([email protected])
|
@@ -115,4 +107,4 @@ print(utterance_disfluency_list)
|
|
115 |
journal={arXiv preprint arXiv:2505.14648},
|
116 |
year={2025}
|
117 |
}
|
118 |
-
```
|
|
|
1 |
---
|
2 |
+
base_model:
|
3 |
+
- openai/whisper-large-v3
|
|
|
|
|
4 |
language:
|
5 |
- en
|
6 |
+
license: bsd-2-clause
|
7 |
metrics:
|
8 |
- accuracy
|
9 |
+
pipeline_tag: automatic-speech-recognition
|
10 |
+
tags:
|
11 |
+
- model_hub_mixin
|
12 |
+
- pytorch_model_hub_mixin
|
13 |
+
- speech_emotion_recognition
|
14 |
+
library_name: transformers
|
15 |
---
|
16 |
+
|
17 |
+
# Whisper-Large V3 for Categorical Emotion Classification
|
18 |
|
19 |
# Model Description
|
20 |
+
This model includes the implementation of categorical emotion classification described in Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits (https://arxiv.org/pdf/2505.14648)
|
21 |
|
22 |
+
The training pipeline used is also the top-performing solution (SAILER) in INTERSPEECH 2025—Speech Emotion Challenge (https://lab-msp.com/MSP-Podcast_Competition/IS2025/).
|
23 |
+
Note that we did not use all the augmentation and did not use the transcript compared to our official challenge submission system, but we created a speech-only system to make the model simple but still effective.
|
24 |
+
|
25 |
+
We use the MSP-Podcast data to train this model, noting that the model might be sensitive to content information when making emotion predictions. However, this could be a good feature for classifying emotions from online content.
|
26 |
+
|
27 |
+
|
28 |
+
The included emotions are:
|
29 |
+
<pre>
|
30 |
[
|
31 |
+
'Anger',
|
32 |
+
'Contempt',
|
33 |
+
'Disgust',
|
34 |
+
'Fear',
|
35 |
+
'Happiness',
|
36 |
+
'Neutral',
|
37 |
+
'Sadness',
|
38 |
+
'Surprise',
|
39 |
+
'Other'
|
40 |
]
|
41 |
+
</pre>
|
42 |
+
|
43 |
+
- Library: https://github.com/tiantiaf0627/vox-profile-release
|
44 |
|
45 |
# How to use this model
|
46 |
|
|
|
60 |
# Load libraries
|
61 |
import torch
|
62 |
import torch.nn.functional as F
|
63 |
+
from src.model.emotion.whisper_emotion import WhisperWrapper
|
|
|
64 |
# Find device
|
65 |
device = torch.device("cuda") if torch.cuda.is_available() else "cpu"
|
|
|
66 |
# Load model from Huggingface
|
67 |
+
model = WhisperWrapper.from_pretrained("tiantiaf/whisper-large-v3-msp-podcast-emotion").to(device)
|
68 |
model.eval()
|
69 |
```
|
70 |
|
71 |
## Prediction
|
72 |
```python
|
73 |
+
# Label List
|
74 |
+
emotion_label_list = [
|
75 |
+
'Anger',
|
76 |
+
'Contempt',
|
77 |
+
'Disgust',
|
78 |
+
'Fear',
|
79 |
+
'Happiness',
|
80 |
+
'Neutral',
|
81 |
+
'Sadness',
|
82 |
+
'Surprise',
|
83 |
+
'Other'
|
84 |
+
]
|
85 |
+
|
86 |
+
# Load data, here just zeros as the example
|
87 |
+
# Our training data filters output audio shorter than 3 seconds (unreliable predictions) and longer than 15 seconds (computation limitation)
|
88 |
+
# So you need to prepare your audio to a maximum of 15 seconds, 16kHz and mono channel
|
89 |
+
max_audio_length = 15 * 16000
|
90 |
+
data = torch.zeros([1, 16000]).float().to(device)[:, :max_audio_length]
|
91 |
+
logits, embedding, _, _, _, _ = model(
|
92 |
+
data, return_feature=True
|
93 |
+
)
|
94 |
+
|
95 |
+
# Probability and output
|
96 |
+
emotion_prob = F.softmax(logits, dim=1)
|
97 |
+
print(emotion_label_list[torch.argmax(emotion_prob).detach().cpu().item()])
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
98 |
```
|
99 |
|
100 |
## If you have any questions, please contact: Tiantian Feng ([email protected])
|
|
|
107 |
journal={arXiv preprint arXiv:2505.14648},
|
108 |
year={2025}
|
109 |
}
|
110 |
+
```
|