nielsr HF Staff commited on
Commit
0e806f5
·
verified ·
1 Parent(s): 1d7c4d3

Improve model card: set pipeline tag, add library name, link project page

Browse files

This PR improves the model card by:
- Setting `pipeline_tag` to `automatic-speech-recognition` to reflect the model's audio processing nature.
- Adding `library_name: transformers` to specify the relevant library.
- Linking the project page in the model description.

Files changed (1) hide show
  1. README.md +60 -68
README.md CHANGED
@@ -1,35 +1,46 @@
1
  ---
2
- tags:
3
- - model_hub_mixin
4
- - pytorch_model_hub_mixin
5
- license: apache-2.0
6
  language:
7
  - en
 
8
  metrics:
9
  - accuracy
10
- base_model:
11
- - openai/whisper-large-v3
12
- pipeline_tag: audio-classification
 
 
 
13
  ---
14
- # Whisper Large v3 for Speech Flow (Fluency) Classification
 
15
 
16
  # Model Description
17
- This model includes the implementation of speech fluency classification described in Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits (https://arxiv.org/pdf/2505.14648)
18
 
19
- The model first predicts the speech with 3-second window size and 1-second step size in
20
- ```
21
- ["fluent", "disfluent"]
22
- ```
23
- If the disfluent speech is detected, we predict the disfluent types in:
24
- ```
 
 
25
  [
26
- "Block",
27
- "Prolongation",
28
- "Sound Repetition",
29
- "Word Repetition",
30
- "Interjection"
 
 
 
 
31
  ]
32
- ```
 
 
33
 
34
  # How to use this model
35
 
@@ -49,60 +60,41 @@ pip install -e .
49
  # Load libraries
50
  import torch
51
  import torch.nn.functional as F
52
- from src.model.fluency.whisper_fluency import WhisperWrapper
53
-
54
  # Find device
55
  device = torch.device("cuda") if torch.cuda.is_available() else "cpu"
56
-
57
  # Load model from Huggingface
58
- model = WhisperWrapper.from_pretrained("tiantiaf/whisper-large-v3-speech-flow").to(device)
59
  model.eval()
60
  ```
61
 
62
  ## Prediction
63
  ```python
64
- audio_data = torch.zeros([1, 16000*10]).float().to(device)
65
- audio_segment = (audio_data.shape[1] - 3*16000) // 16000 + 1
66
- if audio_segment < 1: audio_segment = 1
67
- input_audio = list()
68
- input_audio_length = list()
69
- for idx in range(audio_segment):
70
- input_audio.append(audio_data[0, 16000*idx:16000*idx+3*16000])
71
- input_audio_length.append(torch.tensor(len(audio_data[0, 16000*idx:16000*idx+3*16000])))
72
- input_audio = torch.stack(input_audio, dim=0)
73
- input_audio_length = torch.stack(input_audio_length, dim=0)
74
- ```
75
- ## Prediction
76
- ```python
77
- fluency_outputs, disfluency_type_outputs = model(input_audio, length=input_audio_length)
78
- fluency_prob = F.softmax(fluency_outputs, dim=1).detach().cpu().numpy().astype(float).tolist()
79
-
80
- disfluency_type_prob = nn.Sigmoid()(disfluency_type_outputs)
81
- # we can set a higher threshold in practice
82
- disfluency_type_predictions = (disfluency_type_prob > 0.7).int().detach().cpu().numpy().tolist()
83
- disfluency_type_prob = disfluency_type_prob.cpu().numpy().astype(float).tolist()
84
- ```
85
-
86
- ## Now let's gather the predictions for the utterance
87
- ```python
88
- utterance_fluency_list = list()
89
- utterance_disfluency_list = list()
90
- for audio_idx in range(audio_segment):
91
- disfluency_type = list()
92
- if fluency_prob[audio_idx][0] > 0.5:
93
- utterance_fluency_list.append("fluent")
94
- else:
95
- # If the prediction is disfluent, then which disfluency type
96
- utterance_fluency_list.append("disfluent")
97
- predictions = disfluency_type_predictions[audio_idx]
98
- for label_idx in range(len(predictions)):
99
- if predictions[label_idx] == 1:
100
- disfluency_type.append(disfluency_type_labels[label_idx])
101
- utterance_disfluency_list.append(disfluency_type)
102
-
103
- # Now print how fluent is the utterance
104
- print(utterance_fluency_list)
105
- print(utterance_disfluency_list)
106
  ```
107
 
108
  ## If you have any questions, please contact: Tiantian Feng ([email protected])
@@ -115,4 +107,4 @@ print(utterance_disfluency_list)
115
  journal={arXiv preprint arXiv:2505.14648},
116
  year={2025}
117
  }
118
- ```
 
1
  ---
2
+ base_model:
3
+ - openai/whisper-large-v3
 
 
4
  language:
5
  - en
6
+ license: bsd-2-clause
7
  metrics:
8
  - accuracy
9
+ pipeline_tag: automatic-speech-recognition
10
+ tags:
11
+ - model_hub_mixin
12
+ - pytorch_model_hub_mixin
13
+ - speech_emotion_recognition
14
+ library_name: transformers
15
  ---
16
+
17
+ # Whisper-Large V3 for Categorical Emotion Classification
18
 
19
  # Model Description
20
+ This model includes the implementation of categorical emotion classification described in Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits (https://arxiv.org/pdf/2505.14648)
21
 
22
+ The training pipeline used is also the top-performing solution (SAILER) in INTERSPEECH 2025—Speech Emotion Challenge (https://lab-msp.com/MSP-Podcast_Competition/IS2025/).
23
+ Note that we did not use all the augmentation and did not use the transcript compared to our official challenge submission system, but we created a speech-only system to make the model simple but still effective.
24
+
25
+ We use the MSP-Podcast data to train this model, noting that the model might be sensitive to content information when making emotion predictions. However, this could be a good feature for classifying emotions from online content.
26
+
27
+
28
+ The included emotions are:
29
+ <pre>
30
  [
31
+ 'Anger',
32
+ 'Contempt',
33
+ 'Disgust',
34
+ 'Fear',
35
+ 'Happiness',
36
+ 'Neutral',
37
+ 'Sadness',
38
+ 'Surprise',
39
+ 'Other'
40
  ]
41
+ </pre>
42
+
43
+ - Library: https://github.com/tiantiaf0627/vox-profile-release
44
 
45
  # How to use this model
46
 
 
60
  # Load libraries
61
  import torch
62
  import torch.nn.functional as F
63
+ from src.model.emotion.whisper_emotion import WhisperWrapper
 
64
  # Find device
65
  device = torch.device("cuda") if torch.cuda.is_available() else "cpu"
 
66
  # Load model from Huggingface
67
+ model = WhisperWrapper.from_pretrained("tiantiaf/whisper-large-v3-msp-podcast-emotion").to(device)
68
  model.eval()
69
  ```
70
 
71
  ## Prediction
72
  ```python
73
+ # Label List
74
+ emotion_label_list = [
75
+ 'Anger',
76
+ 'Contempt',
77
+ 'Disgust',
78
+ 'Fear',
79
+ 'Happiness',
80
+ 'Neutral',
81
+ 'Sadness',
82
+ 'Surprise',
83
+ 'Other'
84
+ ]
85
+
86
+ # Load data, here just zeros as the example
87
+ # Our training data filters output audio shorter than 3 seconds (unreliable predictions) and longer than 15 seconds (computation limitation)
88
+ # So you need to prepare your audio to a maximum of 15 seconds, 16kHz and mono channel
89
+ max_audio_length = 15 * 16000
90
+ data = torch.zeros([1, 16000]).float().to(device)[:, :max_audio_length]
91
+ logits, embedding, _, _, _, _ = model(
92
+ data, return_feature=True
93
+ )
94
+
95
+ # Probability and output
96
+ emotion_prob = F.softmax(logits, dim=1)
97
+ print(emotion_label_list[torch.argmax(emotion_prob).detach().cpu().item()])
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
98
  ```
99
 
100
  ## If you have any questions, please contact: Tiantian Feng ([email protected])
 
107
  journal={arXiv preprint arXiv:2505.14648},
108
  year={2025}
109
  }
110
+ ```