docs: Update model card

#1
Files changed (1) hide show
  1. README.md +234 -193
README.md CHANGED
@@ -30,10 +30,13 @@ model-index:
30
  name: WER
31
  ---
32
 
33
- # Pre-release of Roest-wav2vec2-315m-v2
34
- This is a pre-release of a Danish state-of-the-art speech recognition model, trained as part of the CoRal project by [Alvenir](https://www.alvenir.ai/).
 
 
 
 
35
 
36
- This repository contains a Wav2Vec2 model trained on the [CoRal-v2 dataset](https://huggingface.co/datasets/CoRal-project/coral-v2/tree/main). The CoRal-v2 dataset includes a rich variety of Danish conversational and read-aloud data, distributed across diverse age groups, genders, and dialects. The model is designed for automatic speech recognition (ASR).
37
 
38
  ## Quick Start
39
 
@@ -59,233 +62,270 @@ Next you can use the model using the `transformers` Python package as follows:
59
 
60
  Explore the following audio samples along with their transcriptions and accuracy metrics. Each example showcases the model's performance with different Danish dialects.
61
 
62
- ### Example 1 - Vestjysk Dialect
63
-
64
- **Audio Sample:**
65
- <audio controls>
66
- <source src="https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2/resolve/main/audio_samples/example1.wav" type="audio/wav">
67
- Your browser does not support the audio tag.
68
- </audio>
69
-
70
- **Model Transcription:**
71
- *det blev til yderlig ti mål i den første sæson på trods af en position som back*
72
-
73
- **Target Transcription:**
74
- *det blev til yderligere ti mål i den første sæson på trods af en position som back*
75
-
76
- - **Character Error Rate (CER):** 3.7%
77
- - **Word Error Rate (WER):** 5.9%
78
-
79
- ---
80
-
81
- ### Example 2 - Sønderjysk Dialect
82
-
83
- **Audio Sample:**
84
- <audio controls>
85
- <source src="https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2/resolve/main/audio_samples/example2.wav" type="audio/wav">
86
- Your browser does not support the audio tag.
87
- </audio>
88
-
89
- **Model Transcription:**
90
- *en arkitektoniske udformning af pladser forslagene iver benzen*
91
-
92
- **Target Transcription:**
93
- *den arkitektoniske udformning af pladsen er forestået af ivar bentsen*
94
-
95
- - **Character Error Rate (CER):** 20.3%
96
- - **Word Error Rate (WER):** 60.0%
97
-
98
- ---
99
-
100
- ### Example 3 - Nordsjællandsk Dialect
101
-
102
- **Audio Sample:**
103
- <audio controls>
104
- <source src="https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2/resolve/main/audio_samples/example3.wav" type="audio/wav">
105
- Your browser does not support the audio tag.
106
- </audio>
107
-
108
- **Model Transcription:**
109
- *østrig og ungarn samarbejder om søen gennem den østrigske og ungarske vandkommission*
110
-
111
- **Target Transcription:**
112
- *østrig og ungarn samarbejder om søen gennem den østrigske og ungarske vandkommission*
113
-
114
- - **Character Error Rate (CER):** 0.0%
115
- - **Word Error Rate (WER):** 0.0%
116
-
117
- ---
118
-
119
- ### Example 4 - Lollandsk Dialect
120
-
121
- **Audio Sample:**
122
- <audio controls>
123
- <source src="https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2/resolve/main/audio_samples/example4.wav" type="audio/wav">
124
- Your browser does not support the audio tag.
125
- </audio>
126
-
127
- **Model Transcription:**
128
- *det er produceret af thomas helme og indspillede i easy sound recording studio i københavn*
129
-
130
- **Target Transcription:**
131
- *det er produceret af thomas helmig og indspillet i easy sound recording studio i københavn*
132
-
133
- - **Character Error Rate (CER):** 4.4%
134
- - **Word Error Rate (WER):** 13.3%
 
 
 
 
 
 
 
 
 
 
135
 
136
  ---
137
 
138
  ## Model Details
139
 
140
  Wav2Vec2 is a state-of-the-art model architecture for speech recognition, leveraging self-supervised learning from raw audio data. The pre-trained [Wav2Vec2-XLS-R-300M](https://huggingface.co/facebook/wav2vec2-xls-r-300m) has been fine-tuned for automatic speech recognition with the [CoRal-v2 dataset](https://huggingface.co/datasets/CoRal-project/coral-v2/tree/main) dataset to enhance its performance in recognizing Danish speech with consideration to different dialects. The model was trained for 30K steps using the training setup in the [CoRaL repository](https://github.com/alexandrainst/coral/tree) by running:
141
- ```
142
- python src/scripts/finetune_asr_model.py model=wav2vec2-small max_steps=30000 datasets.coral_conversation_internal.id=CoRal-project/coral-v2 datasets.coral_readaloud_internal.id=CoRal-project/coral-v2
143
- ```
144
- The model is evaluated using a Language Model (LM) as post-processing. The utilized LM is the one trained and used by [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2).
145
 
146
- ---
 
 
 
 
 
 
147
 
148
- ## Dataset
 
149
 
150
- ### [CoRal-v2](https://huggingface.co/datasets/CoRal-project/coral-v2/tree/main)
151
- - **Subsets**:
152
- - Conversation
153
- - Read-aloud
154
- - **Language**: Danish.
155
- - **Variation**: Includes various dialects, age groups, and gender distinctions.
156
- ### License
157
- Note that the dataset used is licensed under a custom license, adapted from OpenRAIL-M, which allows commercial use with a few restrictions (speech synthesis and biometric identification). See [license](https://huggingface.co/Alvenir/coral-1-whisper-large/blob/main/LICENSE).
158
 
159
  ---
160
 
161
  ## Evaluation
162
 
163
  The model was evaluated using the following metrics:
164
- - **Word Error Rate (WER)**: The percentage of words incorrectly transcribed.
165
  - **Character Error Rate (CER)**: The percentage of characters incorrectly transcribed.
 
166
 
167
- **OBS!** It should be noted that the [CoRal test dataset](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test) does not contain any conversation data, whereas the model is trained for read-aloud and conversation, but is only tested on read-aloud in the [CoRal test dataset](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test).
168
-
169
-
170
- | Model | Number of parameters | Finetuned on data of type | [CoRal](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test) WER |
171
- | :----------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | --------------------------------------------------------------------------------------: | --------------------------------------------------------------------------------------: |
172
- | [CoRal-project/roest-wav2vec2-1B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2) | 1B | Read-aloud and conversation | 6.5% ± 0.2% | 16.4% ± 0.4% |
173
- | [CoRal-project/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) | 315M | Read-aloud and conversation | 6.5% ± 0.2% | 16.3% ± 0.4% |
174
- | [CoRal-project/roest-whisper-large-v1](https://huggingface.co/CoRal-project/roest-whisper-large-v1) | 1540M | Read-aloud | **4.3% ± 0.2%** | **10.4% ± 0.3%** |
175
- | [CoRal-project/roest-wav2vec2-315M-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315M-v1) | 315M | Read-aloud | 6.6% ± 0.2% | 17.0% ± 0.4% |
176
- | [mhenrichsen/hviske-v2](https://huggingface.co/syvai/hviske-v2) | 1540M | Read-aloud | 4.7% ± 0.2% | 11.8% ± 0.3% |
177
- | [openai/whisper-large-v3](https://hf.co/openai/whisper-large-v3) | 1540M | - | 11.4% ± 0.3% | 28.3% ± 0.6% |
178
 
179
- **OBS!** Benchmark for hviske-v2 has been reevaluted and the confidence interval is larger than reported in the model card.
180
 
181
- The model was also evaluated on a tentative pre-release of the coral-v2 conversation dataset. The results are tentative as the test set only includes 5 unique speakers, of which 4 are women. The test set includes 2 speakers with 'Fynsk' dialect, 1 with 'Sønderjysk', 1 with 'Non-native' and 1 'Nordjysk'. The whisper model is performing very poorly on the test set. An explanation could be hallucinations during silence and short sentences, a known whisper issue. Furthermore, both version 1 models have not been trained on any conversation data giving the models an obvious disadvantage.
 
 
 
182
 
183
  | Model | Number of parameters | Finetuned on data of type | [CoRal-v2::conversation](https://huggingface.co/datasets/CoRal-project/coral-v2/viewer/conversation/test) CER | [CoRal-v2::conversation](https://huggingface.co/datasets/CoRal-project/coral-v2/viewer/conversation/test) WER |
184
  | :-------------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | ------------------------------------------------------------------------------------------------------------: | ------------------------------------------------------------------------------------------------------------: |
185
- | [CoRal-project/roest-wav2vec2-1B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2) | 1B | Read-aloud and conversation | 23.9% | 36.7% |
186
  | [CoRal-project/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) | 315M | Read-aloud and conversation | 24.2% | 37.7% |
187
- | [CoRal-project/roest-whisper-large-v1](https://huggingface.co/CoRal-project/roest-whisper-large-v1) | 1540M | Read-aloud | 138% | 121% |
188
- | [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1) | 315M | Read-aloud | 123% | 80.5% |
189
 
190
 
191
- ### Detailed evaluation across demographics on the CoRal test data
192
- <img src="https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2/resolve/main/images/wer.png">
 
 
 
 
 
 
 
 
 
 
193
 
194
  <img src="https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2/resolve/main/images/cer.png">
195
 
196
- ### Table WER scores in % of evaluation across demographics on the CoRal test data
197
- | Category | roest-whisper-large-v1 | roest-wav2vec2-315m-v1 | roest-wav2vec2-315m-v2 | roest-wav2vec2-1B-v2 |
198
- |:---:|:---:|:---:|:---:|:---:|
199
- | female | 11.5 | 18.5 | 17.7 | 17.8 |
200
- | male | 9.4 | 15.5 | 14.9 | 15.0 |
201
- | 0-25 | 9.0 | 14.7 | 14.0 | 13.7 |
202
- | 25-50 | 10.1 | 16.6 | 15.8 | 15.3 |
203
- | 50+ | 11.3 | 18.2 | 17.7 | 18.5 |
204
- | Bornholmsk | 9.8 | 17.7 | 15.7 | 16.4 |
205
- | Fynsk | 12.1 | 18.3 | 17.7 | 16.7 |
206
- | Københavnsk | 5.9 | 10.2 | 10.0 | 9.5 |
207
- | Non-native | 12.2 | 20.9 | 19.4 | 19.4 |
208
- | Nordjysk | 4.5 | 7.7 | 7.5 | 7.3 |
209
- | Sjællandsk | 7.6 | 12.6 | 12.7 | 11.0 |
210
- | Sydømål | 10.0 | 14.9 | 15.3 | 14.4 |
211
- | Sønderjysk | 17.5 | 26.0 | 25.4 | 27.8 |
212
- | Vestjysk | 15.0 | 26.3 | 25.2 | 26.7 |
213
- | Østjysk | 7.5 | 11.7 | 11.3 | 10.8 |
214
- | Overall | 10.4 | 17.0 | 16.3 | 16.4 |
215
-
216
-
217
- ### Table CER scores in % of evaluation across demographics on the CoRal test data
218
- | Category | roest-whisper-large-v1 | roest-wav2vec2-315m-v1 | roest-wav2vec2-315m-v2 | roest-wav2vec2-1B-v2 |
219
- |:---:|:---:|:---:|:---:|:---:|
220
- | female | 5.1 | 7.4 | 7.2 | 7.3 |
221
- | male | 3.6 | 5.8 | 5.7 | 5.8 |
222
- | 0-25 | 3.4 | 5.4 | 5.3 | 5.1 |
223
- | 25-50 | 4.0 | 6.2 | 6.0 | 5.7 |
224
- | 50+ | 5.0 | 7.5 | 7.4 | 7.8 |
225
- | Bornholmsk | 3.8 | 6.8 | 6.1 | 6.2 |
226
- | Fynsk | 5.1 | 7.4 | 7.2 | 6.9 |
227
- | Københavnsk | 1.9 | 3.3 | 3.2 | 3.0 |
228
- | Non-native | 4.8 | 7.8 | 7.5 | 7.3 |
229
- | Nordjysk | 1.6 | 2.6 | 2.8 | 2.6 |
230
- | Sjællandsk | 3.0 | 4.4 | 4.5 | 3.9 |
231
- | Sydømål | 4.1 | 6.4 | 6.4 | 6.5 |
232
- | Sønderjysk | 8.8 | 11.9 | 11.6 | 12.6 |
233
- | Vestjysk | 6.4 | 10.1 | 9.8 | 10.5 |
234
- | Østjysk | 2.6 | 4.0 | 4.1 | 3.8 |
235
- | Overall | 4.3 | 6.6 | 6.5 | 6.5 |
236
-
237
-
238
-
239
- ### Roest-wav2vec2-315M with and without language model
240
- The inclusion of a post-processing language model can affect the performance significantly. The Roest-v1 and Roest-v2 models are using the same Language Model (LM). The utilized LM is the one trained and used by [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1).
241
-
242
- | Model | Number of parameters | Finetuned on data of type | Postprocessed with Language Model | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.com/datasets/alexandrainst/coral/viewer/read_aloud/test) WER |
243
- | :-------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | --------------------------------: | --------------------------------------------------------------------------------------: | --------------------------------------------------------------------------------------: |
244
- | [CoRal-project/roest-wav2vec2-1B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2) | 1B | Read-aloud and conversation | Yes | **6.5% ± 0.2%** | **16.4% ± 0.4%** |
245
- | [CoRal-project/roest-wav2vec2-1B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2) | 1B | Read-aloud and conversation | No | 8.1% ± 0.2% | 23.9% ± 0.4% |
246
- | [CoRal-project/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) | 315M | Read-aloud and conversation | Yes | **6.5% ± 0.2%** | **16.3% ± 0.4%** |
247
- | [CoRal-project/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) | 315M | Read-aloud and conversation | No | 8.2% ± 0.2% | 25.1% ± 0.4% |
248
- | [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1) | 315M | Read-aloud | Yes | 6.6% ± 0.2% | 17.0% ± 0.4% |
249
- | [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1) | 315M | Read-aloud | No | 8.6% ± 0.2% | 26.3% ± 0.5% |
250
-
251
- ### Detailed Roest-wav2vec2-315M with and without language model on different dialects
252
- Here are the results of the model on different danish dialects in the test set:
253
-
254
- | | Roest-v1 | | Roest-v1 | | Roest-v2 | | Roest-v2 | |
255
- |-------------|---------|---------|---------|---------|---------|---------|---------|---------|
256
- | LM | No | | Yes | | No | | Yes | |
257
- |-------------|---------|---------|---------|---------|---------|---------|---------|---------|
258
- | Dialect | CER (%) | WER (%) | CER (%) | WER (%) | CER (%) | WER (%) | CER (%) | WER (%) |
259
- | Vestjysk | 12.7 | 37.1 | 10.1 | 26.3 | 12.2 | 36.3 | 9.82 | 25.2 |
260
- | Sønderjysk | 14.7 | 37.8 | 11.9 | 26.0 | 14.2 | 36.2 | 11.6 | 25.4 |
261
- | Bornholmsk | 9.32 | 29.9 | 6.79 | 17.7 | 8.08 | 26.7 | 6.12 | 15.7 |
262
- | Østjysk | 5.51 | 18.7 | 3.97 | 11.7 | 5.39 | 18.0 | 4.06 | 11.3 |
263
- | Nordjysk | 3.86 | 13.6 | 2.57 | 7.72 | 3.80 | 13.5 | 2.75 | 7.51 |
264
- | Københavnsk | 5.27 | 18.8 | 3.31 | 10.2 | 5.02 | 17.7 | 3.20 | 9.98 |
265
- | Fynsk | 9.41 | 28.6 | 7.43 | 18.3 | 8.86 | 27.0 | 7.20 | 17.7 |
266
- | Non-native | 10.6 | 33.2 | 7.84 | 20.9 | 10.0 | 31.6 | 7.46 | 19.4 |
267
- | Sjællandsk | 5.82 | 19.5 | 4.44 | 12.6 | 5.70 | 18.6 | 4.48 | 12.7 |
268
- | Sydømål | 7.09 | 20.7 | 6.38 | 14.9 | 6.96 | 20.4 | 6.44 | 15.3 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
269
 
270
  ### Performance on Other Datasets
271
 
272
  The model was also tested against other datasets to evaluate generalizability:
273
 
274
- | | **Roest-whisper-large-v1** | | **Roest-wav2vec2-315M-v1** | | **Roest-wav2vec2-315M-v2** | | **Roest-wav2vec2-1B-v2** | |
275
- | ------------------------------------------------------------------------------------- | -------------------------- | ------- | -------------------------- | ----- | -------------------------- | ------- | ------------------------ | ------- |
276
- | Evaluation Dataset | WER % | CER % | WER % | CER % | WER % | CER % | WER % | CER % |
277
- | [CoRal](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test) | **10.4** | **4.3** | 17.0 | 6.6 | **16.3** | **6.5** | 16.4 | **6.5** |
278
- | [NST-da](https://huggingface.co/datasets/alexandrainst/nst-da) | 29.8 | 14.5 | 29.7 | 13.9 | 26.1 | 11.9 | **12.4** | **4.9** |
279
- | [CommonVoice17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | 15.6 | 8.2 | 16.7 | 6.6 | **14.4** | **5.4** | 26.3 | 10.9 |
280
- | [Fleurs-da_dk](https://huggingface.co/datasets/google/fleurs) | **12.6** | **5.1** | 16.6 | 6.3 | 15.6 | 6.1 | **13.7** | **5.5** |
281
 
282
  **OBS!** The vocab used for training incudes numerals (0,1,2,..,9), which are translated to text in a post-processing step. If the model misses spaces the numbers are interpreted as one, which especially affects the NST score as this dataset contains many numerals.
283
 
284
  ---
285
- ### Note on comparing whisper and wav2vec2 models
286
- The Whisper models detailed in this model card exhibit significantly lower Character Error Rates (CER) and Word Error Rates (WER) compared to the Wav2Vec2 models. Whisper utilizes a transformer-based architecture with additional layers that enhance contextual understanding. In contrast, Wav2Vec2 models employ shorter context windows that focus on sound prediction. The Roest-Wav2Vec2 models incorporate a straightforward language model during post-processing, which addresses errors based on statistical language patterns. Introducing a more complex, contextual post-processing language model might enable a better comparison between these model types, which the CoRal project plans to explore in future releases.
287
 
288
- The Roest-Whisper model excels in read-aloud data, leveraging its embedded contextual framework to achieve more robust recognition within this context. However, Wav2Vec2 models appear to generalize more effectively across various speech recognition tasks, whereas Whisper models incur higher error rates in conversational data. It’s important to note that the CoRal-v2 conversation dataset, being tentative and featuring limited speaker diversity, might influence these results.
 
 
 
 
 
 
 
 
 
289
 
290
  ---
291
 
@@ -307,12 +347,13 @@ The CoRal project is funded by the [Danish Innovation Fund](https://innovationsf
307
 
308
  We would like specifically to thank Dan Saattrup Nielsen, Alexandra Institute for (among other things) the repository work and Simon Leminen Madsen, Alexandra Institute for modelling work.
309
 
 
310
  ## Citation
311
 
312
  ```bibtex
313
  @misc{roest-wav2vec2-315m-v2,
314
  author = {Marie Juhl Jørgensen, Søren Vejlgaard Holm, Martin Carsten Nielsen, Dan Saattrup Nielsen, Sif Bernstorff Lehmann, Simon Leminen Madsen and Torben Blach},
315
- title = {Roest-wav2vec-315m-v2: A Danish state-of-the-art speech recognition model trained on varied demographics and dialects},
316
  year = {2025},
317
  url = {https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2},
318
  }
 
30
  name: WER
31
  ---
32
 
33
+ # Røst-wav2vec2-315m-v2
34
+ This is a Danish state-of-the-art speech recognition model, trained as part of the CoRal project by [Alvenir](https://www.alvenir.ai/).
35
+
36
+ This repository contains a Wav2Vec2 model trained on the [CoRal-v2 dataset](https://huggingface.co/datasets/CoRal-project/coral-v2/tree/main).
37
+ The CoRal-v2 dataset includes a rich variety of Danish conversational and read-aloud data, distributed across diverse age groups, genders, and dialects.
38
+ The model is designed for automatic speech recognition (ASR).
39
 
 
40
 
41
  ## Quick Start
42
 
 
62
 
63
  Explore the following audio samples along with their transcriptions and accuracy metrics. Each example showcases the model's performance with different Danish dialects.
64
 
65
+ <details>
66
+ <summary>
67
+ <b>Example 1 - Vestjysk Dialect</b>
68
+ </summary>
69
+
70
+ **Audio Sample:**
71
+ <audio controls>
72
+ <source src="https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2/resolve/main/audio_samples/example1.wav" type="audio/wav">
73
+ Your browser does not support the audio tag.
74
+ </audio>
75
+
76
+ **Model Transcription:**
77
+ *det blev til yderlig ti mål i den første sæson på trods af en position som back*
78
+
79
+ **Target Transcription:**
80
+ *det blev til yderligere ti mål i den første sæson på trods af en position som back*
81
+
82
+ - **Character Error Rate (CER):** 3.7%
83
+ - **Word Error Rate (WER):** 5.9%
84
+ </details>
85
+
86
+ <details>
87
+ <summary>
88
+ <b>Example 2 - Sønderjysk Dialect</b>
89
+ </summary>
90
+
91
+ **Audio Sample:**
92
+ <audio controls>
93
+ <source src="https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2/resolve/main/audio_samples/example2.wav" type="audio/wav">
94
+ Your browser does not support the audio tag.
95
+ </audio>
96
+
97
+ **Model Transcription:**
98
+ *en arkitektoniske udformning af pladser forslagene iver benzen*
99
+
100
+ **Target Transcription:**
101
+ *den arkitektoniske udformning af pladsen er forestået af ivar bentsen*
102
+
103
+ - **Character Error Rate (CER):** 20.3%
104
+ - **Word Error Rate (WER):** 60.0%
105
+ </details>
106
+
107
+ <details>
108
+ <summary>
109
+ <b>Example 3 - Nordsjællandsk Dialect</b>
110
+ </summary>
111
+
112
+ **Audio Sample:**
113
+ <audio controls>
114
+ <source src="https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2/resolve/main/audio_samples/example3.wav" type="audio/wav">
115
+ Your browser does not support the audio tag.
116
+ </audio>
117
+
118
+ **Model Transcription:**
119
+ *østrig og ungarn samarbejder om søen gennem den østrigske og ungarske vandkommission*
120
+
121
+ **Target Transcription:**
122
+ *østrig og ungarn samarbejder om søen gennem den østrigske og ungarske vandkommission*
123
+
124
+ - **Character Error Rate (CER):** 0.0%
125
+ - **Word Error Rate (WER):** 0.0%
126
+ </details>
127
+
128
+ <details>
129
+ <summary>
130
+ <b>Example 4 - Lollandsk Dialect</b>
131
+ </summary>
132
+
133
+ **Audio Sample:**
134
+ <audio controls>
135
+ <source src="https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2/resolve/main/audio_samples/example4.wav" type="audio/wav">
136
+ Your browser does not support the audio tag.
137
+ </audio>
138
+
139
+ **Model Transcription:**
140
+ *det er produceret af thomas helme og indspillede i easy sound recording studio i københavn*
141
+
142
+ **Target Transcription:**
143
+ *det er produceret af thomas helmig og indspillet i easy sound recording studio i københavn*
144
+
145
+ - **Character Error Rate (CER):** 4.4%
146
+ - **Word Error Rate (WER):** 13.3%
147
+ </details>
148
 
149
  ---
150
 
151
  ## Model Details
152
 
153
  Wav2Vec2 is a state-of-the-art model architecture for speech recognition, leveraging self-supervised learning from raw audio data. The pre-trained [Wav2Vec2-XLS-R-300M](https://huggingface.co/facebook/wav2vec2-xls-r-300m) has been fine-tuned for automatic speech recognition with the [CoRal-v2 dataset](https://huggingface.co/datasets/CoRal-project/coral-v2/tree/main) dataset to enhance its performance in recognizing Danish speech with consideration to different dialects. The model was trained for 30K steps using the training setup in the [CoRaL repository](https://github.com/alexandrainst/coral/tree) by running:
 
 
 
 
154
 
155
+ ```bash
156
+ python src/scripts/finetune_asr_model.py \
157
+ model=wav2vec2-small \
158
+ max_steps=30000 \
159
+ datasets.coral_conversation_internal.id=CoRal-project/coral-v2 \
160
+ datasets.coral_readaloud_internal.id=CoRal-project/coral-v2
161
+ ```
162
 
163
+ The model is evaluated using a Language Model (LM) as post-processing.
164
+ The utilized LM is the one trained and used by [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1).
165
 
166
+ The model was trained on the [CoRal-v2](https://huggingface.co/datasets/CoRal-project/coral-v2/tree/main) dataset, including both the conversational and read-aloud subset.
167
+ This dataset consists of Danish speech across a variety of dialects, age groups and gender distinctions.
168
+ Note that the dataset, and thus also this model, is licensed under a custom license, adapted from OpenRAIL-M, which allows commercial use with few restrictions (speech synthesis and biometric identification) - see [license](https://huggingface.co/Alvenir/coral-1-whisper-large/blob/main/LICENSE).
 
 
 
 
 
169
 
170
  ---
171
 
172
  ## Evaluation
173
 
174
  The model was evaluated using the following metrics:
 
175
  - **Character Error Rate (CER)**: The percentage of characters incorrectly transcribed.
176
+ - **Word Error Rate (WER)**: The percentage of words incorrectly transcribed.
177
 
178
+ ### Conversational CoRal Performance
 
 
 
 
 
 
 
 
 
 
179
 
180
+ The model was firstly evaluated on a tentative version of the coral-v2 conversation dataset.
181
 
182
+ The results are tentative as the test set only includes 5 unique speakers, of which 4 are women.
183
+ The test set includes 2 speakers with 'Fynsk' dialect, 1 with 'Sønderjysk', 1 with 'Non-native' and 1 'Nordjysk'.
184
+ The Whisper model is performing very poorly on the test set. An explanation could be hallucinations during silence and short sentences, a known whisper issue.
185
+ Furthermore, both v1 models have not been trained on any conversation data, giving the models an obvious disadvantage.
186
 
187
  | Model | Number of parameters | Finetuned on data of type | [CoRal-v2::conversation](https://huggingface.co/datasets/CoRal-project/coral-v2/viewer/conversation/test) CER | [CoRal-v2::conversation](https://huggingface.co/datasets/CoRal-project/coral-v2/viewer/conversation/test) WER |
188
  | :-------------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | ------------------------------------------------------------------------------------------------------------: | ------------------------------------------------------------------------------------------------------------: |
189
+ | [CoRal-project/roest-wav2vec2-1B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2) | 1B | Read-aloud and conversation | **23.9%** | **36.7%** |
190
  | [CoRal-project/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) | 315M | Read-aloud and conversation | 24.2% | 37.7% |
191
+ | [CoRal-project/roest-whisper-large-v1](https://huggingface.co/CoRal-project/roest-whisper-large-v1) | 1540M | Read-aloud | 138% | 121% |
192
+ | [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1) | 315M | Read-aloud | 123% | 80.5% |
193
 
194
 
195
+ ### Read-aloud CoRal Performance
196
+
197
+ | Model | Number of parameters | Finetuned on data of type | [CoRal](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test) WER |
198
+ | :-------------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | --------------------------------------------------------------------------------------: | --------------------------------------------------------------------------------------: |
199
+ | [CoRal-project/roest-wav2vec2-1B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2) | 1B | Read-aloud and conversation | 6.5% ± 0.2% | 16.4% ± 0.4% |
200
+ | [CoRal-project/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) | 315M | Read-aloud and conversation | 6.5% ± 0.2% | 16.3% ± 0.4% |
201
+ | [CoRal-project/roest-whisper-large-v1](https://huggingface.co/CoRal-project/roest-whisper-large-v1) | 1540M | Read-aloud | **4.3% ± 0.2%** | **10.4% ± 0.3%** |
202
+ | [CoRal-project/roest-wav2vec2-315M-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315M-v1) | 315M | Read-aloud | 6.6% ± 0.2% | 17.0% ± 0.4% |
203
+ | [mhenrichsen/hviske-v2](https://huggingface.co/syvai/hviske-v2) | 1540M | Read-aloud | 4.7% ± 0.2% | 11.8% ± 0.3% |
204
+ | [openai/whisper-large-v3](https://hf.co/openai/whisper-large-v3) | 1540M | - | 11.4% ± 0.3% | 28.3% ± 0.6% |
205
+
206
+ **OBS!** Benchmark for hviske-v2 has been re-evaluated and the confidence interval is larger than reported in the model card.
207
 
208
  <img src="https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2/resolve/main/images/cer.png">
209
 
210
+ <img src="https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2/resolve/main/images/wer.png">
211
+
212
+
213
+ <details>
214
+ <summary>
215
+ <b>Detailed CER scores in % of evaluation across demographics on the CoRal test data</b>
216
+ </summary>
217
+
218
+ | Category | Røst-whisper-large-v1 | Røst-wav2vec2-315m-v1 | Røst-wav2vec2-315m-v2 | Røst-wav2vec2-1B-v2 |
219
+ |:---:|:---:|:---:|:---:|:---:|
220
+ | female | 5.1 | 7.4 | 7.2 | 7.3 |
221
+ | male | 3.6 | 5.8 | 5.7 | 5.8 |
222
+ | 0-25 | 3.4 | 5.4 | 5.3 | 5.1 |
223
+ | 25-50 | 4.0 | 6.2 | 6.0 | 5.7 |
224
+ | 50+ | 5.0 | 7.5 | 7.4 | 7.8 |
225
+ | Bornholmsk | 3.8 | 6.8 | 6.1 | 6.2 |
226
+ | Fynsk | 5.1 | 7.4 | 7.2 | 6.9 |
227
+ | Københavnsk | 1.9 | 3.3 | 3.2 | 3.0 |
228
+ | Non-native | 4.8 | 7.8 | 7.5 | 7.3 |
229
+ | Nordjysk | 1.6 | 2.6 | 2.8 | 2.6 |
230
+ | Sjællandsk | 3.0 | 4.4 | 4.5 | 3.9 |
231
+ | Sydømål | 4.1 | 6.4 | 6.4 | 6.5 |
232
+ | Sønderjysk | 8.8 | 11.9 | 11.6 | 12.6 |
233
+ | Vestjysk | 6.4 | 10.1 | 9.8 | 10.5 |
234
+ | Østjysk | 2.6 | 4.0 | 4.1 | 3.8 |
235
+ | Overall | 4.3 | 6.6 | 6.5 | 6.5 |
236
+
237
+ </details>
238
+
239
+ <details>
240
+ <summary>
241
+ <b>Detailed WER scores in % of evaluation across demographics on the CoRal test data</b>
242
+ </summary>
243
+
244
+ | Category | Røst-whisper-large-v1 | Røst-wav2vec2-315m-v1 | Røst-wav2vec2-315m-v2 | Røst-wav2vec2-1B-v2 |
245
+ |:---:|:---:|:---:|:---:|:---:|
246
+ | female | 11.5 | 18.5 | 17.7 | 17.8 |
247
+ | male | 9.4 | 15.5 | 14.9 | 15.0 |
248
+ | 0-25 | 9.0 | 14.7 | 14.0 | 13.7 |
249
+ | 25-50 | 10.1 | 16.6 | 15.8 | 15.3 |
250
+ | 50+ | 11.3 | 18.2 | 17.7 | 18.5 |
251
+ | Bornholmsk | 9.8 | 17.7 | 15.7 | 16.4 |
252
+ | Fynsk | 12.1 | 18.3 | 17.7 | 16.7 |
253
+ | Københavnsk | 5.9 | 10.2 | 10.0 | 9.5 |
254
+ | Non-native | 12.2 | 20.9 | 19.4 | 19.4 |
255
+ | Nordjysk | 4.5 | 7.7 | 7.5 | 7.3 |
256
+ | Sjællandsk | 7.6 | 12.6 | 12.7 | 11.0 |
257
+ | Sydømål | 10.0 | 14.9 | 15.3 | 14.4 |
258
+ | Sønderjysk | 17.5 | 26.0 | 25.4 | 27.8 |
259
+ | Vestjysk | 15.0 | 26.3 | 25.2 | 26.7 |
260
+ | Østjysk | 7.5 | 11.7 | 11.3 | 10.8 |
261
+ | Overall | 10.4 | 17.0 | 16.3 | 16.4 |
262
+
263
+ </details>
264
+
265
+ <details>
266
+ <summary>
267
+ <b>Experiments with Røst-wav2vec2-315M with and without language model</b>
268
+ </summary>
269
+
270
+ The inclusion of a post-processing language model can affect the performance significantly.
271
+ The Røst-v1 and Røst-v2 models are using the same Language Model (LM).
272
+ The utilized LM is the one trained and used by [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1).
273
+
274
+ | Model | Number of parameters | Finetuned on data of type | Postprocessed with Language Model | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.com/datasets/alexandrainst/coral/viewer/read_aloud/test) WER |
275
+ | :-------------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | --------------------------------: | --------------------------------------------------------------------------------------: | ---------------------------------------------------------------------------------------: |
276
+ | [CoRal-project/roest-wav2vec2-1B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2) | 1B | Read-aloud and conversation | Yes | **6.5% ± 0.2%** | **16.4% ± 0.4%** |
277
+ | [CoRal-project/roest-wav2vec2-1B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2) | 1B | Read-aloud and conversation | No | 8.1% ± 0.2% | 23.9% ± 0.4% |
278
+ | [CoRal-project/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) | 315M | Read-aloud and conversation | Yes | **6.5% ± 0.2%** | **16.3% ± 0.4%** |
279
+ | [CoRal-project/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) | 315M | Read-aloud and conversation | No | 8.2% ± 0.2% | 25.1% ± 0.4% |
280
+ | [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1) | 315M | Read-aloud | Yes | 6.6% ± 0.2% | 17.0% ± 0.4% |
281
+ | [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1) | 315M | Read-aloud | No | 8.6% ± 0.2% | 26.3% ± 0.5% |
282
+
283
+ Here are the results of the model on different Danish dialects in the test set:
284
+
285
+ | | Røst-v1 | | Røst-v1 | | Røst-v2 | | Røst-v2 | |
286
+ |-------------|---------|---------|---------|---------|---------|---------|---------|---------|
287
+ | **LM** | **No** | | **Yes** | | **No** | | **Yes** | |
288
+ |-------------|---------|---------|---------|---------|---------|---------|---------|---------|
289
+ | Dialect | CER (%) | WER (%) | CER (%) | WER (%) | CER (%) | WER (%) | CER (%) | WER (%) |
290
+ | Vestjysk | 12.7 | 37.1 | 10.1 | 26.3 | 12.2 | 36.3 | 9.82 | 25.2 |
291
+ | Sønderjysk | 14.7 | 37.8 | 11.9 | 26.0 | 14.2 | 36.2 | 11.6 | 25.4 |
292
+ | Bornholmsk | 9.32 | 29.9 | 6.79 | 17.7 | 8.08 | 26.7 | 6.12 | 15.7 |
293
+ | Østjysk | 5.51 | 18.7 | 3.97 | 11.7 | 5.39 | 18.0 | 4.06 | 11.3 |
294
+ | Nordjysk | 3.86 | 13.6 | 2.57 | 7.72 | 3.80 | 13.5 | 2.75 | 7.51 |
295
+ | Københavnsk | 5.27 | 18.8 | 3.31 | 10.2 | 5.02 | 17.7 | 3.20 | 9.98 |
296
+ | Fynsk | 9.41 | 28.6 | 7.43 | 18.3 | 8.86 | 27.0 | 7.20 | 17.7 |
297
+ | Non-native | 10.6 | 33.2 | 7.84 | 20.9 | 10.0 | 31.6 | 7.46 | 19.4 |
298
+ | Sjællandsk | 5.82 | 19.5 | 4.44 | 12.6 | 5.70 | 18.6 | 4.48 | 12.7 |
299
+ | Sydømål | 7.09 | 20.7 | 6.38 | 14.9 | 6.96 | 20.4 | 6.44 | 15.3 |
300
+
301
+ </details>
302
 
303
  ### Performance on Other Datasets
304
 
305
  The model was also tested against other datasets to evaluate generalizability:
306
 
307
+ | | **Røst-whisper-large-v1** | | **Røst-wav2vec2-315M-v1** | | **Røst-wav2vec2-315M-v2** | | **Røst-wav2vec2-1B-v2** | |
308
+ | ------------------------------------------------------------------------------------- | -------------------------- | --------- | -------------------------- | --------- | -------------------------- | ----------- | ------------------------ | --------- |
309
+ | **Evaluation Dataset** | **WER %** | **CER %** | **WER %** | **CER %** | **WER %** | **CER %** | **WER %** | **CER %** |
310
+ | [CoRal](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test) | **10.4** | **4.3** | 17.0 | 6.6 | **16.3** | **6.5** | 16.4 | **6.5** |
311
+ | [NST-da](https://huggingface.co/datasets/alexandrainst/nst-da) | 29.8 | 14.5 | 29.7 | 13.9 | 26.1 | 11.9 | **12.4** | **4.9** |
312
+ | [CommonVoice17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | 15.6 | 8.2 | 16.7 | 6.6 | **14.4** | **5.4** | 26.3 | 10.9 |
313
+ | [Fleurs-da_dk](https://huggingface.co/datasets/google/fleurs) | **12.6** | **5.1** | 16.6 | 6.3 | 15.6 | 6.1 | **13.7** | **5.5** |
314
 
315
  **OBS!** The vocab used for training incudes numerals (0,1,2,..,9), which are translated to text in a post-processing step. If the model misses spaces the numbers are interpreted as one, which especially affects the NST score as this dataset contains many numerals.
316
 
317
  ---
 
 
318
 
319
+ ### Note on comparing Whisper and Wav2Vec2 models
320
+ The Whisper models detailed in this model card exhibit significantly lower Character Error Rates (CER) and Word Error Rates (WER) compared to the Wav2Vec2 models.
321
+ Whisper utilizes a transformer-based architecture with additional layers that enhance contextual understanding.
322
+ In contrast, Wav2Vec2 models employ shorter context windows that focus on sound prediction.
323
+ The Røst-Wav2Vec2 models incorporate a straightforward language model during post-processing, which addresses errors based on statistical language patterns.
324
+ Introducing a more complex, contextual post-processing language model might enable a better comparison between these model types, which the CoRal project plans to explore in future releases.
325
+
326
+ The Røst-Whisper model excels in read-aloud data, leveraging its embedded contextual framework to achieve more robust recognition within this context.
327
+ However, Wav2Vec2 models appear to generalize more effectively across various speech recognition tasks, whereas Whisper models incur higher error rates in conversational data.
328
+ It’s important to note that the CoRal-v2 conversation dataset, being tentative and featuring limited speaker diversity, might influence these results.
329
 
330
  ---
331
 
 
347
 
348
  We would like specifically to thank Dan Saattrup Nielsen, Alexandra Institute for (among other things) the repository work and Simon Leminen Madsen, Alexandra Institute for modelling work.
349
 
350
+
351
  ## Citation
352
 
353
  ```bibtex
354
  @misc{roest-wav2vec2-315m-v2,
355
  author = {Marie Juhl Jørgensen, Søren Vejlgaard Holm, Martin Carsten Nielsen, Dan Saattrup Nielsen, Sif Bernstorff Lehmann, Simon Leminen Madsen and Torben Blach},
356
+ title = {Røst-wav2vec-315m-v2: A Danish state-of-the-art speech recognition model trained on varied demographics and dialects},
357
  year = {2025},
358
  url = {https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2},
359
  }