MarieAlvenir commited on
Commit
33b409c
·
1 Parent(s): 462721f

1B model added to tables

Browse files
Files changed (1) hide show
  1. README.md +74 -55
README.md CHANGED
@@ -167,72 +167,86 @@ The model was evaluated using the following metrics:
167
  **OBS!** It should be noted that the [CoRal test dataset](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) does not contain any conversation data, whereas the model is trained for read-aloud and conversation, but is only tested on read-aloud in the [CoRal test dataset](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test).
168
 
169
 
170
- | Model | Number of parameters | Finetuned on data of type | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) WER |
171
  | :----------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | --------------------------------------------------------------------------------------: | --------------------------------------------------------------------------------------: |
 
172
  | [CoRal-project/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) | 315M | Read-aloud and conversation | 6.5% ± 0.2% | 16.3% ± 0.4% |
173
- | [CoRal-project/roest-whisper-large-v1](https://huggingface.co/CoRal-project/roest-whisper-large) | 1540M | Read-aloud | **4.3% ± 0.2%** | **10.4% ± 0.3%** |
174
- | [CoRal-project/roest-wav2vec2-315M-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1) | 315M | Read-aloud | 6.6% ± 0.2% | 17.0% ± 0.4% |
175
  | [mhenrichsen/hviske-v2](https://huggingface.co/syvai/hviske-v2) | 1540M | Read-aloud | 4.7% ± 0.2% | 11.8% ± 0.3% |
176
  | [openai/whisper-large-v3](https://hf.co/openai/whisper-large-v3) | 1540M | - | 11.4% ± 0.3% | 28.3% ± 0.6% |
177
 
178
  **OBS!** Benchmark for hviske-v2 has been reevaluted and the confidence interval is larger than reported in the model card.
179
 
 
 
 
 
 
 
 
 
 
 
180
  ### Detailed evaluation across demographics on the CoRal test data
181
  <img src="https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2/resolve/main/images/wer.png">
182
 
183
  <img src="https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2/resolve/main/images/cer.png">
184
 
185
  ### Table WER scores in % of evaluation across demographics on the CoRal test data
186
- | Category | roest-whisper-large-v1 | roest-wav2vec2-315m-v1 | roest-wav2vec2-315m-v2 |
187
- |:---:|:---:|:---:|:---:|
188
- | female | 11.5 | 18.5 | 17.7 |
189
- | male | 9.4 | 15.5 | 14.9 |
190
- | 0-25 | 9.0 | 14.7 | 14.0 |
191
- | 25-50 | 10.1 | 16.6 | 15.8 |
192
- | 50+ | 11.3 | 18.2 | 17.7 |
193
- | Bornholmsk | 9.8 | 17.7 | 15.7 |
194
- | Fynsk | 12.1 | 18.3 | 17.7 |
195
- | Københavnsk | 5.9 | 10.2 | 10.0 |
196
- | Non-native | 12.2 | 20.9 | 19.4 |
197
- | Nordjysk | 4.5 | 7.7 | 7.5 |
198
- | Sjællandsk | 7.6 | 12.6 | 12.7 |
199
- | Sydømål | 10.0 | 14.9 | 15.3 |
200
- | Sønderjysk | 17.5 | 26.0 | 25.4 |
201
- | Vestjysk | 15.0 | 26.3 | 25.2 |
202
- | Østjysk | 7.5 | 11.7 | 11.3 |
203
- | Overall | 10.4 | 17.0 | 16.3 |
204
 
205
 
206
  ### Table CER scores in % of evaluation across demographics on the CoRal test data
207
- | Category | roest-whisper-large-v1 | roest-wav2vec2-315m-v1 | roest-wav2vec2-315m-v2 |
208
- |:---:|:---:|:---:|:---:|
209
- | female | 5.1 | 7.4 | 7.2 |
210
- | male | 3.6 | 5.8 | 5.7 |
211
- | 0-25 | 3.4 | 5.4 | 5.3 |
212
- | 25-50 | 4.0 | 6.2 | 6.0 |
213
- | 50+ | 5.0 | 7.5 | 7.4 |
214
- | Bornholmsk | 3.8 | 6.8 | 6.1 |
215
- | Fynsk | 5.1 | 7.4 | 7.2 |
216
- | Københavnsk | 1.9 | 3.3 | 3.2 |
217
- | Non-native | 4.8 | 7.8 | 7.5 |
218
- | Nordjysk | 1.6 | 2.6 | 2.8 |
219
- | Sjællandsk | 3.0 | 4.4 | 4.5 |
220
- | Sydømål | 4.1 | 6.4 | 6.4 |
221
- | Sønderjysk | 8.8 | 11.9 | 11.6 |
222
- | Vestjysk | 6.4 | 10.1 | 9.8 |
223
- | Østjysk | 2.6 | 4.0 | 4.1 |
224
- | Overall | 4.3 | 6.6 | 6.5 |
 
225
 
226
 
227
  ### Roest-wav2vec2-315M with and without language model
228
- The inclusion of a post-processing language model can affect the performance significantly. The Roest-v1 and Roest-v2 models are using the same Language Model (LM). The utilized LM is the one trained and used by [alexandrainst/roest-wav2vec2-315m-v1](https://huggingface.co/alexandrainst/roest-315m).
229
 
230
- | Model | Number of parameters | Finetuned on data of type | Postprocessed with Language Model | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) WER |
231
  | :-------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | --------------------------------: | --------------------------------------------------------------------------------------: | --------------------------------------------------------------------------------------: |
 
 
232
  | [CoRal-project/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) | 315M | Read-aloud and conversation | Yes | **6.5% ± 0.2%** | **16.3% ± 0.4%** |
233
  | [CoRal-project/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) | 315M | Read-aloud and conversation | No | 8.2% ± 0.2% | 25.1% ± 0.4% |
234
- | [alexandrainst/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1) | 315M | Read-aloud | Yes | 6.6% ± 0.2% | 17.0% ± 0.4% |
235
- | [alexandrainst/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1) | 315M | Read-aloud | No | 8.6% ± 0.2% | 26.3% ± 0.5% |
236
 
237
  ### Detailed Roest-wav2vec2-315M with and without language model on different dialects
238
  Here are the results of the model on different danish dialects in the test set:
@@ -257,17 +271,21 @@ Here are the results of the model on different danish dialects in the test set:
257
 
258
  The model was also tested against other datasets to evaluate generalizability:
259
 
260
- | | **Roest-wav2vec2-315M-v1** | | **Roest-wav2vec2-315M-v2** | |
261
- | ------------------------------------------------------------------------------------- | ----------- | ----- | ----------- | -------- |
262
- | Evaluation Dataset | WER % | CER % | WER % | CER % |
263
- | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) | 17.0 | 6.6 | **16.3** | **6.5** |
264
- | [NST-da](https://huggingface.co/datasets/alexandrainst/nst-da) | 29.7 | 13.9 | **26.1** | **11.9** |
265
- | [CommonVoice17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | 16.7 | 6.6 | **14.4** | **5.4** |
266
- | [Fleurs-da_dk](https://huggingface.co/datasets/google/fleurs) | 27.3 | 7.9 | **26.4** | **7.7** |
267
- | [Fleurs-da_dk](https://huggingface.co/datasets/google/fleurs) Normed | 16.6 | 6.3 | **15.6** | **6.1** |
268
 
 
269
 
270
- **OBS!** The vocab used for training incudes numerals (0,1,2,..,9), which are translated to text in a post-processing step. If the model misses spaces the numbers are interpreted as one, which expecially affects the NST score as this dataset contains many numerals.
 
 
 
 
271
 
272
  ---
273
 
@@ -291,11 +309,12 @@ We would like specifically to thank Dan Saattrup Nielsen, Alexandra Institute fo
291
 
292
  ## Citation
293
 
294
- We will submit a research paper soon, but until then, if you use this model in your research or development, please cite it as follows:
295
-
296
  @misc{roest-wav2vec2-315m-v2,
297
- author = {Marie Juhl Jørgensen, Søren Vejlgaard Holm, Martin Carsten Nielsen, Dan Saattrup Nielsen, Sif Bernstorff Lehmann, Simon Leminen Madsen, Anders Jess Pedersen, Anna Katrine van Zee, Anders Søgaard and Torben Blach},
298
  title = {Roest-wav2vec-315m-v2: A Danish state-of-the-art speech recognition model trained on varied demographics and dialects},
299
  year = {2025},
300
  url = {https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2},
301
  }
 
 
 
167
  **OBS!** It should be noted that the [CoRal test dataset](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) does not contain any conversation data, whereas the model is trained for read-aloud and conversation, but is only tested on read-aloud in the [CoRal test dataset](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test).
168
 
169
 
170
+ |Model | Number of parameters | Finetuned on data of type | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) WER |
171
  | :----------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | --------------------------------------------------------------------------------------: | --------------------------------------------------------------------------------------: |
172
+ | [CoRal-project/roest-wav2vec2-1B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2) | 1B | Read-aloud and conversation | 6.5% ± 0.2% | 16.4% ± 0.4% |
173
  | [CoRal-project/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) | 315M | Read-aloud and conversation | 6.5% ± 0.2% | 16.3% ± 0.4% |
174
+ | [CoRal-project/roest-whisper-large-v1](https://huggingface.co/CoRal-project/roest-whisper-large-v1) | 1540M | Read-aloud | **4.3% ± 0.2%** | **10.4% ± 0.3%** |
175
+ | [CoRal-project/roest-wav2vec2-315M-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315M-v1) | 315M | Read-aloud | 6.6% ± 0.2% | 17.0% ± 0.4% |
176
  | [mhenrichsen/hviske-v2](https://huggingface.co/syvai/hviske-v2) | 1540M | Read-aloud | 4.7% ± 0.2% | 11.8% ± 0.3% |
177
  | [openai/whisper-large-v3](https://hf.co/openai/whisper-large-v3) | 1540M | - | 11.4% ± 0.3% | 28.3% ± 0.6% |
178
 
179
  **OBS!** Benchmark for hviske-v2 has been reevaluted and the confidence interval is larger than reported in the model card.
180
 
181
+ The model was also evaluated on a tentative pre-release of the coral-v2 conversation dataset. The results are tentative as the test set only includes 5 unique speakers, of which 4 are women. The test set includes 2 speakers with 'Fynsk' dialect, 1 with 'Sønderjysk', 1 with 'Non-native' and 1 'Nordjysk'. The whisper model is performing very poorly on the test set. An explanation could be hallucinations during silence and short sentences, a known whisper issue. Furthermore, both version 1 models have not been trained on any conversation data giving the models an obvious disadvantage.
182
+
183
+ | Model | Number of parameters | Finetuned on data of type | [CoRal-v2::conversation](https://huggingface.co/datasets/CoRal-project/coral-v2/viewer/conversation/test) CER | [CoRal-v2::conversation](https://huggingface.co/datasets/CoRal-project/coral-v2/viewer/conversation/test) WER |
184
+ | :-------------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | ------------------------------------------------------------------------------------------------------------: | ------------------------------------------------------------------------------------------------------------: |
185
+ | [CoRal-project/roest-wav2vec2-1B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2) | 1B | Read-aloud and conversation | 23.9% | 36.7% |
186
+ | [CoRal-project/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) | 315M | Read-aloud and conversation | 24.2% | 37.7% |
187
+ | [CoRal-project/roest-whisper-large-v1](https://huggingface.co/CoRal-project/roest-whisper-large-v1) | 1540M | Read-aloud | 138% | 121% |
188
+ | [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1) | 315M | Read-aloud | 123% | 80.5% |
189
+
190
+
191
  ### Detailed evaluation across demographics on the CoRal test data
192
  <img src="https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2/resolve/main/images/wer.png">
193
 
194
  <img src="https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2/resolve/main/images/cer.png">
195
 
196
  ### Table WER scores in % of evaluation across demographics on the CoRal test data
197
+ | Category | roest-whisper-large-v1 | roest-wav2vec2-315m-v1 | roest-wav2vec2-315m-v2 | roest-wav2vec2-1B-v2 |
198
+ |:---:|:---:|:---:|:---:|:---:|
199
+ | female | 11.5 | 18.5 | 17.7 | 17.8 |
200
+ | male | 9.4 | 15.5 | 14.9 | 15.0 |
201
+ | 0-25 | 9.0 | 14.7 | 14.0 | 13.7 |
202
+ | 25-50 | 10.1 | 16.6 | 15.8 | 15.3 |
203
+ | 50+ | 11.3 | 18.2 | 17.7 | 18.5 |
204
+ | Bornholmsk | 9.8 | 17.7 | 15.7 | 16.4 |
205
+ | Fynsk | 12.1 | 18.3 | 17.7 | 16.7 |
206
+ | Københavnsk | 5.9 | 10.2 | 10.0 | 9.5 |
207
+ | Non-native | 12.2 | 20.9 | 19.4 | 19.4 |
208
+ | Nordjysk | 4.5 | 7.7 | 7.5 | 7.3 |
209
+ | Sjællandsk | 7.6 | 12.6 | 12.7 | 11.0 |
210
+ | Sydømål | 10.0 | 14.9 | 15.3 | 14.4 |
211
+ | Sønderjysk | 17.5 | 26.0 | 25.4 | 27.8 |
212
+ | Vestjysk | 15.0 | 26.3 | 25.2 | 26.7 |
213
+ | Østjysk | 7.5 | 11.7 | 11.3 | 10.8 |
214
+ | Overall | 10.4 | 17.0 | 16.3 | 16.4 |
215
 
216
 
217
  ### Table CER scores in % of evaluation across demographics on the CoRal test data
218
+ | Category | roest-whisper-large-v1 | roest-wav2vec2-315m-v1 | roest-wav2vec2-315m-v2 | roest-wav2vec2-1B-v2 |
219
+ |:---:|:---:|:---:|:---:|:---:|
220
+ | female | 5.1 | 7.4 | 7.2 | 7.3 |
221
+ | male | 3.6 | 5.8 | 5.7 | 5.8 |
222
+ | 0-25 | 3.4 | 5.4 | 5.3 | 5.1 |
223
+ | 25-50 | 4.0 | 6.2 | 6.0 | 5.7 |
224
+ | 50+ | 5.0 | 7.5 | 7.4 | 7.8 |
225
+ | Bornholmsk | 3.8 | 6.8 | 6.1 | 6.2 |
226
+ | Fynsk | 5.1 | 7.4 | 7.2 | 6.9 |
227
+ | Københavnsk | 1.9 | 3.3 | 3.2 | 3.0 |
228
+ | Non-native | 4.8 | 7.8 | 7.5 | 7.3 |
229
+ | Nordjysk | 1.6 | 2.6 | 2.8 | 2.6 |
230
+ | Sjællandsk | 3.0 | 4.4 | 4.5 | 3.9 |
231
+ | Sydømål | 4.1 | 6.4 | 6.4 | 6.5 |
232
+ | Sønderjysk | 8.8 | 11.9 | 11.6 | 12.6 |
233
+ | Vestjysk | 6.4 | 10.1 | 9.8 | 10.5 |
234
+ | Østjysk | 2.6 | 4.0 | 4.1 | 3.8 |
235
+ | Overall | 4.3 | 6.6 | 6.5 | 6.5 |
236
+
237
 
238
 
239
  ### Roest-wav2vec2-315M with and without language model
240
+ The inclusion of a post-processing language model can affect the performance significantly. The Roest-v1 and Roest-v2 models are using the same Language Model (LM). The utilized LM is the one trained and used by [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1).
241
 
242
+ | Model | Number of parameters | Finetuned on data of type | Postprocessed with Language Model | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.com/datasets/alexandrainst/coral/viewer/read_aloud/test) WER |
243
  | :-------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | --------------------------------: | --------------------------------------------------------------------------------------: | --------------------------------------------------------------------------------------: |
244
+ | [CoRal-project/roest-wav2vec2-1B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2) | 1B | Read-aloud and conversation | Yes | **6.5% ± 0.2%** | **16.4% ± 0.4%** |
245
+ | [CoRal-project/roest-wav2vec2-1B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2) | 1B | Read-aloud and conversation | No | 8.1% ± 0.2% | 23.9% ± 0.4% |
246
  | [CoRal-project/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) | 315M | Read-aloud and conversation | Yes | **6.5% ± 0.2%** | **16.3% ± 0.4%** |
247
  | [CoRal-project/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) | 315M | Read-aloud and conversation | No | 8.2% ± 0.2% | 25.1% ± 0.4% |
248
+ | [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1) | 315M | Read-aloud | Yes | 6.6% ± 0.2% | 17.0% ± 0.4% |
249
+ | [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1) | 315M | Read-aloud | No | 8.6% ± 0.2% | 26.3% ± 0.5% |
250
 
251
  ### Detailed Roest-wav2vec2-315M with and without language model on different dialects
252
  Here are the results of the model on different danish dialects in the test set:
 
271
 
272
  The model was also tested against other datasets to evaluate generalizability:
273
 
274
+ | | **Roest-whisper-large-v1** | | **Roest-wav2vec2-315M-v1** | | **Roest-wav2vec2-315M-v2** | | **Roest-wav2vec2-1B-v2** | |
275
+ | ------------------------------------------------------------------------------------- | -------------------------- | ------- | -------------------------- | ----- | -------------------------- | ------- | ------------------------ | ------- |
276
+ | Evaluation Dataset | WER % | CER % | WER % | CER % | WER % | CER % | WER % | CER % |
277
+ | [CoRal](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test) | **10.4** | **4.3** | 17.0 | 6.6 | **16.3** | **6.5** | 16.4 | **6.5** |
278
+ | [NST-da](https://huggingface.co/datasets/alexandrainst/nst-da) | 29.8 | 14.5 | 29.7 | 13.9 | 26.1 | 11.9 | **12.4** | **4.9** |
279
+ | [CommonVoice17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | 15.6 | 8.2 | 16.7 | 6.6 | **14.4** | **5.4** | 26.3 | 10.9 |
280
+ | [Fleurs-da_dk](https://huggingface.co/datasets/google/fleurs) | **12.6** | **5.1** | 16.6 | 6.3 | 15.6 | 6.1 | **13.7** | **5.5** |
 
281
 
282
+ **OBS!** The vocab used for training incudes numerals (0,1,2,..,9), which are translated to text in a post-processing step. If the model misses spaces the numbers are interpreted as one, which especially affects the NST score as this dataset contains many numerals.
283
 
284
+ ---
285
+ ### Note on comparing whisper and wav2vec2 models
286
+ The Whisper models detailed in this model card exhibit significantly lower Character Error Rates (CER) and Word Error Rates (WER) compared to the Wav2Vec2 models. Whisper utilizes a transformer-based architecture with additional layers that enhance contextual understanding. In contrast, Wav2Vec2 models employ shorter context windows that focus on sound prediction. The Roest-Wav2Vec2 models incorporate a straightforward language model during post-processing, which addresses errors based on statistical language patterns. Introducing a more complex, contextual post-processing language model might enable a better comparison between these model types, which the CoRal project plans to explore in future releases.
287
+
288
+ The Roest-Whisper model excels in read-aloud data, leveraging its embedded contextual framework to achieve more robust recognition within this context. However, Wav2Vec2 models appear to generalize more effectively across various speech recognition tasks, whereas Whisper models incur higher error rates in conversational data. It’s important to note that the CoRal-v2 conversation dataset, being tentative and featuring limited speaker diversity, might influence these results.
289
 
290
  ---
291
 
 
309
 
310
  ## Citation
311
 
312
+ ```bibtex
 
313
  @misc{roest-wav2vec2-315m-v2,
314
+ author = {Marie Juhl Jørgensen, Søren Vejlgaard Holm, Martin Carsten Nielsen, Dan Saattrup Nielsen, Sif Bernstorff Lehmann, Simon Leminen Madsen and Torben Blach},
315
  title = {Roest-wav2vec-315m-v2: A Danish state-of-the-art speech recognition model trained on varied demographics and dialects},
316
  year = {2025},
317
  url = {https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2},
318
  }
319
+ ```
320
+