MarieAlvenir commited on
Commit
9559a1b
·
1 Parent(s): f4408de

Stylistic changes

Browse files
Files changed (1) hide show
  1. README.md +110 -80
README.md CHANGED
@@ -30,7 +30,7 @@ model-index:
30
  name: WER
31
  ---
32
 
33
- # Pre-release of Roest-wav2vec2-1B-v2
34
  This is a pre-release of a Danish state-of-the-art speech recognition model, trained as part of the CoRal project by [Alvenir](https://www.alvenir.ai/).
35
 
36
  This repository contains a Wav2Vec2 model trained on the [CoRal-v2 dataset](https://huggingface.co/datasets/CoRal-project/coral-v2/tree/main). The CoRal-v2 dataset includes a rich variety of Danish conversational and read-aloud data, distributed across diverse age groups, genders, and dialects. The model is designed for automatic speech recognition (ASR).
@@ -59,22 +59,19 @@ Next you can use the model using the `transformers` Python package as follows:
59
  ## Model Details
60
 
61
  Wav2Vec2 is a state-of-the-art model architecture for speech recognition, leveraging self-supervised learning from raw audio data. The pre-trained [wav2vec2-xls-r-1b](https://huggingface.co/facebook/wav2vec2-xls-r-1b) has been fine-tuned for automatic speech recognition with the [CoRal-v2 dataset](https://huggingface.co/datasets/CoRal-project/coral-v2/tree/main) dataset to enhance its performance in recognizing Danish speech with consideration to different dialects. The model was trained for 30K steps using the training setup in the [CoRaL repository](https://github.com/alexandrainst/coral/tree) by running:
62
- ```
63
- python src/scripts/finetune_asr_model.py model=wav2vec2-medium max_steps=30000 datasets.coral_conversation_internal.id=CoRal-project/coral-v2 datasets.coral_readaloud_internal.id=CoRal-project/coral-v2
64
- ```
65
- The model is evaluated using a Language Model (LM) as post-processing. The utilized LM is the one trained and used by [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1).
66
 
67
- ---
 
 
 
 
 
 
68
 
69
- ## Dataset
70
 
71
- ### [CoRal-v2](https://huggingface.co/datasets/CoRal-project/coral-v2/tree/main)
72
- - **Subsets**:
73
- - Conversation
74
- - Read-aloud
75
- - **Language**: Danish.
76
- - **Variation**: Includes various dialects, ages, and gender distinctions.
77
- ### License
78
  Note that the dataset used is licensed under a custom license, adapted from OpenRAIL-M, which allows commercial use with a few restrictions (speech synthesis and biometric identification). See [license](https://huggingface.co/Alvenir/coral-1-whisper-large/blob/main/LICENSE).
79
 
80
  ---
@@ -82,14 +79,38 @@ Note that the dataset used is licensed under a custom license, adapted from Open
82
  ## Evaluation
83
 
84
  The model was evaluated using the following metrics:
85
- - **Word Error Rate (WER)**: The percentage of words incorrectly transcribed.
86
  - **Character Error Rate (CER)**: The percentage of characters incorrectly transcribed.
 
87
 
88
- **OBS!** It should be noted that the [CoRal test dataset](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test) does not contain any conversation data, whereas the model is trained for read-aloud and conversation, but is only tested on read-aloud in the [CoRal test dataset](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
89
 
90
  | Model | Number of parameters | Finetuned on data of type | [CoRal](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test) WER |
91
  | :----------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | --------------------------------------------------------------------------------------: | --------------------------------------------------------------------------------------: |
92
- | [CoRal-project/roest-wav2vec2-1B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2) | 1B | Read-aloud and conversation | 6.5% ± 0.2% | 16.4% ± 0.4% |
93
  | [CoRal-project/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) | 315M | Read-aloud and conversation | 6.5% ± 0.2% | 16.3% ± 0.4% |
94
  | [CoRal-project/roest-whisper-large-v1](https://huggingface.co/CoRal-project/roest-whisper-large-v1) | 1540M | Read-aloud | **4.3% ± 0.2%** | **10.4% ± 0.3%** |
95
  | [CoRal-project/roest-wav2vec2-315M-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315M-v1) | 315M | Read-aloud | 6.6% ± 0.2% | 17.0% ± 0.4% |
@@ -98,79 +119,88 @@ The model was evaluated using the following metrics:
98
 
99
  **OBS!** Benchmark for hviske-v2 has been reevaluted and the confidence interval is larger than reported in the model card.
100
 
101
- The model was also evaluated on a tentative pre-release of the coral-v2 conversation dataset. The results are tentative as the test set only includes 5 unique speakers, of which 4 are women. The test set includes 2 speakers with 'Fynsk' dialect, 1 with 'Sønderjysk', 1 with 'Non-native' and 1 'Nordjysk'. The whisper model is performing very poorly on the test set. An explanation could be hallucinations during silence and short sentences, a known whisper issue. Furthermore, both version 1 models have not been trained on any conversation data giving the models an obvious disadvantage.
102
-
103
- | Model | Number of parameters | Finetuned on data of type | [CoRal-v2::conversation](https://huggingface.co/datasets/CoRal-project/coral-v2/viewer/conversation/test) CER | [CoRal-v2::conversation](https://huggingface.co/datasets/CoRal-project/coral-v2/viewer/conversation/test) WER |
104
- | :-------------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | ------------------------------------------------------------------------------------------------------------: | ------------------------------------------------------------------------------------------------------------: |
105
- | [CoRal-project/roest-wav2vec2-1B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2) | 1B | Read-aloud and conversation | 23.9% | 36.7% |
106
- | [CoRal-project/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) | 315M | Read-aloud and conversation | 24.2% | 37.7% |
107
- | [CoRal-project/roest-whisper-large-v1](https://huggingface.co/CoRal-project/roest-whisper-large-v1) | 1540M | Read-aloud | 138% | 121% |
108
- | [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1) | 315M | Read-aloud | 123% | 80.5% |
109
-
110
- ### Detailed evaluation across demographics on the CoRal test data
111
- <img src="https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2/resolve/main/images/wer.png">
112
-
113
- <img src="https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2/resolve/main/images/cer.png">
114
-
115
- ### Table WER scores in % of evaluation across demographics on the CoRal test data
116
- | Category | roest-whisper-large-v1 | roest-wav2vec2-315m-v1 | roest-wav2vec2-315m-v2 | roest-wav2vec2-1B-v2 |
117
- |:---:|:---:|:---:|:---:|:---:|
118
- | female | 11.5 | 18.5 | 17.7 | 17.8 |
119
- | male | 9.4 | 15.5 | 14.9 | 15.0 |
120
- | 0-25 | 9.0 | 14.7 | 14.0 | 13.7 |
121
- | 25-50 | 10.1 | 16.6 | 15.8 | 15.3 |
122
- | 50+ | 11.3 | 18.2 | 17.7 | 18.5 |
123
- | Bornholmsk | 9.8 | 17.7 | 15.7 | 16.4 |
124
- | Fynsk | 12.1 | 18.3 | 17.7 | 16.7 |
125
- | Københavnsk | 5.9 | 10.2 | 10.0 | 9.5 |
126
- | Non-native | 12.2 | 20.9 | 19.4 | 19.4 |
127
- | Nordjysk | 4.5 | 7.7 | 7.5 | 7.3 |
128
- | Sjællandsk | 7.6 | 12.6 | 12.7 | 11.0 |
129
- | Sydømål | 10.0 | 14.9 | 15.3 | 14.4 |
130
- | Sønderjysk | 17.5 | 26.0 | 25.4 | 27.8 |
131
- | Vestjysk | 15.0 | 26.3 | 25.2 | 26.7 |
132
- | Østjysk | 7.5 | 11.7 | 11.3 | 10.8 |
133
- | Overall | 10.4 | 17.0 | 16.3 | 16.4 |
134
-
135
- ### Table CER scores in % of evaluation across demographics on the CoRal test data
136
- | Category | roest-whisper-large-v1 | roest-wav2vec2-315m-v1 | roest-wav2vec2-315m-v2 | roest-wav2vec2-1B-v2 |
137
- |:---:|:---:|:---:|:---:|:---:|
138
- | female | 5.1 | 7.4 | 7.2 | 7.3 |
139
- | male | 3.6 | 5.8 | 5.7 | 5.8 |
140
- | 0-25 | 3.4 | 5.4 | 5.3 | 5.1 |
141
- | 25-50 | 4.0 | 6.2 | 6.0 | 5.7 |
142
- | 50+ | 5.0 | 7.5 | 7.4 | 7.8 |
143
- | Bornholmsk | 3.8 | 6.8 | 6.1 | 6.2 |
144
- | Fynsk | 5.1 | 7.4 | 7.2 | 6.9 |
145
- | Københavnsk | 1.9 | 3.3 | 3.2 | 3.0 |
146
- | Non-native | 4.8 | 7.8 | 7.5 | 7.3 |
147
- | Nordjysk | 1.6 | 2.6 | 2.8 | 2.6 |
148
- | Sjællandsk | 3.0 | 4.4 | 4.5 | 3.9 |
149
- | Sydømål | 4.1 | 6.4 | 6.4 | 6.5 |
150
- | Sønderjysk | 8.8 | 11.9 | 11.6 | 12.6 |
151
- | Vestjysk | 6.4 | 10.1 | 9.8 | 10.5 |
152
- | Østjysk | 2.6 | 4.0 | 4.1 | 3.8 |
153
- | Overall | 4.3 | 6.6 | 6.5 | 6.5 |
154
-
155
- ### Roest-wav2vec2-1B-v2 with and without language model
156
- The inclusion of a post-processing language model can affect the performance significantly. The Roest-v1 and Roest-v2 models are using the same Language Model (LM). The utilized LM is the one trained and used by [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1).
 
 
 
 
 
 
 
157
 
158
  | Model | Number of parameters | Finetuned on data of type | Postprocessed with Language Model | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.com/datasets/alexandrainst/coral/viewer/read_aloud/test) WER |
159
  | :-------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | --------------------------------: | --------------------------------------------------------------------------------------: | --------------------------------------------------------------------------------------: |
160
- | [CoRal-project/roest-wav2vec2-1B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2) | 1B | Read-aloud and conversation | Yes | **6.5% ± 0.2%** | **16.4% ± 0.4%** |
161
- | [CoRal-project/roest-wav2vec2-1B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2) | 1B | Read-aloud and conversation | No | 8.1% ± 0.2% | 23.9% ± 0.4% |
162
  | [CoRal-project/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) | 315M | Read-aloud and conversation | Yes | **6.5% ± 0.2%** | **16.3% ± 0.4%** |
163
  | [CoRal-project/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) | 315M | Read-aloud and conversation | No | 8.2% ± 0.2% | 25.1% ± 0.4% |
164
  | [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1) | 315M | Read-aloud | Yes | 6.6% ± 0.2% | 17.0% ± 0.4% |
165
  | [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1) | 315M | Read-aloud | No | 8.6% ± 0.2% | 26.3% ± 0.5% |
166
 
 
 
167
 
168
  ### Performance on Other Datasets
169
 
170
  The model was also tested against other datasets to evaluate generalizability:
171
- | | **Roest-whisper-large-v1** | | **Roest-wav2vec2-315M-v1** | | **Roest-wav2vec2-315M-v2** | | **Roest-wav2vec2-1B-v2** | |
172
  | ------------------------------------------------------------------------------------- | -------------------------- | ------- | -------------------------- | ----- | -------------------------- | ------- | ------------------------ | ------- |
173
- | Evaluation Dataset | WER % | CER % | WER % | CER % | WER % | CER % | WER % | CER % |
174
  | [CoRal](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test) | **10.4** | **4.3** | 17.0 | 6.6 | **16.3** | **6.5** | 16.4 | **6.5** |
175
  | [NST-da](https://huggingface.co/datasets/alexandrainst/nst-da) | 29.8 | 14.5 | 29.7 | 13.9 | 26.1 | 11.9 | **12.4** | **4.9** |
176
  | [CommonVoice17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | 15.6 | 8.2 | 16.7 | 6.6 | **14.4** | **5.4** | 26.3 | 10.9 |
@@ -181,9 +211,9 @@ The model was also tested against other datasets to evaluate generalizability:
181
  ---
182
 
183
  ### Note on comparing whisper and wav2vec2 models
184
- The Whisper models detailed in this model card exhibit significantly lower Character Error Rates (CER) and Word Error Rates (WER) compared to the Wav2Vec2 models. Whisper utilizes a transformer-based architecture with additional layers that enhance contextual understanding. In contrast, Wav2Vec2 models employ shorter context windows that focus on sound prediction. The Roest-Wav2Vec2 models incorporate a straightforward language model during post-processing, which addresses errors based on statistical language patterns. Introducing a more complex, contextual post-processing language model might enable a better comparison between these model types, which the CoRal project plans to explore in future releases.
185
 
186
- The Roest-Whisper model excels in read-aloud data, leveraging its embedded contextual framework to achieve more robust recognition within this context. However, Wav2Vec2 models appear to generalize more effectively across various speech recognition tasks, whereas Whisper models incur higher error rates in conversational data. It’s important to note that the CoRal-v2 conversation dataset, being tentative and featuring limited speaker diversity, might influence these results.
187
 
188
  ---
189
 
@@ -209,7 +239,7 @@ We would like specifically to thank Dan Saattrup Nielsen, Alexandra Institute fo
209
  ```bibtex
210
  @misc{roest-wav2vec2-1B-v2,
211
  author = {Marie Juhl Jørgensen, Søren Vejlgaard Holm, Martin Carsten Nielsen, Dan Saattrup Nielsen, Sif Bernstorff Lehmann, Simon Leminen Madsen and Torben Blach},
212
- title = {Roest-wav2vec-1B-v2: A Danish state-of-the-art speech recognition model trained on varied demographics and dialects},
213
  year = {2025},
214
  url = {https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2},
215
  }
 
30
  name: WER
31
  ---
32
 
33
+ # Pre-release of røst-wav2vec2-1B-v2
34
  This is a pre-release of a Danish state-of-the-art speech recognition model, trained as part of the CoRal project by [Alvenir](https://www.alvenir.ai/).
35
 
36
  This repository contains a Wav2Vec2 model trained on the [CoRal-v2 dataset](https://huggingface.co/datasets/CoRal-project/coral-v2/tree/main). The CoRal-v2 dataset includes a rich variety of Danish conversational and read-aloud data, distributed across diverse age groups, genders, and dialects. The model is designed for automatic speech recognition (ASR).
 
59
  ## Model Details
60
 
61
  Wav2Vec2 is a state-of-the-art model architecture for speech recognition, leveraging self-supervised learning from raw audio data. The pre-trained [wav2vec2-xls-r-1b](https://huggingface.co/facebook/wav2vec2-xls-r-1b) has been fine-tuned for automatic speech recognition with the [CoRal-v2 dataset](https://huggingface.co/datasets/CoRal-project/coral-v2/tree/main) dataset to enhance its performance in recognizing Danish speech with consideration to different dialects. The model was trained for 30K steps using the training setup in the [CoRaL repository](https://github.com/alexandrainst/coral/tree) by running:
 
 
 
 
62
 
63
+ ```bash
64
+ python src/scripts/finetune_asr_model.py \
65
+ model=wav2vec2-medium \
66
+ max_steps=30000 \
67
+ datasets.coral_conversation_internal.id=CoRal-project/coral-v2 \
68
+ datasets.coral_readaloud_internal.id=CoRal-project/coral-v2
69
+ ```
70
 
71
+ The model is evaluated using a Language Model (LM) as post-processing. The utilized LM is the one trained and used by [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1).
72
 
73
+ The model was trained on the [CoRal-v2](https://huggingface.co/datasets/CoRal-project/coral-v2/tree/main) dataset, including both the conversational and read-aloud subset.
74
+ This dataset consists of Danish speech across a variety of dialects, age groups and gender distinctions.
 
 
 
 
 
75
  Note that the dataset used is licensed under a custom license, adapted from OpenRAIL-M, which allows commercial use with a few restrictions (speech synthesis and biometric identification). See [license](https://huggingface.co/Alvenir/coral-1-whisper-large/blob/main/LICENSE).
76
 
77
  ---
 
79
  ## Evaluation
80
 
81
  The model was evaluated using the following metrics:
 
82
  - **Character Error Rate (CER)**: The percentage of characters incorrectly transcribed.
83
+ - **Word Error Rate (WER)**: The percentage of words incorrectly transcribed.
84
 
85
+
86
+ ### Conversational CoRal Performance
87
+
88
+ The model was firstly evaluated on a tentative pre-release of the coral-v2 conversation dataset.
89
+
90
+ The results are tentative as the test set only includes 5 unique speakers, of which 4 are women. The test set includes 2 speakers with 'Fynsk' dialect, 1 with 'Sønderjysk', 1 with 'Non-native' and 1 'Nordjysk'.
91
+
92
+ Note that the high generalization error on conversation data for models trained on read-aloud data is still being analyzed.
93
+
94
+ | Model | Number of parameters | Finetuned on data of type | [CoRal-v2::conversation](https://huggingface.co/datasets/CoRal-project/coral-v2/viewer/conversation/test) CER | [CoRal-v2::conversation](https://huggingface.co/datasets/CoRal-project/coral-v2/viewer/conversation/test) WER |
95
+ | :-------------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | ------------------------------------------------------------------------------------------------------------: | ------------------------------------------------------------------------------------------------------------: |
96
+ | CoRal-project/roest-wav2vec2-1B-v2 (This model) | 1B | Read-aloud and conversation | **23.9%**| **36.7%** |
97
+ | [CoRal-project/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) | 315M | Read-aloud and conversation | 24.2% | 37.7% |
98
+ | [CoRal-project/roest-whisper-large-v1](https://huggingface.co/CoRal-project/roest-whisper-large-v1) | 1540M | Read-aloud | 138% | 121% |
99
+ | [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1) | 315M | Read-aloud | 123% | 80.5% |
100
+ | [mhenrichsen/hviske-v2](https://huggingface.co/syvai/hviske-v2) | 1540M | Read-aloud | 78.2% | 72.6% |
101
+ | [openai/whisper-large-v3](https://hf.co/openai/whisper-large-v3) | 1540M | - | 46.4 % | 57.4% |
102
+
103
+ <img src="https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2/resolve/main/images/comparison-conversation-cer.png">
104
+
105
+ <img src="https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2/resolve/main/images/comparison-conversation-wer.png">
106
+
107
+
108
+
109
+ ### Read-aloud CoRal Performance
110
 
111
  | Model | Number of parameters | Finetuned on data of type | [CoRal](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test) WER |
112
  | :----------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | --------------------------------------------------------------------------------------: | --------------------------------------------------------------------------------------: |
113
+ | CoRal-project/roest-wav2vec2-1B-v2 (This model) | 1B | Read-aloud and conversation | 6.5% ± 0.2% | 16.4% ± 0.4% |
114
  | [CoRal-project/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) | 315M | Read-aloud and conversation | 6.5% ± 0.2% | 16.3% ± 0.4% |
115
  | [CoRal-project/roest-whisper-large-v1](https://huggingface.co/CoRal-project/roest-whisper-large-v1) | 1540M | Read-aloud | **4.3% ± 0.2%** | **10.4% ± 0.3%** |
116
  | [CoRal-project/roest-wav2vec2-315M-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315M-v1) | 315M | Read-aloud | 6.6% ± 0.2% | 17.0% ± 0.4% |
 
119
 
120
  **OBS!** Benchmark for hviske-v2 has been reevaluted and the confidence interval is larger than reported in the model card.
121
 
122
+ <img src="https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2/resolve/main/images/comparison-read_aloud-cer.png">
123
+
124
+ <img src="https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2/resolve/main/images/comparison-read_aloud-wer.png">
125
+
126
+
127
+ <details>
128
+ <summary>
129
+ <b>Detailed CER scores in % of evaluation across demographics on the CoRal test data</b>
130
+ </summary>
131
+
132
+ | Category | Røst-whisper-large-v1 | Røst-wav2vec2-315m-v1 | Røst-wav2vec2-315m-v2 | Røst-wav2vec2-1B-v2 |
133
+ |:---:|:---:|:---:|:---:|:---:|
134
+ | female | 5.1 | 7.4 | 7.2 | 7.3 |
135
+ | male | 3.6 | 5.8 | 5.7 | 5.8 |
136
+ | 0-25 | 3.4 | 5.4 | 5.3 | 5.1 |
137
+ | 25-50 | 4.0 | 6.2 | 6.0 | 5.7 |
138
+ | 50+ | 5.0 | 7.5 | 7.4 | 7.8 |
139
+ | Bornholmsk | 3.8 | 6.8 | 6.1 | 6.2 |
140
+ | Fynsk | 5.1 | 7.4 | 7.2 | 6.9 |
141
+ | Københavnsk | 1.9 | 3.3 | 3.2 | 3.0 |
142
+ | Non-native | 4.8 | 7.8 | 7.5 | 7.3 |
143
+ | Nordjysk | 1.6 | 2.6 | 2.8 | 2.6 |
144
+ | Sjællandsk | 3.0 | 4.4 | 4.5 | 3.9 |
145
+ | Sydømål | 4.1 | 6.4 | 6.4 | 6.5 |
146
+ | Sønderjysk | 8.8 | 11.9 | 11.6 | 12.6 |
147
+ | Vestjysk | 6.4 | 10.1 | 9.8 | 10.5 |
148
+ | Østjysk | 2.6 | 4.0 | 4.1 | 3.8 |
149
+ | Overall | 4.3 | 6.6 | 6.5 | 6.5 |
150
+
151
+ </details>
152
+
153
+ <details>
154
+ <summary>
155
+ <b>Detailed WER scores in % of evaluation across demographics on the CoRal test data</b>
156
+ </summary>
157
+
158
+ | Category | Røst-whisper-large-v1 | Røst-wav2vec2-315m-v1 | Røst-wav2vec2-315m-v2 | Røst-wav2vec2-1B-v2 |
159
+ |:---:|:---:|:---:|:---:|:---:|
160
+ | female | 11.5 | 18.5 | 17.7 | 17.8 |
161
+ | male | 9.4 | 15.5 | 14.9 | 15.0 |
162
+ | 0-25 | 9.0 | 14.7 | 14.0 | 13.7 |
163
+ | 25-50 | 10.1 | 16.6 | 15.8 | 15.3 |
164
+ | 50+ | 11.3 | 18.2 | 17.7 | 18.5 |
165
+ | Bornholmsk | 9.8 | 17.7 | 15.7 | 16.4 |
166
+ | Fynsk | 12.1 | 18.3 | 17.7 | 16.7 |
167
+ | Københavnsk | 5.9 | 10.2 | 10.0 | 9.5 |
168
+ | Non-native | 12.2 | 20.9 | 19.4 | 19.4 |
169
+ | Nordjysk | 4.5 | 7.7 | 7.5 | 7.3 |
170
+ | Sjællandsk | 7.6 | 12.6 | 12.7 | 11.0 |
171
+ | Sydømål | 10.0 | 14.9 | 15.3 | 14.4 |
172
+ | Sønderjysk | 17.5 | 26.0 | 25.4 | 27.8 |
173
+ | Vestjysk | 15.0 | 26.3 | 25.2 | 26.7 |
174
+ | Østjysk | 7.5 | 11.7 | 11.3 | 10.8 |
175
+ | Overall | 10.4 | 17.0 | 16.3 | 16.4 |
176
+
177
+ </details>
178
+
179
+ <details>
180
+ <summary>
181
+ <b>Experiments with Røst-wav2vec2 with and without language model</b>
182
+ </summary>
183
+
184
+ The inclusion of a post-processing language model can affect the performance significantly. The Røst-v1 and Røst-v2 models are using the same Language Model (LM). The utilized LM is the one trained and used by [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1).
185
 
186
  | Model | Number of parameters | Finetuned on data of type | Postprocessed with Language Model | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.com/datasets/alexandrainst/coral/viewer/read_aloud/test) WER |
187
  | :-------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | --------------------------------: | --------------------------------------------------------------------------------------: | --------------------------------------------------------------------------------------: |
188
+ | CoRal-project/roest-wav2vec2-1B-v2 (This model) | 1B | Read-aloud and conversation | Yes | **6.5% ± 0.2%** | **16.4% ± 0.4%** |
189
+ | CoRal-project/roest-wav2vec2-1B-v2 (This model) | 1B | Read-aloud and conversation | No | 8.1% ± 0.2% | 23.9% ± 0.4% |
190
  | [CoRal-project/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) | 315M | Read-aloud and conversation | Yes | **6.5% ± 0.2%** | **16.3% ± 0.4%** |
191
  | [CoRal-project/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) | 315M | Read-aloud and conversation | No | 8.2% ± 0.2% | 25.1% ± 0.4% |
192
  | [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1) | 315M | Read-aloud | Yes | 6.6% ± 0.2% | 17.0% ± 0.4% |
193
  | [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1) | 315M | Read-aloud | No | 8.6% ± 0.2% | 26.3% ± 0.5% |
194
 
195
+ </details>
196
+
197
 
198
  ### Performance on Other Datasets
199
 
200
  The model was also tested against other datasets to evaluate generalizability:
201
+ | | **Røst-whisper-large-v1** | | **Røst-wav2vec2-315M-v1** | | **Røst-wav2vec2-315M-v2** | | **Røst-wav2vec2-1B-v2** | |
202
  | ------------------------------------------------------------------------------------- | -------------------------- | ------- | -------------------------- | ----- | -------------------------- | ------- | ------------------------ | ------- |
203
+ | **Evaluation Dataset** | **WER %** | **CER %** | **WER %** | **CER %** | **WER %** | **CER %** | **WER %** | **CER %** |
204
  | [CoRal](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test) | **10.4** | **4.3** | 17.0 | 6.6 | **16.3** | **6.5** | 16.4 | **6.5** |
205
  | [NST-da](https://huggingface.co/datasets/alexandrainst/nst-da) | 29.8 | 14.5 | 29.7 | 13.9 | 26.1 | 11.9 | **12.4** | **4.9** |
206
  | [CommonVoice17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | 15.6 | 8.2 | 16.7 | 6.6 | **14.4** | **5.4** | 26.3 | 10.9 |
 
211
  ---
212
 
213
  ### Note on comparing whisper and wav2vec2 models
214
+ The Whisper models detailed in this model card exhibit significantly lower Character Error Rates (CER) and Word Error Rates (WER) compared to the Wav2Vec2 models. Whisper utilizes a transformer-based architecture with additional layers that enhance contextual understanding. In contrast, Wav2Vec2 models employ shorter context windows that focus on sound prediction. The Røst-Wav2Vec2 models incorporate a straightforward language model during post-processing, which addresses errors based on statistical language patterns. Introducing a more complex, contextual post-processing language model might enable a better comparison between these model types, which the CoRal project plans to explore in future releases.
215
 
216
+ The Røst-Whisper model excels in read-aloud data, leveraging its embedded contextual framework to achieve more robust recognition within this context. However, Wav2Vec2 models appear to generalize more effectively across various speech recognition tasks, whereas Whisper models incur higher error rates in conversational data. It’s important to note that the CoRal-v2 conversation dataset, being tentative and featuring limited speaker diversity, might influence these results.
217
 
218
  ---
219
 
 
239
  ```bibtex
240
  @misc{roest-wav2vec2-1B-v2,
241
  author = {Marie Juhl Jørgensen, Søren Vejlgaard Holm, Martin Carsten Nielsen, Dan Saattrup Nielsen, Sif Bernstorff Lehmann, Simon Leminen Madsen and Torben Blach},
242
+ title = {Røst-wav2vec-1B-v2: A Danish state-of-the-art speech recognition model trained on varied demographics and dialects},
243
  year = {2025},
244
  url = {https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2},
245
  }