docs: Update model card
#1
by
saattrupdan
- opened
README.md
CHANGED
@@ -30,10 +30,13 @@ model-index:
|
|
30 |
name: WER
|
31 |
---
|
32 |
|
33 |
-
#
|
34 |
-
This is a
|
|
|
|
|
|
|
|
|
35 |
|
36 |
-
This repository contains a Wav2Vec2 model trained on the [CoRal-v2 dataset](https://huggingface.co/datasets/CoRal-project/coral-v2/tree/main). The CoRal-v2 dataset includes a rich variety of Danish conversational and read-aloud data, distributed across diverse age groups, genders, and dialects. The model is designed for automatic speech recognition (ASR).
|
37 |
|
38 |
## Quick Start
|
39 |
|
@@ -59,233 +62,270 @@ Next you can use the model using the `transformers` Python package as follows:
|
|
59 |
|
60 |
Explore the following audio samples along with their transcriptions and accuracy metrics. Each example showcases the model's performance with different Danish dialects.
|
61 |
|
62 |
-
|
63 |
-
|
64 |
-
|
65 |
-
|
66 |
-
|
67 |
-
|
68 |
-
|
69 |
-
|
70 |
-
|
71 |
-
|
72 |
-
|
73 |
-
**
|
74 |
-
*det blev til
|
75 |
-
|
76 |
-
|
77 |
-
|
78 |
-
|
79 |
-
|
80 |
-
|
81 |
-
|
82 |
-
|
83 |
-
|
84 |
-
<
|
85 |
-
|
86 |
-
|
87 |
-
|
88 |
-
|
89 |
-
|
90 |
-
|
91 |
-
|
92 |
-
|
93 |
-
|
94 |
-
|
95 |
-
|
96 |
-
|
97 |
-
|
98 |
-
|
99 |
-
|
100 |
-
|
101 |
-
|
102 |
-
|
103 |
-
|
104 |
-
|
105 |
-
|
106 |
-
</
|
107 |
-
|
108 |
-
|
109 |
-
|
110 |
-
|
111 |
-
|
112 |
-
|
113 |
-
|
114 |
-
|
115 |
-
|
116 |
-
|
117 |
-
|
118 |
-
|
119 |
-
|
120 |
-
|
121 |
-
**
|
122 |
-
|
123 |
-
|
124 |
-
|
125 |
-
|
126 |
-
|
127 |
-
|
128 |
-
|
129 |
-
|
130 |
-
**
|
131 |
-
|
132 |
-
|
133 |
-
|
134 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
135 |
|
136 |
---
|
137 |
|
138 |
## Model Details
|
139 |
|
140 |
Wav2Vec2 is a state-of-the-art model architecture for speech recognition, leveraging self-supervised learning from raw audio data. The pre-trained [Wav2Vec2-XLS-R-300M](https://huggingface.co/facebook/wav2vec2-xls-r-300m) has been fine-tuned for automatic speech recognition with the [CoRal-v2 dataset](https://huggingface.co/datasets/CoRal-project/coral-v2/tree/main) dataset to enhance its performance in recognizing Danish speech with consideration to different dialects. The model was trained for 30K steps using the training setup in the [CoRaL repository](https://github.com/alexandrainst/coral/tree) by running:
|
141 |
-
```
|
142 |
-
python src/scripts/finetune_asr_model.py model=wav2vec2-small max_steps=30000 datasets.coral_conversation_internal.id=CoRal-project/coral-v2 datasets.coral_readaloud_internal.id=CoRal-project/coral-v2
|
143 |
-
```
|
144 |
-
The model is evaluated using a Language Model (LM) as post-processing. The utilized LM is the one trained and used by [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2).
|
145 |
|
146 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
147 |
|
148 |
-
|
|
|
149 |
|
150 |
-
|
151 |
-
|
152 |
-
|
153 |
-
- Read-aloud
|
154 |
-
- **Language**: Danish.
|
155 |
-
- **Variation**: Includes various dialects, age groups, and gender distinctions.
|
156 |
-
### License
|
157 |
-
Note that the dataset used is licensed under a custom license, adapted from OpenRAIL-M, which allows commercial use with a few restrictions (speech synthesis and biometric identification). See [license](https://huggingface.co/Alvenir/coral-1-whisper-large/blob/main/LICENSE).
|
158 |
|
159 |
---
|
160 |
|
161 |
## Evaluation
|
162 |
|
163 |
The model was evaluated using the following metrics:
|
164 |
-
- **Word Error Rate (WER)**: The percentage of words incorrectly transcribed.
|
165 |
- **Character Error Rate (CER)**: The percentage of characters incorrectly transcribed.
|
|
|
166 |
|
167 |
-
|
168 |
-
|
169 |
-
|
170 |
-
| Model | Number of parameters | Finetuned on data of type | [CoRal](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test) WER |
|
171 |
-
| :----------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | --------------------------------------------------------------------------------------: | --------------------------------------------------------------------------------------: |
|
172 |
-
| [CoRal-project/roest-wav2vec2-1B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2) | 1B | Read-aloud and conversation | 6.5% ± 0.2% | 16.4% ± 0.4% |
|
173 |
-
| [CoRal-project/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) | 315M | Read-aloud and conversation | 6.5% ± 0.2% | 16.3% ± 0.4% |
|
174 |
-
| [CoRal-project/roest-whisper-large-v1](https://huggingface.co/CoRal-project/roest-whisper-large-v1) | 1540M | Read-aloud | **4.3% ± 0.2%** | **10.4% ± 0.3%** |
|
175 |
-
| [CoRal-project/roest-wav2vec2-315M-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315M-v1) | 315M | Read-aloud | 6.6% ± 0.2% | 17.0% ± 0.4% |
|
176 |
-
| [mhenrichsen/hviske-v2](https://huggingface.co/syvai/hviske-v2) | 1540M | Read-aloud | 4.7% ± 0.2% | 11.8% ± 0.3% |
|
177 |
-
| [openai/whisper-large-v3](https://hf.co/openai/whisper-large-v3) | 1540M | - | 11.4% ± 0.3% | 28.3% ± 0.6% |
|
178 |
|
179 |
-
|
180 |
|
181 |
-
The
|
|
|
|
|
|
|
182 |
|
183 |
| Model | Number of parameters | Finetuned on data of type | [CoRal-v2::conversation](https://huggingface.co/datasets/CoRal-project/coral-v2/viewer/conversation/test) CER | [CoRal-v2::conversation](https://huggingface.co/datasets/CoRal-project/coral-v2/viewer/conversation/test) WER |
|
184 |
| :-------------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | ------------------------------------------------------------------------------------------------------------: | ------------------------------------------------------------------------------------------------------------: |
|
185 |
-
| [CoRal-project/roest-wav2vec2-1B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2) | 1B | Read-aloud and conversation |
|
186 |
| [CoRal-project/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) | 315M | Read-aloud and conversation | 24.2% | 37.7% |
|
187 |
-
| [CoRal-project/roest-whisper-large-v1](https://huggingface.co/CoRal-project/roest-whisper-large-v1)
|
188 |
-
| [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1)
|
189 |
|
190 |
|
191 |
-
###
|
192 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
193 |
|
194 |
<img src="https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2/resolve/main/images/cer.png">
|
195 |
|
196 |
-
|
197 |
-
|
198 |
-
|
199 |
-
|
200 |
-
|
201 |
-
|
202 |
-
|
203 |
-
|
204 |
-
|
|
205 |
-
|
206 |
-
|
|
207 |
-
|
|
208 |
-
|
|
209 |
-
|
|
210 |
-
|
|
211 |
-
|
|
212 |
-
|
|
213 |
-
|
|
214 |
-
|
|
215 |
-
|
216 |
-
|
217 |
-
|
218 |
-
|
|
219 |
-
|
220 |
-
|
|
221 |
-
|
|
222 |
-
|
223 |
-
|
224 |
-
|
225 |
-
|
226 |
-
|
227 |
-
|
228 |
-
|
229 |
-
|
230 |
-
|
|
231 |
-
|
232 |
-
|
|
233 |
-
|
|
234 |
-
|
|
235 |
-
|
|
236 |
-
|
237 |
-
|
238 |
-
|
239 |
-
|
240 |
-
|
241 |
-
|
242 |
-
|
|
243 |
-
|
|
244 |
-
|
|
245 |
-
|
|
246 |
-
|
|
247 |
-
|
|
248 |
-
|
249 |
-
|
250 |
-
|
251 |
-
|
252 |
-
|
253 |
-
|
254 |
-
|
255 |
-
|
256 |
-
|
257 |
-
|
258 |
-
|
259 |
-
|
260 |
-
|
261 |
-
|
262 |
-
|
|
263 |
-
|
|
264 |
-
|
|
265 |
-
|
|
266 |
-
|
|
267 |
-
|
268 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
269 |
|
270 |
### Performance on Other Datasets
|
271 |
|
272 |
The model was also tested against other datasets to evaluate generalizability:
|
273 |
|
274 |
-
| | **
|
275 |
-
| ------------------------------------------------------------------------------------- | -------------------------- |
|
276 |
-
| Evaluation Dataset
|
277 |
-
| [CoRal](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test) | **10.4** | **4.3**
|
278 |
-
| [NST-da](https://huggingface.co/datasets/alexandrainst/nst-da) | 29.8 | 14.5
|
279 |
-
| [CommonVoice17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | 15.6 | 8.2
|
280 |
-
| [Fleurs-da_dk](https://huggingface.co/datasets/google/fleurs) | **12.6** | **5.1**
|
281 |
|
282 |
**OBS!** The vocab used for training incudes numerals (0,1,2,..,9), which are translated to text in a post-processing step. If the model misses spaces the numbers are interpreted as one, which especially affects the NST score as this dataset contains many numerals.
|
283 |
|
284 |
---
|
285 |
-
### Note on comparing whisper and wav2vec2 models
|
286 |
-
The Whisper models detailed in this model card exhibit significantly lower Character Error Rates (CER) and Word Error Rates (WER) compared to the Wav2Vec2 models. Whisper utilizes a transformer-based architecture with additional layers that enhance contextual understanding. In contrast, Wav2Vec2 models employ shorter context windows that focus on sound prediction. The Roest-Wav2Vec2 models incorporate a straightforward language model during post-processing, which addresses errors based on statistical language patterns. Introducing a more complex, contextual post-processing language model might enable a better comparison between these model types, which the CoRal project plans to explore in future releases.
|
287 |
|
288 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
289 |
|
290 |
---
|
291 |
|
@@ -307,12 +347,13 @@ The CoRal project is funded by the [Danish Innovation Fund](https://innovationsf
|
|
307 |
|
308 |
We would like specifically to thank Dan Saattrup Nielsen, Alexandra Institute for (among other things) the repository work and Simon Leminen Madsen, Alexandra Institute for modelling work.
|
309 |
|
|
|
310 |
## Citation
|
311 |
|
312 |
```bibtex
|
313 |
@misc{roest-wav2vec2-315m-v2,
|
314 |
author = {Marie Juhl Jørgensen, Søren Vejlgaard Holm, Martin Carsten Nielsen, Dan Saattrup Nielsen, Sif Bernstorff Lehmann, Simon Leminen Madsen and Torben Blach},
|
315 |
-
title = {
|
316 |
year = {2025},
|
317 |
url = {https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2},
|
318 |
}
|
|
|
30 |
name: WER
|
31 |
---
|
32 |
|
33 |
+
# Røst-wav2vec2-315m-v2
|
34 |
+
This is a Danish state-of-the-art speech recognition model, trained as part of the CoRal project by [Alvenir](https://www.alvenir.ai/).
|
35 |
+
|
36 |
+
This repository contains a Wav2Vec2 model trained on the [CoRal-v2 dataset](https://huggingface.co/datasets/CoRal-project/coral-v2/tree/main).
|
37 |
+
The CoRal-v2 dataset includes a rich variety of Danish conversational and read-aloud data, distributed across diverse age groups, genders, and dialects.
|
38 |
+
The model is designed for automatic speech recognition (ASR).
|
39 |
|
|
|
40 |
|
41 |
## Quick Start
|
42 |
|
|
|
62 |
|
63 |
Explore the following audio samples along with their transcriptions and accuracy metrics. Each example showcases the model's performance with different Danish dialects.
|
64 |
|
65 |
+
<details>
|
66 |
+
<summary>
|
67 |
+
<b>Example 1 - Vestjysk Dialect</b>
|
68 |
+
</summary>
|
69 |
+
|
70 |
+
**Audio Sample:**
|
71 |
+
<audio controls>
|
72 |
+
<source src="https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2/resolve/main/audio_samples/example1.wav" type="audio/wav">
|
73 |
+
Your browser does not support the audio tag.
|
74 |
+
</audio>
|
75 |
+
|
76 |
+
**Model Transcription:**
|
77 |
+
*det blev til yderlig ti mål i den første sæson på trods af en position som back*
|
78 |
+
|
79 |
+
**Target Transcription:**
|
80 |
+
*det blev til yderligere ti mål i den første sæson på trods af en position som back*
|
81 |
+
|
82 |
+
- **Character Error Rate (CER):** 3.7%
|
83 |
+
- **Word Error Rate (WER):** 5.9%
|
84 |
+
</details>
|
85 |
+
|
86 |
+
<details>
|
87 |
+
<summary>
|
88 |
+
<b>Example 2 - Sønderjysk Dialect</b>
|
89 |
+
</summary>
|
90 |
+
|
91 |
+
**Audio Sample:**
|
92 |
+
<audio controls>
|
93 |
+
<source src="https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2/resolve/main/audio_samples/example2.wav" type="audio/wav">
|
94 |
+
Your browser does not support the audio tag.
|
95 |
+
</audio>
|
96 |
+
|
97 |
+
**Model Transcription:**
|
98 |
+
*en arkitektoniske udformning af pladser forslagene iver benzen*
|
99 |
+
|
100 |
+
**Target Transcription:**
|
101 |
+
*den arkitektoniske udformning af pladsen er forestået af ivar bentsen*
|
102 |
+
|
103 |
+
- **Character Error Rate (CER):** 20.3%
|
104 |
+
- **Word Error Rate (WER):** 60.0%
|
105 |
+
</details>
|
106 |
+
|
107 |
+
<details>
|
108 |
+
<summary>
|
109 |
+
<b>Example 3 - Nordsjællandsk Dialect</b>
|
110 |
+
</summary>
|
111 |
+
|
112 |
+
**Audio Sample:**
|
113 |
+
<audio controls>
|
114 |
+
<source src="https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2/resolve/main/audio_samples/example3.wav" type="audio/wav">
|
115 |
+
Your browser does not support the audio tag.
|
116 |
+
</audio>
|
117 |
+
|
118 |
+
**Model Transcription:**
|
119 |
+
*østrig og ungarn samarbejder om søen gennem den østrigske og ungarske vandkommission*
|
120 |
+
|
121 |
+
**Target Transcription:**
|
122 |
+
*østrig og ungarn samarbejder om søen gennem den østrigske og ungarske vandkommission*
|
123 |
+
|
124 |
+
- **Character Error Rate (CER):** 0.0%
|
125 |
+
- **Word Error Rate (WER):** 0.0%
|
126 |
+
</details>
|
127 |
+
|
128 |
+
<details>
|
129 |
+
<summary>
|
130 |
+
<b>Example 4 - Lollandsk Dialect</b>
|
131 |
+
</summary>
|
132 |
+
|
133 |
+
**Audio Sample:**
|
134 |
+
<audio controls>
|
135 |
+
<source src="https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2/resolve/main/audio_samples/example4.wav" type="audio/wav">
|
136 |
+
Your browser does not support the audio tag.
|
137 |
+
</audio>
|
138 |
+
|
139 |
+
**Model Transcription:**
|
140 |
+
*det er produceret af thomas helme og indspillede i easy sound recording studio i københavn*
|
141 |
+
|
142 |
+
**Target Transcription:**
|
143 |
+
*det er produceret af thomas helmig og indspillet i easy sound recording studio i københavn*
|
144 |
+
|
145 |
+
- **Character Error Rate (CER):** 4.4%
|
146 |
+
- **Word Error Rate (WER):** 13.3%
|
147 |
+
</details>
|
148 |
|
149 |
---
|
150 |
|
151 |
## Model Details
|
152 |
|
153 |
Wav2Vec2 is a state-of-the-art model architecture for speech recognition, leveraging self-supervised learning from raw audio data. The pre-trained [Wav2Vec2-XLS-R-300M](https://huggingface.co/facebook/wav2vec2-xls-r-300m) has been fine-tuned for automatic speech recognition with the [CoRal-v2 dataset](https://huggingface.co/datasets/CoRal-project/coral-v2/tree/main) dataset to enhance its performance in recognizing Danish speech with consideration to different dialects. The model was trained for 30K steps using the training setup in the [CoRaL repository](https://github.com/alexandrainst/coral/tree) by running:
|
|
|
|
|
|
|
|
|
154 |
|
155 |
+
```bash
|
156 |
+
python src/scripts/finetune_asr_model.py \
|
157 |
+
model=wav2vec2-small \
|
158 |
+
max_steps=30000 \
|
159 |
+
datasets.coral_conversation_internal.id=CoRal-project/coral-v2 \
|
160 |
+
datasets.coral_readaloud_internal.id=CoRal-project/coral-v2
|
161 |
+
```
|
162 |
|
163 |
+
The model is evaluated using a Language Model (LM) as post-processing.
|
164 |
+
The utilized LM is the one trained and used by [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1).
|
165 |
|
166 |
+
The model was trained on the [CoRal-v2](https://huggingface.co/datasets/CoRal-project/coral-v2/tree/main) dataset, including both the conversational and read-aloud subset.
|
167 |
+
This dataset consists of Danish speech across a variety of dialects, age groups and gender distinctions.
|
168 |
+
Note that the dataset, and thus also this model, is licensed under a custom license, adapted from OpenRAIL-M, which allows commercial use with few restrictions (speech synthesis and biometric identification) - see [license](https://huggingface.co/Alvenir/coral-1-whisper-large/blob/main/LICENSE).
|
|
|
|
|
|
|
|
|
|
|
169 |
|
170 |
---
|
171 |
|
172 |
## Evaluation
|
173 |
|
174 |
The model was evaluated using the following metrics:
|
|
|
175 |
- **Character Error Rate (CER)**: The percentage of characters incorrectly transcribed.
|
176 |
+
- **Word Error Rate (WER)**: The percentage of words incorrectly transcribed.
|
177 |
|
178 |
+
### Conversational CoRal Performance
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
179 |
|
180 |
+
The model was firstly evaluated on a tentative version of the coral-v2 conversation dataset.
|
181 |
|
182 |
+
The results are tentative as the test set only includes 5 unique speakers, of which 4 are women.
|
183 |
+
The test set includes 2 speakers with 'Fynsk' dialect, 1 with 'Sønderjysk', 1 with 'Non-native' and 1 'Nordjysk'.
|
184 |
+
The Whisper model is performing very poorly on the test set. An explanation could be hallucinations during silence and short sentences, a known whisper issue.
|
185 |
+
Furthermore, both v1 models have not been trained on any conversation data, giving the models an obvious disadvantage.
|
186 |
|
187 |
| Model | Number of parameters | Finetuned on data of type | [CoRal-v2::conversation](https://huggingface.co/datasets/CoRal-project/coral-v2/viewer/conversation/test) CER | [CoRal-v2::conversation](https://huggingface.co/datasets/CoRal-project/coral-v2/viewer/conversation/test) WER |
|
188 |
| :-------------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | ------------------------------------------------------------------------------------------------------------: | ------------------------------------------------------------------------------------------------------------: |
|
189 |
+
| [CoRal-project/roest-wav2vec2-1B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2) | 1B | Read-aloud and conversation | **23.9%** | **36.7%** |
|
190 |
| [CoRal-project/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) | 315M | Read-aloud and conversation | 24.2% | 37.7% |
|
191 |
+
| [CoRal-project/roest-whisper-large-v1](https://huggingface.co/CoRal-project/roest-whisper-large-v1) | 1540M | Read-aloud | 138% | 121% |
|
192 |
+
| [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1) | 315M | Read-aloud | 123% | 80.5% |
|
193 |
|
194 |
|
195 |
+
### Read-aloud CoRal Performance
|
196 |
+
|
197 |
+
| Model | Number of parameters | Finetuned on data of type | [CoRal](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test) WER |
|
198 |
+
| :-------------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | --------------------------------------------------------------------------------------: | --------------------------------------------------------------------------------------: |
|
199 |
+
| [CoRal-project/roest-wav2vec2-1B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2) | 1B | Read-aloud and conversation | 6.5% ± 0.2% | 16.4% ± 0.4% |
|
200 |
+
| [CoRal-project/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) | 315M | Read-aloud and conversation | 6.5% ± 0.2% | 16.3% ± 0.4% |
|
201 |
+
| [CoRal-project/roest-whisper-large-v1](https://huggingface.co/CoRal-project/roest-whisper-large-v1) | 1540M | Read-aloud | **4.3% ± 0.2%** | **10.4% ± 0.3%** |
|
202 |
+
| [CoRal-project/roest-wav2vec2-315M-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315M-v1) | 315M | Read-aloud | 6.6% ± 0.2% | 17.0% ± 0.4% |
|
203 |
+
| [mhenrichsen/hviske-v2](https://huggingface.co/syvai/hviske-v2) | 1540M | Read-aloud | 4.7% ± 0.2% | 11.8% ± 0.3% |
|
204 |
+
| [openai/whisper-large-v3](https://hf.co/openai/whisper-large-v3) | 1540M | - | 11.4% ± 0.3% | 28.3% ± 0.6% |
|
205 |
+
|
206 |
+
**OBS!** Benchmark for hviske-v2 has been re-evaluated and the confidence interval is larger than reported in the model card.
|
207 |
|
208 |
<img src="https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2/resolve/main/images/cer.png">
|
209 |
|
210 |
+
<img src="https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2/resolve/main/images/wer.png">
|
211 |
+
|
212 |
+
|
213 |
+
<details>
|
214 |
+
<summary>
|
215 |
+
<b>Detailed CER scores in % of evaluation across demographics on the CoRal test data</b>
|
216 |
+
</summary>
|
217 |
+
|
218 |
+
| Category | Røst-whisper-large-v1 | Røst-wav2vec2-315m-v1 | Røst-wav2vec2-315m-v2 | Røst-wav2vec2-1B-v2 |
|
219 |
+
|:---:|:---:|:---:|:---:|:---:|
|
220 |
+
| female | 5.1 | 7.4 | 7.2 | 7.3 |
|
221 |
+
| male | 3.6 | 5.8 | 5.7 | 5.8 |
|
222 |
+
| 0-25 | 3.4 | 5.4 | 5.3 | 5.1 |
|
223 |
+
| 25-50 | 4.0 | 6.2 | 6.0 | 5.7 |
|
224 |
+
| 50+ | 5.0 | 7.5 | 7.4 | 7.8 |
|
225 |
+
| Bornholmsk | 3.8 | 6.8 | 6.1 | 6.2 |
|
226 |
+
| Fynsk | 5.1 | 7.4 | 7.2 | 6.9 |
|
227 |
+
| Københavnsk | 1.9 | 3.3 | 3.2 | 3.0 |
|
228 |
+
| Non-native | 4.8 | 7.8 | 7.5 | 7.3 |
|
229 |
+
| Nordjysk | 1.6 | 2.6 | 2.8 | 2.6 |
|
230 |
+
| Sjællandsk | 3.0 | 4.4 | 4.5 | 3.9 |
|
231 |
+
| Sydømål | 4.1 | 6.4 | 6.4 | 6.5 |
|
232 |
+
| Sønderjysk | 8.8 | 11.9 | 11.6 | 12.6 |
|
233 |
+
| Vestjysk | 6.4 | 10.1 | 9.8 | 10.5 |
|
234 |
+
| Østjysk | 2.6 | 4.0 | 4.1 | 3.8 |
|
235 |
+
| Overall | 4.3 | 6.6 | 6.5 | 6.5 |
|
236 |
+
|
237 |
+
</details>
|
238 |
+
|
239 |
+
<details>
|
240 |
+
<summary>
|
241 |
+
<b>Detailed WER scores in % of evaluation across demographics on the CoRal test data</b>
|
242 |
+
</summary>
|
243 |
+
|
244 |
+
| Category | Røst-whisper-large-v1 | Røst-wav2vec2-315m-v1 | Røst-wav2vec2-315m-v2 | Røst-wav2vec2-1B-v2 |
|
245 |
+
|:---:|:---:|:---:|:---:|:---:|
|
246 |
+
| female | 11.5 | 18.5 | 17.7 | 17.8 |
|
247 |
+
| male | 9.4 | 15.5 | 14.9 | 15.0 |
|
248 |
+
| 0-25 | 9.0 | 14.7 | 14.0 | 13.7 |
|
249 |
+
| 25-50 | 10.1 | 16.6 | 15.8 | 15.3 |
|
250 |
+
| 50+ | 11.3 | 18.2 | 17.7 | 18.5 |
|
251 |
+
| Bornholmsk | 9.8 | 17.7 | 15.7 | 16.4 |
|
252 |
+
| Fynsk | 12.1 | 18.3 | 17.7 | 16.7 |
|
253 |
+
| Københavnsk | 5.9 | 10.2 | 10.0 | 9.5 |
|
254 |
+
| Non-native | 12.2 | 20.9 | 19.4 | 19.4 |
|
255 |
+
| Nordjysk | 4.5 | 7.7 | 7.5 | 7.3 |
|
256 |
+
| Sjællandsk | 7.6 | 12.6 | 12.7 | 11.0 |
|
257 |
+
| Sydømål | 10.0 | 14.9 | 15.3 | 14.4 |
|
258 |
+
| Sønderjysk | 17.5 | 26.0 | 25.4 | 27.8 |
|
259 |
+
| Vestjysk | 15.0 | 26.3 | 25.2 | 26.7 |
|
260 |
+
| Østjysk | 7.5 | 11.7 | 11.3 | 10.8 |
|
261 |
+
| Overall | 10.4 | 17.0 | 16.3 | 16.4 |
|
262 |
+
|
263 |
+
</details>
|
264 |
+
|
265 |
+
<details>
|
266 |
+
<summary>
|
267 |
+
<b>Experiments with Røst-wav2vec2-315M with and without language model</b>
|
268 |
+
</summary>
|
269 |
+
|
270 |
+
The inclusion of a post-processing language model can affect the performance significantly.
|
271 |
+
The Røst-v1 and Røst-v2 models are using the same Language Model (LM).
|
272 |
+
The utilized LM is the one trained and used by [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1).
|
273 |
+
|
274 |
+
| Model | Number of parameters | Finetuned on data of type | Postprocessed with Language Model | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.com/datasets/alexandrainst/coral/viewer/read_aloud/test) WER |
|
275 |
+
| :-------------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | --------------------------------: | --------------------------------------------------------------------------------------: | ---------------------------------------------------------------------------------------: |
|
276 |
+
| [CoRal-project/roest-wav2vec2-1B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2) | 1B | Read-aloud and conversation | Yes | **6.5% ± 0.2%** | **16.4% ± 0.4%** |
|
277 |
+
| [CoRal-project/roest-wav2vec2-1B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2) | 1B | Read-aloud and conversation | No | 8.1% ± 0.2% | 23.9% ± 0.4% |
|
278 |
+
| [CoRal-project/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) | 315M | Read-aloud and conversation | Yes | **6.5% ± 0.2%** | **16.3% ± 0.4%** |
|
279 |
+
| [CoRal-project/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) | 315M | Read-aloud and conversation | No | 8.2% ± 0.2% | 25.1% ± 0.4% |
|
280 |
+
| [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1) | 315M | Read-aloud | Yes | 6.6% ± 0.2% | 17.0% ± 0.4% |
|
281 |
+
| [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1) | 315M | Read-aloud | No | 8.6% ± 0.2% | 26.3% ± 0.5% |
|
282 |
+
|
283 |
+
Here are the results of the model on different Danish dialects in the test set:
|
284 |
+
|
285 |
+
| | Røst-v1 | | Røst-v1 | | Røst-v2 | | Røst-v2 | |
|
286 |
+
|-------------|---------|---------|---------|---------|---------|---------|---------|---------|
|
287 |
+
| **LM** | **No** | | **Yes** | | **No** | | **Yes** | |
|
288 |
+
|-------------|---------|---------|---------|---------|---------|---------|---------|---------|
|
289 |
+
| Dialect | CER (%) | WER (%) | CER (%) | WER (%) | CER (%) | WER (%) | CER (%) | WER (%) |
|
290 |
+
| Vestjysk | 12.7 | 37.1 | 10.1 | 26.3 | 12.2 | 36.3 | 9.82 | 25.2 |
|
291 |
+
| Sønderjysk | 14.7 | 37.8 | 11.9 | 26.0 | 14.2 | 36.2 | 11.6 | 25.4 |
|
292 |
+
| Bornholmsk | 9.32 | 29.9 | 6.79 | 17.7 | 8.08 | 26.7 | 6.12 | 15.7 |
|
293 |
+
| Østjysk | 5.51 | 18.7 | 3.97 | 11.7 | 5.39 | 18.0 | 4.06 | 11.3 |
|
294 |
+
| Nordjysk | 3.86 | 13.6 | 2.57 | 7.72 | 3.80 | 13.5 | 2.75 | 7.51 |
|
295 |
+
| Københavnsk | 5.27 | 18.8 | 3.31 | 10.2 | 5.02 | 17.7 | 3.20 | 9.98 |
|
296 |
+
| Fynsk | 9.41 | 28.6 | 7.43 | 18.3 | 8.86 | 27.0 | 7.20 | 17.7 |
|
297 |
+
| Non-native | 10.6 | 33.2 | 7.84 | 20.9 | 10.0 | 31.6 | 7.46 | 19.4 |
|
298 |
+
| Sjællandsk | 5.82 | 19.5 | 4.44 | 12.6 | 5.70 | 18.6 | 4.48 | 12.7 |
|
299 |
+
| Sydømål | 7.09 | 20.7 | 6.38 | 14.9 | 6.96 | 20.4 | 6.44 | 15.3 |
|
300 |
+
|
301 |
+
</details>
|
302 |
|
303 |
### Performance on Other Datasets
|
304 |
|
305 |
The model was also tested against other datasets to evaluate generalizability:
|
306 |
|
307 |
+
| | **Røst-whisper-large-v1** | | **Røst-wav2vec2-315M-v1** | | **Røst-wav2vec2-315M-v2** | | **Røst-wav2vec2-1B-v2** | |
|
308 |
+
| ------------------------------------------------------------------------------------- | -------------------------- | --------- | -------------------------- | --------- | -------------------------- | ----------- | ------------------------ | --------- |
|
309 |
+
| **Evaluation Dataset** | **WER %** | **CER %** | **WER %** | **CER %** | **WER %** | **CER %** | **WER %** | **CER %** |
|
310 |
+
| [CoRal](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test) | **10.4** | **4.3** | 17.0 | 6.6 | **16.3** | **6.5** | 16.4 | **6.5** |
|
311 |
+
| [NST-da](https://huggingface.co/datasets/alexandrainst/nst-da) | 29.8 | 14.5 | 29.7 | 13.9 | 26.1 | 11.9 | **12.4** | **4.9** |
|
312 |
+
| [CommonVoice17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | 15.6 | 8.2 | 16.7 | 6.6 | **14.4** | **5.4** | 26.3 | 10.9 |
|
313 |
+
| [Fleurs-da_dk](https://huggingface.co/datasets/google/fleurs) | **12.6** | **5.1** | 16.6 | 6.3 | 15.6 | 6.1 | **13.7** | **5.5** |
|
314 |
|
315 |
**OBS!** The vocab used for training incudes numerals (0,1,2,..,9), which are translated to text in a post-processing step. If the model misses spaces the numbers are interpreted as one, which especially affects the NST score as this dataset contains many numerals.
|
316 |
|
317 |
---
|
|
|
|
|
318 |
|
319 |
+
### Note on comparing Whisper and Wav2Vec2 models
|
320 |
+
The Whisper models detailed in this model card exhibit significantly lower Character Error Rates (CER) and Word Error Rates (WER) compared to the Wav2Vec2 models.
|
321 |
+
Whisper utilizes a transformer-based architecture with additional layers that enhance contextual understanding.
|
322 |
+
In contrast, Wav2Vec2 models employ shorter context windows that focus on sound prediction.
|
323 |
+
The Røst-Wav2Vec2 models incorporate a straightforward language model during post-processing, which addresses errors based on statistical language patterns.
|
324 |
+
Introducing a more complex, contextual post-processing language model might enable a better comparison between these model types, which the CoRal project plans to explore in future releases.
|
325 |
+
|
326 |
+
The Røst-Whisper model excels in read-aloud data, leveraging its embedded contextual framework to achieve more robust recognition within this context.
|
327 |
+
However, Wav2Vec2 models appear to generalize more effectively across various speech recognition tasks, whereas Whisper models incur higher error rates in conversational data.
|
328 |
+
It’s important to note that the CoRal-v2 conversation dataset, being tentative and featuring limited speaker diversity, might influence these results.
|
329 |
|
330 |
---
|
331 |
|
|
|
347 |
|
348 |
We would like specifically to thank Dan Saattrup Nielsen, Alexandra Institute for (among other things) the repository work and Simon Leminen Madsen, Alexandra Institute for modelling work.
|
349 |
|
350 |
+
|
351 |
## Citation
|
352 |
|
353 |
```bibtex
|
354 |
@misc{roest-wav2vec2-315m-v2,
|
355 |
author = {Marie Juhl Jørgensen, Søren Vejlgaard Holm, Martin Carsten Nielsen, Dan Saattrup Nielsen, Sif Bernstorff Lehmann, Simon Leminen Madsen and Torben Blach},
|
356 |
+
title = {Røst-wav2vec-315m-v2: A Danish state-of-the-art speech recognition model trained on varied demographics and dialects},
|
357 |
year = {2025},
|
358 |
url = {https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2},
|
359 |
}
|