Commit
·
2feb5ff
1
Parent(s):
f2d877b
Path updates to CoRal-project and citation info added
Browse files
README.md
CHANGED
@@ -1,6 +1,6 @@
|
|
1 |
---
|
2 |
datasets:
|
3 |
-
- CoRal-
|
4 |
language:
|
5 |
- da
|
6 |
base_model:
|
@@ -33,7 +33,7 @@ model-index:
|
|
33 |
# Pre-release of Roest-wav2vec2-315m-v2
|
34 |
This is a pre-release of a Danish state-of-the-art speech recognition model, trained as part of the CoRal project by [Alvenir](https://www.alvenir.ai/).
|
35 |
|
36 |
-
This repository contains a Wav2Vec2 model trained on the [CoRal-v2 dataset](https://huggingface.co/datasets/CoRal-
|
37 |
|
38 |
## Quick Start
|
39 |
|
@@ -48,7 +48,7 @@ Next you can use the model using the `transformers` Python package as follows:
|
|
48 |
```python
|
49 |
>>> from transformers import pipeline
|
50 |
>>> audio = get_audio() # 16kHz raw audio array
|
51 |
-
>>> transcriber = pipeline(model="CoRal-
|
52 |
>>> transcriber(audio)
|
53 |
{'text': 'your transcription'}
|
54 |
```
|
@@ -64,7 +64,7 @@ Explore the following audio samples along with their transcriptions and accuracy
|
|
64 |
|
65 |
**Audio Sample:**
|
66 |
<audio controls>
|
67 |
-
<source src="https://huggingface.co/CoRal-
|
68 |
Your browser does not support the audio tag.
|
69 |
</audio>
|
70 |
|
@@ -83,7 +83,7 @@ Explore the following audio samples along with their transcriptions and accuracy
|
|
83 |
|
84 |
**Audio Sample:**
|
85 |
<audio controls>
|
86 |
-
<source src="https://huggingface.co/CoRal-
|
87 |
Your browser does not support the audio tag.
|
88 |
</audio>
|
89 |
|
@@ -102,7 +102,7 @@ Explore the following audio samples along with their transcriptions and accuracy
|
|
102 |
|
103 |
**Audio Sample:**
|
104 |
<audio controls>
|
105 |
-
<source src="https://huggingface.co/CoRal-
|
106 |
Your browser does not support the audio tag.
|
107 |
</audio>
|
108 |
|
@@ -121,7 +121,7 @@ Explore the following audio samples along with their transcriptions and accuracy
|
|
121 |
|
122 |
**Audio Sample:**
|
123 |
<audio controls>
|
124 |
-
<source src="https://huggingface.co/CoRal-
|
125 |
Your browser does not support the audio tag.
|
126 |
</audio>
|
127 |
|
@@ -138,9 +138,9 @@ Explore the following audio samples along with their transcriptions and accuracy
|
|
138 |
|
139 |
## Model Details
|
140 |
|
141 |
-
Wav2Vec2 is a state-of-the-art model architecture for speech recognition, leveraging self-supervised learning from raw audio data. The pre-trained [Wav2Vec2-XLS-R-300M](facebook/wav2vec2-xls-r-300m) has been fine-tuned for automatic speech recognition with the [CoRal-v2 dataset](https://huggingface.co/datasets/CoRal-
|
142 |
```
|
143 |
-
python src/scripts/finetune_asr_model.py model=wav2vec2-small max_steps=30000 datasets.coral_conversation_internal.id=CoRal-
|
144 |
```
|
145 |
The model is evaluated using a Language Model (LM) as post-processing. The utilized LM is the one trained and used by [alexandrainst/roest-wav2vec2-315m-v1](https://huggingface.co/alexandrainst/roest-315m).
|
146 |
|
@@ -148,7 +148,7 @@ The model is evaluated using a Language Model (LM) as post-processing. The utili
|
|
148 |
|
149 |
## Dataset
|
150 |
|
151 |
-
### [CoRal-v2](https://huggingface.co/datasets/CoRal-
|
152 |
- **Subsets**:
|
153 |
- Conversation
|
154 |
- Read-aloud
|
@@ -170,7 +170,7 @@ The model was evaluated using the following metrics:
|
|
170 |
|
171 |
| Model | Number of parameters | Finetuned on data of type | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) WER |
|
172 |
| :----------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | --------------------------------------------------------------------------------------: | --------------------------------------------------------------------------------------: |
|
173 |
-
| [CoRal-
|
174 |
| [Alvenir/roest-whisper-large-v1](https://huggingface.co/Alvenir/coral-1-whisper-large) | 1540M | Read-aloud | **4.3% ± 0.2%** | **10.4% ± 0.3%** |
|
175 |
| [alexandrainst/roest-wav2vec2-315M-v1](https://huggingface.co/alexandrainst/roest-315m) | 315M | Read-aloud | 6.6% ± 0.2% | 17.0% ± 0.4% |
|
176 |
| [mhenrichsen/hviske-v2](https://huggingface.co/syvai/hviske-v2) | 1540M | Read-aloud | 4.7% ± 0.2% | 11.8% ± 0.3% |
|
@@ -179,9 +179,9 @@ The model was evaluated using the following metrics:
|
|
179 |
**OBS!** Benchmark for hviske-v2 has been reevaluted and the confidence interval is larger than reported in the model card.
|
180 |
|
181 |
### Detailed evaluation across demographics on the CoRal test data
|
182 |
-
<img src="https://huggingface.co/CoRal-
|
183 |
|
184 |
-
<img src="https://huggingface.co/CoRal-
|
185 |
|
186 |
### Table WER scores in % of evaluation across demographics on the CoRal test data
|
187 |
| Category | roest-whisper-large-v1 | roest-wav2vec2-315m-v1 | roest-wav2vec2-315m-v2 |
|
@@ -230,8 +230,8 @@ The inclusion of a post-processing language model can affect the performance sig
|
|
230 |
|
231 |
| Model | Number of parameters | Finetuned on data of type | Postprocessed with Language Model | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) WER |
|
232 |
| :-------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | --------------------------------: | --------------------------------------------------------------------------------------: | --------------------------------------------------------------------------------------: |
|
233 |
-
| [CoRal-
|
234 |
-
| [CoRal-
|
235 |
| [alexandrainst/roest-wav2vec2-315m-v1](https://huggingface.co/alexandrainst/roest-315m) | 315M | Read-aloud | Yes | 6.6% ± 0.2% | 17.0% ± 0.4% |
|
236 |
| [alexandrainst/roest-wav2vec2-315m-v1](https://huggingface.co/alexandrainst/roest-315m) | 315M | Read-aloud | No | 8.6% ± 0.2% | 26.3% ± 0.5% |
|
237 |
|
@@ -273,7 +273,7 @@ The model was also tested against other datasets to evaluate generalizability:
|
|
273 |
---
|
274 |
|
275 |
## Training curves
|
276 |
-
<img src="https://huggingface.co/CoRal-
|
277 |
|
278 |
---
|
279 |
|
@@ -288,4 +288,15 @@ The CoRal project is funded by the [Danish Innovation Fund](https://innovationsf
|
|
288 |
- [Alvenir](https://www.alvenir.ai/)
|
289 |
- [Corti](https://www.corti.ai/)
|
290 |
|
291 |
-
We would like specifically thank Dan Saattrup Nielsen, Alexandra Institute for (among other things) the repository work and Simon Leminen Madsen, Alexandra Institute for modelling work.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
datasets:
|
3 |
+
- CoRal-project/coral-v2
|
4 |
language:
|
5 |
- da
|
6 |
base_model:
|
|
|
33 |
# Pre-release of Roest-wav2vec2-315m-v2
|
34 |
This is a pre-release of a Danish state-of-the-art speech recognition model, trained as part of the CoRal project by [Alvenir](https://www.alvenir.ai/).
|
35 |
|
36 |
+
This repository contains a Wav2Vec2 model trained on the [CoRal-v2 dataset](https://huggingface.co/datasets/CoRal-project/coral-v2/tree/main). The CoRal-v2 dataset includes a rich variety of Danish conversational and read-aloud data, distributed across diverse age groups, genders, and dialects. The model is designed for automatic speech recognition (ASR).
|
37 |
|
38 |
## Quick Start
|
39 |
|
|
|
48 |
```python
|
49 |
>>> from transformers import pipeline
|
50 |
>>> audio = get_audio() # 16kHz raw audio array
|
51 |
+
>>> transcriber = pipeline(model="CoRal-project/roest-wav2vec2-315m-v2")
|
52 |
>>> transcriber(audio)
|
53 |
{'text': 'your transcription'}
|
54 |
```
|
|
|
64 |
|
65 |
**Audio Sample:**
|
66 |
<audio controls>
|
67 |
+
<source src="https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2/resolve/main/audio_samples/example1.wav" type="audio/wav">
|
68 |
Your browser does not support the audio tag.
|
69 |
</audio>
|
70 |
|
|
|
83 |
|
84 |
**Audio Sample:**
|
85 |
<audio controls>
|
86 |
+
<source src="https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2/resolve/main/audio_samples/example2.wav" type="audio/wav">
|
87 |
Your browser does not support the audio tag.
|
88 |
</audio>
|
89 |
|
|
|
102 |
|
103 |
**Audio Sample:**
|
104 |
<audio controls>
|
105 |
+
<source src="https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2/resolve/main/audio_samples/example3.wav" type="audio/wav">
|
106 |
Your browser does not support the audio tag.
|
107 |
</audio>
|
108 |
|
|
|
121 |
|
122 |
**Audio Sample:**
|
123 |
<audio controls>
|
124 |
+
<source src="https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2/resolve/main/audio_samples/example4.wav" type="audio/wav">
|
125 |
Your browser does not support the audio tag.
|
126 |
</audio>
|
127 |
|
|
|
138 |
|
139 |
## Model Details
|
140 |
|
141 |
+
Wav2Vec2 is a state-of-the-art model architecture for speech recognition, leveraging self-supervised learning from raw audio data. The pre-trained [Wav2Vec2-XLS-R-300M](facebook/wav2vec2-xls-r-300m) has been fine-tuned for automatic speech recognition with the [CoRal-v2 dataset](https://huggingface.co/datasets/CoRal-project/coral-v2/tree/main) dataset to enhance its performance in recognizing Danish speech with consideration to different dialects. The model was trained for 30K steps using the training setup in the [CoRaL repository](https://github.com/alexandrainst/coral/tree) by running:
|
142 |
```
|
143 |
+
python src/scripts/finetune_asr_model.py model=wav2vec2-small max_steps=30000 datasets.coral_conversation_internal.id=CoRal-project/coral-v2 datasets.coral_readaloud_internal.id=CoRal-project/coral-v2
|
144 |
```
|
145 |
The model is evaluated using a Language Model (LM) as post-processing. The utilized LM is the one trained and used by [alexandrainst/roest-wav2vec2-315m-v1](https://huggingface.co/alexandrainst/roest-315m).
|
146 |
|
|
|
148 |
|
149 |
## Dataset
|
150 |
|
151 |
+
### [CoRal-v2](https://huggingface.co/datasets/CoRal-project/coral-v2/tree/main)
|
152 |
- **Subsets**:
|
153 |
- Conversation
|
154 |
- Read-aloud
|
|
|
170 |
|
171 |
| Model | Number of parameters | Finetuned on data of type | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) WER |
|
172 |
| :----------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | --------------------------------------------------------------------------------------: | --------------------------------------------------------------------------------------: |
|
173 |
+
| [CoRal-project/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-project/roest-whisper-large) | 315M | Read-aloud and conversation | 6.5% ± 0.2% | 16.3% ± 0.4% |
|
174 |
| [Alvenir/roest-whisper-large-v1](https://huggingface.co/Alvenir/coral-1-whisper-large) | 1540M | Read-aloud | **4.3% ± 0.2%** | **10.4% ± 0.3%** |
|
175 |
| [alexandrainst/roest-wav2vec2-315M-v1](https://huggingface.co/alexandrainst/roest-315m) | 315M | Read-aloud | 6.6% ± 0.2% | 17.0% ± 0.4% |
|
176 |
| [mhenrichsen/hviske-v2](https://huggingface.co/syvai/hviske-v2) | 1540M | Read-aloud | 4.7% ± 0.2% | 11.8% ± 0.3% |
|
|
|
179 |
**OBS!** Benchmark for hviske-v2 has been reevaluted and the confidence interval is larger than reported in the model card.
|
180 |
|
181 |
### Detailed evaluation across demographics on the CoRal test data
|
182 |
+
<img src="https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2/resolve/main/images/wer.png">
|
183 |
|
184 |
+
<img src="https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2/resolve/main/images/cer.png">
|
185 |
|
186 |
### Table WER scores in % of evaluation across demographics on the CoRal test data
|
187 |
| Category | roest-whisper-large-v1 | roest-wav2vec2-315m-v1 | roest-wav2vec2-315m-v2 |
|
|
|
230 |
|
231 |
| Model | Number of parameters | Finetuned on data of type | Postprocessed with Language Model | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) WER |
|
232 |
| :-------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | --------------------------------: | --------------------------------------------------------------------------------------: | --------------------------------------------------------------------------------------: |
|
233 |
+
| [CoRal-project/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) | 315M | Read-aloud and conversation | Yes | **6.5% ± 0.2%** | **16.3% ± 0.4%** |
|
234 |
+
| [CoRal-project/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) | 315M | Read-aloud and conversation | No | 8.2% ± 0.2% | 25.1% ± 0.4% |
|
235 |
| [alexandrainst/roest-wav2vec2-315m-v1](https://huggingface.co/alexandrainst/roest-315m) | 315M | Read-aloud | Yes | 6.6% ± 0.2% | 17.0% ± 0.4% |
|
236 |
| [alexandrainst/roest-wav2vec2-315m-v1](https://huggingface.co/alexandrainst/roest-315m) | 315M | Read-aloud | No | 8.6% ± 0.2% | 26.3% ± 0.5% |
|
237 |
|
|
|
273 |
---
|
274 |
|
275 |
## Training curves
|
276 |
+
<img src="https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2/resolve/main/images/training_plots.png">
|
277 |
|
278 |
---
|
279 |
|
|
|
288 |
- [Alvenir](https://www.alvenir.ai/)
|
289 |
- [Corti](https://www.corti.ai/)
|
290 |
|
291 |
+
We would like specifically to thank Dan Saattrup Nielsen, Alexandra Institute for (among other things) the repository work and Simon Leminen Madsen, Alexandra Institute for modelling work.
|
292 |
+
|
293 |
+
## Citation
|
294 |
+
|
295 |
+
We will submit a research paper soon, but until then, if you use this model in your research or development, please cite it as follows:
|
296 |
+
|
297 |
+
@misc{roest-wav2vec2-315m-v2,
|
298 |
+
author = {Marie Juhl Jørgensen, Søren Vejlgaard Holm, Martin Carsten Nielsen, Dan Saattrup Nielsen, Sif Bernstorff Lehmann, Simon Leminen Madsen, Anders Jess Pedersen, Anna Katrine van Zee, Anders Søgaard and Torben Blach},
|
299 |
+
title = {Roest-wav2vec-315m-v2: A Danish state-of-the-art speech recognition model trained on varied demographics and dialects},
|
300 |
+
year = {2025},
|
301 |
+
url = {https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2},
|
302 |
+
}
|