Commit
·
8f879d0
1
Parent(s):
8798351
Correction of typos
Browse files
README.md
CHANGED
@@ -60,9 +60,9 @@ Next you can use the model using the `transformers` Python package as follows:
|
|
60 |
|
61 |
Wav2Vec2 is a state-of-the-art model architecture for speech recognition, leveraging self-supervised learning from raw audio data. The pre-trained [wav2vec2-xls-r-1b](facebook/wav2vec2-xls-r-1b) has been fine-tuned for automatic speech recognition with the [CoRal-v2 dataset](https://huggingface.co/datasets/CoRal-project/coral-v2/tree/main) dataset to enhance its performance in recognizing Danish speech with consideration to different dialects. The model was trained for 30K steps using the training setup in the [CoRaL repository](https://github.com/alexandrainst/coral/tree) by running:
|
62 |
```
|
63 |
-
python src/scripts/finetune_asr_model.py model=wav2vec2-
|
64 |
```
|
65 |
-
The model is evaluated using a Language Model (LM) as post-processing. The utilized LM is the one trained and used by [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-
|
66 |
|
67 |
---
|
68 |
|
@@ -73,7 +73,7 @@ The model is evaluated using a Language Model (LM) as post-processing. The utili
|
|
73 |
- Conversation
|
74 |
- Read-aloud
|
75 |
- **Language**: Danish.
|
76 |
-
- **Variation**: Includes various dialects,
|
77 |
### License
|
78 |
Note that the dataset used is licensed under a custom license, adapted from OpenRAIL-M, which allows commercial use with a few restrictions (speech synthesis and biometric identification). See [license](https://huggingface.co/Alvenir/coral-1-whisper-large/blob/main/LICENSE).
|
79 |
|
@@ -98,7 +98,7 @@ The model was evaluated using the following metrics:
|
|
98 |
|
99 |
**OBS!** Benchmark for hviske-v2 has been reevaluted and the confidence interval is larger than reported in the model card.
|
100 |
|
101 |
-
The model was also evaluated on a tentative pre-release of the coral-v2 conversation dataset. The results are tentative as the test set only includes 5 unique speakers, of which 4 are women. The test set includes 2 speakers with 'Fynsk' dialect, 1 with 'Sønderjysk', 1 with 'Non-native' and 1 'Nordjysk'. The whisper model is performing very poorly on the test set. An explanation could be hallucinations during silence and short sentences, a known whisper issue. Furthermore, both version 1 models have not been trained on any conversation data giving the models
|
102 |
|
103 |
| Model | Number of parameters | Finetuned on data of type | [CoRal-v2::conversation](https://huggingface.co/datasets/CoRal-project/coral-v2/viewer/conversation/test) CER | [CoRal-v2::conversation](https://huggingface.co/datasets/CoRal-project/coral-v2/viewer/conversation/test) WER |
|
104 |
| :-------------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | ------------------------------------------------------------------------------------------------------------: | ------------------------------------------------------------------------------------------------------------: |
|
@@ -168,15 +168,15 @@ The inclusion of a post-processing language model can affect the performance sig
|
|
168 |
### Performance on Other Datasets
|
169 |
|
170 |
The model was also tested against other datasets to evaluate generalizability:
|
171 |
-
| | **Roest-whisper-large-v1
|
172 |
-
| ------------------------------------------------------------------------------------- |
|
173 |
-
| Evaluation Dataset | WER %
|
174 |
-
| [CoRal](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test) | **10.4**
|
175 |
-
| [NST-da](https://huggingface.co/datasets/alexandrainst/nst-da) | 29.8
|
176 |
-
| [CommonVoice17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | 15.6
|
177 |
-
| [Fleurs-da_dk](https://huggingface.co/datasets/google/fleurs) | **12.6**
|
178 |
-
|
179 |
-
**OBS!** The vocab used for training incudes numerals (0,1,2,..,9), which are translated to text in a post-processing step. If the model misses spaces the numbers are interpreted as one, which
|
180 |
|
181 |
---
|
182 |
|
|
|
60 |
|
61 |
Wav2Vec2 is a state-of-the-art model architecture for speech recognition, leveraging self-supervised learning from raw audio data. The pre-trained [wav2vec2-xls-r-1b](facebook/wav2vec2-xls-r-1b) has been fine-tuned for automatic speech recognition with the [CoRal-v2 dataset](https://huggingface.co/datasets/CoRal-project/coral-v2/tree/main) dataset to enhance its performance in recognizing Danish speech with consideration to different dialects. The model was trained for 30K steps using the training setup in the [CoRaL repository](https://github.com/alexandrainst/coral/tree) by running:
|
62 |
```
|
63 |
+
python src/scripts/finetune_asr_model.py model=wav2vec2-medium max_steps=30000 datasets.coral_conversation_internal.id=CoRal-project/coral-v2 datasets.coral_readaloud_internal.id=CoRal-project/coral-v2
|
64 |
```
|
65 |
+
The model is evaluated using a Language Model (LM) as post-processing. The utilized LM is the one trained and used by [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1).
|
66 |
|
67 |
---
|
68 |
|
|
|
73 |
- Conversation
|
74 |
- Read-aloud
|
75 |
- **Language**: Danish.
|
76 |
+
- **Variation**: Includes various dialects, ages, and gender distinctions.
|
77 |
### License
|
78 |
Note that the dataset used is licensed under a custom license, adapted from OpenRAIL-M, which allows commercial use with a few restrictions (speech synthesis and biometric identification). See [license](https://huggingface.co/Alvenir/coral-1-whisper-large/blob/main/LICENSE).
|
79 |
|
|
|
98 |
|
99 |
**OBS!** Benchmark for hviske-v2 has been reevaluted and the confidence interval is larger than reported in the model card.
|
100 |
|
101 |
+
The model was also evaluated on a tentative pre-release of the coral-v2 conversation dataset. The results are tentative as the test set only includes 5 unique speakers, of which 4 are women. The test set includes 2 speakers with 'Fynsk' dialect, 1 with 'Sønderjysk', 1 with 'Non-native' and 1 'Nordjysk'. The whisper model is performing very poorly on the test set. An explanation could be hallucinations during silence and short sentences, a known whisper issue. Furthermore, both version 1 models have not been trained on any conversation data giving the models an obvious disadvantage.
|
102 |
|
103 |
| Model | Number of parameters | Finetuned on data of type | [CoRal-v2::conversation](https://huggingface.co/datasets/CoRal-project/coral-v2/viewer/conversation/test) CER | [CoRal-v2::conversation](https://huggingface.co/datasets/CoRal-project/coral-v2/viewer/conversation/test) WER |
|
104 |
| :-------------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | ------------------------------------------------------------------------------------------------------------: | ------------------------------------------------------------------------------------------------------------: |
|
|
|
168 |
### Performance on Other Datasets
|
169 |
|
170 |
The model was also tested against other datasets to evaluate generalizability:
|
171 |
+
| | **Roest-whisper-large-v1** | | **Roest-wav2vec2-315M-v1** | | **Roest-wav2vec2-315M-v2** | | **Roest-wav2vec2-1B-v2** | |
|
172 |
+
| ------------------------------------------------------------------------------------- | -------------------------- | ------- | -------------------------- | ----- | -------------------------- | ------- | ------------------------ | ------- |
|
173 |
+
| Evaluation Dataset | WER % | CER % | WER % | CER % | WER % | CER % | WER % | CER % |
|
174 |
+
| [CoRal](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test) | **10.4** | **4.3** | 17.0 | 6.6 | **16.3** | **6.5** | 16.4 | **6.5** |
|
175 |
+
| [NST-da](https://huggingface.co/datasets/alexandrainst/nst-da) | 29.8 | 14.5 | 29.7 | 13.9 | 26.1 | 11.9 | **12.4** | **4.9** |
|
176 |
+
| [CommonVoice17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | 15.6 | 8.2 | 16.7 | 6.6 | **14.4** | **5.4** | 26.3 | 10.9 |
|
177 |
+
| [Fleurs-da_dk](https://huggingface.co/datasets/google/fleurs) | **12.6** | **5.1** | 16.6 | 6.3 | 15.6 | 6.1 | **13.7** | **5.5** |
|
178 |
+
|
179 |
+
**OBS!** The vocab used for training incudes numerals (0,1,2,..,9), which are translated to text in a post-processing step. If the model misses spaces the numbers are interpreted as one, which especially affects the NST score as this dataset contains many numerals.
|
180 |
|
181 |
---
|
182 |
|