CoRal-project
/

roest-wav2vec2-1B-v2

Automatic Speech Recognition

Safetensors

Danish

wav2vec2

Eval Results

Model card Files Files and versions Community

MarieAlvenir commited on Apr 29

Commit

8f879d0

1 Parent(s): 8798351

Correction of typos

Browse files

Files changed (1) hide show

README.md +13 -13

README.md CHANGED Viewed

@@ -60,9 +60,9 @@ Next you can use the model using the `transformers` Python package as follows:
 Wav2Vec2 is a state-of-the-art model architecture for speech recognition, leveraging self-supervised learning from raw audio data. The pre-trained [wav2vec2-xls-r-1b](facebook/wav2vec2-xls-r-1b) has been fine-tuned for automatic speech recognition with the [CoRal-v2 dataset](https://huggingface.co/datasets/CoRal-project/coral-v2/tree/main) dataset to enhance its performance in recognizing Danish speech with consideration to different dialects. The model was trained for 30K steps using the training setup in the [CoRaL repository](https://github.com/alexandrainst/coral/tree) by running:
 ```
-python src/scripts/finetune_asr_model.py model=wav2vec2-small max_steps=30000 datasets.coral_conversation_internal.id=CoRal-project/coral-v2 datasets.coral_readaloud_internal.id=CoRal-project/coral-v2
 ```
-The model is evaluated using a Language Model (LM) as post-processing. The utilized LM is the one trained and used by [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2).
 ---
@@ -73,7 +73,7 @@ The model is evaluated using a Language Model (LM) as post-processing. The utili
 	- Conversation
 	- Read-aloud
 - **Language**: Danish.
-- **Variation**: Includes various dialects, age groups, and gender distinctions.
 ### License
 Note that the dataset used is licensed under a custom license, adapted from OpenRAIL-M, which allows commercial use with a few restrictions (speech synthesis and biometric identification). See [license](https://huggingface.co/Alvenir/coral-1-whisper-large/blob/main/LICENSE).
@@ -98,7 +98,7 @@ The model was evaluated using the following metrics:
 **OBS!** Benchmark for hviske-v2 has been reevaluted and the confidence interval is larger than  reported in the model card.
-The model was also evaluated on a tentative pre-release of the coral-v2 conversation dataset. The results are tentative as the test set only includes 5 unique speakers, of which 4 are women. The test set includes 2 speakers with 'Fynsk' dialect, 1 with 'Sønderjysk', 1 with 'Non-native' and 1 'Nordjysk'. The whisper model is performing very poorly on the test set. An explanation could be hallucinations during silence and short sentences, a known whisper issue. Furthermore, both version 1 models have not been trained on any conversation data giving the models and obvious disadvantage.
 | Model                                                                                               | Number of parameters |   Finetuned on data of type | [CoRal-v2::conversation](https://huggingface.co/datasets/CoRal-project/coral-v2/viewer/conversation/test) CER | [CoRal-v2::conversation](https://huggingface.co/datasets/CoRal-project/coral-v2/viewer/conversation/test) WER |
 | :-------------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | ------------------------------------------------------------------------------------------------------------: | ------------------------------------------------------------------------------------------------------------: |
@@ -168,15 +168,15 @@ The inclusion of a post-processing language model can affect the performance sig
 ### Performance on Other Datasets
 The model was also tested against other datasets to evaluate generalizability:
-|                                                                                       | **Roest-whisper-large-v1**|         | **Roest-wav2vec2-315M-v1** |       | **Roest-wav2vec2-315M-v2** |         | **Roest-wav2vec2-1B-v2** |
-| ------------------------------------------------------------------------------------- | ---------------------- | ------- | -------------------------- | ----- | -------------------------- | ------- | ------------------------ |
-| Evaluation Dataset                                                                    | WER %                  | CER %   | WER %                      | CER % | WER %                      | CER %   | WER %                    |
-| [CoRal](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test)   | **10.4**               | **4.3** | 17.0                       | 6.6   | **16.3**                   | **6.5** | 16.4                     |
-| [NST-da](https://huggingface.co/datasets/alexandrainst/nst-da)                        | 29.8                   | 14.5    | 29.7                       | 13.9  | 26.1                       | 11.9    | **12.4**                 |
-| [CommonVoice17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | 15.6                   | 8.2     | 16.7                       | 6.6   | **14.4**                   | **5.4** | 26.3                     |
-| [Fleurs-da_dk](https://huggingface.co/datasets/google/fleurs)                         | **12.6**               | **5.1** | 16.6                       | 6.3   | 15.6                       | 6.1     | **13.7**                 |
-**OBS!** The vocab used for training incudes numerals (0,1,2,..,9), which are translated to text in a post-processing step. If the model misses spaces the numbers are interpreted as one, which expecially affects the NST score as this dataset contains many numerals.
 ---

 Wav2Vec2 is a state-of-the-art model architecture for speech recognition, leveraging self-supervised learning from raw audio data. The pre-trained [wav2vec2-xls-r-1b](facebook/wav2vec2-xls-r-1b) has been fine-tuned for automatic speech recognition with the [CoRal-v2 dataset](https://huggingface.co/datasets/CoRal-project/coral-v2/tree/main) dataset to enhance its performance in recognizing Danish speech with consideration to different dialects. The model was trained for 30K steps using the training setup in the [CoRaL repository](https://github.com/alexandrainst/coral/tree) by running:
 ```
+python src/scripts/finetune_asr_model.py model=wav2vec2-medium max_steps=30000 datasets.coral_conversation_internal.id=CoRal-project/coral-v2 datasets.coral_readaloud_internal.id=CoRal-project/coral-v2
 ```
+The model is evaluated using a Language Model (LM) as post-processing. The utilized LM is the one trained and used by [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1).
 ---
 	- Conversation
 	- Read-aloud
 - **Language**: Danish.
+- **Variation**: Includes various dialects, ages, and gender distinctions.
 ### License
 Note that the dataset used is licensed under a custom license, adapted from OpenRAIL-M, which allows commercial use with a few restrictions (speech synthesis and biometric identification). See [license](https://huggingface.co/Alvenir/coral-1-whisper-large/blob/main/LICENSE).
 **OBS!** Benchmark for hviske-v2 has been reevaluted and the confidence interval is larger than  reported in the model card.
+The model was also evaluated on a tentative pre-release of the coral-v2 conversation dataset. The results are tentative as the test set only includes 5 unique speakers, of which 4 are women. The test set includes 2 speakers with 'Fynsk' dialect, 1 with 'Sønderjysk', 1 with 'Non-native' and 1 'Nordjysk'. The whisper model is performing very poorly on the test set. An explanation could be hallucinations during silence and short sentences, a known whisper issue. Furthermore, both version 1 models have not been trained on any conversation data giving the models an obvious disadvantage.
 | Model                                                                                               | Number of parameters |   Finetuned on data of type | [CoRal-v2::conversation](https://huggingface.co/datasets/CoRal-project/coral-v2/viewer/conversation/test) CER | [CoRal-v2::conversation](https://huggingface.co/datasets/CoRal-project/coral-v2/viewer/conversation/test) WER |
 | :-------------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | ------------------------------------------------------------------------------------------------------------: | ------------------------------------------------------------------------------------------------------------: |
 ### Performance on Other Datasets
 The model was also tested against other datasets to evaluate generalizability:
+|                                                                                       | **Roest-whisper-large-v1** |         | **Roest-wav2vec2-315M-v1** |       | **Roest-wav2vec2-315M-v2** |         | **Roest-wav2vec2-1B-v2** |         |
+| ------------------------------------------------------------------------------------- | -------------------------- | ------- | -------------------------- | ----- | -------------------------- | ------- | ------------------------ | ------- |
+| Evaluation Dataset                                                                    | WER %                      | CER %   | WER %                      | CER % | WER %                      | CER %   | WER %                    | CER %   |
+| [CoRal](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test)   | **10.4**                   | **4.3** | 17.0                       | 6.6   | **16.3**                   | **6.5** | 16.4                     | **6.5** |
+| [NST-da](https://huggingface.co/datasets/alexandrainst/nst-da)                        | 29.8                       | 14.5    | 29.7                       | 13.9  | 26.1                       | 11.9    | **12.4**                 | **4.9** |
+| [CommonVoice17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | 15.6                       | 8.2     | 16.7                       | 6.6   | **14.4**                   | **5.4** | 26.3                     | 10.9    |
+| [Fleurs-da_dk](https://huggingface.co/datasets/google/fleurs)                         | **12.6**                   | **5.1** | 16.6                       | 6.3   | 15.6                       | 6.1     | **13.7**                 | **5.5** |
+**OBS!** The vocab used for training incudes numerals (0,1,2,..,9), which are translated to text in a post-processing step. If the model misses spaces the numbers are interpreted as one, which especially affects the NST score as this dataset contains many numerals.
 ---