CoRal-project
/

roest-wav2vec2-315m-v2

Automatic Speech Recognition

Safetensors

Danish

wav2vec2

Eval Results

Model card Files Files and versions Community

MarieAlvenir commited on 12 days ago

Commit

30f680f

1 Parent(s): 466b591

Indications of (This model)

Browse files

Files changed (1) hide show

README.md +7 -7

README.md CHANGED Viewed

@@ -187,7 +187,7 @@ Note that the high generelization error on conversation data for models trained
 | Model                                                                                               | Number of parameters |   Finetuned on data of type | [CoRal-v2::conversation](https://huggingface.co/datasets/CoRal-project/coral-v2/viewer/conversation/test) CER | [CoRal-v2::conversation](https://huggingface.co/datasets/CoRal-project/coral-v2/viewer/conversation/test) WER |
 | :-------------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | ------------------------------------------------------------------------------------------------------------: | ------------------------------------------------------------------------------------------------------------: |
 | [CoRal-project/roest-wav2vec2-1B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2)     |                   1B | Read-aloud and conversation |                                                                                                     **23.9%** |                                                                                                     **36.7%** |
-| [CoRal-project/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) |                 315M | Read-aloud and conversation |                                                                                                         24.2% |                                                                                                         37.7% |
 | [CoRal-project/roest-whisper-large-v1](https://huggingface.co/CoRal-project/roest-whisper-large-v1) |                1540M |                  Read-aloud |                                                                                                          138% |                                                                                                          121% |
 | [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1) |                 315M |                  Read-aloud |                                                                                                          123% |                                                                                                         80.5% |
 | [mhenrichsen/hviske-v2](https://huggingface.co/syvai/hviske-v2)                                     |                1540M |                  Read-aloud |                                                                                                         78.2% |                                                                                                         72.6% |
@@ -203,7 +203,7 @@ Note that the high generelization error on conversation data for models trained
 | Model                                                                                               | Number of parameters |   Finetuned on data of type | [CoRal](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test) WER |
 | :-------------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | --------------------------------------------------------------------------------------: | --------------------------------------------------------------------------------------: |
 | [CoRal-project/roest-wav2vec2-1B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2)     |                   1B | Read-aloud and conversation |                                                                             6.5% ± 0.2% |                                                                            16.4% ± 0.4% |
-| [CoRal-project/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) |                 315M | Read-aloud and conversation |                                                                             6.5% ± 0.2% |                                                                            16.3% ± 0.4% |
 | [CoRal-project/roest-whisper-large-v1](https://huggingface.co/CoRal-project/roest-whisper-large-v1) |                1540M |                  Read-aloud |                                                                         **4.3% ± 0.2%** |                                                                        **10.4% ± 0.3%** |
 | [CoRal-project/roest-wav2vec2-315M-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315M-v1) |                 315M |                  Read-aloud |                                                                             6.6% ± 0.2% |                                                                            17.0% ± 0.4% |
 | [mhenrichsen/hviske-v2](https://huggingface.co/syvai/hviske-v2)                                     |                1540M |                  Read-aloud |                                                                             4.7% ± 0.2% |                                                                            11.8% ± 0.3% |
@@ -282,12 +282,12 @@ Note that the high generelization error on conversation data for models trained
   | :-------------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | --------------------------------: | --------------------------------------------------------------------------------------: | ---------------------------------------------------------------------------------------: |
   | [CoRal-project/roest-wav2vec2-1B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2)     |                   1B | Read-aloud and conversation |                               Yes |                                                                         **6.5% ± 0.2%** |                                                                         **16.4% ± 0.4%** |
   | [CoRal-project/roest-wav2vec2-1B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2)     |                   1B | Read-aloud and conversation |                                No |                                                                             8.1% ± 0.2% |                                                                             23.9% ± 0.4% |
-  | [CoRal-project/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) |                 315M | Read-aloud and conversation |                               Yes |                                                                         **6.5% ± 0.2%** |                                                                         **16.3% ± 0.4%** |
-  | [CoRal-project/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) |                 315M | Read-aloud and conversation |                                No |                                                                             8.2% ± 0.2% |                                                                             25.1% ± 0.4% |
   | [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1) |                 315M |                  Read-aloud |                               Yes |                                                                             6.6% ± 0.2% |                                                                             17.0% ± 0.4% |
   | [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1) |                 315M |                  Read-aloud |                                No |                                                                             8.6% ± 0.2% |                                                                             26.3% ± 0.5% |
-  Here are the results of the model on different Danish dialects in the test set:
   |             | Røst-v1 |         | Røst-v1 |         | Røst-v2 |         | Røst-v2 |         |
   |-------------|---------|---------|---------|---------|---------|---------|---------|---------|
@@ -314,10 +314,10 @@ The model was also tested against other datasets to evaluate generalizability:
 |                                                                                       | **Røst-whisper-large-v1**  |           | **Røst-wav2vec2-315M-v1**  |           | **Røst-wav2vec2-315M-v2**  |             | **Røst-wav2vec2-1B-v2**  |           |
 | ------------------------------------------------------------------------------------- | -------------------------- | --------- | -------------------------- | --------- | -------------------------- | ----------- | ------------------------ | --------- |
 | **Evaluation Dataset**                                                                | **WER %**                  | **CER %** | **WER %**                  | **CER %** | **WER %**                  | **CER %**   | **WER %**                | **CER %** |
-| [CoRal](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test)   | **10.4**                   | **4.3**   | 17.0                       | 6.6       | **16.3**                   | **6.5**     | 16.4                     | **6.5**   |
 | [NST-da](https://huggingface.co/datasets/alexandrainst/nst-da)                        | 29.8                       | 14.5      | 29.7                       | 13.9      | 26.1                       | 11.9        | **12.4**                 | **4.9**   |
 | [CommonVoice17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | 15.6                       | 8.2       | 16.7                       | 6.6       | **14.4**                   | **5.4**     | 26.3                     | 10.9      |
-| [Fleurs-da_dk](https://huggingface.co/datasets/google/fleurs)                         | **12.6**                   | **5.1**   | 16.6                       | 6.3       | 15.6                       | 6.1         | **13.7**                 | **5.5**   |
 **OBS!** The vocab used for training incudes numerals (0,1,2,..,9), which are translated to text in a post-processing step. If the model misses spaces the numbers are interpreted as one, which especially affects the NST score as this dataset contains many numerals.

 | Model                                                                                               | Number of parameters |   Finetuned on data of type | [CoRal-v2::conversation](https://huggingface.co/datasets/CoRal-project/coral-v2/viewer/conversation/test) CER | [CoRal-v2::conversation](https://huggingface.co/datasets/CoRal-project/coral-v2/viewer/conversation/test) WER |
 | :-------------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | ------------------------------------------------------------------------------------------------------------: | ------------------------------------------------------------------------------------------------------------: |
 | [CoRal-project/roest-wav2vec2-1B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2)     |                   1B | Read-aloud and conversation |                                                                                                     **23.9%** |                                                                                                     **36.7%** |
+| CoRal-project/roest-wav2vec2-315M-v2 (This model)|                 315M | Read-aloud and conversation |                                                                                                         24.2% |                                                                                                         37.7% |
 | [CoRal-project/roest-whisper-large-v1](https://huggingface.co/CoRal-project/roest-whisper-large-v1) |                1540M |                  Read-aloud |                                                                                                          138% |                                                                                                          121% |
 | [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1) |                 315M |                  Read-aloud |                                                                                                          123% |                                                                                                         80.5% |
 | [mhenrichsen/hviske-v2](https://huggingface.co/syvai/hviske-v2)                                     |                1540M |                  Read-aloud |                                                                                                         78.2% |                                                                                                         72.6% |
 | Model                                                                                               | Number of parameters |   Finetuned on data of type | [CoRal](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test) WER |
 | :-------------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | --------------------------------------------------------------------------------------: | --------------------------------------------------------------------------------------: |
 | [CoRal-project/roest-wav2vec2-1B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2)     |                   1B | Read-aloud and conversation |                                                                             6.5% ± 0.2% |                                                                            16.4% ± 0.4% |
+| CoRal-project/roest-wav2vec2-315M-v2 (This model) |                 315M | Read-aloud and conversation |                                                                             6.5% ± 0.2% |                                                                            16.3% ± 0.4% |
 | [CoRal-project/roest-whisper-large-v1](https://huggingface.co/CoRal-project/roest-whisper-large-v1) |                1540M |                  Read-aloud |                                                                         **4.3% ± 0.2%** |                                                                        **10.4% ± 0.3%** |
 | [CoRal-project/roest-wav2vec2-315M-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315M-v1) |                 315M |                  Read-aloud |                                                                             6.6% ± 0.2% |                                                                            17.0% ± 0.4% |
 | [mhenrichsen/hviske-v2](https://huggingface.co/syvai/hviske-v2)                                     |                1540M |                  Read-aloud |                                                                             4.7% ± 0.2% |                                                                            11.8% ± 0.3% |
   | :-------------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | --------------------------------: | --------------------------------------------------------------------------------------: | ---------------------------------------------------------------------------------------: |
   | [CoRal-project/roest-wav2vec2-1B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2)     |                   1B | Read-aloud and conversation |                               Yes |                                                                         **6.5% ± 0.2%** |                                                                         **16.4% ± 0.4%** |
   | [CoRal-project/roest-wav2vec2-1B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2)     |                   1B | Read-aloud and conversation |                                No |                                                                             8.1% ± 0.2% |                                                                             23.9% ± 0.4% |
+  | CoRal-project/roest-wav2vec2-315M-v2 (This model) |                 315M | Read-aloud and conversation |                               Yes |                                                                         **6.5% ± 0.2%** |                                                                         **16.3% ± 0.4%** |
+  | CoRal-project/roest-wav2vec2-315M-v2 (This model) |                 315M | Read-aloud and conversation |                                No |                                                                             8.2% ± 0.2% |                                                                             25.1% ± 0.4% |
   | [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1) |                 315M |                  Read-aloud |                               Yes |                                                                             6.6% ± 0.2% |                                                                             17.0% ± 0.4% |
   | [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1) |                 315M |                  Read-aloud |                                No |                                                                             8.6% ± 0.2% |                                                                             26.3% ± 0.5% |
+  Here are the results of the Røst-Wav2Vec2-315m models on different Danish dialects in the test set:
   |             | Røst-v1 |         | Røst-v1 |         | Røst-v2 |         | Røst-v2 |         |
   |-------------|---------|---------|---------|---------|---------|---------|---------|---------|
 |                                                                                       | **Røst-whisper-large-v1**  |           | **Røst-wav2vec2-315M-v1**  |           | **Røst-wav2vec2-315M-v2**  |             | **Røst-wav2vec2-1B-v2**  |           |
 | ------------------------------------------------------------------------------------- | -------------------------- | --------- | -------------------------- | --------- | -------------------------- | ----------- | ------------------------ | --------- |
 | **Evaluation Dataset**                                                                | **WER %**                  | **CER %** | **WER %**                  | **CER %** | **WER %**                  | **CER %**   | **WER %**                | **CER %** |
+| [CoRal](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test)   | **10.4**                   | **4.3**   | 17.0                       | 6.6       | 16.3                   | 6.5     | 16.4                     | 6.5   |
 | [NST-da](https://huggingface.co/datasets/alexandrainst/nst-da)                        | 29.8                       | 14.5      | 29.7                       | 13.9      | 26.1                       | 11.9        | **12.4**                 | **4.9**   |
 | [CommonVoice17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | 15.6                       | 8.2       | 16.7                       | 6.6       | **14.4**                   | **5.4**     | 26.3                     | 10.9      |
+| [Fleurs-da_dk](https://huggingface.co/datasets/google/fleurs)                         | **12.6**                   | **5.1**   | 16.6                       | 6.3       | 15.6                       | 6.1         | 13.7                | 5.5   |
 **OBS!** The vocab used for training incudes numerals (0,1,2,..,9), which are translated to text in a post-processing step. If the model misses spaces the numbers are interpreted as one, which especially affects the NST score as this dataset contains many numerals.