MarieAlvenir commited on
Commit
a5f7c0f
·
1 Parent(s): e764359
Files changed (1) hide show
  1. README.md +164 -0
README.md ADDED
@@ -0,0 +1,164 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ This is a Danish state-of-the-art speech recognition model, trained as part of the CoRal project by [Alvenir](https://www.alvenir.ai/).
2
+ ## Overview
3
+
4
+ This repository contains the Wav2Vec2 model trained on the [CoRal-v2 dataset](https://huggingface.co/datasets/CoRal-dataset/coral-v2/tree/main). The CoRal-v2 dataset includes a rich variety of Danish conversational and read-aloud data, distributed across diverse age groups, genders, and dialects. The model is designed for automatic speech recognition (ASR).
5
+
6
+ ## Quick Start
7
+
8
+ Start by installing the required libraries:
9
+
10
+ ```shell
11
+ $ pip install transformers kenlm pyctcdecode
12
+ ```
13
+
14
+ Next you can use the model using the `transformers` Python package as follows:
15
+
16
+ ```python
17
+ >>> from transformers import pipeline
18
+ >>> audio = get_audio() # 16kHz raw audio array
19
+ >>> transcriber = pipeline(model="CoRal-dataset/roest-wav2vec2-315m-v2")
20
+ >>> transcriber(audio)
21
+ {'text': 'your transcription'}
22
+ ```
23
+
24
+ ## Model Details
25
+
26
+ Wav2Vec2 is a state-of-the-art model architecture for speech recognition, leveraging self-supervised learning from raw audio data. The pre-trained [Wav2Vec2-XLS-R-300M](facebook/wav2vec2-xls-r-300m) has been fine-tuned for automatic speech recognition with the [CoRal-v2 dataset](https://huggingface.co/datasets/CoRal-dataset/coral-v2/tree/main) dataset to enhance its performance in recognizing Danish speech with consideration to different dialects. The model was trained for 30K steps using the training setup in the [CoRaL repository](https://github.com/alexandrainst/coral/tree) by running:
27
+ ```
28
+ python src/scripts/finetune_asr_model.py model=wav2vec2-small max_steps=30000 datasets.coral_conversation_internal.id=CoRal-dataset/coral-v2 datasets.coral_readaloud_internal.id=CoRal-dataset/coral-v2
29
+ ```
30
+ The model is evaluated using a Language Model (LM) as post-processing. The utilized LM is the one trained and used by [alexandrainst/roest-315m](https://huggingface.co/alexandrainst/roest-315m).
31
+ ## Dataset
32
+
33
+ ### [CoRal-v2](https://huggingface.co/datasets/CoRal-dataset/coral-v2/tree/main)
34
+ - **Subsets**:
35
+ - Conversation
36
+ - Read-aloud
37
+ - **Language**: Danish.
38
+ - **Variation**: Includes various dialects, age groups, and gender distinctions.
39
+ ### License
40
+ Note that the dataset used is licensed under a custom license, adapted from OpenRAIL-M, which allows commercial use with a few restrictions (speech synthesis and biometric identification). See [license](https://huggingface.co/Alvenir/coral-1-whisper-large/blob/main/LICENSE).
41
+
42
+ ## Evaluation
43
+
44
+ The model was evaluated using the following metrics:
45
+ - **Word Error Rate (WER)**: The percentage of words incorrectly transcribed.
46
+ - **Character Error Rate (CER)**: The percentage of characters incorrectly transcribed.
47
+
48
+ **OBS!** It should be noted that the [CoRal test dataset](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) does not contain any conversation data, whereas the model is trained for read-aloud and conversation, but is only tested on read-aloud in the [CoRal test dataset](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test).
49
+
50
+
51
+ | Model | Number of parameters | Finetuned on data of type | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) WER |
52
+ | :----------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | --------------------------------------------------------------------------------------: | --------------------------------------------------------------------------------------: |
53
+ | [CoRal-dataset/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-dataset/roest-whisper-large) | 315M | Read-aloud and conversation | 6.5% ± 0.2% | 16.3% ± 0.4% |
54
+ | [CoRal-dataset/roest-whisper-large-v2](https://huggingface.co/CoRal-dataset/roest-whisper-large) | 1540M | Read-aloud and conversation | 5.3% ± 0.2% | 12.0% ± 0.4% |
55
+ | [Alvenir/coral-1-whisper-large](https://huggingface.co/Alvenir/coral-1-whisper-large) | 1540M | Read-aloud | **4.3% ± 0.2%** | **10.4% ± 0.3%** |
56
+ | [alexandrainst/roest-315m](https://huggingface.co/alexandrainst/roest-315m) | 315M | Read-aloud | 6.6% ± 0.2% | 17.0% ± 0.4% |
57
+ | [mhenrichsen/hviske-v2](https://huggingface.co/syvai/hviske-v2) | 1540M | Read-aloud | 4.7% ± 0.2% | 11.8% ± 0.3% |
58
+ | [openai/whisper-large-v3](https://hf.co/openai/whisper-large-v3) | 1540M | - | 11.4% ± 0.3% | 28.3% ± 0.6% |
59
+
60
+ **OBS!** Benchmark for hviske-v2 has been reevaluted and the confidence interval is larger than reported in the model card.
61
+
62
+ ### Detailed evaluation across demographics on the CoRal test data
63
+ <img src="https://huggingface.co/CoRal-dataset/roest-wav2vec2-315m-v2/images/wer.png">
64
+
65
+ <img src="https://huggingface.co/CoRal-dataset/roest-wav2vec2-315m-v2/images/cer.png">
66
+
67
+ ### Table CER scores in % of evaluation across demographics on the CoRal test data
68
+ | Category | roest-wav2vec2-315m-v2 | roest-315m | roest-whisper-large-v2 | coral-1-whisper-large |
69
+ |:---:|:---:|:---:|:---:|:---:|
70
+ | female | 7.2 | 7.4 | 6.9 | 5.1 |
71
+ | male | 5.7 | 5.8 | 3.7 | 3.6 |
72
+ | 0-25 | 5.3 | 5.4 | 3.3 | 3.4 |
73
+ | 25-50 | 6.0 | 6.2 | 6.5 | 4.0 |
74
+ | 50+ | 7.4 | 7.5 | 5.1 | 5.0 |
75
+ | Bornholmsk | 6.1 | 6.8 | 3.4 | 3.8 |
76
+ | Fynsk | 7.2 | 7.4 | 13.8 | 5.1 |
77
+ | Københavnsk | 3.2 | 3.3 | 2.1 | 1.9 |
78
+ | Non-native | 7.5 | 7.8 | 4.9 | 4.8 |
79
+ | Nordjysk | 2.8 | 2.6 | 1.7 | 1.6 |
80
+ | Sjællandsk | 4.5 | 4.4 | 2.9 | 3.0 |
81
+ | Sydømål | 6.4 | 6.4 | 4.1 | 4.1 |
82
+ | Sønderjysk | 11.6 | 11.9 | 8.8 | 8.8 |
83
+ | Vestjysk | 9.8 | 10.1 | 6.9 | 6.4 |
84
+ | Østjysk | 4.1 | 4.0 | 2.8 | 2.6 |
85
+ | Overall | 6.5 | 6.6 | 5.3 | 4.3 |
86
+
87
+ ### Table WER scores in % of evaluation across demographics on the CoRal test data
88
+ | Category | roest-wav2vec2-315m-v2 | roest-315m | roest-whisper-large-v2 | coral-1-whisper-large |
89
+ |:---:|:---:|:---:|:---:|:---:|
90
+ | female | 17.7 | 18.5 | 14.2 | 11.5 |
91
+ | male | 14.9 | 15.5 | 9.9 | 9.4 |
92
+ | 0-25 | 14.0 | 14.7 | 9.0 | 9.0 |
93
+ | 25-50 | 15.8 | 16.6 | 14.1 | 10.1 |
94
+ | 50+ | 17.7 | 18.2 | 11.5 | 11.3 |
95
+ | Bornholmsk | 15.7 | 17.7 | 9.3 | 9.8 |
96
+ | Fynsk | 17.7 | 18.3 | 24.9 | 12.1 |
97
+ | Københavnsk | 10.0 | 10.2 | 6.7 | 5.9 |
98
+ | Non-native | 19.4 | 20.9 | 13.0 | 12.2 |
99
+ | Nordjysk | 7.5 | 7.7 | 4.9 | 4.5 |
100
+ | Sjællandsk | 12.7 | 12.6 | 7.5 | 7.6 |
101
+ | Sydømål | 15.3 | 14.9 | 10.3 | 10.0 |
102
+ | Sønderjysk | 25.4 | 26.0 | 17.4 | 17.5 |
103
+ | Vestjysk | 25.2 | 26.3 | 16.3 | 15.0 |
104
+ | Østjysk | 11.3 | 11.7 | 8.0 | 7.5 |
105
+ | Overall | 16.3 | 17.0 | 12.0 | 10.4 |
106
+
107
+
108
+ ### Roest-wav2vec2-315M with and without language model
109
+ The inclusion of a post-processing language model can affect the performance significantly. The Roest-v1 and Roest-v2 models are using the same Language Model (LM). The utilized LM is the one trained and used by [alexandrainst/roest-315m](https://huggingface.co/alexandrainst/roest-315m).
110
+
111
+ | Model | Number of parameters | Finetuned on data of type | Postprocessed with Language Model | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) WER |
112
+ | :-------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | --------------------------------: | --------------------------------------------------------------------------------------: | --------------------------------------------------------------------------------------: |
113
+ | [CoRal-dataset/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-dataset/roest-wav2vec2-315m-v2) | 315M | Read-aloud and conversation | Yes | **6.5% ± 0.2%** | **16.3% ± 0.4%** |
114
+ | [CoRal-dataset/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-dataset/roest-wav2vec2-315m-v2) | 315M | Read-aloud and conversation | No | 8.2% ± 0.2% | 25.1% ± 0.4% |
115
+ | [alexandrainst/roest-315m](https://huggingface.co/alexandrainst/roest-315m) | 315M | Read-aloud | Yes | 6.6% ± 0.2% | 17.0% ± 0.4% |
116
+ | [alexandrainst/roest-315m](https://huggingface.co/alexandrainst/roest-315m) | 315M | Read-aloud | No | 8.6% ± 0.2% | 26.3% ± 0.5% |
117
+
118
+ ### Detailed Roest-wav2vec2-315M with and without language model on different dialects
119
+ Here are the results of the model on different danish dialects in the test set:
120
+
121
+ | | Roest-1 | | Roest-1 | | Roest-2 | | Roest-2 | |
122
+ |-------------|---------|---------|---------|---------|---------|---------|---------|---------|
123
+ | LM | No | | Yes | | No | | Yes | |
124
+ |-------------|---------|---------|---------|---------|---------|---------|---------|---------|
125
+ | Dialect | CER (%) | WER (%) | CER (%) | WER (%) | CER (%) | WER (%) | CER (%) | WER (%) |
126
+ | Vestjysk | 12.7 | 37.1 | 10.1 | 26.3 | 12.2 | 36.3 | 9.82 | 25.2 |
127
+ | Sønderjysk | 14.7 | 37.8 | 11.9 | 26.0 | 14.2 | 36.2 | 11.6 | 25.4 |
128
+ | Bornholmsk | 9.32 | 29.9 | 6.79 | 17.7 | 8.08 | 26.7 | 6.12 | 15.7 |
129
+ | Østjysk | 5.51 | 18.7 | 3.97 | 11.7 | 5.39 | 18.0 | 4.06 | 11.3 |
130
+ | Nordjysk | 3.86 | 13.6 | 2.57 | 7.72 | 3.80 | 13.5 | 2.75 | 7.51 |
131
+ | Københavnsk | 5.27 | 18.8 | 3.31 | 10.2 | 5.02 | 17.7 | 3.20 | 9.98 |
132
+ | Fynsk | 9.41 | 28.6 | 7.43 | 18.3 | 8.86 | 27.0 | 7.20 | 17.7 |
133
+ | Non-native | 10.6 | 33.2 | 7.84 | 20.9 | 10.0 | 31.6 | 7.46 | 19.4 |
134
+ | Sjællandsk | 5.82 | 19.5 | 4.44 | 12.6 | 5.70 | 18.6 | 4.48 | 12.7 |
135
+ | Sydømål | 7.09 | 20.7 | 6.38 | 14.9 | 6.96 | 20.4 | 6.44 | 15.3 |
136
+
137
+ ### Performance on Other Datasets
138
+
139
+ The model was also tested against other datasets to evaluate generalizability:
140
+
141
+ | | **Roest-wav2vec2-315M-v1** | | **Roest-wav2vec2-315M-v2** | |
142
+ | ------------------------------------------------------------------------------------- | ----------- | ----- | ----------- | -------- |
143
+ | Evaluation Dataset | WER % | CER % | WER % | CER % |
144
+ | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) | 17.0 | 6.6 | **16.3** | **6.5** |
145
+ | [NST-da](https://huggingface.co/datasets/alexandrainst/nst-da) | 29.7 | 13.9 | **26.1** | **11.9** |
146
+ | [CommonVoice17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | 16.7 | 6.6 | **14.4** | **5.4** |
147
+ | [Fleurs-da_dk](https://huggingface.co/datasets/google/fleurs) | 27.3 | 7.9 | **26.4** | **7.7** |
148
+ | [Fleurs-da_dk](https://huggingface.co/datasets/google/fleurs) Normed | 16.6 | 6.3 | **15.6** | **6.1** |
149
+
150
+ ## Training curves
151
+ <img src="https://huggingface.co/CoRal-dataset/roest-wav2vec2-315m-v2/images/training_plots.png">
152
+
153
+ ## Creators and Funders
154
+ This model has been trained and the model card written by Marie Juhl Jørgensen and Søren Vejlgaard Holm at [Alvenir](https://www.alvenir.ai/).
155
+
156
+ The CoRal project is funded by the [Danish Innovation Fund](https://innovationsfonden.dk/) and consists of the following partners:
157
+
158
+ - [Alexandra Institute](https://alexandra.dk/)
159
+ - [University of Copenhagen](https://www.ku.dk/)
160
+ - [Agency for Digital Government](https://digst.dk/)
161
+ - [Alvenir](https://www.alvenir.ai/)
162
+ - [Corti](https://www.corti.ai/)
163
+
164
+ We would like specifically thank Dan Saattrup Nielsen, Alexandra Institute for (among other things) the repository work and Simon Leminen Madsen, Alexandra Institute for modelling work.