Røst-wav2vec2-315m-v2

This is a Danish state-of-the-art speech recognition model, trained as part of the CoRal project by Alvenir.

This repository contains a Wav2Vec2 model trained on the CoRal-v2 dataset. The CoRal-v2 dataset includes a rich variety of Danish conversational and read-aloud data, distributed across diverse age groups, genders, and dialects. The model is designed for automatic speech recognition (ASR).

Try it out in our interactive demo!

Quick Start

Start by installing the required libraries:

$ pip install transformers kenlm pyctcdecode

Next you can use the model using the transformers Python package as follows:

>>> from transformers import pipeline
>>> audio = get_audio()  # 16kHz raw audio array
>>> transcriber = pipeline(model="CoRal-project/roest-wav2vec2-315m-v2")
>>> transcriber(audio)
{'text': 'your transcription'}

Transcription Examples

Explore the following audio samples along with their transcriptions and accuracy metrics. Each example showcases the model's performance with different Danish dialects.

Example 1 - Vestjysk Dialect

Audio Sample:

Model Transcription:
det blev til yderlig ti mål i den første sæson på trods af en position som back

Target Transcription:
det blev til yderligere ti mål i den første sæson på trods af en position som back

Character Error Rate (CER): 3.7%
Word Error Rate (WER): 5.9%

Example 2 - Sønderjysk Dialect

Audio Sample:

Model Transcription:
en arkitektoniske udformning af pladser forslagene iver benzen

Target Transcription:
den arkitektoniske udformning af pladsen er forestået af ivar bentsen

Character Error Rate (CER): 20.3%
Word Error Rate (WER): 60.0%

Example 3 - Nordsjællandsk Dialect

Audio Sample:

Model Transcription:
østrig og ungarn samarbejder om søen gennem den østrigske og ungarske vandkommission

Target Transcription:
østrig og ungarn samarbejder om søen gennem den østrigske og ungarske vandkommission

Character Error Rate (CER): 0.0%
Word Error Rate (WER): 0.0%

Example 4 - Lollandsk Dialect

Audio Sample:

Model Transcription:
det er produceret af thomas helme og indspillede i easy sound recording studio i københavn

Target Transcription:
det er produceret af thomas helmig og indspillet i easy sound recording studio i københavn

Character Error Rate (CER): 4.4%
Word Error Rate (WER): 13.3%

Model Details

Wav2Vec2 is a state-of-the-art model architecture for speech recognition, leveraging self-supervised learning from raw audio data. The pre-trained Wav2Vec2-XLS-R-300M has been fine-tuned for automatic speech recognition with the CoRal-v2 dataset dataset to enhance its performance in recognizing Danish speech with consideration to different dialects. The model was trained for 30K steps using the training setup in the CoRaL repository by running:

python src/scripts/finetune_asr_model.py \
  model=wav2vec2-small \
  max_steps=30000 \
  datasets.coral_conversation_internal.id=CoRal-project/coral-v2 \
  datasets.coral_readaloud_internal.id=CoRal-project/coral-v2

The model is evaluated using a Language Model (LM) as post-processing. The utilized LM is the one trained and used by CoRal-project/roest-wav2vec2-315m-v1.

The model was trained on the CoRal-v2 dataset, including both the conversational and read-aloud subset. This dataset consists of Danish speech across a variety of dialects, age groups and gender distinctions. Note that the dataset, and thus also this model, is licensed under a custom license, adapted from OpenRAIL-M, which allows commercial use with few restrictions (speech synthesis and biometric identification) - see license.

Evaluation

The model was evaluated using the following metrics:

Character Error Rate (CER): The percentage of characters incorrectly transcribed.
Word Error Rate (WER): The percentage of words incorrectly transcribed.

Conversational CoRal Performance

The model was firstly evaluated on a tentative version of the coral-v2 conversation dataset.

The results are tentative as the test set only includes 5 unique speakers, of which 4 are women. The test set includes 2 speakers with 'Fynsk' dialect, 1 with 'Sønderjysk', 1 with 'Non-native' and 1 'Nordjysk'.

Note that the high generalization error on conversation data for models trained on read-aloud data is still being analyzed.

Model	Number of parameters	Finetuned on data of type	CoRal-v2::conversation CER	CoRal-v2::conversation WER
CoRal-project/roest-wav2vec2-2B-v2	2B	Read-aloud and conversation	23.6%	34.3%
CoRal-project/roest-wav2vec2-1B-v2	1B	Read-aloud and conversation	23.9%	36.7%
CoRal-project/roest-wav2vec2-315M-v2 (This model)	315M	Read-aloud and conversation	24.2%	37.7%
CoRal-project/roest-whisper-large-v1	1540M	Read-aloud	138%	121%
CoRal-project/roest-wav2vec2-315m-v1	315M	Read-aloud	123%	80.5%
syvai/hviske-v2	1540M	Read-aloud	78.2%	72.6%
openai/whisper-large-v3	1540M	-	46.4 %	57.4%

Read-aloud CoRal Performance

Model	Number of parameters	Finetuned on data of type	CoRal CER	CoRal WER
CoRal-project/roest-wav2vec2-2B-v2	2B	Read-aloud and conversation	6.2% ± 0.2%	16.0% ± 0.4%
CoRal-project/roest-wav2vec2-1B-v2	1B	Read-aloud and conversation	6.5% ± 0.2%	16.4% ± 0.4%
CoRal-project/roest-wav2vec2-315M-v2 (This model)	315M	Read-aloud and conversation	6.5% ± 0.2%	16.3% ± 0.4%
CoRal-project/roest-whisper-large-v1	1540M	Read-aloud	4.3% ± 0.2%	10.4% ± 0.3%
CoRal-project/roest-wav2vec2-315M-v1	315M	Read-aloud	6.6% ± 0.2%	17.0% ± 0.4%
mhenrichsen/hviske-v2	1540M	Read-aloud	4.7% ± 0.2%	11.8% ± 0.3%
openai/whisper-large-v3	1540M	-	11.4% ± 0.3%	28.3% ± 0.6%

OBS! Benchmark for hviske-v2 has been re-evaluated and the confidence interval is larger than reported in the model card.

Detailed CER scores in % of evaluation across demographics on the CoRal test data

Category	whisper-large-v3	hviske-v2	røst-whisper-large-v1	røst-wav2vec2-315m-v1	røst-wav2vec2-315m-v2	røst-wav2vec2-1B-v2	røst-wav2vec2-2B-v2
female	12.3	5.4	5.1	7.4	7.2	7.3	7.2
male	10.6	4.1	3.6	5.8	5.7	5.8	5.3
0-25	9.1	3.8	3.4	5.4	5.3	5.1	4.7
25-50	11.4	4.7	4.0	6.2	6.0	5.7	5.3
50+	12.4	5.2	5.0	7.5	7.4	7.8	7.7
Bornholmsk	12.1	3.8	3.8	6.8	6.1	6.2	5.7
Fynsk	12.0	5.9	5.1	7.4	7.2	6.9	6.1
Københavnsk	5.6	2.1	1.9	3.3	3.2	3.0	2.6
Non-native	17.4	5.9	4.8	7.8	7.5	7.3	6.6
Nordjysk	4.7	1.5	1.6	2.6	2.8	2.6	2.3
Sjællandsk	8.0	3.3	3.0	4.4	4.5	3.9	3.8
Sydømål	7.7	4.3	4.1	6.4	6.4	6.5	5.8
Sønderjysk	20.0	9.4	8.8	11.9	11.6	12.6	13.3
Vestjysk	17.6	7.2	6.4	10.1	9.8	10.5	10.8
Østjysk	5.9	2.9	2.6	4.0	4.1	3.8	3.5
Overall	11.4	4.7	4.3	6.6	6.5	6.5	6.2

Detailed WER scores in % of evaluation across demographics on the CoRal test data

Category	whisper-large-v3	hviske-v2	røst-whisper-large-v1	røst-wav2vec2-315m-v1	røst-wav2vec2-315m-v2	røst-wav2vec2-1B-v2	røst-wav2vec2-2B-v2
female	30.2	12.7	11.5	18.5	17.7	17.8	17.8
male	26.5	10.9	9.4	15.5	14.9	15.0	14.3
0-25	24.1	10.3	9.0	14.7	14.0	13.7	12.9
25-50	28.4	12.2	10.1	16.6	15.8	15.3	14.5
50+	30.0	12.1	11.3	18.2	17.7	18.5	18.7
Bornholmsk	31.6	10.4	9.8	17.7	15.7	16.4	15.3
Fynsk	29.3	14.3	12.1	18.3	17.7	16.7	15.2
Københavnsk	16.8	6.7	5.9	10.2	10.0	9.5	8.4
Non-native	40.9	15.4	12.2	20.9	19.4	19.4	18.1
Nordjysk	13.5	4.3	4.5	7.7	7.5	7.3	6.9
Sjællandsk	21.7	8.9	7.6	12.6	12.7	11.0	10.5
Sydømål	19.2	10.4	10.0	14.9	15.3	14.4	13.7
Sønderjysk	44.3	19.0	17.5	26.0	25.4	27.8	29.6
Vestjysk	42.0	17.7	15.0	26.3	25.2	26.7	28.3
Østjysk	16.9	8.2	7.5	11.7	11.3	10.8	10.1
Overall	28.3	11.8	10.4	17.0	16.3	16.4	16.0

Experiments with Røst-wav2vec2 with and without language model

The inclusion of a post-processing language model can affect the performance significantly. The Røst-v1 and Røst-v2 models are using the same Language Model (LM). The utilized LM is the one trained and used by CoRal-project/roest-wav2vec2-315m-v1.

Model	Number of parameters	Finetuned on data of type	Postprocessed with Language Model	CoRal CER	CoRal WER
CoRal-project/roest-wav2vec2-2B-v2	2B	Read-aloud and conversation	Yes	6.2% ± 0.2%	16.0% ± 0.4%
CoRal-project/roest-wav2vec2-2B-v2	2B	Read-aloud and conversation	No	7.8% ± 0.2%	23.0% ± 0.4%
CoRal-project/roest-wav2vec2-1B-v2	1B	Read-aloud and conversation	Yes	6.5% ± 0.2%	16.4% ± 0.4%
CoRal-project/roest-wav2vec2-1B-v2	1B	Read-aloud and conversation	No	8.1% ± 0.2%	23.9% ± 0.4%
CoRal-project/roest-wav2vec2-315M-v2 (This model)	315M	Read-aloud and conversation	Yes	6.5% ± 0.2%	16.3% ± 0.4%
CoRal-project/roest-wav2vec2-315M-v2	315M	Read-aloud and conversation	No	8.2% ± 0.2%	25.1% ± 0.4%
CoRal-project/roest-wav2vec2-315m-v1	315M	Read-aloud	Yes	6.6% ± 0.2%	17.0% ± 0.4%
CoRal-project/roest-wav2vec2-315m-v1	315M	Read-aloud	No	8.6% ± 0.2%	26.3% ± 0.5%

Here are the results of the Røst-Wav2Vec2-315m models on different Danish dialects in the test set:

	Røst-v1		Røst-v1		Røst-v2		Røst-v2
LM	No		Yes		No		Yes
-------------	---------	---------	---------	---------	---------	---------	---------	---------
Dialect	CER (%)	WER (%)	CER (%)	WER (%)	CER (%)	WER (%)	CER (%)	WER (%)
Vestjysk	12.7	37.1	10.1	26.3	12.2	36.3	9.82	25.2
Sønderjysk	14.7	37.8	11.9	26.0	14.2	36.2	11.6	25.4
Bornholmsk	9.32	29.9	6.79	17.7	8.08	26.7	6.12	15.7
Østjysk	5.51	18.7	3.97	11.7	5.39	18.0	4.06	11.3
Nordjysk	3.86	13.6	2.57	7.72	3.80	13.5	2.75	7.51
Københavnsk	5.27	18.8	3.31	10.2	5.02	17.7	3.20	9.98
Fynsk	9.41	28.6	7.43	18.3	8.86	27.0	7.20	17.7
Non-native	10.6	33.2	7.84	20.9	10.0	31.6	7.46	19.4
Sjællandsk	5.82	19.5	4.44	12.6	5.70	18.6	4.48	12.7
Sydømål	7.09	20.7	6.38	14.9	6.96	20.4	6.44	15.3

"Fynsk" dialect specific models

Two dialect specific Wav2vec2 315m models have been tuned on "Fynsk" coral v2 data.

wav2vec2-315m-v2-fynsk: Complete finetuning using only “Fynsk” dialect data for 30000 steps starting with Wav2Vec2-XLS-R-300M (same base model as this model)

wav2vec2-315m-v2-fynsk-light: Finetuning using only “Fynsk” dialect data for 5000 steps starting with CoRal-project/roest-wav2vec2-315m-v2 (this model)

Results on "Fynsk" subsets of Danish evaluation benchmarks:

	Røst-wav2vec2-315m-v2		Røst-wav2vec2-315m-v2-Fynsk		Røst-wav2vec2-315m-v2-Fynsk-Light
Evaluation data	WER %	CER %	WER %	CER %	WER %	CER %
CoRal2conv::Fynsk	32.6	21.7	44.5	30.4	34.5	22.4
CoRal::Fynsk	17.7	7.2	27.8	10.3	17.5	7.2
NST-da::Fyn	27.2	11.9	30.8	11.4	27.0	11.8

Comparison of results on different Danish benchmarks:

	Røst-wav2vec2-315m-v2		Røst-wav2vec2-315m-v2-Fynsk		Røst-wav2vec2-315m-v2-Fynsk-Light
Evaluation data	WER %	CER %	WER %	CER %	WER %	CER %
CoRal	16.3	6.5	32.4	12.4	16.8	6.6
CoRal-v2-read	17.0	6.7	33.5	12.6	17.5	6.8
CoRal-v2-conv	37.7	24.2	64.7	44.6	40.0	25.3
NST-da	28.4	12.4	34.9	12.6	28.5	12.3
CommonVoice17	14.4	5.4	24.1	9.1	15.1	6.0
AlvenirOss	11.3	4.4	20.8	8.2	11.4	4.5
AlvenirWiki	8.0	3.0	13.5	4.5	8.2	3.0
Fleurs-da_dk	15.6	6.1	23.6	8.8	16.5	6.4

Performance on Other Datasets

The model was also tested against other datasets to evaluate generalizability:

	Røst-wav2vec2-2B-v2		Røst-wav2vec2-1B-v2		Røst-wav2vec2-315M-v2		Røst-wav2vec2-315M-v1		Røst-whisper-large-v1
Evaluation Dataset	WER %	CER %	WER %	CER %	WER %	CER %	WER %	CER %	WER %	CER %
CoRal	16.0	6.2	16.4	6.5	16.3	6.5	17.0	6.6	10.4	4.3
NST-da	27.0	11.7	27.7	11.9	28.4	12.4	29.7	13.9	29.8	14.5
CommonVoice17	12.0	4.5	26.3	10.9	14.4	5.4	16.7	6.6	15.6	8.2
Fleurs-da_dk	12.5	5.1	13.7	5.5	15.6	6.1	16.6	6.3	12.6	5.1
AlvenirOss	8.1	3.1	9.1	3.6	11.3	4.4	14.8	6.0	9.2	3.9
AlvenirWiki	6.5	2.4	7.2	2.7	8.0	3.0	7.9	3.0	7.5	2.8

OBS! The vocab used for training incudes numerals (0,1,2,..,9), which are translated to text in a post-processing step. If the model misses spaces the numbers are interpreted as one, which especially affects the NST score as this dataset contains many numerals.

Note on comparing Whisper and Wav2Vec2 models

The Whisper models detailed in this model card exhibit significantly lower Character Error Rates (CER) and Word Error Rates (WER) compared to the Wav2Vec2 models. Whisper utilizes a transformer-based architecture with additional layers that enhance contextual understanding. In contrast, Wav2Vec2 models employ shorter context windows that focus on sound prediction. The Røst-Wav2Vec2 models incorporate a straightforward language model during post-processing, which addresses errors based on statistical language patterns. Introducing a more complex, contextual post-processing language model might enable a better comparison between these model types, which the CoRal project plans to explore in future releases.

The Røst-Whisper model excels in read-aloud data, leveraging its embedded contextual framework to achieve more robust recognition within this context. However, Wav2Vec2 models appear to generalize more effectively across various speech recognition tasks, whereas Whisper models incur higher error rates in conversational data. It’s important to note that the CoRal-v2 conversation dataset, being tentative and featuring limited speaker diversity, might influence these results.

Training curves

Creators and Funders

This model has been trained and the model card written by Marie Juhl Jørgensen and Søren Vejlgaard Holm at Alvenir.

The CoRal project is funded by the Danish Innovation Fund and consists of the following partners:

We would like specifically to thank Dan Saattrup Nielsen, Alexandra Institute for (among other things) the repository work and Simon Leminen Madsen, Alexandra Institute for modelling work.

Citation

@misc{roest-wav2vec2-315m-v2,
  author    = {Marie Juhl Jørgensen, Søren Vejlgaard Holm, Martin Carsten Nielsen, Dan Saattrup Nielsen, Sif Bernstorff Lehmann, Simon Leminen Madsen and Torben Blach},
  title     = {Røst-wav2vec-315m-v2: A Danish state-of-the-art speech recognition model trained on varied demographics and dialects},
  year      = {2025},
  url       = {https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2},
}

CoRal-project
/

roest-wav2vec2-315m-v2

Røst-wav2vec2-315m-v2

Quick Start

Transcription Examples

Model Details

Evaluation

Conversational CoRal Performance

Read-aloud CoRal Performance

Performance on Other Datasets

Note on comparing Whisper and Wav2Vec2 models

Training curves

Creators and Funders

Citation

Model tree for CoRal-project/roest-wav2vec2-315m-v2

Dataset used to train CoRal-project/roest-wav2vec2-315m-v2

Space using CoRal-project/roest-wav2vec2-315m-v2 1

Evaluation results