wav2vec2-large-fi-150k-finetuned / README.md

Update README.md

8827b6a verified 2 months ago

5.61 kB

	---
	license: apache-2.0
	tags:
	- automatic-speech-recognition
	- fi
	- finnish
	library_name: transformers
	language: fi
	base_model:
	- GetmanY1/wav2vec2-large-fi-150k
	model-index:
	- name: wav2vec2-large-fi-150k-finetuned
	results:
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Lahjoita puhetta (Donate Speech)
	type: lahjoita-puhetta
	args: fi
	metrics:
	- name: Dev WER
	type: wer
	value: 15.34
	- name: Dev CER
	type: cer
	value: 4.14
	- name: Test WER
	type: wer
	value: 16.86
	- name: Test CER
	type: cer
	value: 5.07
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Finnish Parliament
	type: FinParl
	args: fi
	metrics:
	- name: Dev16 WER
	type: wer
	value: 11.3
	- name: Dev16 CER
	type: cer
	value: 4.75
	- name: Test16 WER
	type: wer
	value: 8.29
	- name: Test16 CER
	type: cer
	value: 3.34
	- name: Test20 WER
	type: wer
	value: 6.94
	- name: Test20 CER
	type: cer
	value: 2.15
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Common Voice 16.1
	type: mozilla-foundation/common_voice_16_1
	args: fi
	metrics:
	- name: Dev WER
	type: wer
	value: 7.17
	- name: Dev CER
	type: cer
	value: 1.11
	- name: Test WER
	type: wer
	value: 5.86
	- name: Test CER
	type: cer
	value: 0.91
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: FLEURS
	type: google/fleurs
	args: fi_fi
	metrics:
	- name: Dev WER
	type: wer
	value: 9.2
	- name: Dev CER
	type: cer
	value: 5.23
	- name: Test WER
	type: wer
	value: 10.69
	- name: Test CER
	type: cer
	value: 5.79
	---

	# Finnish Wav2vec2-Large ASR

	[GetmanY1/wav2vec2-large-fi-150k](https://huggingface.co/GetmanY1/wav2vec2-large-fi-150k) fine-tuned on 4600 hours of Finnish speech on 16kHz sampled speech audio:
	* 1500 hours of [Lahjoita puhetta (Donate Speech)](https://link.springer.com/article/10.1007/s10579-022-09606-3) (colloquial Finnish)
	* 3100 hours of the [Finnish Parliament dataset](https://link.springer.com/article/10.1007/s10579-023-09650-7)

	When using the model make sure that your speech input is also sampled at 16Khz.

	## Model description

	The Finnish Wav2Vec2 Large has the same architecture and uses the same training objective as the English and multilingual one described in [Paper](https://arxiv.org/abs/2006.11477).

	[GetmanY1/wav2vec2-large-fi-150k](https://huggingface.co/GetmanY1/wav2vec2-large-fi-150k) is a large-scale, 317-million parameter monolingual model pre-trained on 158k hours of unlabeled Finnish speech, including [KAVI radio and television archive materials](https://kavi.fi/en/radio-ja-televisioarkistointia-vuodesta-2008/), Lahjoita puhetta (Donate Speech), Finnish Parliament, Finnish VoxPopuli.

	You can read more about the pre-trained model from [this paper](https://www.isca-archive.org/interspeech_2025/getman25_interspeech.html). The training scripts are available on [GitHub](https://github.com/aalto-speech/large-scale-monolingual-speech-foundation-models).

	## Intended uses

	You can use this model for Finnish ASR (speech-to-text).

	### How to use

	To transcribe audio files the model can be used as a standalone acoustic model as follows:

	```
	from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
	from datasets import load_dataset
	import torch

	# load model and processor
	processor = Wav2Vec2Processor.from_pretrained("GetmanY1/wav2vec2-large-fi-150k-finetuned")
	model = Wav2Vec2ForCTC.from_pretrained("GetmanY1/wav2vec2-large-fi-150k-finetuned")

	# load dummy dataset and read soundfiles
	ds = load_dataset("mozilla-foundation/common_voice_16_1", "fi", split='test')

	# tokenize
	input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values # Batch size 1

	# retrieve logits
	logits = model(input_values).logits

	# take argmax and decode
	predicted_ids = torch.argmax(logits, dim=-1)
	transcription = processor.batch_decode(predicted_ids)
	```

	## Citation

	If you use our models or scripts, please cite our article as:

	```bibtex
	@inproceedings{getman25_interspeech,
	title = {{Is your model big enough? Training and interpreting large-scale monolingual speech foundation models}},
	author = {{Yaroslav Getman and Tamás Grósz and Tommi Lehtonen and Mikko Kurimo}},
	year = {{2025}},
	booktitle = {{Interspeech 2025}},
	pages = {{231--235}},
	doi = {{10.21437/Interspeech.2025-46}},
	issn = {{2958-1796}},
	}
	```

	## Team Members

	- Yaroslav Getman, [Hugging Face profile](https://huggingface.co/GetmanY1), [LinkedIn profile](https://www.linkedin.com/in/yaroslav-getman/)
	- Tamas Grosz, [Hugging Face profile](https://huggingface.co/Grosy), [LinkedIn profile](https://www.linkedin.com/in/tam%C3%A1s-gr%C3%B3sz-950a049a/)

	Feel free to contact us for more details 🤗