Model Card for Whisper N-Gram Language Models
Model Description
These models are KenLM n-gram models trained for supporting automatic speech recognition (ASR) tasks, specifically designed to work well with Whisper ASR models but are generally applicable to any ASR system requiring robust n-gram language models. These models can improve recognition accuracy by providing context-specific probabilities of word sequences.
Intended Use
These models are intended for use in language modeling tasks within ASR systems to improve prediction accuracy, especially in low-resource language scenarios. They can be integrated into any system that supports KenLM models.
Model Details
Each model is built using the KenLM toolkit and is based on n-gram statistics extracted from large, domain-specific corpora. The models available are:
- Basque (eu):
5gram-eu.bin
(11G) - Galician (gl):
5gram-gl.bin
(8.4G) - Catalan (ca):
5gram-ca.bin
(20G) - Spanish (es):
5gram-es.bin
(13G)
How to Use
Here is an example of how to load and use the Basque model with KenLM in Python:
import kenlm
from huggingface_hub import hf_hub_download
filepath = hf_hub_download(repo_id="HiTZ/whisper-lm-ngrams", filename="5gram-eu.bin")
model = kenlm.Model(filepath)
print(model.score("talka diskoetxearekin grabatzen ditut beti abestien maketak", bos=True, eos=True))
Training Data
The models were trained on corpora capped at 27 million sentences each to maintain comparability and manageability. Here's a breakdown of the sources for each language:
Basque: EusCrawl 1.0
Galician: SLI GalWeb Corpus
Catalan: Catalan Textual Corpus
Spanish: Spanish LibriSpeech MLS
Additional data from recent Wikipedia dumps and the Opus corpus were used as needed to reach the sentence cap.
Model Performance
The performance of these models varies by the specific language and the quality of the training data. Typically, performance is evaluated based on perplexity and the improvement in ASR accuracy when integrated.
Considerations
These models are designed for use in research and production for language-specific ASR tasks. They should be tested for bias and fairness to ensure appropriate use in diverse settings.
Citation
If you use these models in your research, please cite:
@misc{dezuazo2025whisperlmimprovingasrmodels,
title={Whisper-LM: Improving ASR Models with Language Models for Low-Resource Languages},
author={Xabier de Zuazo and Eva Navas and Ibon Saratxaga and Inma Hernáez Rioja},
year={2025},
eprint={2503.23542},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2503.23542},
}
And you can check the related paper preprint in arXiv:2503.23542 for more details.
Licensing
This model is available under the Creative Commons Attribution 4.0 International License (CC BY 4.0). You are free to use, modify, and distribute this model as long as you credit the original creators.
Acknowledgements
We would like to express our gratitude to Niels Rogge for his guidance and support in the creation of this dataset repository. You can find more about his work at his Hugging Face profile.