Model Card for Whisper N-Gram Language Models

Model Description

These models are KenLM n-gram models trained for supporting automatic speech recognition (ASR) tasks, specifically designed to work well with Whisper ASR models but are generally applicable to any ASR system requiring robust n-gram language models. These models can improve recognition accuracy by providing context-specific probabilities of word sequences.

Intended Use

These models are intended for use in language modeling tasks within ASR systems to improve prediction accuracy, especially in low-resource language scenarios. They can be integrated into any system that supports KenLM models.

Model Details

Each model is built using the KenLM toolkit and is based on n-gram statistics extracted from large, domain-specific corpora. The models available are:

  • Basque (eu): 5gram-eu.bin (11G)
  • Galician (gl): 5gram-gl.bin (8.4G)
  • Catalan (ca): 5gram-ca.bin (20G)
  • Spanish (es): 5gram-es.bin (13G)

How to Use

Here is an example of how to load and use the Basque model with KenLM in Python:

import kenlm
from huggingface_hub import hf_hub_download

filepath = hf_hub_download(repo_id="HiTZ/whisper-lm-ngrams", filename="5gram-eu.bin")
model = kenlm.Model(filepath)
print(model.score("talka diskoetxearekin grabatzen ditut beti abestien maketak", bos=True, eos=True))

Training Data

The models were trained on corpora capped at 27 million sentences each to maintain comparability and manageability. Here's a breakdown of the sources for each language:

Additional data from recent Wikipedia dumps and the Opus corpus were used as needed to reach the sentence cap.

Model Performance

The performance of these models varies by the specific language and the quality of the training data. Typically, performance is evaluated based on perplexity and the improvement in ASR accuracy when integrated.

Considerations

These models are designed for use in research and production for language-specific ASR tasks. They should be tested for bias and fairness to ensure appropriate use in diverse settings.

Citation

If you use these models in your research, please cite:

@misc{dezuazo2025whisperlmimprovingasrmodels,
      title={Whisper-LM: Improving ASR Models with Language Models for Low-Resource Languages}, 
      author={Xabier de Zuazo and Eva Navas and Ibon Saratxaga and Inma Hernáez Rioja},
      year={2025},
      eprint={2503.23542},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.23542}, 
}

And you can check the related paper preprint in arXiv:2503.23542 for more details.

Licensing

This model is available under the Creative Commons Attribution 4.0 International License (CC BY 4.0). You are free to use, modify, and distribute this model as long as you credit the original creators.

Acknowledgements

We would like to express our gratitude to Niels Rogge for his guidance and support in the creation of this dataset repository. You can find more about his work at his Hugging Face profile.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support