OpenEuroLLM Tokenizer v2 (oellm-262k-v2)

Model Description

This is the second version of the OpenEuroLLM project's custom tokenizer, oellm-262k-v2. It is a highly multilingual SentencePiece BPE (Byte-Pair Encoding) tokenizer designed for processing a wide range of European languages.

This tokenizer was trained with the specific goal of aligning with the strategy of modern, high-performance models like Google's Gemma. It features:

  • A large vocabulary of 262,144 tokens.
  • Byte-fallback to ensure every character can be encoded, preventing unknown tokens.
  • A training process optimized for a diverse mix of web text and curated datasets.

This tokenizer is a foundational component intended for use in training large language models within the OpenEuroLLM project and for the broader research community.

Training Data

The tokenizer was trained on a ~200 GB (1 billion line) subset of a larger 800 GB text corpus. This data was curated from several sources available on the LUMI supercomputer, primarily consisting of:

  • HPLT (High-Performance Language Technologies) dataset v2.0
  • FineWeb dataset

The corpus contains a wide variety of European languages, with the data distribution reflecting the contents of these source datasets.

Training Procedure

The tokenizer was trained on the LUMI supercomputer using Google's sentencepiece library. The training was performed on a largemem node with 950GB of RAM and took approximately 46 hours to complete.

The key training parameters were:

  • Model Type: BPE (Byte-Pair Encoding)
  • Vocabulary Size: 262,144
  • Character Coverage: 0.9995
  • Byte Fallback: Enabled
  • Special Tokens: <s> (BOS), </s> (EOS), <pad>, <unk>

How to Use

You can use this tokenizer directly from the Hugging Face Hub with the transformers library.

from transformers import AutoTokenizer

# Load the tokenizer from the Hub
tokenizer = AutoTokenizer.from_pretrained("jonasaise/oellm_tokenizer_262k_v2")

# Example usage
text = "Hej, detta är ett test av den nya OpenEuroLLM-tokeniseraren."
encoded_ids = tokenizer.encode(text)

print(f"Original text: {text}")
print(f"Encoded token IDs: {encoded_ids}")

decoded_text = tokenizer.decode(encoded_ids)
print(f"Decoded text: {decoded_text}")

# The tokenizer automatically adds a BOS token
# >>> Decoded text: <s> Hej, detta är ett test av den nya OpenEuroLLM-tokeniseraren.

Intended Use and Limitations

This tokenizer is intended to be used for pre-training and fine-tuning large language models on multilingual European text. Its performance is expected to be strong on the languages well-represented in the HPLT and FineWeb datasets.

The tokenizer's behavior and biases are a reflection of its training data. Users should be aware that the data is primarily sourced from the web and may contain noise, biases, or offensive content.

Licensing

This tokenizer is released under the Apache 2.0 license.

Citation

If you use this tokenizer in your research, please cite it as follows:

@misc{lind_2025_oellm_tokenizer_v2,
  author = {Jonas Lind and the OpenEuroLLM Contributors},
  title = {OpenEuroLLM Tokenizer v2 (oellm-262k-v2)},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face repository},
  howpublished = {\url{[https://huggingface.co/jonasaise/oellm_tokenizer_262k_v2](https://huggingface.co/jonasaise/oellm_tokenizer_262k_v2)}}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support