OpenEuroLLM Tokenizer v2 (oellm-262k-v2)
Model Description
This is the second version of the OpenEuroLLM project's custom tokenizer, oellm-262k-v2
. It is a highly multilingual SentencePiece BPE (Byte-Pair Encoding) tokenizer designed for processing a wide range of European languages.
This tokenizer was trained with the specific goal of aligning with the strategy of modern, high-performance models like Google's Gemma. It features:
- A large vocabulary of 262,144 tokens.
- Byte-fallback to ensure every character can be encoded, preventing unknown tokens.
- A training process optimized for a diverse mix of web text and curated datasets.
This tokenizer is a foundational component intended for use in training large language models within the OpenEuroLLM project and for the broader research community.
Training Data
The tokenizer was trained on a ~200 GB (1 billion line) subset of a larger 800 GB text corpus. This data was curated from several sources available on the LUMI supercomputer, primarily consisting of:
- HPLT (High-Performance Language Technologies) dataset v2.0
- FineWeb dataset
The corpus contains a wide variety of European languages, with the data distribution reflecting the contents of these source datasets.
Training Procedure
The tokenizer was trained on the LUMI supercomputer using Google's sentencepiece
library. The training was performed on a largemem
node with 950GB of RAM and took approximately 46 hours to complete.
The key training parameters were:
- Model Type: BPE (Byte-Pair Encoding)
- Vocabulary Size: 262,144
- Character Coverage: 0.9995
- Byte Fallback: Enabled
- Special Tokens:
<s>
(BOS),</s>
(EOS),<pad>
,<unk>
How to Use
You can use this tokenizer directly from the Hugging Face Hub with the transformers
library.
from transformers import AutoTokenizer
# Load the tokenizer from the Hub
tokenizer = AutoTokenizer.from_pretrained("jonasaise/oellm_tokenizer_262k_v2")
# Example usage
text = "Hej, detta är ett test av den nya OpenEuroLLM-tokeniseraren."
encoded_ids = tokenizer.encode(text)
print(f"Original text: {text}")
print(f"Encoded token IDs: {encoded_ids}")
decoded_text = tokenizer.decode(encoded_ids)
print(f"Decoded text: {decoded_text}")
# The tokenizer automatically adds a BOS token
# >>> Decoded text: <s> Hej, detta är ett test av den nya OpenEuroLLM-tokeniseraren.
Intended Use and Limitations
This tokenizer is intended to be used for pre-training and fine-tuning large language models on multilingual European text. Its performance is expected to be strong on the languages well-represented in the HPLT and FineWeb datasets.
The tokenizer's behavior and biases are a reflection of its training data. Users should be aware that the data is primarily sourced from the web and may contain noise, biases, or offensive content.
Licensing
This tokenizer is released under the Apache 2.0 license.
Citation
If you use this tokenizer in your research, please cite it as follows:
@misc{lind_2025_oellm_tokenizer_v2,
author = {Jonas Lind and the OpenEuroLLM Contributors},
title = {OpenEuroLLM Tokenizer v2 (oellm-262k-v2)},
year = {2025},
publisher = {Hugging Face},
journal = {Hugging Face repository},
howpublished = {\url{[https://huggingface.co/jonasaise/oellm_tokenizer_262k_v2](https://huggingface.co/jonasaise/oellm_tokenizer_262k_v2)}}
}