Model Card for oellm-tokenizer-262k-v1

Model Details

This is a Byte-Pair Encoding (BPE) tokenizer.

Model Type: BPE Tokenizer
Vocabulary Size: 262,144
Special Tokens: <pad>, <eos>, <bos>
Compatibility: Designed for Gemma3-style models.

Intended Uses & Limitations

This tokenizer is intended for researchers and developers working on pre-training or fine-tuning language models for European languages and code. It is not a model and cannot be used for inference on its own.

Training Data

The tokenizer was trained on a ~800 GB randomly sampled subset of a 1.2 TB text corpus. The data mixture was designed to provide broad coverage of European languages and high-quality English text.

The primary data sources were:

Nemotron-CC: High-quality English data from Common Crawl.
HPLT v2.0: Multilingual data from the High Performance Language Technologies project, focusing on languages prioritized by the OpenEuroLLM initiative.

Training Procedure

The tokenizer was trained on LUMI-C using a single node with 128 CPU cores and 1TB of RAM, using the Hugging Face tokenizers library.

Overall Average Fertility Across All Languages tested

(Lower is better)

oellm-262k-v1 1494.30
gemma3-4b-it 1656.77
gemma-2b 1689.27
Teuken7B 1897.91
Llama-3-8B 2063.76