Model Card for oellm-tokenizer-262k-v1

Model Details

This is a Byte-Pair Encoding (BPE) tokenizer.

  • Model Type: BPE Tokenizer
  • Vocabulary Size: 262,144
  • Special Tokens: <pad>, <eos>, <bos>
  • Compatibility: Designed for Gemma3-style models.

Intended Uses & Limitations

This tokenizer is intended for researchers and developers working on pre-training or fine-tuning language models for European languages and code. It is not a model and cannot be used for inference on its own.

Training Data

The tokenizer was trained on a ~800 GB randomly sampled subset of a 1.2 TB text corpus. The data mixture was designed to provide broad coverage of European languages and high-quality English text.

The primary data sources were:

  • Nemotron-CC: High-quality English data from Common Crawl.
  • HPLT v2.0: Multilingual data from the High Performance Language Technologies project, focusing on languages prioritized by the OpenEuroLLM initiative.

Training Procedure

The tokenizer was trained on LUMI-C using a single node with 128 CPU cores and 1TB of RAM, using the Hugging Face tokenizers library.

Overall Average Fertility Across All Languages tested

(Lower is better)

  • oellm-262k-v1 1494.30
  • gemma3-4b-it 1656.77
  • gemma-2b 1689.27
  • Teuken7B 1897.91
  • Llama-3-8B 2063.76
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support