Model Card for oellm-tokenizer-262k-v1
Model Details
This is a Byte-Pair Encoding (BPE) tokenizer.
- Model Type: BPE Tokenizer
- Vocabulary Size: 262,144
- Special Tokens:
<pad>
,<eos>
,<bos>
- Compatibility: Designed for Gemma3-style models.
Intended Uses & Limitations
This tokenizer is intended for researchers and developers working on pre-training or fine-tuning language models for European languages and code. It is not a model and cannot be used for inference on its own.
Training Data
The tokenizer was trained on a ~800 GB randomly sampled subset of a 1.2 TB text corpus. The data mixture was designed to provide broad coverage of European languages and high-quality English text.
The primary data sources were:
- Nemotron-CC: High-quality English data from Common Crawl.
- HPLT v2.0: Multilingual data from the High Performance Language Technologies project, focusing on languages prioritized by the OpenEuroLLM initiative.
Training Procedure
The tokenizer was trained on LUMI-C using a single node with 128 CPU cores and 1TB of RAM, using the Hugging Face tokenizers
library.
Overall Average Fertility Across All Languages tested
(Lower is better)
- oellm-262k-v1 1494.30
- gemma3-4b-it 1656.77
- gemma-2b 1689.27
- Teuken7B 1897.91
- Llama-3-8B 2063.76
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support