bvv241-2-3: Unicode & Wikipedia-based Tokenizer with Precomputed Frozen Embeddings

Tokenizer Description

This tokenizer is based on a hybrid vocabulary:

This tokenizer uses a strictly structured Unicode mapping scheme:

Plane 0 (0–65535): All single Unicode code points (monograms) are mapped 1:1 to token codes, directly matching standard Unicode BMP.
Private and unused code ranges (Plane 0, e.g., 0xE000–0xF8FF):
- All multi-character tokens (bigrams, trigrams) are placed exclusively in these ranges.
This design achieves total, lossless Unicode text coverage, with all multi-symbol tokens isolated above the core Unicode range.
Data-driven bigrams and trigrams from Wikipedia (token co-occurrence),
Vocabulary size: 65,536 tokens,
Embedding dimension: 1024.

The associated normalized_embeddings_weights.pt file contains a [vocab_size x embed_dim] matrix of precomputed, L2-normalized, frozen embeddings.
No semantic information is encoded; embeddings remain fixed throughout LM pretraining.

This tokenizer and embedding set is ideal for exploring semantic emergence and modular/fusion LM training over frozen, surface-level representations, enabling reproducible experiments and research.

How to Get Started with the Tokenizer


from transformers import AutoTokenizer

from huggingface_hub import hf_hub_download

import torch

tokenizer = AutoTokenizer.from_pretrained('Bochkov/bvv241-2-3')


emb_path = hf_hub_download(
    repo_id="Bochkov/bvv241-2-3",
    filename="normalized_embeddings_weights.pt"
)

embeddings = torch.load(emb_path)

🧑‍🔬 Citation & Concept

If you use this model or the underlying concepts in your research, please cite our work:

@misc{bochkov2025emergentsemanticstokenembeddings,
      title={Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations}, 
      author={A. Bochkov},
      year={2025},
      eprint={2507.04886},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2507.04886}, 
}

This work demonstrates that transformer blocks, not token embeddings, carry the semantic burden in LLMs — a step toward modular, fusable, multilingual LMs.

Bochkov
/

bvv241-2-3

bvv241-2-3: Unicode & Wikipedia-based Tokenizer with Precomputed Frozen Embeddings

Tokenizer Description

How to Get Started with the Tokenizer

🧑‍🔬 Citation & Concept

Collection including Bochkov/bvv241-2-3

Tokenizers