Tokenizers
Collection
This collection features frozen, precomputed token embedding tensors designed for experimentation with semantic emergence in language models.
β’
5 items
β’
Updated
This tokenizer is based on a hybrid vocabulary:
This tokenizer uses a strictly structured Unicode mapping scheme:
The associated normalized_embeddings_weights.pt
file contains a [vocab_size x embed_dim] matrix of precomputed, L2-normalized, frozen embeddings.
No semantic information is encoded; embeddings remain fixed throughout LM pretraining.
No training or adaptation; suitable for plug-and-play use in research on embedding-free semantic emergence and modular LMs.
from transformers import AutoTokenizer
from huggingface_hub import hf_hub_download
import torch
tokenizer = AutoTokenizer.from_pretrained('Bochkov/bvv241-abs')
emb_path = hf_hub_download(
repo_id="Bochkov/bvv241-abs",
filename="normalized_embeddings_weights.pt"
)
embeddings = torch.load(emb_path)
If you use this model or the underlying concepts in your research, please cite our work:
@misc{bochkov2025emergentsemanticstokenembeddings,
title={Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations},
author={A. Bochkov},
year={2025},
eprint={2507.04886},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.04886},
}
This work demonstrates that transformer blocks, not token embeddings, carry the semantic burden in LLMs β a step toward modular, fusable, multilingual LMs.