Papers
arxiv:2505.09738

Achieving Tokenizer Flexibility in Language Models through Heuristic Adaptation and Supertoken Learning

Published on May 14
· Submitted by adarshxs on May 16
Authors:
,
,

Abstract

Pretrained language models (LLMs) are often constrained by their fixed tokenization schemes, leading to inefficiencies and performance limitations, particularly for multilingual or specialized applications. This tokenizer lock-in presents significant challenges. standard methods to overcome this often require prohibitive computational resources. Although tokenizer replacement with heuristic initialization aims to reduce this burden, existing methods often require exhaustive residual fine-tuning and still may not fully preserve semantic nuances or adequately address the underlying compression inefficiencies. Our framework introduces two innovations: first, Tokenadapt, a model-agnostic tokenizer transplantation method, and second, novel pre-tokenization learning for multi-word Supertokens to enhance compression and reduce fragmentation. Tokenadapt initializes new unique token embeddings via a hybrid heuristic that combines two methods: a local estimate based on subword decomposition using the old tokenizer, and a global estimate utilizing the top-k semantically similar tokens from the original vocabulary. This methodology aims to preserve semantics while significantly minimizing retraining requirements. Empirical investigations validate both contributions: the transplantation heuristic successfully initializes unique tokens, markedly outperforming conventional baselines and sophisticated methods including Transtokenizer and ReTok, while our Supertokens achieve notable compression gains. Our zero-shot perplexity results demonstrate that the TokenAdapt hybrid initialization consistently yields lower perplexity ratios compared to both ReTok and TransTokenizer baselines across different base models and newly trained target tokenizers. TokenAdapt typically reduced the overall perplexity ratio significantly compared to ReTok, yielding at least a 2-fold improvement in these aggregate scores.

Community

Paper author Paper submitter

Pretrained language models (LLMs) are tied to a fixed tokenizer. This "tokenizer lock‑in" hurts efficiency and accuracy, especially in multilingual or domain‑specific settings. Replacing the tokenizer is attractive, but existing methods need costly end‑to‑end fine‑tuning and often lose meaning. We present a two‑part framework that keeps cost low and quality high.

TokenAdapt🛠️:

A model‑agnostic procedure that transplants a new tokenizer into a frozen LLM. Unique tokens are initialized with a hybrid heuristic that combines (a) a local approximation from subword decomposition in the old vocabulary and (b) a global approximation from the top‑k semantically closest tokens.

image.png

Supertokens⚡:

A light pre‑tokenization stage that learns frequent multi‑word units, increasing compression and lowering sequence length.

In zero‑shot evaluation across multiple base models and target tokenizers, TokenAdapt cuts perplexity ratios by up to 2x compared with ReTok and outperforms TransTokenizer without any extra training. When combined with supertokens, sequence length drops, further reducing compute.

Our results show that tokenizer transplantation and learned supertokens can unlock the benefits of custom tokenizers while avoiding the heavy cost of full model retraining.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2505.09738 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2505.09738 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2505.09738 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.