arxiv:2508.14292

Tokens with Meaning: A Hybrid Tokenization Approach for NLP

Published on Aug 19

Authors:

M. Ali Bayram ,

Abstract

A hybrid tokenization framework combining rule-based morphological analysis with statistical subword segmentation improves tokenization for morphologically rich languages like Turkish.

AI-generated summary

Tokenization plays a pivotal role in natural language processing (NLP), shaping how text is segmented and interpreted by language models. While subword methods such as Byte Pair Encoding (BPE) and WordPiece have been effective, they often struggle with morphologically rich and agglutinative languages because they rely on frequency rather than linguistic structure. We introduce a hybrid tokenization framework that combines rule-based morphological analysis with statistical subword segmentation. The method uses phonological normalization, root-affix dictionaries, and a novel algorithm that balances morpheme preservation with vocabulary efficiency. It assigns shared identifiers to phonologically variant affixes (e.g., -ler and -lar) and altered root forms (e.g., kitap vs. kitab{\i}), reducing redundancy while maintaining semantic integrity. Special tokens are added for whitespace and case, including an UPPERCASE marker to avoid vocabulary inflation from capitalization. BPE is integrated for out-of-vocabulary coverage without harming morphological coherence. On the TR-MMLU benchmark, the tokenizer achieves the highest Turkish Token Percentage (90.29\%) and Pure Token Percentage (85.8\%). Comparisons with tokenizers from LLaMA, Gemma, and GPT show more linguistically meaningful and coherent tokens. Although demonstrated on Turkish, the approach is language-independent and adaptable to other languages, offering a practical path toward more interpretable and effective multilingual NLP systems.

View arXiv page View PDF Add to collection

Community

stefan-it

about 17 hours ago

Hey @alibayram and team,

it's a very interesting approach! Do you plan to release the code for conducting more experiments with it, I would like to test it out for German :)

alibayram

Paper author about 3 hours ago

•

edited about 3 hours ago

Hello Stefan,
Sorry for the late response. I've already published the tokenizer on PyPI (https://pypi.org/project/turkish-tokenizer/) and released the code on GitHub so that you can find everything, including the development history, in the commits. And a live demo is also running as a space here on my HuggingFace profile.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2508.14292 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2508.14292 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2508.14292 in a Space README.md to link it from this page.