--- language: - en license: mit --- # char128-shift Tokenizer A fixed-size Hugging Face–compatible **character tokenizer** with a dedicated **SHIFT** token (`↨`) to represent uppercase letters. Instead of assigning separate tokens to uppercase `A–Z`, each uppercase is encoded as `↨` + lowercase (e.g., `H` → `↨h`). This repository contains the ready-to-use tokenizer, which can be loaded with `AutoTokenizer`, as well as the script that made it (in src\ folder) --- ## Features * **Fixed 128-token vocabulary** (including specials). * **Uppercase encoding via SHIFT token**, no duplicate uppercase letters in vocab. * **WordLevel model** with explicit closed character set. * **Pre-tokenizer** splits by Unicode grapheme clusters (`\X`), so emoji and diacritics are preserved. * **Normalizer** maps `A–Z` → `↨` + lowercase explicitly. * **Decoder** concatenates tokens directly (no extra spaces). --- ## Installation You only need `transformers` (for Python interface) and optionally `tokenizers` (for advanced building). ```bash pip install transformers>=4.40 tokenizers>=0.14 ``` No PyTorch/TensorFlow/Flax required to use the tokenizer itself. --- ## Usage ### Load from local folder ```python from transformers import AutoTokenizer # Load local tokenizer folder tok = AutoTokenizer.from_pretrained("char128_shift_tokenizer") print(tok.vocab_size) # 128 ids = tok.encode("Hello, There!\n") print(ids) print(tok.decode(ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)) # → "↨hello, ↨there!\n" ``` ### Load from Hugging Face Hub ```python from transformers import AutoTokenizer # Replace with your Hub repo tok = AutoTokenizer.from_pretrained("Corianas/char128_shift_tokenizer") ``` --- ## Restoring Uppercase The decode output will show SHIFT markers (e.g., `↨h`). For display, restore casing: ```python def restore_uppercase(s: str, shift="↨"): out, i, n = [], 0, len(s) while i < n: if s[i] == shift and i+1 < n and s[i+1] != shift: out.append(s[i+1].upper()); i += 2 else: out.append(s[i]); i += 1 return "".join(out) ids = tok.encode("Hello, There!\n") decoded = tok.decode(ids, skip_special_tokens=True, clean_up_tokenization_spaces=False) print(decoded) # "↨hello, ↨there!\n" print(restore_uppercase(decoded)) # "Hello, There!\n" ``` --- ## Vocabulary The 128 tokens include: * **Lowercase letters** `a–z` * **Digits** `0–9` * **Whitespace** (space, `\n`, `\t`) * **Punctuation and symbols** (configurable) * **Diacritics** like `è`, `é` if needed * **Special tokens** ``, ``, ``, `` * **SHIFT token** `↨` Uppercase `A–Z` are **not** in vocab — they are represented via SHIFT. --- ## Integration For dataset preparation: ```python import numpy as np, os from transformers import AutoTokenizer tok = AutoTokenizer.from_pretrained("char128_shift_tokenizer") with open("input.txt", "r", encoding="utf-8") as f: data = f.read() n = len(data) train_txt, val_txt = data[:int(0.9*n)], data[int(0.9*n):] train_ids = tok.encode(train_txt) val_ids = tok.encode(val_txt) np.array(train_ids, dtype=np.uint16).tofile("train.bin") np.array(val_ids, dtype=np.uint16).tofile("val.bin") ``` Your model’s `vocab_size` must match (128). --- ## Known Edge Cases * **Non-ASCII uppercase** (like `À`, `É`) are lowercased without SHIFT unless you add explicit rules. * **Spaces in decode** are disabled by setting decoder to concat; if you see them, ensure your tokenizer was saved with `tok.decoder = decoders.Sequence([])`. * **Unknown chars** → ``. Ensure your vocab includes everything you expect. --- ## License MIT --- ## Example Test ```python from transformers import AutoTokenizer tok = AutoTokenizer.from_pretrained("Corianas/char128_shift_tokenizer") ids = tok.encode("Hello, There!\n") print(ids) print(tok.decode(ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)) # ↨hello, ↨there!\n ```