fix tokenizer by removing pretokenizer
#17
by
stephantulkens
- opened
This PR removes the redundant pretokenizer in the tokenizer json file. This tokenizer splits on whitespace, but because of the normalization step before, all whitespaces are replaced by a Metaspace token. This can lead some toolkits to believe splitting is happening when actually this is not the case.
I've done the folllowing test:
from typing import cast, Iterator
from datasets import load_dataset, Dataset
from tokenizers import Tokenizer
from tqdm import tqdm
def batch_iterator(dataset: Dataset, batch_size=1000, total: int = 10_000) -> Iterator[str]:
i = 0
for batch in dataset.iter(batch_size):
for line in batch["text"]: # type: ignore[no-any-return]
yield line
i += 1
if i >= total:
break
if __name__ == "__main__":
tok = Tokenizer.from_file("tokenizer.json")
tok2 = Tokenizer.from_pretrained("google/embeddinggemma-300m")
subsets = ("20231101.zh", "20231101.en", "20231101.ja", "20231101.dv")
for subset in subsets:
dataset = cast(Dataset, load_dataset(
"wikimedia/wikipedia", subset, split="train", streaming=True
))
print("Processing subset:", subset)
for line in tqdm(batch_iterator(dataset)):
x = tok.encode(line)
y = tok2.encode(line)
assert x.tokens == y.tokens
assert x.ids == y.ids
These tests confirmed that across a variety of languages, the old and new tokenizer are the same, as expected.
stephantulkens
changed pull request status to
open