fix tokenizer by removing pretokenizer

#17

This PR removes the redundant pretokenizer in the tokenizer json file. This tokenizer splits on whitespace, but because of the normalization step before, all whitespaces are replaced by a Metaspace token. This can lead some toolkits to believe splitting is happening when actually this is not the case.

I've done the folllowing test:

from typing import cast, Iterator
from datasets import load_dataset, Dataset
from tokenizers import Tokenizer
from tqdm import tqdm


def batch_iterator(dataset: Dataset, batch_size=1000, total: int = 10_000) -> Iterator[str]:
    i = 0
    for batch in dataset.iter(batch_size):
        for line in batch["text"]:  # type: ignore[no-any-return]
            yield line
            i += 1
        if i >= total:
            break


if __name__ == "__main__":
    tok = Tokenizer.from_file("tokenizer.json")
    tok2 = Tokenizer.from_pretrained("google/embeddinggemma-300m")

    subsets = ("20231101.zh", "20231101.en", "20231101.ja", "20231101.dv")

    for subset in subsets:
        dataset = cast(Dataset, load_dataset(
            "wikimedia/wikipedia", subset, split="train", streaming=True
        ))

        print("Processing subset:", subset)
        for line in tqdm(batch_iterator(dataset)):
            x = tok.encode(line)
            y = tok2.encode(line)
            assert x.tokens == y.tokens
            assert x.ids == y.ids

These tests confirmed that across a variety of languages, the old and new tokenizer are the same, as expected.

stephantulkens changed pull request status to open
Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment