fix tokenizer by removing pretokenizer

#17

by stephantulkens - opened 9 days ago

base: refs/heads/main

←

from: refs/pr/17

Discussion Files changed

-2

stephantulkens

9 days ago

•

edited 9 days ago

This PR removes the redundant pretokenizer in the tokenizer json file. This tokenizer splits on whitespace, but because of the normalization step before, all whitespaces are replaced by a Metaspace token. This can lead some toolkits to believe splitting is happening when actually this is not the case.

I've done the folllowing test:

from typing import cast, Iterator
from datasets import load_dataset, Dataset
from tokenizers import Tokenizer
from tqdm import tqdm


def batch_iterator(dataset: Dataset, batch_size=1000, total: int = 10_000) -> Iterator[str]:
    i = 0
    for batch in dataset.iter(batch_size):
        for line in batch["text"]:  # type: ignore[no-any-return]
            yield line
            i += 1
        if i >= total:
            break


if __name__ == "__main__":
    tok = Tokenizer.from_file("tokenizer.json")
    tok2 = Tokenizer.from_pretrained("google/embeddinggemma-300m")

    subsets = ("20231101.zh", "20231101.en", "20231101.ja", "20231101.dv")

    for subset in subsets:
        dataset = cast(Dataset, load_dataset(
            "wikimedia/wikipedia", subset, split="train", streaming=True
        ))

        print("Processing subset:", subset)
        for line in tqdm(batch_iterator(dataset)):
            x = tok.encode(line)
            y = tok2.encode(line)
            assert x.tokens == y.tokens
            assert x.ids == y.ids

These tests confirmed that across a variety of languages, the old and new tokenizer are the same, as expected.

fix: remove redundant pretokenizer945165e1

stephantulkens changed pull request status to open 9 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Ready to merge

This branch is ready to get merged automatically.

· Sign up or log in to comment