Tokenizer config seems broken

by Barahlush - opened Jun 8

Jun 8

•

There is a problem when using the model on Windows machine. Chat template is read in wrong encoding, which breaks special tokens, e.g. <｜Assistant｜> turns into <пЅњAssistantпЅњ>.
As result, when downloading and using the tokenizer with transformers/unsloth, the chat template appends these broken sequences instead of correct ones and the result is not tokenized correctly (e.g. "<пЅњAssistantпЅњ>" is split into ~6 tokens instead of 1)

Barahlush changed discussion status to closed Jun 8

Barahlush changed discussion status to open Jun 8

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment