Tokenizer config seems broken

#4
by Barahlush - opened

There is a problem when using the model on Windows machine. Chat template is read in wrong encoding, which breaks special tokens, e.g. <|Assistant|> turns into <пЅњAssistantпЅњ>.
As result, when downloading and using the tokenizer with transformers/unsloth, the chat template appends these broken sequences instead of correct ones and the result is not tokenized correctly (e.g. "<пЅњAssistantпЅњ>" is split into ~6 tokens instead of 1)

Barahlush changed discussion status to closed
Barahlush changed discussion status to open

Sign up or log in to comment