Token issue? <start_of_turn> split into multiple tokens
As the title says. start_of_turn seems to be split into multiple tokens when running under llama.cpp. Where as every other Gemma 3 gguf has it as the single token.
It only seems to be start_of_turn that has this problem. end_of_turn is still a single token.
Tokens that start_of_turn is split into are:
'<':655, 'start':3041, '_':236779, 'of':1340, '_':236779, 'turn':887, '>':236813
Other Gemma 3 ggufs have this as single token: 105.
Edit: Had to edit this post because it seems you can't write start_of_turn with the < .> around it on here. It just disappears from the text.
Can confirm seeing the same issue as well, with latest llama.cpp (b5050)
So I used HF's GGUF javascript package to inspect the GGUF:
import { gguf } from '@huggingface/gguf';
const { metadata, tensorInfos } = await gguf("https://huggingface.co/google/gemma-3-12b-it-qat-q4_0-gguf/resolve/main/gemma-3-12b-it-q4_0.gguf", {
additionalFetchHeaders: {
"Authorization": "Bearer " + process.env.HF_TOKEN,
}
});
console.log(metadata['tokenizer.ggml.tokens']);
const startId = metadata['tokenizer.ggml.tokens'].indexOf("<start_of_turn>");
console.log(startId);
console.log(metadata['tokenizer.ggml.token_type'][startId]);
This prints out correctly the token ID 105 for . However, the token_type
is set to 1 (normal token) while it should be 3 (control token)
Check again with non-QAT GGUF from unsloth, metadata['tokenizer.ggml.token_type'][startId]
prints 3
const { metadata, tensorInfos } = await gguf("https://huggingface.co/unsloth/gemma-3-4b-it-GGUF/resolve/main/gemma-3-4b-it-BF16.gguf", {
(the rest of code is the same)
@google Can you regenerate the GGUF while correcting the token type? I think it's better to copy-paste this metadata key from an existing GGUF
Apparently this is not the only token issue these qat versions have. From here: https://www.reddit.com/r/LocalLLaMA/comments/1jvi860/comment/mmdgpim/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
"I just checked, there is indeed a whole lot of tokens (6411 to be precise) that are configured differently between the qat models and the models quantized with llama.cpp "
@ngxson
Here are the tokens that differ between a gguf generated from llama.cpp and the QAT. (ignoring all the <unused[N]>
tokens wich are NORMAL instead of CONTROL)
NORMAL->CONTROL
['<mask>', '<start_of_turn>', '<end_of_turn>', '<start_of_image>', '<end_of_image>']
--------------------------
NORMAL->USER_DEFINED
['[multimodal]', '\n', '\n\n', '\n\n\n', '\n\n\n\n', '\n\n\n\n\n', '\n\n\n\n\n\n', '\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '▁▁', '▁▁▁', '▁▁▁▁', '▁▁▁▁▁', '▁▁▁▁▁▁', '▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '<table>', '<caption>', '<thead>', '<tbody>', '<tfoot>', '<tr>', '<th>', '<td>', '</table>', '</caption>', '</thead>', '</tbody>', '</tfoot>', '</tr>', '</th>', '</td>', '<h1>', '<h2>', '<h3>', '<h4>', '<h5>', '<h6>', '<blockquote>', '</h1>', '</h2>', '</h3>', '</h4>', '</h5>', '</h6>', '</blockquote>', '<strong>', '<em>', '<b>', '<i>', '<u>', '<s>', '<sub>', '<sup>', '<code>', '</strong>', '</em>', '</b>', '</i>', '</u>', '</s>', '</sub>', '</sup>', '</code>', '<a>', '<html>', '<body>', '<img>', '<span>', '<bbox>', '<ul>', '<li>', '<div>',
'<iframe>', '<footer>', '</a>', '</html>', '</body>', '</img>', '</span>', '</bbox>', '</ul>', '</li>', '</div>', '</iframe>', '</footer>', '\t', '\t\t', '\t\t\t', '\t\t\t\t', '\t\t\t\t\t', '\t\t\t\t\t\t', '\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t']
--------------------------
UNKNOWN->CONTROL
['<unk>']
Is it expected for the <unk>
token to be CONTROL? UNKNOWN seems to make more sense in this case...