Token issue? <start_of_turn> split into multiple tokens

#3
by mw55a - opened

As the title says. start_of_turn seems to be split into multiple tokens when running under llama.cpp. Where as every other Gemma 3 gguf has it as the single token.

It only seems to be start_of_turn that has this problem. end_of_turn is still a single token.

Tokens that start_of_turn is split into are:

'<':655, 'start':3041, '_':236779, 'of':1340, '_':236779, 'turn':887, '>':236813

Other Gemma 3 ggufs have this as single token: 105.

Edit: Had to edit this post because it seems you can't write start_of_turn with the < .> around it on here. It just disappears from the text.

Can confirm seeing the same issue as well, with latest llama.cpp (b5050)

So I used HF's GGUF javascript package to inspect the GGUF:

import { gguf } from '@huggingface/gguf';

const { metadata, tensorInfos } = await gguf("https://huggingface.co/google/gemma-3-12b-it-qat-q4_0-gguf/resolve/main/gemma-3-12b-it-q4_0.gguf", {
  additionalFetchHeaders: {
    "Authorization": "Bearer " + process.env.HF_TOKEN,
  }
});

console.log(metadata['tokenizer.ggml.tokens']);

const startId = metadata['tokenizer.ggml.tokens'].indexOf("<start_of_turn>");
console.log(startId);
console.log(metadata['tokenizer.ggml.token_type'][startId]);

This prints out correctly the token ID 105 for . However, the token_type is set to 1 (normal token) while it should be 3 (control token)

Check again with non-QAT GGUF from unsloth, metadata['tokenizer.ggml.token_type'][startId] prints 3

const { metadata, tensorInfos } = await gguf("https://huggingface.co/unsloth/gemma-3-4b-it-GGUF/resolve/main/gemma-3-4b-it-BF16.gguf", {

(the rest of code is the same)

@google Can you regenerate the GGUF while correcting the token type? I think it's better to copy-paste this metadata key from an existing GGUF

Apparently this is not the only token issue these qat versions have. From here: https://www.reddit.com/r/LocalLLaMA/comments/1jvi860/comment/mmdgpim/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

"I just checked, there is indeed a whole lot of tokens (6411 to be precise) that are configured differently between the qat models and the models quantized with llama.cpp "

@ngxson
Here are the tokens that differ between a gguf generated from llama.cpp and the QAT. (ignoring all the <unused[N]> tokens wich are NORMAL instead of CONTROL)

NORMAL->CONTROL

['<mask>', '<start_of_turn>', '<end_of_turn>', '<start_of_image>', '<end_of_image>']

--------------------------
NORMAL->USER_DEFINED

['[multimodal]', '\n', '\n\n', '\n\n\n', '\n\n\n\n', '\n\n\n\n\n', '\n\n\n\n\n\n', '\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '▁▁', '▁▁▁', '▁▁▁▁', '▁▁▁▁▁', '▁▁▁▁▁▁', '▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '<table>', '<caption>', '<thead>', '<tbody>', '<tfoot>', '<tr>', '<th>', '<td>', '</table>', '</caption>', '</thead>', '</tbody>', '</tfoot>', '</tr>', '</th>', '</td>', '<h1>', '<h2>', '<h3>', '<h4>', '<h5>', '<h6>', '<blockquote>', '</h1>', '</h2>', '</h3>', '</h4>', '</h5>', '</h6>', '</blockquote>', '<strong>', '<em>', '<b>', '<i>', '<u>', '<s>', '<sub>', '<sup>', '<code>', '</strong>', '</em>', '</b>', '</i>', '</u>', '</s>', '</sub>', '</sup>', '</code>', '<a>', '<html>', '<body>', '<img>', '<span>', '<bbox>', '<ul>', '<li>', '<div>', 
'<iframe>', '<footer>', '</a>', '</html>', '</body>', '</img>', '</span>', '</bbox>', '</ul>', '</li>', '</div>', '</iframe>', '</footer>', '\t', '\t\t', '\t\t\t', '\t\t\t\t', '\t\t\t\t\t', '\t\t\t\t\t\t', '\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t']

--------------------------
UNKNOWN->CONTROL

['<unk>']

Is it expected for the <unk> token to be CONTROL? UNKNOWN seems to make more sense in this case...

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment