duplicated bos_token when using apply_chat_template with Tokenizer

#20
by irvingjr - opened
    tokenizer: PreTrainedTokenizer = self.tokenizer_model
    msg = tokenizer.apply_chat_template(list_of_msg, tokenize=False, tools=None)
    outputs = tokenizer(
        msg, max_length=max_seq_len, truncation=True, add_special_tokens=True,
    )// add_special_tokens is the key. and sftTrainer from trl also set add_special_tokens to be true.

using the code above, for following text.

message = [
{"role": "system", "content": "You are an AI assistant."},
{"role": "user", "content": "What is the meaning of life?."},
{"role": "assistant", "content": "The meaning of life is 42."},
{"role": "user", "content": "That's ridiculous."},
{"role": "assistant", "content": "I agree."},
]

apply_chat_template will output following text:

<|begin▁of▁sentence|>You are an AI assistant.<|User|>What is the meaning of life?.<|Assistant|>The meaning of life is 42.<|end▁of▁sentence|><|User|>That's ridiculous.<|Assistant|>I agree.<|end▁of▁sentence|>

note that the <|begin▁of▁sentence|> already render to the message since the template add it before system context

image.png

and since the add_bos_token in tokenizer_config.json is true, the tokenizer(add_special_tokens=True) will output another bos as followed:

image.png

maybe we can modify the tokenizer_config.config and set add_bos_token to false by default?

Sign up or log in to comment