deepseek-ai/DeepSeek-R1-Distill-Llama-8B · duplicated bos_token when using apply_chat

    tokenizer: PreTrainedTokenizer = self.tokenizer_model
    msg = tokenizer.apply_chat_template(list_of_msg, tokenize=False, tools=None)
    outputs = tokenizer(
        msg, max_length=max_seq_len, truncation=True, add_special_tokens=True,
    )// add_special_tokens is the key. and sftTrainer from trl also set add_special_tokens to be true.

using the code above, for following text.

message = [
{"role": "system", "content": "You are an AI assistant."},
{"role": "user", "content": "What is the meaning of life?."},
{"role": "assistant", "content": "The meaning of life is 42."},
{"role": "user", "content": "That's ridiculous."},
{"role": "assistant", "content": "I agree."},
]

apply_chat_template will output following text:

<｜begin▁of▁sentence｜>You are an AI assistant.<｜User｜>What is the meaning of life?.<｜Assistant｜>The meaning of life is 42.<｜end▁of▁sentence｜><｜User｜>That's ridiculous.<｜Assistant｜>I agree.<｜end▁of▁sentence｜>

note that the <｜begin▁of▁sentence｜> already render to the message since the template add it before system context

and since the add_bos_token in tokenizer_config.json is true, the tokenizer(add_special_tokens=True) will output another bos as followed:

deepseek-ai
/

DeepSeek-R1-Distill-Llama-8B

duplicated bos_token when using apply_chat_template with Tokenizer