duplicated bos_token when using apply_chat_template with Tokenizer
tokenizer: PreTrainedTokenizer = self.tokenizer_model
msg = tokenizer.apply_chat_template(list_of_msg, tokenize=False, tools=None)
outputs = tokenizer(
msg, max_length=max_seq_len, truncation=True, add_special_tokens=True,
)// add_special_tokens is the key. and sftTrainer from trl also set add_special_tokens to be true.
using the code above, for following text.
message = [
{"role": "system", "content": "You are an AI assistant."},
{"role": "user", "content": "What is the meaning of life?."},
{"role": "assistant", "content": "The meaning of life is 42."},
{"role": "user", "content": "That's ridiculous."},
{"role": "assistant", "content": "I agree."},
]
apply_chat_template will output following text:
<|begin▁of▁sentence|>You are an AI assistant.<|User|>What is the meaning of life?.<|Assistant|>The meaning of life is 42.<|end▁of▁sentence|><|User|>That's ridiculous.<|Assistant|>I agree.<|end▁of▁sentence|>
note that the <|begin▁of▁sentence|> already render to the message since the template add it before system context
and since the add_bos_token in tokenizer_config.json is true, the tokenizer(add_special_tokens=True) will output another bos as followed:
maybe we can modify the tokenizer_config.config and set add_bos_token to false by default?