Chat template of AutoTokenizer does not work on assistant mask tokens

#27

by doruktarhan6 - opened Apr 19

Discussion

doruktarhan6

Apr 19

•

edited Apr 19

Issue with Qwen2.5 Chat Template: Assistant Mask Always Zero

I'm encountering an issue when applying the chat template using AutoTokenizer for Qwen 2.5 models. The assistant mask always comes back full of zeros. The chat template seems to be looking for {{ "{% generation %}" }} token but it doesn't exist.

Code

from transformers import AutoTokenizer

model_name = "Qwen/Qwen2.5-7B-Instruct"
tokenizer2 = AutoTokenizer.from_pretrained(model_name)

msgs = [
    {"role":"system", "content":"This is system prompt"},
    {"role":"user", "content":"this is user prompt"},
    {"role":"assistant","content":"This is assistant prompt"}
]

chat_without_tokenized = tokenizer2.apply_chat_template(msgs, tokenize=False)

max_length = 4000
rendered_text = tokenizer2.apply_chat_template(
    msgs, 
    tokenize=True, 
    max_length=max_length,
    add_generation_prompt=False,
    return_assistant_tokens_mask=True,
    return_dict=True
)

#print(rendered_text)
input_ids = rendered_text["input_ids"]
attention_mask = rendered_text["attention_mask"]
assistant_mask = rendered_text["assistant_masks"]

print(f"Length InputIDs = {len(input_ids)} Input IDs: {input_ids}")
print(f"Length Attention Mask = {len(attention_mask)} Attention Mask: {attention_mask}")
print(f"Length Assistant Mask = {len(assistant_mask)} Assistant Mask: {assistant_mask}")

Output

return_assistant_tokens_mask==True but chat template does not contain `{% generation %}` keyword.
Length InputIDs = 27 Input IDs: [151644, 8948, 198, 1986, 374, 1849, 9934, 151645, 198, 151644, 872, 198, 574, 374, 1196, 9934, 151645, 198, 151644, 77091, 198, 1986, 374, 17847, 9934, 151645, 198]
Length Attention Mask = 27 Attention Mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
Length Assistant Mask = 27 Assistant Mask: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

doruktarhan6

Apr 19

I also tried to decode the template with the add_generation_prompt True and False. Things get worse. The template seems to be completely wrong as it tries to add the generation prompt over again.

Decoded text (add_generation_prompt=True):
<|im_start|>system
This is system prompt<|im_end|>
<|im_start|>user
this is user prompt<|im_end|>
<|im_start|>assistant
This is assistant prompt<|im_end|>
<|im_start|>assistant

fopdoodle8

Apr 20

The original chat template of Qwen2.5 doesn't support assistant masks as there are no {{ "{% generation %}" }} and {{ "{% endgeneration %}" }} in it.
You can rewrite the template to support this feature:

qwen2_5_vl_template_add_generation = (
    "{% set image_count = namespace(value=0) %}"
    "{% set video_count = namespace(value=0) %}"
    "{% for message in messages %}"
    "{% if loop.first and message['role'] != 'system' %}"
    "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
    "{% endif %}"
    "<|im_start|>{{ message['role'] }}\n"
    "{% if message['content'] is string %}"
    "{% if message['role'] == 'assistant' %}"
    "{% generation %}"
    "{{ message['content'] }}"
    "{% endgeneration %}"
    "{% else %}"
    "{{ message['content'] }}"
    "{% endif %}"
    "<|im_end|>\n"
    "{% else %}"
    "{% for content in message['content'] %}"
    "{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}"
    "{% set image_count.value = image_count.value + 1 %}"
    "{% if add_vision_id %}"
    "Picture {{ image_count.value }}: "
    "{% endif %}"
    "<|vision_start|><|image_pad|><|vision_end|>"
    "{% elif content['type'] == 'video' or 'video' in content %}"
    "{% set video_count.value = video_count.value + 1 %}"
    "{% if add_vision_id %}"
    "Video {{ video_count.value }}: "
    "{% endif %}"
    "<|vision_start|><|video_pad|><|vision_end|>"
    "{% elif 'text' in content %}"
    "{% if message['role'] == 'assistant' %}"
    "{% generation %}"
    "{{ content['text'] }}"
    "{% endgeneration %}"
    "{% else %}"
    "{{ content['text'] }}"
    "{% endif %}"
    "{% endif %}"
    "{% endfor %}"
    "<|im_end|>\n"
    "{% endif %}"
    "{% endfor %}"
    "{% if add_generation_prompt %}"
    "<|im_start|>assistant\n"
    "{% endif %}")

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment