Chat Templates

A chat template is a Jinja2 snippet that formats messages into the string a model was trained on. For example:

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
>>> tokenizer.chat_template
"{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"
>>> tokenizer.apply_chat_template([{"role": "user", "content": "Hi!"}], tokenize=False)
'<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nHi!<|im_end|>\n'

In most cases you don’t need to worry about chat templates: models ship their template along with the tokenizer, and TRL applies it for you. The whole thing is transparent. But some TRL recipes rely on features that most shipped templates don’t include:

SFT with assistant_only_loss=True needs {% generation %} / {% endgeneration %} markers around assistant output, so the loss mask can target only assistant tokens.
GRPO with tool calls needs the template to be prefix-preserving: appending a tool message must not change how earlier messages are rendered.

TRL ships patched templates under trl/chat_templates/ for common families (Qwen, Llama, DeepSeek-V3, GPT-OSS, …) and swaps them in automatically for supported models. For any other model, you’ll need to patch its template yourself. The rest of this page catalogs what’s bundled.

Supported model families

TRL stores reference copies of the original templates so it can identify supported models at init and swap in a training template when needed. The following families are recognized: Cohere, Cohere2, DeepSeek-V3, Gemma, Gemma3, GLM-4-MoE, GPT-OSS, Idefics3, Llama 3 / 3.1 / 3.2, Llava-Next, Nemotron 3 (Nano, Super, Ultra), Phi-3, Phi-3.5, Qwen2-VL, Qwen2.5, Qwen2.5-VL, Qwen3 (including the Instruct-2507 variant), Qwen3-VL, Qwen3.5, Qwen3.6.

Training templates

Patched templates that fix training-specific issues. Swapped in at init when tools are enabled (GRPO) or when assistant_only_loss=True (SFT).

cohere_training.jinja

Patched Cohere template. Diff vs cohere.jinja:

Wrap assistant message output with {% generation %} / {% endgeneration %} so that return_assistant_tokens_mask=True produces correct masks for SFT assistant-only loss.

cohere2_training.jinja

Patched Cohere2 template. Diff vs cohere2.jinja:

Move the trailing <|END_OF_TURN_TOKEN|> from after the role-dispatch {% endif %} into each role branch, and wrap the assistant branch (<|START_RESPONSE|>...<|END_RESPONSE|><|END_OF_TURN_TOKEN|>) with {% generation %} / {% endgeneration %} so that return_assistant_tokens_mask=True produces correct masks for SFT assistant-only loss.

deepseekv3_training.jinja

Patched DeepSeek-V3 template. Diff vs deepseekv3.jinja:

Uses | tojson on tool['function']['arguments'] so that arguments can be passed as a dict (the documented format per transformers docs). The original template uses raw string concatenation, which crashes on dict inputs.
Wraps assistant message output with {% generation %} / {% endgeneration %} markers for SFT assistant-only loss.

gemma_training.jinja

Patched Gemma template (shared by Gemma and Gemma2, which ship identical chat templates). Diff vs gemma.jinja:

Split the unified assistant output so that the <start_of_turn>model\n header (a prompt cue, not generated by the model) sits outside the generation block, and wrap the assistant content with {% generation %} / {% endgeneration %} markers for SFT assistant-only loss.

gemma3_training.jinja

Patched Gemma 3 template. Same diff as gemma_training.jinja (split the unified output line into role-specific branches so the <start_of_turn>model\n prompt cue sits outside the generation block, and wrap the assistant content with {% generation %} / {% endgeneration %}), applied to the Gemma 3 base template that supports system messages and multimodal content blocks.

glm4moe_training.jinja

Patched GLM-4-MoE template. Diff vs glm4moe.jinja:

Require both <think> and </think> to be present before parsing, to avoid incorrect splitting when the model generates only one tag:

- {%- if '</think>' in content %}
+ {%- if '<think>' in content and '</think>' in content %}

Wrap assistant message output (including the thinking block and tool calls) with {% generation %} / {% endgeneration %} markers for SFT assistant-only loss.

qwen3_training.jinja

Patched Qwen3 template. Diff vs qwen3.jinja:

Require both <think> and </think> to be present before parsing, to avoid incorrect splitting when the model generates only one tag:

- {%- if '</think>' in content %}
+ {%- if '<think>' in content and '</think>' in content %}

Always include the thinking block regardless of message position. The original conditionally omits it based on loop.last, which changes the assistant rendering when a tool message is appended, breaking prefix-preservation:

- {%- if loop.index0 > ns.last_query_index %}
-     {%- if loop.last or (not loop.last and reasoning_content) %}
-         {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content.strip('\n') + '\n</think>\n\n' + content.lstrip('\n') }}
-     {%- else %}
-         {{- '<|im_start|>' + message.role + '\n' + content }}
-     {%- endif %}
- {%- else %}
-     {{- '<|im_start|>' + message.role + '\n' + content }}
- {%- endif %}
+ {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content.strip('\n') + '\n</think>\n\n' + content.lstrip('\n') }}

Wrap assistant message output with {% generation %} / {% endgeneration %} so that return_assistant_tokens_mask=True produces correct masks for SFT assistant-only loss.

qwen3_vl_training.jinja

Patched Qwen3-VL template. Diff vs qwen3_vl.jinja:

Wrap assistant message output (both content and tool_calls) with {% generation %} / {% endgeneration %} so that return_assistant_tokens_mask=True produces correct masks for SFT assistant-only loss.

gptoss_training.jinja

Patched GPT-OSS template. Diff vs gptoss.jinja:

Wrap assistant message output with {% generation %} / {% endgeneration %} so that return_assistant_tokens_mask=True produces correct masks for SFT assistant-only loss.

idefics3_training.jinja

Patched Idefics3 template. Diff vs idefics3.jinja:

Split the assistant message into its own branch so the {% generation %} / {% endgeneration %} markers wrap the assistant content. This enables return_assistant_tokens_mask=True to produce correct masks for SFT assistant-only loss.

llama3_training.jinja

Patched Llama 3 template. Diff vs llama3.jinja:

Wrap assistant message output with {% generation %} / {% endgeneration %} so that return_assistant_tokens_mask=True produces correct masks for SFT assistant-only loss.

llava_next_training.jinja

Patched Llava-Next template. Diff vs llava_next.jinja:

Wrap assistant message output with {% generation %} / {% endgeneration %} so that return_assistant_tokens_mask=True produces correct masks for SFT assistant-only loss.

nemotron_3_nano_training.jinja

Patched Nemotron Nano template. Diff vs nemotron_3_nano.jinja: the original is already prefix-preserving, so the only change is wrapping assistant message output with {% generation %} / {% endgeneration %} so that return_assistant_tokens_mask=True produces correct masks for SFT assistant-only loss.

nemotron_3_super_training.jinja

Patched Nemotron Super template. Diff vs nemotron_3_super.jinja: same as nemotron_3_nano_training.jinja — the original is already prefix-preserving, so the only change is wrapping assistant message output with {% generation %} / {% endgeneration %} so that return_assistant_tokens_mask=True produces correct masks for SFT assistant-only loss.

nemotron_3_ultra_training.jinja

Patched Nemotron Ultra template. Diff vs nemotron_3_ultra.jinja: same as nemotron_3_nano_training.jinja — the original is already prefix-preserving, so the only change is wrapping assistant message output with {% generation %} / {% endgeneration %} so that return_assistant_tokens_mask=True produces correct masks for SFT assistant-only loss.

phi3_training.jinja

Patched Phi-3 template. Diff vs phi3.jinja:

Wrap assistant message output with {% generation %} / {% endgeneration %} so that return_assistant_tokens_mask=True produces correct masks for SFT assistant-only loss.

phi3_5_training.jinja

Patched Phi-3.5 template. Diff vs phi3_5.jinja:

Wrap assistant message output with {% generation %} / {% endgeneration %} so that return_assistant_tokens_mask=True produces correct masks for SFT assistant-only loss.

qwen2_5_training.jinja

Patched Qwen2.5 template. Diff vs qwen2_5.jinja:

Wrap assistant message output with {% generation %} / {% endgeneration %} so that return_assistant_tokens_mask=True produces correct masks for SFT assistant-only loss.

qwen2_5_vl_training.jinja

Patched Qwen2.5-VL template (also used for Qwen2-VL, which ships a byte-identical template). Diff vs qwen2_5_vl.jinja:

qwen3_instruct_2507_training.jinja

Patched Qwen3-Instruct-2507 template (used by models like Qwen3-4B-Instruct-2507, which ship a simpler Qwen3 variant without reasoning_content / <think> parsing, multi_step_tool tracking, or the enable_thinking flag). Diff vs qwen3_instruct_2507.jinja:

Wrap assistant message output with {% generation %} / {% endgeneration %} so that return_assistant_tokens_mask=True produces correct masks for SFT assistant-only loss.

qwen3_5_think_training.jinja / qwen3_5_nothink_training.jinja

Patched Qwen3.5 templates, shared logic across both flavors (they differ only in the default value of the enable_thinking flag — qwen3_5_think_training.jinja defaults to thinking enabled, used by Qwen3.5-4B and larger; qwen3_5_nothink_training.jinja defaults to thinking disabled, used by Qwen3.5-2B and smaller). Diff vs qwen3_5_think.jinja / qwen3_5_nothink.jinja: same set of changes as qwen3_training.jinja — require both <think> and </think> to be present before parsing, drop the loop.index0 > ns.last_query_index conditional so the thinking block is always emitted (prefix-preservation), and wrap assistant output with {% generation %} / {% endgeneration %} markers for SFT assistant-only loss.

qwen3_6_training.jinja

Patched Qwen3.6 template. Diff vs qwen3_6.jinja: same set of changes as qwen3_training.jinja — require both <think> and </think> to be present before parsing, drop the loop.index0 > ns.last_query_index conditional so the thinking block is always emitted (prefix-preservation), and wrap assistant output with {% generation %} / {% endgeneration %} markers for SFT assistant-only loss.

Related utilities

See Chat Template Utilities for the helper functions (clone_chat_template(), is_chat_template_prefix_preserving, get_training_chat_template()) that operate on these templates.

Update on GitHub