Corrected jinja template with tool Support works with PR llama.cpp/pull/15186

#9
by xbruce22 - opened

I have added tool support to llama.cpp here
But parsing had issue with Zai's provided Jinja template.
I have added corrected template here.

Works great now with Cline πŸ’ͺ,
image.png

Works great with MCP settings too πŸ”₯.
image.png

Unsloth AI org

Thanks we're gonna investigate!!

@xbruce22 I'm seeing GLM 4.5 Air struggling to find the stop token and entering chat completion repetition loops when using the template you attached to this PR.

Notably, I am running llama.cpp commit b049315 compiled from master, not your llama.cpp branch.

Edit: I am running your branch now and seeing the same issue in open-webui tool calling.

I've only seen loops when both: 1. The template is being applied at llama-server runtime; and 2. The model is invoking a tool call. This could be a coincidence. It's happened when the model invokes tool calls through Kilo's vscode extension and open-webui MCPO.

image.png

Here is my llama server command:

#!/bin/bash
#
llama-server \
  -m  ./GGUF/GLM-4.5-Air-UD-Q5_K_XL.gguf \
  --alias "GLM-4.5-Air-UD-Q5_K_XL" \
  --host 0.0.0.0 \
  --port 8080 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  -fa \
  -ngl 99 \
  --metrics \
  --jinja \
  --chat-template-file ./glm4.5_chat_template.jinja

Hey @ernestr , Let me rebuild and check, meanwhile can you try in your Jupyter notebook using OpenAI API?

Update 1:

after rebuild, I was able to successfully use Kilo code without any issue. I have used following command to run llama-server
./llama.cpp/build/bin/llama-server -hf unsloth/GLM-4.5-Air-GGUF:IQ2_M --alias GLM-4.5-Air-GPUs -c 60000 --host 0.0.0.0 -np 1 -ngl 999 -ts 72,28 -b 1024 -ub 256 --jinja --chat-template-file template/chat_template.jinja

Try using different temperature values.

Testing open-webui next.

I found an error in the template: it incorrectly removes the assistant's block before its turn is over. This prevents it from using the reasoning tokens it generated after the tool response and before the next user message. This bug happens because tool calls are returned as a "user" message.

To fix this, modify lines 39-43 to:

{%- for m in messages %}
    {%- if m.role == 'user' %}
        {%- set user_content = visible_text(m.content) -%}
        {%- if not user_content.startswith("{\n  \"tool_response\"") %}
            {% set ns.last_user_index = loop.index0 -%}
        {%- endif -%}
    {%- endif %}
{%- endfor %}

I found an error in the template: it incorrectly removes the assistant's block before its turn is over. This prevents it from using the reasoning tokens it generated after the tool response and before the next user message. This bug happens because tool calls are returned as a "user" message.

To fix this, modify lines 39-43 to:

{%- for m in messages %}
    {%- if m.role == 'user' %}
        {%- set user_content = visible_text(m.content) -%}
        {%- if not user_content.startswith("{\n  \"tool_response\"") %}
            {% set ns.last_user_index = loop.index0 -%}
        {%- endif -%}
    {%- endif %}
{%- endfor %}

I really appreciate the help brother. Thank you for pointing it out. How did you notice it? Please share methodology or details so that I can be sure of it next time.

I found an error in the template: it incorrectly removes the assistant's block before its turn is over. This prevents it from using the reasoning tokens it generated after the tool response and before the next user message. This bug happens because tool calls are returned as a "user" message.

To fix this, modify lines 39-43 to:

{%- for m in messages %}
    {%- if m.role == 'user' %}
        {%- set user_content = visible_text(m.content) -%}
        {%- if not user_content.startswith("{\n  \"tool_response\"") %}
            {% set ns.last_user_index = loop.index0 -%}
        {%- endif -%}
    {%- endif %}
{%- endfor %}

I really appreciate the help brother. Thank you for pointing it out. How did you notice it? Please share methodology or details so that I can be sure of it next time.

I was troubleshooting, so I enabled the /slots endpoint on llama.cpp, started a new chat conversation with tool support (using open webui), and then hit the endpoint during both assistant phases:

  • User turn 1
  • Assistant turn 1
    • Assistant generates
    • Tool call response
    • Assistant generates again

And did the same for turn 2 as well.

But after tool call, assistant generally generates again right? How did you notice the issue?

I originally was using the chat template built into the GGUF, and llama.cpp server gave a 500 error after the tool call, when the assistant is supposed to generate again. The error message was related to jinja processing. So I copied your chat template into a file and set llama.cpp server to use it. I wanted to verify that it worked, so I used the /slots endpoint to inspect the raw LLM context during the chat, before and after the tool call.

I ran into another issue. It doesn't handle the case where there are multiple thinking blocks in the same interaction (which happens when the assistant decides to use a tool, then think more afterward).

Here's what I'm using in place of lines 56-77 to fix it, though it's quick-and-dirty. It concatenates everything inside <think> tags and sets it to reasoning_content, and concatenates everything outside <think> tags and sets it to content.

{%- elif m.role == 'assistant' -%}
<|assistant|>
{%- set reasoning_content = '' %}
{%- set content = visible_text(m.content) %}
{%- if m.reasoning_content is string %}
    {%- set reasoning_content = m.reasoning_content %}
{%- else %}
    {%- if '<think>' in content %}
        {%- set reasoning_content_ns = namespace(reasoning_content='') %}
        {%- set content_outside_ns = namespace(content_outside='') %}
        {%- set parts = content.split('<think>') %}
        {%- set counter_ns = namespace(i=-1) %}
        {%- for part in parts %}
            {%- set counter_ns.i = counter_ns.i + 1 %}
            {%- if counter_ns.i == 0 %}
                {%- set content_outside_ns.content_outside = content_outside_ns.content_outside + part %}
            {%- else %}
                {%- if '</think>' in part %}
                    {%- set think_split = part.split('</think>') %}
                    {%- set reasoning_content_ns.reasoning_content = reasoning_content_ns.reasoning_content + '\n' + think_split[0] %}
                    {%- if think_split|length > 1 %}
                        {%- set content_outside_ns.content_outside = content_outside_ns.content_outside + '\n' + think_split[1] %}
                    {%- endif %}
                {%- else %}
                    {%- set reasoning_content_ns.reasoning_content = reasoning_content_ns.reasoning_content + '\n' + part %}
                {%- endif %}
            {%- endif %}
        {%- endfor %}
        {%- set reasoning_content = reasoning_content_ns.reasoning_content.lstrip('\n') %}
        {%- set content = content_outside_ns.content_outside.lstrip('\n') %}
    {%- endif %}
{%- endif %}
{%- if loop.index0 > ns.last_user_index and reasoning_content -%}
{{ '\n<think>' + reasoning_content.strip() + '</think>'}}

Hey @ernestr , Let me rebuild and check, meanwhile can you try in your Jupyter notebook using OpenAI API?

Update 1:

after rebuild, I was able to successfully use Kilo code without any issue. I have used following command to run llama-server
./llama.cpp/build/bin/llama-server -hf unsloth/GLM-4.5-Air-GGUF:IQ2_M --alias GLM-4.5-Air-GPUs -c 60000 --host 0.0.0.0 -np 1 -ngl 999 -ts 72,28 -b 1024 -ub 256 --jinja --chat-template-file template/chat_template.jinja

Try using different temperature values.

Testing open-webui next.

Hey @xbruce22 really appreciate you attempting to replicate.

I saw the repetition (rather than stop token) with Kilo's Default temperature (0.0 per their docs) and when manually setting the temperature to 0.6. I'm using the default batch sizes and a larger max context as well as q8 cache, but otherwise I can't finding a meaningful delta in our llama-server commands. Of course, I could be missing it!

llama-server -m GLM-4.5-Air-UD-Q5_K_XL.gguf --alias "GLM-4.5-Air-UD-Q5_K_XL" --host 0.0.0.0 --port 8080 --cache-type-k q8_0 --cache-type-v q8_0 -fa -ngl 99 --metrics --jinja --chat-template-file /home/x0xxin/ML_Scripts/Llama.cpp/glm4.5_chat_template.jinja --slots --verbose

Tool calling definitely works because it creates the todo list. Then it hits patches like these while trying to stop:

image.png

image.png

The only other significant difference I can think of is that I'm using a UD_Q5 quant which one would assume is more capable. I double checked that I'm using the original chat template from your PR and can confirm that I see srv params_from_: Chat format: GLM 4.5 in the logs. I am still testing with e0ee297. Also, I think I mentioned this in the other conversation but I am able to successfully execute your python script that demos the tool call.

Hi @ernestr , template change cannot cause repetition. This problem is of quantization. Try using GLM 4.5 Air IQ2_M.

Hey @bgreene010 , really appreciate the feedback. Let me test.

Update 1:

For me MCP tool calling even for multi-tool in single response worked even with proper thinking. I tried continuing conversation for longer and it was still working with all the tool calls.

Can you please share exact prompt with framework details so that I can replicate in my system as well?

Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment