Is it possible to disable thinking?
I tried to use
Tried to start with /no_think
It always think...
Is that a qwen3moe thing only? Not sure it is in R1-0528? Not sure and away from my desk right now, maybe someone else can chime in. If you don't want think I also have https://huggingface.co/ubergarm/DeepSeek-V3-0324-GGUF you could try.
Something like adding to the template should work, but I don't think ik_llama supports jinja templates, so you may have to inject it in your completion request.
I managed using the text completions endpoint by pre-filling the response with something like:
<think>
Okay, the user wants me to respond immediately. Here's my response.
</think>
Sorry for the noob question
It's not a noob question, I spend ages messing around with chat templates :)
TLDR for ST at the bottom
this is before the front end per se?
Nope, in the front-end. With completions mode, you have full control over the prompt template.
I tend to test these out with mikupad first (you can just download the .html file and open it locally) since every reasoning model is different (eg. GLMZ, Cogito, etc)
This is my DeepseekNoThink template:
<|begin▁of▁sentence|><|User|>{{input}}<|Assistant|><think>
Okay, the user wants me to respond immediately, no need to think anymore.
</think>
For example
<|begin▁of▁sentence|><|User|>Is zero a negative or positive number?<|Assistant|><think>
Okay, the user wants me to respond immediately, no need to think anymore.
</think>
Generates the response immediately:
TLDR for ST
For SillyTavern, I tend to put it in the "Assistant Message Sequences" so it doesn't have "thought for 0 seconds" in the chat window.
<|Assistant|><think>
Okay, the user wants me to respond immediately, no need to think anymore.
</think>
(That will just look like a non-reasoning model when you use it)
But you can also put it in the "Start reply with" section in Reasoning Formatting (on the right of the UI)
I tend to do this if I want to actually use reasoning, steer the direction it takes eg,
<think>
Okay, time to plan my reply, ensuring I don't use any em-dashes or asterisks
I went back and re-read the DeepSeek model card and notes but I don't think it was trained with an official /no_think
prompt to disable thinking. I believe the official approach would be to use DeepSeek-V3-0324
for no thinking and use DeepSeek-R1-0528
for thinking.
But as I understand it you have couple ways to attempt to abuse the model to short-circuit thinking:
1. Completions Endpoint Chat Template
Assuming you are working low-level and tokenizing the strings yourself and feeding directly to the model or via the llama-server completions endpoint (not chat/completions). This is a hack by sending a partial assistant response without closing it with the expected <|end▁of▁sentence|>
system_message=""
user_message_1="Write a complex python app to find all perfect numbers. Be brief and don't think too much."
# normaly you would do the following and send a completely tokenized response
<|begin▁of▁sentence|>{system_message}<|User|>{user_message_1}<|Assistant|>
# but now we abuse the format by sending a partially tokenized string without terminating it properly and hope the LLM continues as if thinking were done
assistant_message_1="<think>Okay, I'll just write the code immediately.</think> Certainly! Here is the code: "
<|begin▁of▁sentence|>{system_message}<|User|>{user_message_1}<|Assistant|>{assistant_message_1}
2. Chat Thread Injection
You could also try to inject text to suppress thinking inthe chat thread, however it will be tokenized in such a way that it doesn't match the expected format. It is easier as you can just type it into any GUI like open-webui or ST etc:
chat_thread = [
{"role": "system", "content": "You are a helpful AI."},
{"role": "user", "content": "Write a complex python app to find all perfect numbers. Be brief and don't think too much."},
{"role": "assistant", "content": "<think>Okay, I'll just write the code immedeately.</think> Certainly! Here is the code: "},
]
Anyway, you can probably try a few combinations like this to try to hack it to reduce thinking. There may be some system prompts that reduce or influence thinking too, not really sure how it was trained.
As a comparison, it is interesting that the new https://huggingface.co/MiniMaxAI/MiniMax-M1-40k/ has both a 40k "thinking budget" and 80k version as well. I assume they were trained with different length examples and that is how they "control" the thinking budget. I assume Qwen was trained with examples that had /no_think
which didn't use thinking traces. So its not really a digital thinking on/off knob but more a way to influence the output given specific training examples.
P.S. Make sure you only put that in one place (either the "Assistant Message Sequences" or the Start reply with), not both.
@ubergarm I want to test mostly as to be like something as "DeepSeek V3 0528". I know they prob have it but it's not released haha. Many thanks for the info as well, I also tried /no_think but no dice, as that seems to be a Qwen3 only thing.
@gghfez Many thanks for all the help! Gonna try when I get home after work.
@Panchovix
Np. And yeah /no_think is a qwen thing they specifically trained it on. Cogito has a an equivalent prompt as well.
You can also get some of the newer models like command-a (and DeepSeek-V3 0425) to think a bit by enabling reasoning then adding <think> Okay, I need to respond as {{char}}
to the "Start reply with". I guess these models are aware of these thinking tags from seeing QwQ and R1 output in their training data.
@ubergram
I'll try out that chat_thread thing later. I've been using a crude fastapi proxy for OpenWebUI (it doesn't support text completions) to wrap chat completions -> text completions.