Qwen/Qwen3-235B-A22B · In complex reasoning tasks Qwen3 is far behind QwQ

May 7

Hi,

I am currently testing advanced reasoning in Qwen3 and I expected it to be better than QwQ because it is bigger and has better benchmark results. Unfortunately I am completely disappointed - it seems that benchmarks are not reliable in some cases. I noticed this before with the difference between QwQ and DeepSeek-R1 - R1 has slightly better benchmarks, but QwQ generates N times longer Chain-of-Thoughts and performs best in my tasks. In QwQ-Preview there were some problems with senseless reasoning loops, but this was fixed in QwQ and the model was able to generate even 30k token chains without hallucinations - something that no other model can do.

On the other hand Qwen3 generates very short chains (so the answers are much faster, but the quality is also much worse), even shorter than DeepSeek-R1 and I think this is the reason for its worse results - after 20k tokens of the initial input message, each response was full of hallucinations - most of the subsequent interactions were focused on correcting the model's errors, but after it understood some point correctly, it started hallucinating about something else.

Then I have to go back to QwQ, because it's really hard to work with Qwen3.

P.S. Please don't get me wrong, I still think you make some of the best open source models, and QwQ is my personal favorite, but with Qwen3 unfortunately in some advanced cases something went wrong and the quality dropped dramatically

devopsML

May 7

Hi,

I am currently testing advanced reasoning in Qwen3 and I expected it to be better than QwQ because it is bigger and has better benchmark results. Unfortunately I am completely disappointed - it seems that benchmarks are not reliable in some cases. I noticed this before with the difference between QwQ and DeepSeek-R1 - R1 has slightly better benchmarks, but QwQ generates N times longer Chain-of-Thoughts and performs best in my tasks. In QwQ-Preview there were some problems with senseless reasoning loops, but this was fixed in QwQ and the model was able to generate even 30k token chains without hallucinations - something that no other model can do.

On the other hand Qwen3 generates very short chains (so the answers are much faster, but the quality is also much worse), even shorter than DeepSeek-R1 and I think this is the reason for its worse results - after 20k tokens of the initial input message, each response was full of hallucinations - most of the subsequent interactions were focused on correcting the model's errors, but after it understood some point correctly, it started hallucinating about something else.

Then I have to go back to QwQ, because it's really hard to work with Qwen3.

P.S. Please don't get me wrong, I still think you make some of the best open source models, and QwQ is my personal favorite, but with Qwen3 unfortunately in some advanced cases something went wrong and the quality dropped dramatically

for us the experience is different. this is because when we ask it to complete a task in non-english (e.g. chinese, Vietnamese, korean.............. ) qwq sometimes hallucinates while qwen 3 does not. then we later found out that the latter supports 119+ languages including all of the languages we have mentioned.

AdamF92

May 7

•

edited May 7

In my case it's only english, but very complex reasoning tasks - designing new deep learning architectures/algorithms, advanced coding (deep learning library) and a lot of theory. The main problem was that when we talked about 3 different architectures and 3 different algorithms, Qwen3 mixed and confused them in every answer – I asked about QA datasets for fine-tuning, it prepared me instructions for QA datasets for reinforcement learning, etc. Generally, in this topic, it generates about 4-5x shorter chains than QwQ and I think that is the reason of described behavior.

On the other hand, I just tested more generic prompt on Qwen3, QwQ and DeepSeek R1 - calculating a size of two transformer models (encoder and decoder), based on provided params. In this case I have different expectation, if all models calculate it correctly, then the best is the fastest:

DeepSeek-R1 - ~5.5k tokens
QwQ - ~6.5k tokens
Qwen3 - in first tests, it stopped after 10k tokens, without the answer (returning from chain), in second test it also stopped, but after 13k tokens... That's enough, in this time I'll calculate it manually many times

So it seems like Qwen3 is not even able to calculate very simple formulas, because of very stupid loop - most of the generated chain, it trying to answer if Positional Encoding is Embedding or Non-Embedding. Which is completely bizarre, because the models provided in details are using RoPE, which has no trainable parameters. Then, everytime Qwen3 is almost ready to escape the chain and provide the answer, it stops with something like "Wait, but POS is Embedding or Non-embedding param?", no matter than "POS=0" :/ That were a problem in QwQ-Preview, mostly fixed in QwQ and now it's back with Qwen3, so this model seems to be a big regress in complex reasoning tasks.

I'm still testing and will provide more details when I found more issues. Hope you will take care about it in the future

devopsML

May 7

In my case it's only english, but very complex reasoning tasks - designing new deep learning architectures/algorithms, advanced coding (deep learning library) and a lot of theory. The main problem was that when we talked about 3 different architectures and 3 different algorithms, Qwen3 mixed and confused them in every answer – I asked about QA datasets for fine-tuning, it prepared me instructions for QA datasets for reinforcement learning, etc. Generally, in this topic, it generates about 4-5x shorter chains than QwQ and I think that is the reason of described behavior.

On the other hand, I just tested more generic prompt on Qwen3, QwQ and DeepSeek R1 - calculating a size of two transformer models (encoder and decoder), based on provided params. In this case I have different expectation, if all models calculate it correctly, then the best is the fastest:

DeepSeek-R1 - ~5.5k tokens

QwQ - ~6.5k tokens

Qwen3 - in first tests, it stopped after 10k tokens, without the answer (returning from chain), in second test it also stopped, but after 13k tokens... That's enough, in this time I'll calculate it manually many times
So it seems like Qwen3 is not even able to calculate very simple formulas, because of very stupid loop - most of the generated chain, it trying to answer if Positional Encoding is Embedding or Non-Embedding. Which is completely bizarre, because the models provided in details are using RoPE, which has no trainable parameters. Then, everytime Qwen3 is almost ready to escape the chain and provide the answer, it stops with something like "Wait, but POS is Embedding or Non-embedding param?", no matter than "POS=0" :/ That were a problem in QwQ-Preview, mostly fixed in QwQ and now it's back with Qwen3, so this model seems to be a big regress in complex reasoning tasks.

I'm still testing and will provide more details when I found more issues. Hope you will take care about it in the future

did you try qwq or qwen 3 in huggingchat or the official qwen chat version?

AdamF92

May 7

HuggingChat has characters count limit on single message (50k chars, after that, you'll get a "Message too long" error, no matter of model context size - it's API limitation), so I'm using it only for simple tasks and models tests. For advanced tasks I'm using custom provider - Hyperbolic. But it shouldn't matter here – the only difference might be quantization & YaRN usage, apart from that it's the same model. I'm also using recommended temperature (0.6) and top p (0.95) settings.

It's not just me with Qwen3 problems - there's another topic when others said that QwQ has better results - https://huggingface.co/Qwen/Qwen3-235B-A22B/discussions/18

devopsML

May 8

then its recommended to use both (either in huggingchat or another custom provider)

AdamF92

May 8

@devopsML That's not true. Except you're representing Qwen team and that's their official statement. All the posts/docs about Qwen3 are saying, that it outperforms QwQ in complex reasoning tasks, but it seems like it's true only for benchmarks and in practice it performs worse. We expect a progress from next generation of the same family models, Qwen3 is the successor of QwQ and Qwen2.5, then it should be recommended to use it instead of previous generation models.

I hope that this feedback will reach the Qwen team, and they will treat the current release as a "Preview" version and fix the problems in the next release, as it was the case with QwQ-Preview and QwQ, because despite the problems, the model still has a lot of potential

devopsML

May 8

@devopsML That's not true. Except you're representing Qwen team and that's their official statement. All the posts/docs about Qwen3 are saying, that it outperforms QwQ in complex reasoning tasks, but it seems like it's true only for benchmarks and in practice it performs worse. We expect a progress from next generation of the same family models, Qwen3 is the successor of QwQ and Qwen2.5, then it should be recommended to use it instead of previous generation models.

I hope that this feedback will reach the Qwen team, and they will treat the current release as a "Preview" version and fix the problems in the next release, as it was the case with QwQ-Preview and QwQ, because despite the problems, the model still has a lot of potential

alr, i feel bad for u.......... we may need for qwen 3 to improve a bit to match its raw performance compared to its main rival gemini 2.5 pro. hopefully, deepseek r2 or (possibly) qwen 4 will fix that much better.......
anyway, feel free to use qwq if you should!

aabbccddwasd

May 9

@devopsML That's not true. Except you're representing Qwen team and that's their official statement. All the posts/docs about Qwen3 are saying, that it outperforms QwQ in complex reasoning tasks, but it seems like it's true only for benchmarks and in practice it performs worse. We expect a progress from next generation of the same family models, Qwen3 is the successor of QwQ and Qwen2.5, then it should be recommended to use it instead of previous generation models.

I hope that this feedback will reach the Qwen team, and they will treat the current release as a "Preview" version and fix the problems in the next release, as it was the case with QwQ-Preview and QwQ, because despite the problems, the model still has a lot of potential

alr, i feel bad for u.......... we may need for qwen 3 to improve a bit to match its raw performance compared to its main rival gemini 2.5 pro. hopefully, deepseek r2 or (possibly) qwen 4 will fix that much better.......
anyway, feel free to use qwq if you should!

maybe qwen3.5 will be much better, I think qwen's point five series are always better than integer series, qwen2 also have a lot of problems compared to qwen1.5, but 2.5 fixed that and preforms much better

devopsML

May 9

@devopsML That's not true. Except you're representing Qwen team and that's their official statement. All the posts/docs about Qwen3 are saying, that it outperforms QwQ in complex reasoning tasks, but it seems like it's true only for benchmarks and in practice it performs worse. We expect a progress from next generation of the same family models, Qwen3 is the successor of QwQ and Qwen2.5, then it should be recommended to use it instead of previous generation models.

I hope that this feedback will reach the Qwen team, and they will treat the current release as a "Preview" version and fix the problems in the next release, as it was the case with QwQ-Preview and QwQ, because despite the problems, the model still has a lot of potential

alr, i feel bad for u.......... we may need for qwen 3 to improve a bit to match its raw performance compared to its main rival gemini 2.5 pro. hopefully, deepseek r2 or (possibly) qwen 4 will fix that much better.......
anyway, feel free to use qwq if you should!

maybe qwen3.5 will be much better, I think qwen's point five series are always better than integer series, qwen2 also have a lot of problems compared to qwen1.5, but 2.5 fixed that and preforms much better

hope qwen 3.5 will be out there soon, alongside deepseek r2............

Dampfinchen

May 9

Pretty hyped for Qwen 3.5. Hope it has native vision support.

devopsML

May 10

Pretty hyped for Qwen 3.5. Hope it has native vision support.

me too......

creampx

May 13

Yes, Qwen 3 is not good enough, so i thought the team which developed Qwen3 and QwQ is not same.
And another model named "QwQ-plus" are even better than QwQ-32B and Deepseek R1 in my agent.
link: https://bailian.console.aliyun.com/?tab=model#/model-market/detail/qwq-plus