SimpleQA jumped from 12.2 to 54.3?
Edit: The true English SimpleQA score of this model is ~15. I provided an example in another comment and you can test it yourself on information you're familiar with the link provided by Shinku below, but don't waste your time. A score of 54 is obviously absurd. That's higher than Gemini 2.4 pro's and GPT4's scores despite being far larger English focused models, so of course the much smaller Chinese focused Qwen3 235b doesn't come close to their broad English knowledge, or even Llama 3.1 70b's (SimpleQA of ~20). Their training data is obviously contaminated with the SimpleQA test questions.
This simply isn't possible. The English SimpleQA is a non-multiple choice (full recall) test of esoteric knowledge across a broad spectrum of domains. Its 100% about the precise storage and retrieval of broad KNOWLEDGE and it's theoretically impossible to fine-tune a higher score from the same base model like with SKILLS (e.g. coding and math).
The only way to boost the score sans cheating is to fully retrain the Qwen3 base model on a vastly larger and more diverse English corpus. Even then a score of 54.3 is likely impossible with only 235b parameters using today's technology. It's even higher than much larger and English-centric leaders like GPT4 and Gemini 2.5 Pro.
I haven't tested this model yet, but I will. Yet despite this I'm much greater than 99% confident that Alibaba deliberately cheated and trained on the SimpleQA test data. I'll gladly eat my words. But for now, shame on you Alibaba.
Is it possible to reliably disable the "think" tags/behavior with just finetuning without it occasionally showing up? Or even with continued pre-training?
My guess is that they started from the checkpoint prior to the "think" RL stuff, and did some magic on continued pre-training and fine-tuning, possibly with RL but without extensive reasoning tokens. I mean the math scores alone would've required some kind of RL.
But yeah that SimpleQA score just seems insane for a model this size. That's almost GPT 4.5 level, which was an insanely big model. Luckily, it's a very easy benchmark to create a private version of by collecting a bunch of basic facts, and I hope the model doesn't disappoint because holy hell those benchmarks.
Haha, I told you SimpleQA scores don’t tell you much.
@noneUsername The broad coverage and non-multiple choice nature of the SimpleQA test makes it a very accurate measure of a model's broad English knowledge (far better than all other commonly used tests combined) as long as the test questions are removed from the training data.
And I didn't catch any models cheating on the test (dozens tested) until recently. First with Ernie 4.5, and now the revised Qwen3. So the only reason you're right (its scores don't tell you much) is because the test questions are starting to make it into the training data of some models.
I tested it, and there’s no way this model scored more than 15 on SimpleQA without cheating, it doesn’t know 10 % of what Kimi-k2 knows, and Kimi-k2 scored 31. To be fair, this model is excellent at translation, it translated 1,000 lines in a single pass, line by line, with consistently high quality (from Japanese).
@Shinku I appreciate you testing this model. It's too large to run on my PC and I can't seem to find an online inference provider. The chat link Alibaba provided on the model card performs just as poorly as the last Qwen3 235b so I don't think they're hosting the new version yet.
I also tested Kimi K2 and its score of 31 is accurate, and like you said, vastly superior to Qwen3's knowledge. The test is very hard (nearly all esoteric questions) so a score of 31 is actually a very good score and is a little better than Llama 3.1 405b and DeepSeek v3's.
Do you think this model and Ernie 4.5 chose the "right" training data to achieve the SimpleQA score increase? I think so too, but I wouldn't call it "cheating".
As early as Qwen2.5 and even earlier, it was very common to add Q&A/instruction following/problem solving related corpora during the pre-training stage. Qwen2.5's base model can achieve a score of 90 on gsm8k, which is very... amazing.
What I want to tell you is that the model training process naturally desires high-quality corpora, and the highest-quality corpora are obviously the most suitable corpora for testing the model's intelligence. Therefore, it is unrealistic to expect the model to "not deliberately train on corpora related to the benchmark".
This is especially unrealistic when you argue that a benchmark that can be easily recited like SimpleQA can reflect the general task ability or "popular knowledge" of LLM.
@phil111 I'm using this (free on OpenRouter): https://chutes.ai/app/chute/b2b7a64c-b203-5a5f-8982-a9c5cc12058c
@noneUsername The SimpleQA test is nothing like GSM8k because it tests knowledge vs ability.
When it comes to abilities like math you can teach a human or language model a few quality examples and they can then solve countless other similar problems, so training on a quality set of questions (e.g. the GSM8k test itself) can improve the overall quality of the model.
In contrast, when it comes to knowledge tests like the SimpleQA training on the included questions out of a potential pool of millions across a spectrum of domains doesn't improve the overall broad knowledge of the LLM.
In other words, you either know the factual information, or you don't. Being provided examples of random factual answers doesn't help getting other random factual answers correct.
In short, there's simply no way to legitimately get a higher SimpleQA score unless you train on a larger and more diverse corpus. Training on the insignficiant subset of factual questions included in the SimpleQA test provides no overall benefit to an AI model other than an artificially high SimpleQA score (aka cheating).
@Shinku
predicted this updated Qwen3 has a SimpleQA score of 15, and based on my limited testing using the link he provided (temp 0.3) that is almost exactly what my estimate is. It's doing a little better than the last version (13) on some questions, but otherwise has approximately the same total English knowledge. There's no way it has a SimpleQA score of anywhere near 20 (Llama 3.1 70b), let alone 50.
Here's one sample question about the TV show Corner Gas, which was the number 1 most watched show in Canada all years running. Some names are completely wrong (e.g. Paul Gross and Emily Hampshire), and many others are associated with the wrong characters.
Prompt: "What are the 6 main characters, and the actors who portrayed them, on the TV show Corner Gas? Don't add details, just list them. And what year did the show first air?"
Response:
Brent Leroy – Brent Butt
Hank Yarbo – Paul Gross
Wanda Dollard – Emily Hampshire
Davis "Davis" Trainor – Fred Ewanuick
Lacey Burrows – Tara Spencer-Nairn
Oscar Leroy – Eric Peterson
The show first aired in 2004.
Haha, isn't SimpleQA based on some existing text? For the model team that wants to improve SimpleQA's performance, they won't be stupid enough to add SimpleQA's question-answer pairs to the training. They just need to design a novel corpus collection strategy so that the collected corpus has exactly 54.3% overlap with SimpleQA.
You can't prevent others from collecting corpus from the Internet, right?
Again, using stupid TV show casts to test model capabilities. I have nothing to say, just be happy.
@noneUsername Again, your criticism misses the point. The SimpleQA test questions are a random ~0.001% sampling of esoteric questions taken from a vast pool of potential questions across a broad spectrum of domains. There's simply no way to notably and legitimately increase SimpleQA scores beyond training democratically on a large diverse corpus. The categories are much too broad and include Science & Technology, Politics, Art, Other, Geography, Sports, Music, TV Shows, History, and Video Games.
You need to stop being a belittling prick and see this for what it is. I almost never watch TV and am unusually disconnected from pop culture. None of this is about me personally. This is about model makers faking gains by giving up broad knowledge and abilities in order to spike test scores on a handful of domains (e.g. coding, math, and STEM), and then fabricating scores on the test designed to detect said overfitting (the SimpleQA). Alibaba went from overfitting, but not cheating with Qwen2.5 and 3, to full blown cheating in this latest version of Qwen3. They didn't make a mistake. They know as well as I do that this model's English SimpleQA score (broad English knowledge) is nowhere near 50. They flat out cheated and lied.
They have gotten really blatant with this one.
So, you're saying a score over 50 is cheating. Where do you draw the line, then? Is scoring over 40 cheating? What about over 30? Or 20?
Why don't you use that brilliant 🧠 of yours and enlighten me?@phil111
This model actually knows the plot of LOST and sequence of events really well (testing on openrouter)!
@kugwzk Bro, even DeepSeek with ~670 B scored 27, Kimi K2 with 1 T parameters scored 31, GPT-4o (maybe 1.7 T parameters) managed 38.2 % and ChatGPT-4.5 (believed to have 5 T parameters) achieved 62.5 %. Do you really believe a 235 B model stores almost twice as much knowledge as a 1 T parameter model? Did they even pre-train this model, or did they use the same base model as Qwen and just post-train it?
@noneUsername Again, your criticism misses the point. The SimpleQA test questions are a random ~0.001% sampling of esoteric questions taken from a vast pool of potential questions across a broad spectrum of domains. There's simply no way to notably and legitimately increase SimpleQA scores beyond training democratically on a large diverse corpus. The categories are much too broad and include Science & Technology, Politics, Art, Other, Geography, Sports, Music, TV Shows, History, and Video Games.
You need to stop being a belittling prick and see this for what it is. I almost never watch TV and am unusually disconnected from pop culture. None of this is about me personally. This is about model makers faking gains by giving up broad knowledge and abilities in order to spike test scores on a handful of domains (e.g. coding, math, and STEM), and then fabricating scores on the test designed to detect said overfitting (the SimpleQA). Alibaba went from overfitting, but not cheating with Qwen2.5 and 3, to full blown cheating in this latest version of Qwen3. They didn't make a mistake. They know as well as I do that this model's English SimpleQA score (broad English knowledge) is nowhere near 50. They flat out cheated and lied.
70% of SimpleQA answers can be found via wikipedia. Rewriting wikipedia is all you need.
Rewriting “phil111’s stupid TV show cast list” is all you need.
But be careful, phil111 will call you “a belittling prick”.
@noneUsername & @izusa Can we please focus on what really matters. If the English SimpleQA test is pointless then Alibaba should have simply accepted the earned score of ~15, but instead they not only lied, but claimed a ridiculously high score of 54, which is above much larger English-focused models like Gemini 2.5 Pro and GPT4.
Google, Meta, Mistral, OpenAI... have never, and would never, do such a thing. Cheating so egregiously not only calls into question all other Qwen3 test scores, but creates chaos in the open source ecosystem. For example, it puts pressure on other AI makers to either call them out as cheaters or cheat themselves in order to appear competitive.
Your focus on what you perceive to be the pettiness of what's covered by the SimpleQA test is simply irrelevant by comparison. Alibaba has gone nuclear. They decided that cheating so egregiously that 100% of experts in the industry know they cheated isn't a deal breaker.
@phil111 我在https://chat.qwen.ai/选择2507进行了测试,根据我的结果,你说的可能是对的,实际使用中它并没有那么博学,我使用qwen3 2507测试时它的benchmark只有32%,远远没有54%这么高
"Alibaba has gone nuclear." I like this. Isn't it a cool line? like "Now Alibaba become Death, the destroyer of worlds."
I have argued with you many times about whether SimpleQA scores reflect the level of popular knowledge of LLMs, whether low SimpleQA scores like Phi4 mean what you call "overfitting", and whether high SimpleQA scores mean what you call "cheating".
I am not your teacher. If I explain it once and you ignore it, I will not explain it a second time, or let me put it more directly: I have paid attention to what you call "really important things", but my attention has been ignored by you.
You can do the same thing to anyone: emphasize the amazing consistency of SimpleQA scores with your experience of using LLMs, emphasize the importance of SimpleQA scores to model popular knowledge and general conversation ability, emphasize that SimpleQA scores are difficult to fake and emphasize that abnormal SimpleQA scores are due to "cheating".
But what you do is just to highlight your point of view, which I have seen and is not very interesting.
Especially this time, Qwen3-235B-A22B-Instruct-2507 performed well in my ERP test, so I particularly disagree with your point of view.
@noneUsername Again, your criticism misses the point. The SimpleQA test questions are a random ~0.001% sampling of esoteric questions taken from a vast pool of potential questions across a broad spectrum of domains. There's simply no way to notably and legitimately increase SimpleQA scores beyond training democratically on a large diverse corpus. The categories are much too broad and include Science & Technology, Politics, Art, Other, Geography, Sports, Music, TV Shows, History, and Video Games.
You need to stop being a belittling prick and see this for what it is. I almost never watch TV and am unusually disconnected from pop culture. None of this is about me personally. This is about model makers faking gains by giving up broad knowledge and abilities in order to spike test scores on a handful of domains (e.g. coding, math, and STEM), and then fabricating scores on the test designed to detect said overfitting (the SimpleQA). Alibaba went from overfitting, but not cheating with Qwen2.5 and 3, to full blown cheating in this latest version of Qwen3. They didn't make a mistake. They know as well as I do that this model's English SimpleQA score (broad English knowledge) is nowhere near 50. They flat out cheated and lied.
70% of SimpleQA answers can be found via wikipedia. Rewriting wikipedia is all you need.
So, is this the reason why Qwen was trained multiple times on simpleQA-related wiki data? The key point is that Qwen specifically increased the proportion of long-tail data related to simpleQA, rather than increasing the proportion of all long-tail data.
Rewriting “phil111’s stupid TV show cast list” is all you need.
Phil has noticed a real phenomenon whereby these newer models have large world knowledge gaps, and hallucinate the answers.
He's also that of the common benchmarks, SimpleQA score usually has the closet correlation to this phenomenon
The TV show cast list is just one easy to repeat way of testing this, similar to those "stupid one-shot coding prompts" like "make a flappy bird game" or "numbered balls in a rotating hexagon".
performed well in my ERP test
Cool, but that niche use case is very different from the niche use case Phil is testing and reporting for us!
Anyway, to be fair, I'm guessing they probably used SimpleQA answers for GPRO (they have to use something with a verified answer) with the hope that the model will be able to generalize and respond better to similar questions (provided the knowledge is in the base model).
The model is certainly a lot better at reasoning about details of fictional stories and seems like a big improvement over the previous release.
One interesting thing I noticed is this updated version of Qwen3 makes just as many factual hallucinations as the previous version when asking for broad information, such as the main cast of a movie. But when you then directly ask about specific facts it got wrong then it commonly gets them right.
For example, when asked "Who portrayed Samuel Harvey Graynamore in the movie Joe versus the Volcano?" it correctly said Lloyd Bridges, but when asked for the main cast of the show it attributed the wrong actor to Samuel Harvey Graynamore.
So to get factual information out of Qwen3 ask simple and direct questions.
Another interesting thing is even when being simple and direct this updated Qwen3 repeatedly gave the same wrong answer as the last version before giving the correct answer.
For example, when asked " Who played the comic store owner on the TV show The Big Bang Theory?"
It responded with the wrong actor (Bob Newhart), then the right one (Kevin Sussman): "The comic book store owner on The Big Bang Theory was played by actor Bob Newhart. He portrayed Professor Proton, the fictional character from the childhood science show that Sheldon and Leonard watched, and later appeared as the real-life actor who played Professor Proton, Arthur Jeffries. However, the owner of the comic book store, Stuart Bloom, was played by Kevin Sussman."
Anyways, this explains why Qwen3 does so poorly on my test. Most of my questions are dense (main cast, multiple parts...), allowing for much faster testing of more information. And the updated Qwen, if anything, does worse on said questions. But it does a lot better on the short and direct questions about one thing at a time, but even then regularly starts by answering it wrong. Regardless, it still gets far more simple and direct questions wrong than GPT4 and Gemini 2.5 pro, so it's legitimate SimpleQA score is still nowhere near 54.