This just trades general performance for domain specific gains.
If you strip the corpus of ~95% of the world's most popular knowledge in order to over train on the remaining ~5% (mainly coding, math and science), then off course you can push the scores in said grossly overfit domains a little higher than other comparably sized models. That's just overfitting.
For example, you can't test top triathletes in high jump against top high jumpers then declare the high jumpers as superior athletes. It takes a lot of intense well-rounded training to become a triathlete, just as it takes a lot of well rounded training for an LLM to become a general purpose AI model. So giving up gobs of broad knowledge and abilities, such information about popular movies and the ability to write poetry, in order to focus training on a handful of domains (mainly coding, math and science), then comparing tests scores in said grossly overfit domains with general purpose AI models is more than just misleading; it's meaningless.
Anybody reading this can test whatever broad knowledge and abilities you care to so don't read too much into the following example, but it shows that even with thinking enabled this 9b model hallucinates far worse when it comes to wildly popular things, such as TV shows seen by countless millions of people, than other comparably sized models, and even much smaller models like Llama 3.2 3b.
Prompt: "List the 6 main characters, and the actors who played them, from Two and a Half Men. Don't add details, just list them. And what year did the show first air?"
Response:
Charlie Harper - Charlie Sheen
Alan Harper - Jason Sudeikis
Penny - Amanda Seyfried
Walden - Jon Cryer
Harper's mom - Joan Cusack
Lisa - Jamie Lee Curtis
The show first aired in 2003.
It's not just that Nemotron 9b's response about the main cast of a wildly popular and relatively recent TV show is a flood of hallucinations, but it's the absurd nature of the hallucinations, such as Penny from the Big Bang Theory being associated with an actress from neither show (Amanda Seyfried), and other randomly selected actress names like Joan Cusack and Jamie Lee Curtis, mixing up the main characters Walden and Alan while omitting Walden's last name, simply referring Evelyn Harper as Harper's mom, and so on.
When will this madness end? We've known since day one that if you focus training on just a handful of domains that you can push test scores in said domains higher compared to general purpose AI models, just as you can jump higher if you focus training on the high jump versus a broad spectrum of activities like swimming, running and biking. Qwen3 8b is itself grossly overfit to the same domains (e.g. coding, math, and science), which is why, as you stated, it scored higher on said tests than other comparably sized models like Gemma 2 9b and Llama 3.1 8b, yet has FAR less broad knowledge and abilities than these models (e.g. it can't write poems worth shit). Then you come along and give up even more broad knowledge and abilities than Qwen3 8b in order to focus even more training on primarily coding, math and science to climb a little higher than Qwen3 8b on tests covering said domains. Whoopee doo.
I couldnβt agree more.
If you strip the corpus of ~95% of the world's most popular knowledge in order to over train on the remaining ~5% (mainly coding, math and science), then off course you can push the scores in said grossly overfit domains a little higher than other comparably sized models. That's just overfitting.
For example, you can't test top triathletes in high jump against top high jumpers then declare the high jumpers as superior athletes. It takes a lot of intense well-rounded training to become a triathlete, just as it takes a lot of well rounded training for an LLM to become a general purpose AI model. So giving up gobs of broad knowledge and abilities, such information about popular movies and the ability to write poetry, in order to focus training on a handful of domains (mainly coding, math and science), then comparing tests scores in said grossly overfit domains with general purpose AI models is more than just misleading; it's meaningless.
Anybody reading this can test whatever broad knowledge and abilities you care to so don't read too much into the following example, but it shows that even with thinking enabled this 9b model hallucinates far worse when it comes to wildly popular things, such as TV shows seen by countless millions of people, than other comparably sized models, and even much smaller models like Llama 3.2 3b.
Prompt: "List the 6 main characters, and the actors who played them, from Two and a Half Men. Don't add details, just list them. And what year did the show first air?"
Response:
Charlie Harper - Charlie Sheen
Alan Harper - Jason Sudeikis
Penny - Amanda Seyfried
Walden - Jon Cryer
Harper's mom - Joan Cusack
Lisa - Jamie Lee CurtisThe show first aired in 2003.
It's not just that Nemotron 9b's response about the main cast of a wildly popular and relatively recent TV show is a flood of hallucinations, but it's the absurd nature of the hallucinations, such as Penny from the Big Bang Theory being associated with an actress from neither show (Amanda Seyfried), and other randomly selected actress names like Joan Cusack and Jamie Lee Curtis, mixing up the main characters Walden and Alan while omitting Walden's last name, simply referring Evelyn Harper as Harper's mom, and so on.
When will this madness end? We've known since day one that if you focus training on just a handful of domains that you can push test scores in said domains higher compared to general purpose AI models, just as you can jump higher if you focus training on the high jump versus a broad spectrum of activities like swimming, running and biking. Qwen3 8b is itself grossly overfit to the same domains (e.g. coding, math, and science), which is why, as you stated, it scored higher on said tests than other comparably sized models like Gemma 2 9b and Llama 3.1 8b, yet has FAR less broad knowledge and abilities than these models (e.g. it can't write poems worth shit). Then you come along and give up even more broad knowledge and abilities than Qwen3 8b in order to focus even more training on primarily coding, math and science to climb a little higher than Qwen3 8b on tests covering said domains. Whoopee doo.
I suppose everyone knows these Nemotron models are just for benchmarks. Frankly speaking I don't think there's a lot of people who actually use these models. Nvidia is not that good on training models.
More or less playing the Devil's Advocate here, but do you really want an 8B class model to know or remotely care about random Hollywood trivia from 20+ years ago? Under 50B and probably under a few hundred billion parameters, embedding trivia will always have a limit.
We are now in a world where Jan-V1, a 4B local model plugged into a free search MCP, is going to nail that answer 100% of the time. And if this model can call tools decently, so would it. If I were considering how to train this model or Jan, I would basically remove data like that from the corpus. It no longer serves a purpose. The weights that could have memorized the answer should be better used for other things.
Just food for thought. This still could have been benchmaxxed to the point of having issues. I care much more about its general ability to follow instructions, call tools, and have eloquent outputs. IMHO your poetry test would be much more telling; I'd go a step further and ask for poetry in iambic pentameter, making specific edits to a text while keeping the author's style, or prompting it to come up with Dad jokes would be far more relevant.
I would prefer it to admit it doesn't know than confabulate, though.
@JDWarner It's not about simple question and answering. When these models train on primarily coding, math, and science then no amount of RAG can bring back any degree of competence outside of the overfit domains (e.g. writing poetry).
This is why the most powerful LLMs can't solve absurdly easy arc agi questions, and RAG doesn't help. That is, they can't solve any problems, or perform any task, outside of their training because they have 0 intelligence. It's all about knowledge (pattern matching).
For example, they can get gold in math and coding competitions, yet can't solve painfully obvious original puzzles, games, humor etc., or any original math and coding problem no matter how simple, because all they're doing is recalling the answer and making superficial modifications to it (e.g. changing the value of variables).
In order to create a general purpose LLM capable of serving the needs of the general population, not just coding nerds, then you simply MUST train democratically on mankind's most popular data. Focusing training of small ~9b LLMs on just a handful of domains like coding, math and science just makes crappy tools (coders and math solvers) which still don't have anywhere near the required accuracy and reliability on said tasks to be useful to anyone. Even coders only test said models and go back to using Opus or something. And even then no competent coder is becoming overall more productive with the help of Opus.
In conclusion, since current LLM tech is 100% about knowledge (pattern matching) and 0% about intelligence, they require balanced training across all of humanity's most popular knowledge to function across all tasks like writing stories, poems, and chatting effectively with intelligent humans (e.g. responding appropriately to humor, metaphor and nuance). RAG doesn't help. If anything writing poems and chatting with a spectrum of people, including geniuses, is far harder than more constrained tasks like generating code and solving math by simply recalling a nearest match in the training data and making superficial modifications to it in order to better align with the prompt.
The only reason LLM makers are obsessing with only a handful of domains like coding in the open source community is because nearly everyone here is an autistic coding nerd and they declare new models as great or horrible based upon coding alone (e.g. >95% of Youtube LLM reviews are nothing but the testing of coding prompts). If this community was dominated by poets then they would have trained on far more poems than code blocks. It would still be 100% about knowledge and pattern matching (retrieving the nearest match poem and making superficial modifications to maintain rhyme etc.), but in the end the generated poems would be far better, and the code far worse.
@phil111 I will go ahead and keep playing Devil's Advocate for a moment.
There is a great deal of overlap between STEM and creative disciplines, from golden ratios throughout art to the common experience of high achieving individuals in STEM also being musicians. Perhaps these are not as distinct as they appear at first glance?
I would push back against an actor list from a random TV show as being essential training info. The Iliad, Romeo and Juliet, Great Expectations - these absolutely would be, but a scrape of IMDB is not going to be essential. In fact, there is irony to your scorn for memorization of STEM problems when your key example is nothing more than direct memorization of specific facts of dubious relevance. My point is that Hollywood factoids, as factoids, are not useful info for testing. Articles analyzing specific episodes, actor performance? Valid. But we're at the point where Einstein's quote (maybe paraphrased) "never memorize anything you can look up in a book" is getting to be true for these models. Where I do agree is we want more raw pluripotent intelligence, but I do not agree this must come from endless memorization of facts.
A more cynical take would be that we have a growing array of good benchmarks in STEM, but relatively few to clearly "grade" model output in eloquence and creative tasks other than arena vibes. There is clearly an opportunity here! If you, like I, would like more models adept in generating Shakespearian iambic pentameter, a good place to start would be the metrics. But, continuing the theme, they should be along the lines of "is this actually iambic pentameter and rhyming (and somewhat coherent)" rather than "canit spit out the exact phrasing from that one scene in the confrontation with Tybalt."
My understanding is that Claude is trained on a very large corpus of literature and this likely shows.
@JDWarner You're certainly right about top performers in STEM also being adept in creative tasks like music and poetry. For example, Noble prize winners in science and members of elite institutions like the British Royal Society stand out in these areas, such as being far more likely to play a musical instrument compared to their less successful peers.
However, this has more to do with them being geniuses and leaders in their respective fields because the same strong association exists among the leaders in every academic pursuit, including members of high IQ societies like Mensa and 999.
The truth is a disproportionately high percentage of the STEM population has high functioning autism, especially coders, and reliably have ignorance of, even disdain for, very popular aspects of humanity that aren't practical like coding, math, and the advancement of science and technology
Anyways, you're far too focused on the specific example I provided. It was only one among a potential pool of countless used to illustrate that this model is profoundly ignorant of most popular domains of knowledge, including movies, music, and games. I can't show models like this to any of my non-technical family members, or ~95% of the general population, because their responses are always a form of 'AI is utter crap'.
You simply can't scoop out ~95% of humanity's most popular knowledge without lobotomizing the resultant AI model, leaving it absurdly and frustrating dense across a wide spectrum of use cases. For example, in movies, shows, literature, conversations... people commonly make references to popular, and seemingly unrelated, shared knowledge when making comparisons, jokes, sarcastic quips etc., so when you filter out or severely under-train >95% of humanity's most popular knowledge the resultant AI models become very unreliable and keep responding in absurdly stupid and inappropriate ways. And for what? To gain ~5 points on coding, math and science tests? If you want a specialized tool, such as a coder, then create a tool (e.g. Qwen3 coder). You're going to have to trust me when I say blindly and democratically training on humanity's most popular knowledge will make AI models vastly superior in countless ways to the general population. Models like this one are absolute garbage to pretty much everyone who isn't in STEM. They're specialized tools, not general purpose AI models.