Magistral Small has far less popular knowledge than Mistral Small.

#4
by phil111 - opened

I run a very easy broad knowledge test that asks basic questions about very popular movies, shows, music... Basically all popular knowledge that isn't covered by the MMLU in order to identify models that are selectively overtraining on the small subset of popular knowledge that boosts MMLU scores.

Anyways, Mistral Small 2409 performed well on this test, scoring 82.6/100. However, Mistral Small 2501 score dropped to 75.4/100, and so on. And now this version (2509) has lost so much broad knowledge that it's reliably hallucinating on the same easy questions that far smaller models like Llama 3.2 3b reliably get completely right.

For example, Two and a Half Men was one of the most watched shows of all time and ran for 12 seasons, yet it struggles to simply list the main cast ("ο»Ώο»Ώο»ΏList the 6 main characters, and the actors who played them, from Two and a Half Men. Don't add details, just list them. And what year did the show first air?").

Charlie Harper – Charlie Sheen (correct)
Alan Harper – Jon Cryer (correct)
Jake Harper – Angus T. Jones (correct)
Evelyn Harper – Marin Hinkle (wrong; Evelyn was Alan & Charlie's mother, and the actress Marin Hinkle played Alan's ex-wife Judith)
Judith Harper – Amanda Payton (later replaced by Judy Greer) (wrong: both Amanda and Judy are incorrect)
Berta – Holland Taylor (wrong: Berta was the housekeeper, and Holland portrayed Alan and Charlie's mother)

This model has a a very shallow and sharp knowledge horizon. For example, it reliably gets the casts of the top 100 movies and shows of all time correct, such as the TV show Friends and the movie Pulp Fiction, plus it follows instruction well, which pretty much rules out a configuration, tokenizer... issue. But then it suddenly starts hallucinating like crazy about a little less popular, but still very popular, information known by countless millions of people. And as previously stated, small models like Llama 3.2 3b get nearly all of the same easy question about very popular things, which this model reliably hallucinates on, correct, and when using the same 0.3 temperature, quantization, etc. (the results are even worse at the recommended settings, such as temp 0.7).

I suppose having a model that is marginally better at math, coding, and STEM is useful, but this can no longer be considered a general purpose AI model that the general population can use as a daily driver.

Well, that's what finetunings are for.

@CATAMERCA Firstly, >99% of the general population can't and aren't going to fine-tune models, nor use community fine-tunes. They're just going to use the official instruct versions.

Secondly, this isn't a fine-tuning issue. The base models are loosing the broad knowledge for various reasons, such as undertraining low priority knowledge tokens and overfitting high priority coding, math, and STEM tokens at the tail end of training. The only way to restore the lost broad knowledge is to basically re-train the model with a full broad knowledge corpus for weeks on a supercomputer about as powerful as the one it was originally trained on. In short, fine-tuning improves tasks, like coding, math, and story writing, but does not improve general knowledge.

Why do people want models filled with completely useless info?
These models are striving into a direction where they become really good at retriving and evaluating information dynamically.
Training static info into these models makes no sense at all, I want a model which when I ask it questions like yours, is able to search the web, evaluate and validate sources, because thats future proof for any upcoming TV soap also.
If its at the cost of having less useless info pre trained into the model, then your observation is kind of a good thing, altough it shouldnt just "hallucinate" facts and actually do the research, but you have to give it tools for that.
For any other usecase idk, maybe just use wikipedia api?

@ztsvvstz That's not how any of this works.

It's not just about giving answers to random questions. And even if it was, RAG has HUGE disadvantages when it comes to things like latency, complexity, and dependency of the internet being up, especially when traveling, and so on. Plus other major issues come to mind, such as privacy, especially if you're using the model to analyze things like personal finances, business records and client logs.

But more importantly, a large number of complex tasks require the knowledge to be in the weights for quality organic outputs, such as when writing stories, poems, and jokes, or simply discussing a popular topic of interest with an organic flow to it. Plus RAG results in weird mistakes, shifts in prose, and so on, as things are pulled from various sources. It's like interacting with a slow and unreliable schizophrenic.

Lastly, if RAG was sufficient for knowledge then why are select domains of knowledge that are no more useful in performing tasks like coding and math, such as the names of past scientists, not only included in the weights, but more and more is added to the weights in every generation to boost MMLU scores?

I can go on an on, but I assure you with 100% confidence as someone who's been testing AI models across a broad spectrum of tasks for years that RAG does not, in any meaningful way, compensate for a model's broad general ignorance of humanity's most popular knowledge. I find this commonly expressed elitist attitude of let them eat RAG to be as annoying as it is stupid. AI models aren't just for autistic male coding nerds.

Okay one more thought, because its really kinda absurd (no offense tho)
Lets imagine you train a model soley on "general knowledge", you'd have to retrain it all the time or else it'd be obsolete in 2-3 years due to lack of recent knowledge, on top of that its too dumb to obtain recent news because its whole knowledge network consists of associations of which actor played in which TV series, or static information which might have been contradicted in the meantime (in which case it will still defend its trained facts)
What you're describing really sounds more like a job for a vector database, or an LLM coupled with a vector db
But you'll probably see that a "general knowledge" llm on itself will start to fail at more complex queries anyways, which werent discussed in a public forum somewhere, because its lacking logic to connect the facts ;)

That being said I should probably add that LLMs right now aint even that good at connecting facts even with overdoing math, logic & reasoning, but adding more random facts will probably just add to the confusion
It'd be great if they'd learn to properly verify any information (wether they generated it, or RAG'd it)
Poems and stuff however is an interesting niche use case I havent even thought about, thats a good point

phil111 changed discussion status to closed

@ztsvvstz

General knowledge remains relevant for millennia, and your "logic" misunderstanding is borderline painful. Mathematics builds on philosophical foundations. Try implementing RSA encryption without Leibniz/Newton's calculus feud resolutions. I can tell you ahead of time that this won't work.

Other examples are GPT-4 solving coding challenges using patterns from StackOverflow and Jane Austen's social protocol analyses. LLMs don't separate "logic" from knowledge because transformer architectures process through shared attention heads (ie. generating Wittgenstein's language games and Python syntax).

This techno-reductionism produces unusable models. IBM Watson Health's 62% error rate in cancer trials proved "pure logic" systems fail without sociological context. We abandoned that approach because reality will use both historical knowledge and let's say.... TikTok slang comprehension to parse modern queries.

Sign up or log in to comment