Gemma A3B
It would be amazing if Google released the A3B model with 27–30 billion parameters. A mixture-of-experts (MoE) architecture is incredibly useful because it allows the model to dynamically activate only the relevant experts for each task, improving efficiency, specialization, and overall performance.
That would be nice. There's 3 open source A3b models that I'm aware of (gpt-os, Ernie 4.5 and Qwen3), but they all lack the broad knowledge and abilities of Gemma 3 so I still use Gemma 3 27b as my daily driver despite it being much slower. It's still pretty fast on modern PCs, but I agree that an MOE version would be amazing.
Hi @kalashshah19 , A3b (active 3 billion) means only a small fraction of a model's total parameters (~3 billion out of ~30 billion) are active at a time when generating a response, allowing the model to have nearly the same knowledge and abilities as a ~30 billion parameter dense model while outputting responses ~3x faster.
For example, Qwen3-30B-A3B-Instruct-2507 has 30 billion total parameters but only 3 billion of which are active when generating each output token.
Hello @Maria99934 and @phil111 , I am curious what A3b means ? What is it's full form ?
This increases performance very conveniently, today I only know of QWEN3 with an architecture that can really think, but it's more analytical, Gemma is very good at history humanities and creating text content.
That would be nice. There's 3 open source A3b models that I'm aware of (gpt-os, Ernie 4.5 and Qwen3), but they all lack the broad knowledge and abilities of Gemma 3 so I still use Gemma 3 27b as my daily driver despite it being much slower. It's still pretty fast on modern PCs, but I agree that an MOE version would be amazing.
maybe moe models are just bad at knowledge, compared to dense models.
@CHNtentes MOE models have the same theoretical maximum storage capacity as dense models.
For example, gpt-os-120b only has 5b active parameters yet has the approximate knowledge of 120b dense LLMs, such as a SimpleQA score of ~17.
Another example is Mixtral 8x7b (12.9 active parameters) which was trained by Mistral on pretty much the same corpus as Mistral 7b yet it scores much higher across the board on knowledge tests, including the SimpleQA, MMLU, and my personal broad knowledge test (82 vs 69).
It's "intelligence", not knowledge, that was the primary issue with early MOEs like Mixtral. However, DeepSeek was the first to largely overcome this limitation.
Lastly, the only reason some MOEs, such as Qwen 30b 3b, are so profoundly ignorant across numerous domains of knowledge is because Alibaba grossly overfit a handful of domains like coding, math, and STEM. This is confirmed by the fact that Qwen3 34b dense is equally ignorant across the same domains of knowledge and scores comparably across broad knowledge tests like the SimpleQA, MMLU, and my personal broad knowledge test as Qwen3 30b 3b.
So in short, MOE models aren't bad at knowledge, and in theory if everything is done right (e.g. routing) they should match the information density of equally sized dense models.
@CHNtentes MOE models have the same theoretical maximum storage capacity as dense models.
For example, gpt-os-120b only has 5b active parameters yet has the approximate knowledge of 120b dense LLMs, such as a SimpleQA score of ~17.
Another example is Mixtral 8x7b (12.9 active parameters) which was trained by Mistral on pretty much the same corpus as Mistral 7b yet it scores much higher across the board on knowledge tests, including the SimpleQA, MMLU, and my personal broad knowledge test (82 vs 69).
It's "intelligence", not knowledge, that was the primary issue with early MOEs like Mixtral. However, DeepSeek was the first to largely overcome this limitation.
Lastly, the only reason some MOEs, such as Qwen 30b 3b, are so profoundly ignorant across numerous domains of knowledge is because Alibaba grossly overfit a handful of domains like coding, math, and STEM. This is confirmed by the fact that Qwen3 34b dense is equally ignorant across the same domains of knowledge and scores comparably across broad knowledge tests like the SimpleQA, MMLU, and my personal broad knowledge test as Qwen3 30b 3b.
So in short, MOE models aren't bad at knowledge, and in theory if everything is done right (e.g. routing) they should match the information density of equally sized dense models.
Thank you for the detailed explanation, it is very helpful
Hello @Maria99934 and @phil111 , I am curious what A3b means ? What is it's full form ?
This increases performance very conveniently, today I only know of QWEN3 with an architecture that can really think, but it's more analytical, Gemma is very good at history humanities and creating text content.
Thanks for the information mate.
@phil111 That's not entirely correct. A 30b dense model would be able to hold a lot more knowledge than a 30b MoE. There's a formula for this somewhere (I'm not near my bookmarks right now).
You're right that we need to compare like for like. ie, not Mixtral vs llama3. So here are some examples:
Qwen3 MoE vs Dense
If you compare the first 30B Qwen3 MoE with the dense models they released at the same time, it was pretty much on par with the 14B dense model and the 32b dense model was a lot better (at everything).
Mistral MoE vs Dense
Mixtral 8x22b (142b) was a lot worse than than Mistral-Large-123b). Apparently it retained less knowledge than the leaked, quantized 70b Mistral model that was uploaded on HF around that time.
IMO, a gpt-os 70b dense version of would absolutely knock that 120b MoE out of the water!
There's also the issue of finetuning. There are hundreds of community fine tunes of dense models (even Mistral-Large and a few Command-A/Command-R). But practically no MoEs because they're so expensive and difficult for hobbyists to train.
Cohere seem to be the only company still releasing large, powerful dense models. Personally, I hope Google stick with dense models.
A 70b or 120b dense Gemma would be incredible (but I'm guessing they can't do that as it'd compete with Gemini ?)
As you can probably tell, I hope Google don't jump on the MoE bandwagon for Gemma!
@gghfez I honestly agree with your conclusion that the dense design reliably outperforms the MOE design despite having comparable test scores.
For example, even though the test scores of Qwen3 34b are reliably only 1 to 2 points higher than Qwen3 30b 3b, such as AIME25 of 72.0 vs 70.9 and MuliIF of 73.0 vs 72.2, the real-world performance difference between them is larger than the test scores suggest.
That is, while the two models had comparable knowledge and abilities, the Qwen3 34b dense was not only less error-prone, such as outputting in the wrong language, it also showed more variation. For example, stories and synonym lists have more variability at the same temps.
However, your second example of Mixtral 8x22b and the leaked Mistral Large isn't a fair comparison. The skills, such as story writing and coding, of the Mixtral and Mistral Large/Small series are very different (e.g. Mistral Large writes far better stories than Mixtral 8x22b), and the broad knowledge of Mixtral 8x22b is still higher than any version of Mistral Large. Part of the confusion is the Mistral Large/Medium/Small series started to overfit domains like coding, making it appear more knowledgeable in some contexts.
Anyways, so while I agree that a 70b or 120b dense Gemma would outperform a 70b or 120b MOE they would be FAR slower on modern PCs, hence FAR fewer people would run them locally. So all things considered a 120b MOE Gemma with 5b active parameters like gpt-os-120 would be chosen over a 120b dense Gemma by >95% of users. The relatively small performance gain wouldn't come close to making up for the gigantic difference in speed.
Hi All,
Thank you so much all of your inputs and interest in Gemma model, we'll keep in mind all of your inputs and escalate the same to respective team for the further evaluation. Thanks once again for your contribution to make Gemma models a significant step forward in open-source AI.
Thanks.
你可以干嘛?