It's beating Claude 3.7 on (competitive) programming –a domain Anthropic has been historically really strong at– and it's getting close to o1-mini/R1 on olympiad level coding with just 7B parameters!
Google just dropped an exciting technical report for the brand-new Gemma3 model! 🚀 Here are my personal notes highlighting the most intriguing architectural innovations, design choices, and insights from this release:
1) Architecture choices: > No more softcaping, replace by QK-Norm > Both Pre AND Post Norm > Wider MLP than Qwen2.5, ~ same depth > SWA with 5:1 and 1024 (very small and cool ablation on the paper!) > No MLA to save KV cache, SWA do the job!
2) Long context > Only increase the rope in the global layer (to 1M) > Confirmation that it's harder to do long context for smol models, no 128k for the 1B > Pretrained with 32k context? seems very high > No yarn nor llama3 like rope extension
3) Distillation > Only keep te first 256 logits for the teacher > Ablation on the teacher gap (tl;dr you need some "patience" to see that using a small teacher is better) > On policy distillation yeahh (by @agarwl_ et al), not sure if the teacher gap behave the same here, curious if someone have more info?
4) Others > Checkpoint with QAT, that's very cool > RL using improve version of BOND, WARM/WARP good excuse to look at @ramealexandre papers > Only use Zero3, no TP/PP if i understand correctly ? > Training budget relatively similar than gemma2
Gemma3 family is out! Reading the tech report, and this section was really interesting to me from a methods/scientific fairness pov.
Instead of doing over-hyped comparisons, they clearly state that **results are reported in a setup which is advantageous to their models**. (Which everybody does, but people usually don't say)
For a tech report, it makes a lot of sense to report model performance when used optimally! On leaderboards on the other hand, comparison will be apples to apples, but in a potentially unoptimal way for a given model family (like some user interact sub-optimally with models)
Also contains a cool section (6) on training data memorization rate too! Important to see if your model will output the training data it has seen as such: always an issue for privacy/copyright/... but also very much for evaluation!
Because if your model knows its evals by heart, you're not testing for generalization.
We are introducing multi-backend support in Hugging Face Text Generation Inference! With new TGI architecture we are now able to plug new modeling backends to get best performances according to selected model and available hardware. This first step will very soon be followed by the integration of new backends (TRT-LLM, llama.cpp, vLLM, Neuron and TPU).
We are polishing the TensorRT-LLM backend which achieves impressive performances on NVIDIA GPUs, stay tuned 🤗 !
We applied the same data-driven approach that led to SOTA English performance in🍷 FineWeb to thousands of languages.
🥂 FineWeb2 has 8TB of compressed text data and outperforms other multilingual datasets in our experiments.
The dataset is released under the permissive 📜 ODC-By 1.0 license, and the 💻 code to reproduce it and our evaluations is public.
We will very soon announce a big community project, and are working on a 📝 blogpost walking you through the entire dataset creation process. Stay tuned!