Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
singhsidhukuldeepย 
posted an update Jun 30
Post
506
๐Ÿ“… Remember when at the beginning of the year @Google gave an update on knowledge distillation! Introducing a way of learning from Self-Generated Mistakes?

๐Ÿ“Š Resulting in significant improvements across tasks:
- ๐Ÿ“„ 2.1x in summarization
- ๐ŸŒ 1.7x in translation
- ๐Ÿง  1.9x in reasoning tasks

๐Ÿš€ Well, it looks like Google wasn't messing around! According to the Gemma 2 tech report, knowledge distillation was used to pre-train the 9B model, while the 27B model was pre-trained from scratch.

๐Ÿ“ˆ For post-training, the Gemma 2 team generated completions from a stronger teacher model (unspecified in the report, but presumably Gemini Ultra), and then trained the student models on this synthetic data with SFT. This is quite common as seen in many open models, such as Zephyr and OpenHermes.

๐Ÿค” Sounds too good to be true? These models suffer from distribution mismatch between output sequences seen during training and those generated by the student during inference.

๐Ÿ“ฐ This is where the January 2024 paper "On-Policy Distillation of Language Models" comes in...

๐Ÿ” Gemma 2 team used โ€œon-policy distillation,โ€ where the student generates completions from the SFT prompts. These completions are then used to compute the KL divergence between the teacherโ€™s and studentโ€™s logits. By minimizing the KL divergence throughout training, the student learns to model the behavior of the teacher accurately while also minimizing the train-inference mismatch.

๐Ÿ“š Gem๐Ÿ”น of a blog by @huggingface uncovering everything Gemma 2: https://huggingface.co/blog/gemma2#knowledge-distillation

๐Ÿ“„ On-Policy Distillation of Language Models: On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes (2306.13649)