Post
506
๐
Remember when at the beginning of the year
@Google
gave an update on knowledge distillation! Introducing a way of learning from Self-Generated Mistakes?
๐ Resulting in significant improvements across tasks:
- ๐ 2.1x in summarization
- ๐ 1.7x in translation
- ๐ง 1.9x in reasoning tasks
๐ Well, it looks like Google wasn't messing around! According to the Gemma 2 tech report, knowledge distillation was used to pre-train the 9B model, while the 27B model was pre-trained from scratch.
๐ For post-training, the Gemma 2 team generated completions from a stronger teacher model (unspecified in the report, but presumably Gemini Ultra), and then trained the student models on this synthetic data with SFT. This is quite common as seen in many open models, such as Zephyr and OpenHermes.
๐ค Sounds too good to be true? These models suffer from distribution mismatch between output sequences seen during training and those generated by the student during inference.
๐ฐ This is where the January 2024 paper "On-Policy Distillation of Language Models" comes in...
๐ Gemma 2 team used โon-policy distillation,โ where the student generates completions from the SFT prompts. These completions are then used to compute the KL divergence between the teacherโs and studentโs logits. By minimizing the KL divergence throughout training, the student learns to model the behavior of the teacher accurately while also minimizing the train-inference mismatch.
๐ Gem๐น of a blog by @huggingface uncovering everything Gemma 2: https://huggingface.co/blog/gemma2#knowledge-distillation
๐ On-Policy Distillation of Language Models: On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes (2306.13649)
๐ Resulting in significant improvements across tasks:
- ๐ 2.1x in summarization
- ๐ 1.7x in translation
- ๐ง 1.9x in reasoning tasks
๐ Well, it looks like Google wasn't messing around! According to the Gemma 2 tech report, knowledge distillation was used to pre-train the 9B model, while the 27B model was pre-trained from scratch.
๐ For post-training, the Gemma 2 team generated completions from a stronger teacher model (unspecified in the report, but presumably Gemini Ultra), and then trained the student models on this synthetic data with SFT. This is quite common as seen in many open models, such as Zephyr and OpenHermes.
๐ค Sounds too good to be true? These models suffer from distribution mismatch between output sequences seen during training and those generated by the student during inference.
๐ฐ This is where the January 2024 paper "On-Policy Distillation of Language Models" comes in...
๐ Gemma 2 team used โon-policy distillation,โ where the student generates completions from the SFT prompts. These completions are then used to compute the KL divergence between the teacherโs and studentโs logits. By minimizing the KL divergence throughout training, the student learns to model the behavior of the teacher accurately while also minimizing the train-inference mismatch.
๐ Gem๐น of a blog by @huggingface uncovering everything Gemma 2: https://huggingface.co/blog/gemma2#knowledge-distillation
๐ On-Policy Distillation of Language Models: On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes (2306.13649)