Everything You Need to Know about Knowledge Distillation

Community Article Published March 6, 2025

🔳 This is one of the hottest topics thanks to DeepSeek. Learn with us: the core idea, its types, scaling laws, real-world cases and useful resources to dive deeper

In the previous episode, we discussed Hugging Face’s “Smol” family of models and their effective strategy for training small LMs through high-quality dataset mixing. Today we want to go further in exploring training techniques for smaller models, making it the perfect time to discuss knowledge distillation (KD). Proposed a decade ago, this method has continued to evolve. For example, DeepSeek’s advancements, particularly the effective distillation of DeepSeek-R1, have recently brought a wave of attention to this approach.

So, what is the key idea behind knowledge distillation? It enables to transfer knowledge from larger model, called teacher, to smaller one, called student. This process allows smaller models to inherit the strong capabilities of larger ones, avoiding the need for training from scratch and making powerful models more accessible. Let’s explore how knowledge distillation has evolved over time, the different types of distillation that exist today, the key factors to consider for effective model distillation, and useful resources to master it.

📨 Click follow! If you want to receive our articles straight to your inbox, please subscribe here

In today’s episode, we will cover:

When did knowledge distillation appear as a technique?
A detailed explanation of knowledge distillation
Types of knowledge distillation
Improved algorithms
Distillation scaling laws
Benefits
Not without limitations
Real-world effective use cases (why OpenAI got mad at DeepSeek)
Conclusion
Sources and further reading

When did knowledge distillation appear as a technique?

The ideas behind knowledge distillation (KD) date back to 2006, when Bucilă, Caruana, and Niculescu-Mizil in their work “Model Compression” showed that an ensemble of models could be compressed into a single smaller model without much loss in accuracy. They demonstrated that a cumbersome model (like an ensemble) could be effectively replaced by a lean model that was easier to deploy.

Later in 2015, Geoffrey Hinton, Oriol Vinyals, and Jeff Dean coined the term “distillation” in their “Distilling the Knowledge in a Neural Network” paper. This term was referred to the process of transferring knowledge from a large, complex AI model or ensemble to a smaller, faster AI model, called the distilled model. Instead of just training the smaller model on correct answers, researchers proposed to give it the probability distribution from the large model. This helps the smaller model learn not just what the right answer is, but also how confident the big model is about each option. This training concept is closely connected to the softmax function, so let's explore more precisely how this all works at the core.

Image Credit: “Knowledge Distillation: A Survey” paper

A detailed explanation of knowledge distillation

Firstly, we need to clarify what is softmax.

It is a mathematical function used in machine learning, especially in neural networks, to convert raw scores, called logits, into probabilities. It helps a model decide which category, or class, an input belongs to by ensuring that the output values sum to 1, making them interpretable as probabilities.

The important parameter in softmax is a temperature (T) — it is a way to control how confident or uncertain a model’s predictions are. It adjusts the sharpness of the probability distribution – making it either more confident (sharp) or more uncertain (soft). If T = 1, it is the default setting referring to the normal softmax behavior, where only the correct answer receives 100% probability. In this case, softmax creates hard targets. When the temperature is increased (T > 1), softmax creates soft targets, meaning the probabilities are more spread out or softer.

Soft targets are useful for distillation and training, and the knowledge distillation process below shows why. It typically involves several steps:

First, the teacher model is trained on the original task and dataset.
Next, the teacher model produces logits. These logits are converted into soft targets using a softmax function with a higher temperature to make the probability distribution softer.
The student model is then trained on these soft targets often alongside the hard targets (true labels) by minimizing the difference between the student’s output distribution and the teacher’s output distribution.

Image Credit: “Knowledge Distillation: A Survey” paper

During this process the student learns to reproduce not just the correct answers, but also the teacher’s relative confidence in those answers and its mistakes. This knowledge about how the teacher distributes probability mass among the incorrect categories provides rich information that helps the student generalize better. By combining a standard training loss on true labels with a distillation loss on the teacher’s soft labels, the student can achieve accuracy close to the teacher model’s accuracy, despite having far fewer parameters.

If we were to summarize the key idea in one sentence, it would be this: The student is optimized to mimic the teacher’s behavior, not just outputs.

There is another approach to distillation proposed in the “Distilling the Knowledge in a Neural Network” paper, and it is matching logits. Instead of just copying probabilities, this method directly makes the small model's logits resemble the large model’s logits. In high-temperature settings, this method becomes mathematically equivalent to standard distillation, and both approaches lead to similar benefits.

Types of knowledge distillation

What we have explored are just two options for knowledge distillation, but KD can be applied in various ways depending on what knowledge is transferred from teacher to student. These types of distillation were perfectly illustrated in the paper *“Knowledge Distillation: A Survey” by researchers from the University of Sydney and the University of London. So common techniques include:

Image Credit: “Knowledge Distillation: A Survey” paper

Response-based distillation (outputs as knowledge): It is exactly what we have discussed above. This classic approach uses the teacher’s final output probabilities as the target for training the student. It works well in different AI tasks, like image classification, object detection, and pose estimation.
Feature-based distillation (intermediate layers as knowledge): Instead of only copying the final predictions, the student also learns from the intermediate layers, or feature maps, of the teacher. This idea was proposed in the “FitNets: Hints for Thin Deep Nets” paper. Think of this technique as the student learning from the teacher's step-by-step problem-solving process. The student matches its intermediate representations to those of the teacher, learning both the right answers and the reasoning behind them. Different methods, such as attention maps, probability distributions, and layer connections, help match teacher and student features. It especially improves performance in tasks like image recognition and object detection.

Image Credit: “Knowledge Distillation: A Survey” paper

Relation-based distillation (relationships as knowledge): The student model learns to mimic relationships between different parts of the teacher model – either between layers or between different data samples. For example, the students compares multiple samples and learns their similarities. This method is more complex but can work with multiple teacher models, merging their knowledge.

Image Credit: “Knowledge Distillation: A Survey” paper

It is a classification of the three concepts used for knowledge distillation based on what to distill. But how exactly can we transfer the knowledge during training? There are three main ways:

Offline distillation – The teacher is trained first, then teaches the student.
Online distillation – The teacher and student learn together.
Self-distillation – The student learns from itself.

Here is what you need to know about these training schemes:

But that’s not all. Since the emergence of the knowledge distillation concept, many different approaches have been developed to improve knowledge transfer.

Improved algorithms

Different studies proposed knowledge distillation algorithms to make it more effective, in sometimes even applying other techniques. Here are some of them:

Multi-teacher distillation: Combines knowledge from multiple teachers for a more well-rounded student. This approach can use different teachers for different features, average predictions from all teachers or randomly select a teacher in each training step. It combines strengths from various models and improves diversity in knowledge.
Cross-modal distillation: Transfers knowledge between different types of data, for example, from image to text or from audio to video.
Graph-based distillation: Focuses on relationships between different data points to captures hidden relationships and better understand data structure.
Attention-based distillation: The teacher model generates attention maps to highlight important areas in the data, and the student copies these attention maps, learning where to focus.
Non-Target Class-Enhanced KD (NTCE-KD): Focuses on the often-ignored probabilities of non-target classes in the teacher’s output, so the student model learns incorrect labels as well.
Adversarial distillation: Uses Generative Adversarial Networks (GANs) to help the student improve by mimicking the teacher. GANs can generate extra synthetic data to improve student training or use their discriminator, which checks if data is real or fake, to compare teacher and student outputs.
Data-free distillation: Works without needing the original dataset. Here a GAN is also used to generate synthetic training data based on the teacher model.
Quantized distillation: Reduces high-precision numbers for calculations (like 32-bit) used in large models to smaller, low-precision values (8-bit or 2-bit), making AI models lighter and faster.
Speculative Knowledge Distillation (SKD): The student and teacher cooperate during text generation training. The student generates draft tokens and the teacher selectively replaces low-quality tokens, producing on-the-fly high-quality training data aligned with the student’s own distribution.
Lifelong distillation: Model keeps learning over time, remembering old skills while learning new ones. Here are some variation of this method: 1) model learns how to learn, so it can quickly adapt to new tasks (meta-learning); 2) it learns from very few examples by using knowledge from past tasks (few-shot learning) or 3) It keeps a compressed version of its knowledge while training on new tasks (global distillation).
Distillation in generative models: It refers to distilling complex generation processes into simpler ones. For example, you can distill a multi-step diffusion model into a single-step generative model (often a GAN) to dramatically accelerate inference. Also, in text-to-speech models, distillation can be used to compress autoregressive models into non-autoregressive ones, speeding up speech generation.
Neural Architecture Search (NAS): Helps find the best student model automatically to match the teacher.

This variety of approaches demonstrates that distillation is a crucial training concept, which continues to evolve and remains an essential part of training small models.

Distillation scaling laws

We have explored different ways to transfer knowledge from a larger model to a smaller one. But can we predict how effective this knowledge distillation will be? How will the distilled model perform, and what factors will it depend on?

This is where Apple and the University of Oxford have made a significant contribution to this vast topic. They developed distillation scaling laws and identified key trends in model behavior after distillation.

Here are their main findings, which you should consider for more effective distillation:

Distillation scaling law predicts how well a student model will perform based on three key factors:
- Student model's size
- The number of training tokens
- The teacher’s size and quality

This law follows a "power law" relationship – performance improves in a predictable way but only to a point. After this point adding more resources won’t improve the model.

This can be used as the following idea: If the student is small, you can use a smaller teacher to save compute. If the student is large, you need a better and larger teacher for optimal performance. As we increase compute, the best teacher size grows initially, but then it plateaus because using a very large teacher becomes too expensive.

Image Credit: “Distillation Scaling Laws” paper

However, a good teacher doesn’t always mean a better student. If a teacher is too strong, the student might struggle to learn from it, leading to worse performance. This is called the capacity gap — it is when the student isn’t powerful enough to properly mimic the teacher.
Distillation is most efficient when:
- The student is small enough that supervised training would be too expensive.
- A teacher model already exists, so the cost of training one is not required, or it can be used beyond training just a single student model. Researchers recommend to use supervised learning instead of distillation:
- If both the teacher and student need to be trained, because teacher training costs outweigh distillation benefits.
- If enough compute and data are available, because supervised learning always outperforms distillation at high compute or data budgets.
Sometimes a student model can even outperform its teacher. This phenomenon is called weak-to-strong generalization.

Knowing these trends in KD can help avoid costly mistakes when applying this technique to the training of small models. Overall, what are the advantages of using knowledge distillation?

Benefits of knowledge distillation

Here we summarize everything which makes KD a great option for training smaller models.

Lower memory and compute requirements: Knowledge distillation helps to retaining much of the performance while significantly reducing computational requirements. Smaller models consume less memory and processing power, making them suitable for deployment on edge devices, mobile applications, and embedded systems.
Faster inference: The distilled student model is typically smaller and requires fewer computations, leading to lower latency and faster predictions, which is crucial for real-time applications.
Improved generalization: The student model often learns a more generalized and distilled version of the knowledge, potentially reducing overfitting and improving performance on unseen data.
Training stability: The student model benefits from the structured knowledge of the teacher model, leading to smoother and more stable training, especially in cases where data is limited or noisy.
Transfer of specialized, diverse and multi-task knowledge: A student model can be trained with insights from multiple teacher models, allowing it to inherit knowledge from diverse architectures or domains and perform well across different tasks.
Privacy-preserving AI: Knowledge distillation enables models to be trained without exposing raw data, which is beneficial in scenarios where data privacy regulations are applicable.
Energy efficiency: With reduced computation, knowledge-distilled models consume less energy, making them environmentally and economically viable for large-scale AI deployments.

Knowledge distillation provides a powerful way to leverage the prowess of state-of-the-art models in practical settings using more compact efficient models. However, some issues remain.

Not without limitations

Like any technique, while knowledge distillation offers many benefits, it also has several limitations and challenges:

Increased training complexity: Distillation requires training two models, the teacher and the student. This adds an additional step compared to directly training a smaller model from scratch.
Loss of information: The student model may not capture all the nuances, fine-grained knowledge, or complex reasoning capabilities of the larger teacher model.
Performance trade-off: While the student model aims to retain most of the teacher’s performance, there is often a trade-off between size and accuracy.
If the student model is too small, it might not have enough capacity to effectively learn from the teacher. This happens especially when the teacher model is too strong for the student.
Dependence on teacher model quality: If the teacher model is biased or contains errors, these issues will likely be transferred to the student.
Sensitivity to temperature and hyperparameters: The effectiveness of KD heavily depends on the choice of temperature parameter and loss function. Improper tuning can lead to poor knowledge transfer or bad students performance.
Energy and computational costs: While the student model is efficient, the distillation process itself can be computationally expensive, especially for large-scale models.

Despite these limitations, knowledge distillation is a widespread technique that continues to be used in many cases, even with top-tier models.

Implementation of knowledge distillation: Real-world effective use cases

One of the freshest examples of effective KD is the DeepSeek case, where DeepSeek transferred the reasoning capabilities of DeepSeek-R1 into smaller models to make powerful reasoning models more accessible.

They fine-tuned open-source models like Qwen and Llama using the 800,000 high-quality training examples from DeepSeek-R1. Surprisingly, these distilled models performed even better than applying reinforcement learning directly to the smaller models.

However, this case also sparked controversy when OpenAI raised concerns about the methods used in the distillation process. OpenAI alleged that DeepSeek may have leveraged outputs from proprietary models, such as ChatGPT, to enhance its training data. Reports suggested that some responses from DeepSeek’s models closely resembled those from OpenAI’s systems, leading to suspicions of unauthorized knowledge extraction.

Further fueling these claims, DeepSeek’s AI model occasionally identified itself as ChatGPT, hinting at potential training overlaps. Additionally, OpenAI and Microsoft reportedly detected unusual activity linked to data extraction efforts by individuals associated with DeepSeek. These findings led to a broader discussion on the ethical boundaries of knowledge distillation and the protection of proprietary AI technologies.

Despite the controversy, the results of DeepSeek’s knowledge distillation were significant.

The distilled DeepSeek-R1-Distill-Qwen-7B model outperformed much larger models like QwQ-32B on reasoning benchmarks.
The distilled 32B and 70B versions set new records for open-source AI reasoning tasks, demonstrating the effectiveness of distillation techniques in improving model efficiency and accessibility.

This case highlights both the potential and the challenges of knowledge distillation in AI development.

Image Credit: The original DeepSeek-R1 paper

Another example is a classic one from Natural Language Processing (NLP) – Hugging Face’s DistilBERT. It is a distilled version of the BERT model that retains about 97% of BERT’s language understanding capability with only 60% of the runtime, achieving a 40% reduction in model size.

In October 2024, Microsoft researchers distilled a 405-billion-parameter Llama 3.1 model into 70B and 8B variants, using high-quality synthetic data. The distilled models retained comparable accuracy to the 405B teacher and even matched or surpassed the teacher’s zero-shot performance on some benchmarks.

For the vision domain, here is Meta AI’s Segment Anything Model (SAM) as an example. It is a powerful but resource-heavy segmentation model and recently researchers proposed KD-SAM, a distilled version of SAM tailored for medical image segmentation. Using a combination of feature-map losses, they transferred both structural and semantic knowledge from SAM’s giant Vision Transformer into a much smaller model. As a result, KD-SAM achieved segmentation accuracy on par with or even superior to the original SAM.

A prime example in speech recognition is Amazon Alexa. Researchers used a KD approach with a huge amount of unlabeled speech: the teacher model processed over 1 million hours of speech to generate soft targets, which then trained a smaller student acoustic model. The distilled student model could run efficiently on Alexa’s consumer devices, yet delivered accurate speech recognition under tight memory and CPU constraints.

Moreover, knowledge distillation helps bring AI to edge devices such as microcontrollers, sensors, and distributed devices that operate on battery power or low energy.

Conclusion

Today, we took a deep dive into the knowledge distillation training technique. This method has proven to be an effective approach for training small models by leveraging the capabilities and knowledge of powerful, larger ones.

Small models are all about AI accessibility, and knowledge distillation is about making knowledge accessible to smaller models. What’s great is that KD techniques are continuously evolving, as seen in the variety of distillation approaches that we have today. We hope that future developments in KD will help overcome current limitations, leading to more effective training.

To explore knowledge distillation further, we invite you to check out the tutorials for practical experience that we summarized in our post on Twitter, along with the studies listed below.

Check it out on Twitter

Author: Alyona Vert Editor: Ksenia Se

Bonus: Resources to dive deeper

📨 If you want to receive our articles straight to your inbox, please subscribe here

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment

Upvote