Papers
arxiv:2505.21067

Why Distillation can Outperform Zero-RL: The Role of Flexible Reasoning

Published on May 27
Authors:
,
,
,
,
,
,
,
,

Abstract

A simple distillation method using 920 examples outperforms zero-RL in flexibility and advanced cognitive behaviors by enhancing multi-perspective thinking and metacognitive awareness.

AI-generated summary

Reinforcement learning (RL) has played an important role in improving the reasoning ability of large language models (LLMs). Some studies apply RL directly to smaller base models (known as zero-RL) and also achieve notable progress. However, in this paper, we show that using only 920 examples, a simple distillation method based on the base model can clearly outperform zero-RL, which typically requires much more data and computational cost. By analyzing the token frequency in model outputs, we find that the distilled model shows more flexible reasoning. It uses anthropomorphic tokens and logical connectors much more often than the zero-RL model. Further analysis reveals that distillation enhances the presence of two advanced cognitive behaviors: Multi-Perspective Thinking or Attempting and Metacognitive Awareness. Frequent occurrences of these two advanced cognitive behaviors give rise to flexible reasoning, which is essential for solving complex reasoning problems, while zero-RL fails to significantly boost the frequency of these behaviors.

Community

Great work! Have you noticed the distinctiveness of 32B, or have you tried on other models, such as Qwen2.5-14B? It seems easy to obtain high performance by training Qwen2.5-32B with few training data, which is however not the same case for other models, even from Qwen family. And I don't know why.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2505.21067 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2505.21067 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2505.21067 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.