arxiv:2505.07787

Learning from Peers in Reasoning Models

Published on May 12

· Submitted by

Zeno-Luo on May 13

Upvote

Authors:

Tongxu Luo ,

Wenyu Du ,

Jiaxi Bi ,

Zhengyang Tang ,

Benyou Wang

Abstract

Large Reasoning Models (LRMs) have the ability to self-correct even when they make mistakes in their reasoning paths. However, our study reveals that when the reasoning process starts with a short but poor beginning, it becomes difficult for the model to recover. We refer to this phenomenon as the "Prefix Dominance Trap". Inspired by psychological findings that peer interaction can promote self-correction without negatively impacting already accurate individuals, we propose **Learning from Peers** (LeaP) to address this phenomenon. Specifically, every tokens, each reasoning path summarizes its intermediate reasoning and shares it with others through a routing mechanism, enabling paths to incorporate peer insights during inference. However, we observe that smaller models sometimes fail to follow summarization and reflection instructions effectively. To address this, we fine-tune them into our **LeaP-T** model series. Experiments on AIME 2024, AIME 2025, AIMO 2025, and GPQA Diamond show that LeaP provides substantial improvements. For instance, QwQ-32B with LeaP achieves nearly 5 absolute points higher than the baseline on average, and surpasses DeepSeek-R1-671B on three math benchmarks with an average gain of 3.3 points. Notably, our fine-tuned LeaP-T-7B matches the performance of DeepSeek-R1-Distill-Qwen-14B on AIME 2024. In-depth analysis reveals LeaP's robust error correction by timely peer insights, showing strong error tolerance and handling varied task difficulty. LeaP marks a milestone by enabling LRMs to collaborate during reasoning. Our code, datasets, and models are available at https://learning-from-peers.github.io/ .

View arXiv page View PDF Project page GitHub repository Add to collection

Community

Zeno-Luo

Paper author Paper submitter 4 days ago

Large Reasoning Models often get stuck when they start reasoning incorrectly (the "Prefix Dominance Trap"). We propose LeaP (Learning from Peers), a method where parallel reasoning paths share intermediate summaries to learn from each other and self-correct during inference. We also release LeaP-T models fine-tuned for this framework. Experiments show LeaP significantly boosts reasoning performance (e.g., +5 points for QwQ-32B) and error recovery on benchmarks like AIME and GPQA. Our code, datasets, and models are available at https://learning-from-peers.github.io/