Abstract
Large Reasoning Models (LRMs) have the ability to self-correct even when they make mistakes in their reasoning paths. However, our study reveals that when the reasoning process starts with a short but poor beginning, it becomes difficult for the model to recover. We refer to this phenomenon as the "Prefix Dominance Trap". Inspired by psychological findings that peer interaction can promote self-correction without negatively impacting already accurate individuals, we propose **Learning from Peers** (LeaP) to address this phenomenon. Specifically, every tokens, each reasoning path summarizes its intermediate reasoning and shares it with others through a routing mechanism, enabling paths to incorporate peer insights during inference. However, we observe that smaller models sometimes fail to follow summarization and reflection instructions effectively. To address this, we fine-tune them into our **LeaP-T** model series. Experiments on AIME 2024, AIME 2025, AIMO 2025, and GPQA Diamond show that LeaP provides substantial improvements. For instance, QwQ-32B with LeaP achieves nearly 5 absolute points higher than the baseline on average, and surpasses DeepSeek-R1-671B on three math benchmarks with an average gain of 3.3 points. Notably, our fine-tuned LeaP-T-7B matches the performance of DeepSeek-R1-Distill-Qwen-14B on AIME 2024. In-depth analysis reveals LeaP's robust error correction by timely peer insights, showing strong error tolerance and handling varied task difficulty. LeaP marks a milestone by enabling LRMs to collaborate during reasoning. Our code, datasets, and models are available at https://learning-from-peers.github.io/ .
Community
Large Reasoning Models often get stuck when they start reasoning incorrectly (the "Prefix Dominance Trap"). We propose LeaP (Learning from Peers), a method where parallel reasoning paths share intermediate summaries to learn from each other and self-correct during inference. We also release LeaP-T models fine-tuned for this framework. Experiments show LeaP significantly boosts reasoning performance (e.g., +5 points for QwQ-32B) and error recovery on benchmarks like AIME and GPQA. Our code, datasets, and models are available at https://learning-from-peers.github.io/
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking (2025)
- ShorterBetter: Guiding Reasoning Models to Find Optimal Inference Length for Efficient Reasoning (2025)
- Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models (2025)
- Speculative Thinking: Enhancing Small-Model Reasoning with Large Model Guidance at Inference Time (2025)
- GRPO-LEAD: A Difficulty-Aware Reinforcement Learning Approach for Concise Mathematical Reasoning in Language Models (2025)
- Dynamic Early Exit in Reasoning Models (2025)
- Think Deep, Think Fast: Investigating Efficiency of Verifier-free Inference-time-scaling Methods (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 3
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper