Thought-Augmented Policy Optimization: Bridging External Guidance and Internal Capabilities
Abstract
A novel RL framework, TAPO, integrates external guidance to enhance model performance and exploration compared to existing methods.
Reinforcement learning (RL) has emerged as an effective method for training reasoning models. However, existing RL approaches typically bias the model's output distribution toward reward-maximizing paths without introducing external knowledge. This limits their exploration capacity and results in a narrower reasoning capability boundary compared to base models. To address this limitation, we propose TAPO (Thought-Augmented Policy Optimization), a novel framework that augments RL by incorporating external high-level guidance ("thought patterns"). By adaptively integrating structured thoughts during training, TAPO effectively balances model-internal exploration and external guidance exploitation. Extensive experiments show that our approach significantly outperforms GRPO by 99% on AIME, 41% on AMC, and 17% on Minerva Math. Notably, these high-level thought patterns, abstracted from only 500 prior samples, generalize effectively across various tasks and models. This highlights TAPO's potential for broader applications across multiple tasks and domains. Our further analysis reveals that introducing external guidance produces powerful reasoning models with superior explainability of inference behavior and enhanced output readability.
Community
๐ We are excited to announce our latest research paper, "Thought-Augmented Policy Optimization: Bridging External Guidance and Internal Capabilities"! This work introduces TAPO, a novel reinforcement learning (RL) framework designed to significantly enhance the reasoning capabilities of large language models (LLMs).
๐ Unlike existing RL approaches that often bias towards self-generated reward-maximizing paths without introducing external knowledge, TAPO introduces high-level external guidance in the form of "thought patterns". By adaptively integrating these structured thoughts during training, TAPO achieves a powerful balance between model-internal exploration and external guidance exploitation. We demonstrate that TAPO significantly outperforms GRPO, achieving remarkable gains of 99% on AIME, 41% on AMC, and 17% on Minerva Math benchmarks. Notably, these high-level thought patterns are abstracted from only 500 prior samples, showing strong generalization across various tasks and models. Furthermore, TAPO leads to reasoning models with superior explainability of inference behavior and enhanced output readability.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Learning to Reason under Off-Policy Guidance (2025)
- Entropy-guided sequence weighting for efficient exploration in RL-based LLM fine-tuning (2025)
- Reinforced MLLM: A Survey on RL-Based Reasoning in Multimodal Large Language Models (2025)
- AAPO: Enhance the Reasoning Capabilities of LLMs with Advantage Momentum (2025)
- Training Large Language Models to Reason via EM Policy Gradient (2025)
- Improving RL Exploration for LLM Reasoning through Retrospective Replay (2025)
- SARI: Structured Audio Reasoning via Curriculum-Guided Reinforcement Learning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper