arxiv:2505.15692

Thought-Augmented Policy Optimization: Bridging External Guidance and Internal Capabilities

Published on May 21

· Submitted by

Jinyang23 on May 26

Upvote

Authors:

Jinyang Wu ,

Chonghua Liao ,

Abstract

A novel RL framework, TAPO, integrates external guidance to enhance model performance and exploration compared to existing methods.

AI-generated summary

Reinforcement learning (RL) has emerged as an effective method for training reasoning models. However, existing RL approaches typically bias the model's output distribution toward reward-maximizing paths without introducing external knowledge. This limits their exploration capacity and results in a narrower reasoning capability boundary compared to base models. To address this limitation, we propose TAPO (Thought-Augmented Policy Optimization), a novel framework that augments RL by incorporating external high-level guidance ("thought patterns"). By adaptively integrating structured thoughts during training, TAPO effectively balances model-internal exploration and external guidance exploitation. Extensive experiments show that our approach significantly outperforms GRPO by 99% on AIME, 41% on AMC, and 17% on Minerva Math. Notably, these high-level thought patterns, abstracted from only 500 prior samples, generalize effectively across various tasks and models. This highlights TAPO's potential for broader applications across multiple tasks and domains. Our further analysis reveals that introducing external guidance produces powerful reasoning models with superior explainability of inference behavior and enhanced output readability.

View arXiv page View PDF Add to collection

Community

Jinyang23

Paper author Paper submitter 14 days ago

🚀 We are excited to announce our latest research paper, "Thought-Augmented Policy Optimization: Bridging External Guidance and Internal Capabilities"! This work introduces TAPO, a novel reinforcement learning (RL) framework designed to significantly enhance the reasoning capabilities of large language models (LLMs).

🌟 Unlike existing RL approaches that often bias towards self-generated reward-maximizing paths without introducing external knowledge, TAPO introduces high-level external guidance in the form of "thought patterns". By adaptively integrating these structured thoughts during training, TAPO achieves a powerful balance between model-internal exploration and external guidance exploitation. We demonstrate that TAPO significantly outperforms GRPO, achieving remarkable gains of 99% on AIME, 41% on AMC, and 17% on Minerva Math benchmarks. Notably, these high-level thought patterns are abstracted from only 500 prior samples, showing strong generalization across various tasks and models. Furthermore, TAPO leads to reasoning models with superior explainability of inference behavior and enhanced output readability.