Papers
arxiv:2505.15692

Thought-Augmented Policy Optimization: Bridging External Guidance and Internal Capabilities

Published on May 21
ยท Submitted by Jinyang23 on May 26
Authors:
,
,
,
,
,

Abstract

A novel RL framework, TAPO, integrates external guidance to enhance model performance and exploration compared to existing methods.

AI-generated summary

Reinforcement learning (RL) has emerged as an effective method for training reasoning models. However, existing RL approaches typically bias the model's output distribution toward reward-maximizing paths without introducing external knowledge. This limits their exploration capacity and results in a narrower reasoning capability boundary compared to base models. To address this limitation, we propose TAPO (Thought-Augmented Policy Optimization), a novel framework that augments RL by incorporating external high-level guidance ("thought patterns"). By adaptively integrating structured thoughts during training, TAPO effectively balances model-internal exploration and external guidance exploitation. Extensive experiments show that our approach significantly outperforms GRPO by 99% on AIME, 41% on AMC, and 17% on Minerva Math. Notably, these high-level thought patterns, abstracted from only 500 prior samples, generalize effectively across various tasks and models. This highlights TAPO's potential for broader applications across multiple tasks and domains. Our further analysis reveals that introducing external guidance produces powerful reasoning models with superior explainability of inference behavior and enhanced output readability.

Community

Paper author Paper submitter

๐Ÿš€ We are excited to announce our latest research paper, "Thought-Augmented Policy Optimization: Bridging External Guidance and Internal Capabilities"! This work introduces TAPO, a novel reinforcement learning (RL) framework designed to significantly enhance the reasoning capabilities of large language models (LLMs).

๐ŸŒŸ Unlike existing RL approaches that often bias towards self-generated reward-maximizing paths without introducing external knowledge, TAPO introduces high-level external guidance in the form of "thought patterns". By adaptively integrating these structured thoughts during training, TAPO achieves a powerful balance between model-internal exploration and external guidance exploitation. We demonstrate that TAPO significantly outperforms GRPO, achieving remarkable gains of 99% on AIME, 41% on AMC, and 17% on Minerva Math benchmarks. Notably, these high-level thought patterns are abstracted from only 500 prior samples, showing strong generalization across various tasks and models. Furthermore, TAPO leads to reasoning models with superior explainability of inference behavior and enhanced output readability.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2505.15692 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2505.15692 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2505.15692 in a Space README.md to link it from this page.

Collections including this paper 2