Abstract
Geometric-Mean Policy Optimization (GMPO) stabilizes policy updates in large language models by maximizing the geometric mean of token-level rewards, improving performance on mathematical and multimodal reasoning benchmarks.
Recent advancements, such as Group Relative Policy Optimization (GRPO), have enhanced the reasoning capabilities of large language models by optimizing the arithmetic mean of token-level rewards. However, GRPO suffers from unstable policy updates when processing tokens with outlier importance-weighted rewards, which manifests as extreme importance sampling ratios during training, i.e., the ratio between the sampling probabilities assigned to a token by the current and old policies. In this work, we propose Geometric-Mean Policy Optimization (GMPO), a stabilized variant of GRPO. Instead of optimizing the arithmetic mean, GMPO maximizes the geometric mean of token-level rewards, which is inherently less sensitive to outliers and maintains a more stable range of importance sampling ratio. In addition, we provide comprehensive theoretical and experimental analysis to justify the design and stability benefits of GMPO. Beyond improved stability, GMPO-7B outperforms GRPO by an average of 4.1% on multiple mathematical benchmarks and 1.4% on multimodal reasoning benchmark, including AIME24, AMC, MATH500, OlympiadBench, Minerva, and Geometry3K. Code is available at https://github.com/callsys/GMPO.
Community
Introducing Geometric-Mean Policy Optimization (GMPO), a stabilized variant of GRPO. GMPO maximizes the geometric mean of token-level rewards, which is inherently less sensitive to outliers and maintains a more stable range of importance sampling ratio. In addition, Comprehensive theoretical and experimental analysis are conducted to justify the design and stability benefits of GMPO.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- RePO: Replay-Enhanced Policy Optimization (2025)
- Group Sequence Policy Optimization (2025)
- APO: Enhancing Reasoning Ability of MLLMs via Asymmetric Policy Optimization (2025)
- Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR (2025)
- Truncated Proximal Policy Optimization (2025)
- EFRame: Deeper Reasoning via Exploration-Filter-Replay Reinforcement Learning Framework (2025)
- UloRL:An Ultra-Long Output Reinforcement Learning Approach for Advancing Large Language Models'Reasoning Abilities (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper