arxiv:2507.20673

Geometric-Mean Policy Optimization

Published on Jul 28

· Submitted by

jeepliu on Jul 29

Upvote

Authors:

Junpeng Liu ,

Abstract

Geometric-Mean Policy Optimization (GMPO) stabilizes policy updates in large language models by maximizing the geometric mean of token-level rewards, improving performance on mathematical and multimodal reasoning benchmarks.

AI-generated summary

Recent advancements, such as Group Relative Policy Optimization (GRPO), have enhanced the reasoning capabilities of large language models by optimizing the arithmetic mean of token-level rewards. However, GRPO suffers from unstable policy updates when processing tokens with outlier importance-weighted rewards, which manifests as extreme importance sampling ratios during training, i.e., the ratio between the sampling probabilities assigned to a token by the current and old policies. In this work, we propose Geometric-Mean Policy Optimization (GMPO), a stabilized variant of GRPO. Instead of optimizing the arithmetic mean, GMPO maximizes the geometric mean of token-level rewards, which is inherently less sensitive to outliers and maintains a more stable range of importance sampling ratio. In addition, we provide comprehensive theoretical and experimental analysis to justify the design and stability benefits of GMPO. Beyond improved stability, GMPO-7B outperforms GRPO by an average of 4.1% on multiple mathematical benchmarks and 1.4% on multimodal reasoning benchmark, including AIME24, AMC, MATH500, OlympiadBench, Minerva, and Geometry3K. Code is available at https://github.com/callsys/GMPO.

View arXiv page View PDF GitHub 21 Add to collection

Community

jeepliu

Paper author Paper submitter 1 day ago

Introducing Geometric-Mean Policy Optimization (GMPO), a stabilized variant of GRPO. GMPO maximizes the geometric mean of token-level rewards, which is inherently less sensitive to outliers and maintains a more stable range of importance sampling ratio. In addition, Comprehensive theoretical and experimental analysis are conducted to justify the design and stability benefits of GMPO.

Code: https://github.com/callsys/GMPO