TRL documentation
Paper Index
You are viewing main version, which requires installation from source. If you'd like
regular pip install, checkout the latest stable version (v0.20.0).
Paper Index
Section under construction. Feel free to contribute!
Group Sequence Policy Optimization
📜 Paper: https://huggingface.co/papers/2507.18071
GSPO is a GRPO variant that computes importance sampling weights at the sequence level instead of per-token. To reproduce the paper’s setting, use this configuration:
from trl import GRPOConfig
training_args = GRPOConfig(
importance_sampling_level="sequence",
loss_type="grpo",
steps_per_generation=...,
beta=0.04, # not explicitly specified in the paper, but they likely used the same value as in the GRPO paper
epsilon=3e-4, # https://x.com/ChujieZheng/status/1948933507696525392
)
While the original paper doesn’t specify the hyperparameters used, this modification only has an effect when training is slightly off-policy—for example, when steps_per_generation > gradient_accumulation_steps
or num_iterations > 1
. Otherwise, it is effectively equivalent to no modification.