TRL documentation

GSPO-token

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v0.24.0).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

GSPO-token

In the paper Group Sequence Policy Optimization, the authors propose a token-level objective variant to GSPO, called GSPO-token. To use GSPO-token, you can use the GRPOTrainer class in trl.experimental.gspo_token.

Usage

from trl.experimental.gspo_token import GRPOTrainer
from trl import GRPOConfig

training_args = GRPOConfig(
    importance_sampling_level="sequence_token",
    ...
)

To leverage GSPO-token, the user will need to provide the per-token advantage Ai,t^ \hat{A_{i,t}} for each token t t in the sequence i i (i.e., make Ai,t^ \hat{A_{i,t}} varies with t t —which isn’t the case here, Ai,t^=Ai^ \hat{A_{i,t}}=\hat{A_{i}} ). Otherwise, GSPO-Token gradient is just equivalent to the original GSPO implementation.

Update on GitHub