KTAE: A Model-Free Algorithm to Key-Tokens Advantage Estimation in Mathematical Reasoning
Wei Sun, Wen Yang, Pu Jian,
Qianlong Du,
Fuwei Cui,
Shuo Ren,
Jiajun Zhang
Institute of Automation Chinese Academy of Sciences
π Overview
Recent advances have demonstrated that integrating reinforcement learning with rule-based rewards can significantly enhance the reasoning capabilities of large language models (LLMs), even without supervised fine-tuning (SFT). However, prevalent reinforcement learning algorithms such as GRPO and its variants like DAPO, suffer from a coarse granularity issue when computing the advantage. Specifically, they compute rollout-level advantages that assign identical values to every token within a sequence, failing to capture token-specific contributions. To address this limitation, we propose Key-token Advantage Estimation ($\textit{KTAE}$)βa novel algorithm that estimates fine-grained, token-level advantages without introducing additional models. KTAE leverages the correctness of sampled rollouts and applies statistical analysis to quantify the importance of individual tokens within a sequence to the final outcome. This quantified token-level importance is then combined with the rollout-level advantage to obtain a more fine-grained token-level advantage estimation. Empirical results show that models trained with GRPO+KTAE and DAPO+KTAE outperform baseline methods across five mathematical reasoning benchmarks. Notably, they achieve higher accuracy with shorter responses and even surpass R1-Distill-Qwen-1.5B using the same base model.
In summary, the KTAE algorithm offers several advantages:
KTAE provides more fine-grained advantage information without introducing extra models, resulting in lower training costs.
KTAE directly computes the importance differences between tokens using statistical analysis methods, offering strong interpretability.
KTAE's key-token value is computed based on the correctness of the final answer and retains the original rollout-level advantage, making it less susceptible to reward hacking.
KTAE can make the model pay more attention to key tokens and reduce the learning of irrelevant tokens, which can effectively reduce the response length.
π₯ Update
- [21/05/2025]π₯Key-token Advantage Estimation is coming!
π Contents
π§ Available Models
Model Size | DAPO+KTAE | GRPO+KTAE |
---|---|---|
1.5B | ||
7B |
π· Setup
Please follow the instructions below to install the required packages.
- Clone this repository
https://github.com/xiaolizh1/KTAE.git
- Install Package
conda create -n KTAE python=3.10 -y
conda activate KTAE
cd KTAE
pip install -r requirements.txt
π Train
Our training is mostly performed on Verl code base and makes some changes.
π GRPO+KTAE
bash examples/grpo_trainer/run_qwen2.5_7b.sh #train 7b model
bash examples/grpo_trainer/run_qwen2.5_math_1.5b.sh #train 1.5b model
π DAPO+KTAE
bash recipe/dapo/run_dapo_qwen2.5_7b.sh #train 7b model
bash recipe/dapo/run_dapo_qwen2.5_1.5b.sh #train 1.5b model
π Merge Model
cd scripts
bash merge_model.sh #merge checkpoint
β Evaluation
Our evaluate code is base on Dr.GRPO
cd eval
bash run_eval.sh
π Experiments
We provide some results in this section. More detailed results can be found in our paper.

Main Results
Method validation result.
Comparison with baselines on Accuracy.
Comparison with baselines on Efficiency.
π More Analysis
Ablation analysis.
Visualization example.
π Citation
If you find this repo useful for your research, please consider citing the paper
@misc{sun2025ktaemodelfreealgorithmkeytokens,
title={KTAE: A Model-Free Algorithm to Key-Tokens Advantage Estimation in Mathematical Reasoning},
author={Wei Sun and Wen Yang and Pu Jian and Qianlong Du and Fuwei Cui and Shuo Ren and Jiajun Zhang},
year={2025},
eprint={2505.16826},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2505.16826},
}
π Acknowledgement
We would like to thank the following repos for their great work:
- Verl for providing the training framework
- Vllm for the efficient inference engine with high throughput
- transformers for providing the model-base and fune-tuning framework
π License
This project is released under the Apache 2.0 license. Parts of this project contain code and models from other sources, which are subject to their respective licenses.
- Downloads last month
- 2