π GRPO-LEAD: Efficient Reasoning Enhancement for Mathematical Tasks
π Overview
GRPO-LEAD (GRPO with Length-dependent rewards, Explicit penalties, and Advantage reweighting for Difficulty) is an advanced reinforcement learning pipeline designed to fine-tune large language models (LLMs) for concise, accurate, and efficient reasoning in mathematical tasks.
π Performance Benchmarks
The following benchmarks were conducted on AIME24 and AIME25 datasets, evaluated with parameters: 14k maximum tokens, temperature of 0.6, min-p of 0.01, and 32 samples per question.
Model | AIME24 Cons@32 | AIME24 Pass@1 | AIME24 Avg. Length | AIME25 Cons@32 | AIME25 Pass@1 | AIME25 Avg. Length |
---|---|---|---|---|---|---|
DeepSeek-Distlled-14B | 0.800 | 0.614 | 9182 | 0.633 | 0.429 | 10046 |
Light-R1-14B-DS | 0.833 | 0.641 | 9571 | 0.767 | 0.505 | 10194 |
LEAD-14B (ours) | 0.867 | 0.650 | 8267 | 0.767 | 0.539 | 8668 |
Our GRPO-LEAD model achieves superior consistency and higher accuracy, demonstrating significantly improved reasoning efficiency as evidenced by shorter average reasoning lengths.
βοΈ Usage
To achieve the best performance in solving mathematical problems, simply use the following prompt format:
[
{
"role": "user",
"content": question + "\nLet's think step by step and output the final answer within \\boxed{}."
}
]
π Code and Documentation
For complete details, codebase, and usage examples, please visit our GitHub repository:
π¦ Dataset: GRPO-LEAD-SFTData
We release GRPO-LEAD-SFTData, a curated collection of 12,153 high-quality mathematical reasoning samples for supervised fine-tuning. Generated via QwQ-32B. Derived primarily from the DeepScaler dataset (DeepScaler), we retain only examples with difficulty > 1, targeting challenging problem-solving scenarios. All entries are structured for seamless integration with LLaMA Factory and follow a standardized SFT-ready format.
Used as the training data for GRPO-LEADβs supervised fine-tuning stage, this dataset is able to increase the model's base capability in solving mathematical problems.,
π Citation
If you find our work useful, please cite it as:
@misc{zhang2025grpoleaddifficultyawarereinforcementlearning,
title={GRPO-LEAD: A Difficulty-Aware Reinforcement Learning Approach for Concise Mathematical Reasoning in Language Models},
author={Jixiao Zhang and Chunsheng Zuo},
year={2025},
eprint={2504.09696},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2504.09696},
}
Enjoy exploring GRPO-LEAD! πβ¨
- Downloads last month
- 3
Model tree for PlanePaper/LEAD-7B
Base model
deepseek-ai/DeepSeek-R1-Distill-Qwen-14B