--- library_name: transformers license: apache-2.0 datasets: - agentica-org/DeepScaleR-Preview-Dataset language: - en base_model: - Qwen/Qwen2.5-7B --- # Model Card SFTed and RLed for mathematical reasoning in our MathIF project. Github Repository: https://github.com/TingchenFu/MathIF ## Training Details We base our experiments on the DeepScaler dataset, which contains approximately 40k math reasoning samples. We first distill long CoT reasoning traces from QwQ-32B, filtering out samples where QwQ-32B fails to generate a correct answer or the CoT exceeds 8192 tokens. This results in 18k high-quality examples. The training is conducted using 16 NVIDIA H100 GPUs. For reinforcement learning, we adopt the GRPO framework and use verifiable outcome-based rewards. The model is trained with VeRL framework with most hyper-parameters following the default setting. ## Evaluation We use nucleus sampling (T=1.0, p=0.95) with a maximum generation length of 16,384 tokens for decoding and vLLM engine for efficient inference. ## Citation **BibTeX:** ``` @article{fu2025scaling, title={Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models}, author={Fu, Tingchen and Gu, Jiawei and Li, Yafu and Qu, Xiaoye and Cheng, Yu}, journal={arXiv preprint arXiv:2505.14810}, year={2025} } ```