metadata

library_name: transformers
license: apache-2.0
datasets:
  - agentica-org/DeepScaleR-Preview-Dataset
language:
  - en
base_model:
  - Qwen/Qwen2.5-7B

Model Card

SFTed and RLed for mathematical reasoning in our MathIF project.

Github Repository: https://github.com/TingchenFu/MathIF

Training Details

We base our experiments on the DeepScaler dataset, which contains approximately 40k math reasoning samples. We first distill long CoT reasoning traces from QwQ-32B, filtering out samples where QwQ-32B fails to generate a correct answer or the CoT exceeds 8192 tokens. This results in 18k high-quality examples.

The training is conducted using 16 NVIDIA H100 GPUs. For reinforcement learning, we adopt the GRPO framework and use verifiable outcome-based rewards. The model is trained with VeRL framework with most hyper-parameters following the default setting.

Evaluation

We use nucleus sampling (T=1.0, p=0.95) with a maximum generation length of 16,384 tokens for decoding and vLLM engine for efficient inference.

Citation

BibTeX:

@article{fu2025scaling,
  title={Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models},
  author={Fu, Tingchen and Gu, Jiawei and Li, Yafu and Qu, Xiaoye and Cheng, Yu},
  journal={arXiv preprint arXiv:2505.14810},
  year={2025}
}