---
library_name: transformers
license: apache-2.0
datasets:
- agentica-org/DeepScaleR-Preview-Dataset
language:
- en
base_model:
- Qwen/Qwen2.5-7B
---

# Model Card 

SFTed and RLed for mathematical reasoning in our MathIF project. 

Github Repository: https://github.com/TingchenFu/MathIF


## Training Details

We base our experiments on the DeepScaler dataset, which contains approximately 40k math reasoning samples. We first distill long CoT reasoning traces from QwQ-32B, filtering out samples where QwQ-32B fails to generate a correct answer or the CoT exceeds 8192 tokens. This results in 18k high-quality examples.

The training is conducted using 16 NVIDIA H100 GPUs. For reinforcement learning, we adopt the GRPO framework and use verifiable outcome-based rewards. The model is trained with VeRL framework with most hyper-parameters following the default setting.


## Evaluation

We use nucleus sampling (T=1.0, p=0.95) with a maximum generation length of 16,384 tokens for decoding and vLLM engine for efficient inference.


## Citation

<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

**BibTeX:**

```
@article{fu2025scaling,
  title={Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models},
  author={Fu, Tingchen and Gu, Jiawei and Li, Yafu and Qu, Xiaoye and Cheng, Yu},
  journal={arXiv preprint arXiv:2505.14810},
  year={2025}
}
```