This model improves the instruction following capabilities of Qwen-2.5-7B-Instruct using preference tuning on the WildChecklists dataset. This model is described in detail in Checklists Are Better Than Reward Models For Aligning Language Models.

This model is specifically designed to improve complex or subjective instruction following:

InFoBench/IFEval:

Model InfoBench (Overall) IFEval (prompt-level strict) IFEval (prompt-level loose) IFEval (instr-level strict) IFEval (instr-level loose)
Qwen-2.5-7B-Instruct (on-policy) 78.1 72.5 75.0 79.9 81.8
+ RLCF 84.1 72.6 77.3 80.3 84.1

FollowBench:

Model Soft L1 L2 L3 L4 L5 Avg Hard L1 L2 L3 L4 L5 Avg CSL
Qwen-2.5-7B-Instruct (on-policy) 87.4 84.0 83.0 79.6 79.0 82.6 87.4 80.6 72.3 62.2 54.4 71.4 3.05
+ RLCF 88.6 88.8 83.8 79.9 81.0 84.4 88.6 85.2 75.8 65.1 61.8 75.3 3.30

We find that it is as good or slightly worse at other tasks, such as math reasoning, and it may change the safety alignment behavior slightly of the Qwen-2.5-7B-Instruct (modestly decreasing the refusal rate to unsafe prompts while considerably decreasing the refusal rate for safe prompts).

If you write a paper using this model, please cite us!

@misc{RLCF,
      title={Checklists Are Better Than Reward Models For Aligning Language Models},
      author={Vijay Viswanathan and Yanchao Sun and Shuang Ma and Xiang Kong and Meng Cao and Graham Neubig and Tongshuang Wu},
      year={2025},
      eprint={2507.18624},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Note: Apple was not involved with the process of training this model or producing the data used to train this model, and the creation of this model was done exclusively at Carnegie Mellon University (CMU) by researchers at CMU.

Downloads last month
3
Safetensors
Model size
7.62B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support