viswavi/qwen2.5_rlcf · Hugging Face

This model improves the instruction following capabilities of Qwen-2.5-7B-Instruct using preference tuning on the WildChecklists dataset. This model is described in detail in Checklists Are Better Than Reward Models For Aligning Language Models.

This model is specifically designed to improve complex or subjective instruction following:

InFoBench/IFEval:

Model	InfoBench (Overall)	IFEval (prompt-level strict)	IFEval (prompt-level loose)	IFEval (instr-level strict)	IFEval (instr-level loose)
Qwen-2.5-7B-Instruct (on-policy)	78.1	72.5	75.0	79.9	81.8
+ RLCF	84.1	72.6	77.3	80.3	84.1

FollowBench:

Model	Soft	L1	L2	L3	L4	L5	Avg	Hard	L1	L2	L3	L4	L5	Avg	CSL
Qwen-2.5-7B-Instruct (on-policy)		87.4	84.0	83.0	79.6	79.0	82.6		87.4	80.6	72.3	62.2	54.4	71.4	3.05
+ RLCF		88.6	88.8	83.8	79.9	81.0	84.4		88.6	85.2	75.8	65.1	61.8	75.3	3.30

We find that it is as good or slightly worse at other tasks, such as math reasoning, and it may change the safety alignment behavior slightly of the Qwen-2.5-7B-Instruct (modestly decreasing the refusal rate to unsafe prompts while considerably decreasing the refusal rate for safe prompts).

If you write a paper using this model, please cite us!

@misc{RLCF,
      title={Checklists Are Better Than Reward Models For Aligning Language Models},
      author={Vijay Viswanathan and Yanchao Sun and Shuang Ma and Xiang Kong and Meng Cao and Graham Neubig and Tongshuang Wu},
      year={2025},
      eprint={2507.18624},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Note: Apple was not involved with the process of training this model or producing the data used to train this model, and the creation of this model was done exclusively at Carnegie Mellon University (CMU) by researchers at CMU.