This model improves the instruction following capabilities of Qwen-2.5-7B-Instruct using preference tuning on the WildChecklists dataset. This model is described in detail in Checklists Are Better Than Reward Models For Aligning Language Models.
This model is specifically designed to improve complex or subjective instruction following:
InFoBench/IFEval:
Model | InfoBench (Overall) | IFEval (prompt-level strict) | IFEval (prompt-level loose) | IFEval (instr-level strict) | IFEval (instr-level loose) |
---|---|---|---|---|---|
Qwen-2.5-7B-Instruct (on-policy) | 78.1 | 72.5 | 75.0 | 79.9 | 81.8 |
+ RLCF | 84.1 | 72.6 | 77.3 | 80.3 | 84.1 |
FollowBench:
Model | Soft | L1 | L2 | L3 | L4 | L5 | Avg | Hard | L1 | L2 | L3 | L4 | L5 | Avg | CSL |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Qwen-2.5-7B-Instruct (on-policy) | 87.4 | 84.0 | 83.0 | 79.6 | 79.0 | 82.6 | 87.4 | 80.6 | 72.3 | 62.2 | 54.4 | 71.4 | 3.05 | ||
+ RLCF | 88.6 | 88.8 | 83.8 | 79.9 | 81.0 | 84.4 | 88.6 | 85.2 | 75.8 | 65.1 | 61.8 | 75.3 | 3.30 |
We find that it is as good or slightly worse at other tasks, such as math reasoning, and it may change the safety alignment behavior slightly of the Qwen-2.5-7B-Instruct (modestly decreasing the refusal rate to unsafe prompts while considerably decreasing the refusal rate for safe prompts).
If you write a paper using this model, please cite us!
@misc{RLCF,
title={Checklists Are Better Than Reward Models For Aligning Language Models},
author={Vijay Viswanathan and Yanchao Sun and Shuang Ma and Xiang Kong and Meng Cao and Graham Neubig and Tongshuang Wu},
year={2025},
eprint={2507.18624},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Note: Apple was not involved with the process of training this model or producing the data used to train this model, and the creation of this model was done exclusively at Carnegie Mellon University (CMU) by researchers at CMU.
- Downloads last month
- 3