Policy Filtration in RLHF to Fine-Tune LLM for Code Generation Paper โข 2409.06957 โข Published Sep 11, 2024 โข 7
MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable Step-Level Supervision Paper โข 2505.13427 โข Published 22 days ago โข 25
CPGD: Toward Stable Rule-based Reinforcement Learning for Language Models Paper โข 2505.12504 โข Published 23 days ago โข 23
LeetCodeDataset: A Temporal Dataset for Robust Evaluation and Efficient Training of Code LLMs Paper โข 2504.14655 โข Published Apr 20 โข 19
Pre-DPO: Improving Data Utilization in Direct Preference Optimization Using a Guiding Reference Model Paper โข 2504.15843 โข Published Apr 22 โข 18