DPO-RM/Qwen2.5-Math-1.5B-prime-no_logSoftmax_refRM-beta1-eurus_rl_15k-step120-reward Updated May 5 • 3
DPO-RM/Qwen2.5-Math-1.5B-prime-no_logSoftmax_refRM-beta1-eurus_rl_15k-step120-actor Updated May 5 • 5
DPO-RM/Qwen2.5-Math-1.5B-prime-no_logSoftmax_refRM-beta1-eurus_rl_15k-step110-reward Updated May 5 • 3
DPO-RM/Qwen2.5-Math-1.5B-prime-no_logSoftmax_refRM-beta1-eurus_rl_15k-step110-actor Updated May 5 • 5
DPO-RM/Qwen2.5-Math-1.5B-prime-no_logSoftmax_refRM-beta1-eurus_rl_15k-step100-reward Updated May 5 • 2
DPO-RM/Qwen2.5-Math-1.5B-prime-no_logSoftmax_refRM-beta1-eurus_rl_15k-step100-actor Updated May 5 • 3
DPO-RM/Qwen2.5-Math-1.5B-prime-no_logSoftmax_refRM-beta1-eurus_rl_15k-step90-reward Updated May 5 • 2
DPO-RM/Qwen2.5-Math-1.5B-prime-no_logSoftmax_refRM-beta1-eurus_rl_15k-step80-reward Updated May 5 • 5