Spurious Rewards: Rethinking Training Signals in RLVR
-
stellalisy/rethink_rlvr_reproduce-ground_truth-qwen2.5_math_7b-lr5e-7-kl0.00-step50
Text Generation • 8B • Updated • 41 -
stellalisy/rethink_rlvr_reproduce-ground_truth-qwen2.5_math_7b-lr5e-7-kl0.00-step100
Text Generation • 8B • Updated • 33 -
stellalisy/rethink_rlvr_reproduce-ground_truth-qwen2.5_math_7b-lr5e-7-kl0.00-step150
Text Generation • 8B • Updated • 260 -
stellalisy/rethink_rlvr_reproduce-majority_vote-qwen2.5_math_7b-lr5e-7-kl0.00-step50
Text Generation • 8B • Updated • 39