PeterJinGo/R1-nq_hotpotqa_train-qwen2.5-3b-em-ppo-v0.2
3B
•
Updated
•
20
Exploration with a more stable RL pipeline with outcome-only reward and scaled-up LLMs. https://arxiv.org/abs/2503.09516