GRPO fine-tuned DeepSeek-R1-Qwen3-8B for next token prediction according to paper https://huggingface.co/papers/2506.08007 251e747 verified ykarout commited on 19 days ago