Post
3220
๐ ๐๐ฟ๐ฎ๐ถ๐ป๐ฒ๐ฑ ๐ฎ ๐๐ฎ๐ป๐ด๐๐ฎ๐ด๐ฒ ๐ ๐ผ๐ฑ๐ฒ๐น ๐๐ผ ๐๐ฐ๐ต๐ฒ๐ฑ๐๐น๐ฒ ๐ฒ๐๐ฒ๐ป๐๐ ๐๐ถ๐๐ต ๐๐ฅ๐ฃ๐ข! ๐ ๐๏ธ
โ๏ธ Blog post: https://huggingface.co/blog/anakin87/qwen-scheduler-grpo
I experimented with GRPO lately.
I am fascinated by models learning from prompts and rewards - no example answers needed like in Supervised Fine-Tuning.
After the DeepSeek boom, everyone is trying GRPO with GSM8K or the Countdown Game...
I wanted a different challenge, like ๐๐ฒ๐ฎ๐ฐ๐ต๐ถ๐ป๐ด ๐ฎ ๐บ๐ผ๐ฑ๐ฒ๐น ๐๐ผ ๐ฐ๐ฟ๐ฒ๐ฎ๐๐ฒ ๐ฎ ๐๐ฐ๐ต๐ฒ๐ฑ๐๐น๐ฒ ๐ณ๐ฟ๐ผ๐บ ๐ฎ ๐น๐ถ๐๐ ๐ผ๐ณ ๐ฒ๐๐ฒ๐ป๐๐ ๐ฎ๐ป๐ฑ ๐ฝ๐ฟ๐ถ๐ผ๐ฟ๐ถ๐๐ถ๐ฒ๐.
Choosing an original problem forced me to:
๐ค Think about the problem setting
๐งฌ Generate data
๐ค Choose the right base model
๐ Design reward functions (and experiencing reward hacking)
๐ Run multiple rounds of training, hoping that my model would learn something.
A fun and rewarding ๐ experience.
I learned a lot of things, that I want to share with you. ๐
โ๏ธ Blog post: https://huggingface.co/blog/anakin87/qwen-scheduler-grpo
๐ป Code: https://github.com/anakin87/qwen-scheduler-grpo
๐ค Hugging Face collection (dataset and model): anakin87/qwen-scheduler-grpo-680bcc583e817390525a8837
โ๏ธ Blog post: https://huggingface.co/blog/anakin87/qwen-scheduler-grpo
I experimented with GRPO lately.
I am fascinated by models learning from prompts and rewards - no example answers needed like in Supervised Fine-Tuning.
After the DeepSeek boom, everyone is trying GRPO with GSM8K or the Countdown Game...
I wanted a different challenge, like ๐๐ฒ๐ฎ๐ฐ๐ต๐ถ๐ป๐ด ๐ฎ ๐บ๐ผ๐ฑ๐ฒ๐น ๐๐ผ ๐ฐ๐ฟ๐ฒ๐ฎ๐๐ฒ ๐ฎ ๐๐ฐ๐ต๐ฒ๐ฑ๐๐น๐ฒ ๐ณ๐ฟ๐ผ๐บ ๐ฎ ๐น๐ถ๐๐ ๐ผ๐ณ ๐ฒ๐๐ฒ๐ป๐๐ ๐ฎ๐ป๐ฑ ๐ฝ๐ฟ๐ถ๐ผ๐ฟ๐ถ๐๐ถ๐ฒ๐.
Choosing an original problem forced me to:
๐ค Think about the problem setting
๐งฌ Generate data
๐ค Choose the right base model
๐ Design reward functions (and experiencing reward hacking)
๐ Run multiple rounds of training, hoping that my model would learn something.
A fun and rewarding ๐ experience.
I learned a lot of things, that I want to share with you. ๐
โ๏ธ Blog post: https://huggingface.co/blog/anakin87/qwen-scheduler-grpo
๐ป Code: https://github.com/anakin87/qwen-scheduler-grpo
๐ค Hugging Face collection (dataset and model): anakin87/qwen-scheduler-grpo-680bcc583e817390525a8837