Malaysian Qwen 2.5 1.5B Instruct Reasoning GRPO
Online Reinforcement learning using GRPO full parameter on warmup reasoning SFT https://huggingface.co/mesolitica/Malaysian-Qwen2.5-1.5B-Reasoning-SFT on highly curated Malaysian Reasoning dataset.
Improvement
- Multitask reasoning, each datapoint been replicated to 4 generations.
- Actual online reinforcement learning.
Better performance
To get better performance, use system prompt You are going to enter reasoning mode. First, you try to think step-by-step in Malay. After that, put your final answer within $\\boxed{}$.
Training session
Finetune on combine/combined-malaysian-reasoning.jsonl, this is train set from mesolitica/Malaysian-Reasoning.
How we train
- GRPO full parameters.
- WanDB at https://wandb.ai/huseinzol05/fpf-Malaysian-Qwen2.5-1.5B-Reasoning-SFT-GRPO
Checkpoints
- Epoch 5.0, revision b4c3d2b391ff08141a0728c6f1868bffed313be6
Source code
Source code at https://github.com/mesolitica/malaya/blob/master/session/qwen2.5/1.5b-grpo-fsdp.sh
Benchmark
All the benchmarks generate using vLLM.
Dialect Translation
Evaluation based on sacrebleu CHRF max@5.
Source code for evaluation at https://github.com/mesolitica/malaya/tree/master/session/qwen2.5/evaluate-dialect
Dialect to standard Malay,
From: johor To: malay, score: 52.23619118965661
From: kedah To: malay, score: 53.10401746444021
From: pahang To: malay, score: 50.99975609997574
From: negeri sembilan To: malay, score: 50.11902142208946
From: kelantan To: malay, score: 42.843012553721046
From: penang To: malay, score: 57.40784069730589
From: melaka To: malay, score: 54.85551785683515
average: 51.65219389771773
Standard Malay to dialect,
From: malay To: johor, score: 51.74985868151864
From: malay To: kedah, score: 47.69151337899059
From: malay To: pahang, score: 56.16867018729128
From: malay To: negeri sembilan, score: 43.81721289079021
From: malay To: kelantan, score: 36.88914960449112
From: malay To: penang, score: 56.94874625842691
From: malay To: melaka, score: 64.25911695574484
average: 51.0748954224648
MalayMMLU
Source code for evaluation at https://github.com/mesolitica/malaya/tree/master/session/qwen2.5/evaluate-malaymmlu
Evaluation based on Accuracy@1,
STEM 60.37658616455178
Language 63.32697201017812
Social science 54.09077768141082
Others 57.7356680259055
Humanities 60.09101251422071
Evaluation based on Accuracy@5,
STEM 61.07245190339746
Language 63.02480916030534
Social science 54.31338537149465
Others 58.21060206284481
Humanities 60.73265073947668
Special thanks
Special thanks to https://www.sns.com.my and Nvidia for 8x H100 node!
- Downloads last month
- 1
Model tree for mesolitica/Malaysian-Qwen2.5-1.5B-Reasoning-GRPO
Base model
mesolitica/Malaysian-Qwen2.5-1.5B-Instruct