Malaysian Qwen 2.5 14B Instruct Reasoning GRPO
Online Reinforcement learning using GRPO full parameter on warmup reasoning SFT https://huggingface.co/mesolitica/Malaysian-Qwen2.5-14B-Reasoning-SFT on highly curated Malaysian Reasoning dataset.
Improvement
- Multitask reasoning, each datapoint been replicated to 4 generations.
- Actual online reinforcement learning.
Better performance
To get better performance, use system prompt You are going to enter reasoning mode. First, you try to think step-by-step in Malay. After that, put your final answer within $\\boxed{}$.
Training session
Finetune on combine/combined-malaysian-reasoning.jsonl, this is train set from mesolitica/Malaysian-Reasoning.
How we train
- GRPO full parameters.
- WanDB at https://wandb.ai/huseinzol05/fpf-Malaysian-Qwen2.5-14B-Reasoning-SFT-GRPO
Checkpoints
- Epoch 1.0, revision cc1032dfe961a56a3e33e36f03c37ed09b33c7fe
- Epoch 2.0, revision 90896edeb1eb18cb48ac682ad606d4ec51172941
Source code
Source code at https://github.com/mesolitica/malaya/blob/master/session/qwen2.5/14b-grpo-fsdp.sh
Benchmark
Dialect Translation
All the benchmarks generate using vLLM, evaluation based on sacrebleu CHRF max@5.
Source code for evaluation at https://github.com/mesolitica/malaya/tree/master/session/qwen2.5/evaluate-dialect
Dialect to standard Malay,
From: johor To: malay, score: 58.84114972328088
From: kedah To: malay, score: 61.23535640853852
From: pahang To: malay, score: 60.538184656921736
From: negeri sembilan To: malay, score: 59.33677942673728
From: kelantan To: malay, score: 53.67899007513317
From: penang To: malay, score: 64.65390412500909
From: melaka To: malay, score: 58.63391894024569
average: 59.55975476512377
Standard Malay to dialect,
From: malay To: johor, score: 55.69851104502618
From: malay To: kedah, score: 56.537698809297844
From: malay To: pahang, score: 61.46337868712478
From: malay To: negeri sembilan, score: 52.483104592534914
From: malay To: kelantan, score: 45.44384811848678
From: malay To: penang, score: 68.91583154150995
From: malay To: melaka, score: 72.5073072144931
average: 59.00709714406764
MalayMMLU
Source code for evaluation at https://github.com/mesolitica/malaya/tree/master/session/qwen2.5/evaluate-malaymmlu
Evaluation based on Accuracy@1,
STEM 79.00122799836267
Language 78.76908396946564
Social science 70.88753975137323
Others 73.23099064523866
Humanities 76.22298065984073
average 75.62236460485619
Evaluation based on Accuracy@5,
STEM 78.87024150634467
Language 79.04580152671755
Social science 70.88464874241109
Others 73.29335572079636
Humanities 76.37315130830488
average 75.69343976091491
Special thanks
Special thanks to https://www.sns.com.my and Nvidia for 8x H100 node!
- Downloads last month
- 10
Model tree for mesolitica/Malaysian-Qwen2.5-14B-Reasoning-GRPO
Base model
mesolitica/Malaysian-Qwen2.5-14B-Instruct