Malaysian Qwen 2.5 1.5B Instruct Reasoning GRPO

Online Reinforcement learning using GRPO full parameter on warmup reasoning SFT https://huggingface.co/mesolitica/Malaysian-Qwen2.5-1.5B-Reasoning-SFT on highly curated Malaysian Reasoning dataset.

Improvement

  1. Multitask reasoning, each datapoint been replicated to 4 generations.
  2. Actual online reinforcement learning.

Better performance

To get better performance, use system prompt You are going to enter reasoning mode. First, you try to think step-by-step in Malay. After that, put your final answer within $\\boxed{}$.

Training session

Finetune on combine/combined-malaysian-reasoning.jsonl, this is train set from mesolitica/Malaysian-Reasoning.

How we train

  1. GRPO full parameters.
  2. WanDB at https://wandb.ai/huseinzol05/fpf-Malaysian-Qwen2.5-1.5B-Reasoning-SFT-GRPO

Checkpoints

  1. Epoch 5.0, revision b4c3d2b391ff08141a0728c6f1868bffed313be6

Source code

Source code at https://github.com/mesolitica/malaya/blob/master/session/qwen2.5/1.5b-grpo-fsdp.sh

Benchmark

All the benchmarks generate using vLLM.

Dialect Translation

Evaluation based on sacrebleu CHRF max@5.

Source code for evaluation at https://github.com/mesolitica/malaya/tree/master/session/qwen2.5/evaluate-dialect

Dialect to standard Malay,

From: johor To: malay, score: 52.23619118965661
From: kedah To: malay, score: 53.10401746444021
From: pahang To: malay, score: 50.99975609997574
From: negeri sembilan To: malay, score: 50.11902142208946
From: kelantan To: malay, score: 42.843012553721046
From: penang To: malay, score: 57.40784069730589
From: melaka To: malay, score: 54.85551785683515
average: 51.65219389771773

Standard Malay to dialect,

From: malay To: johor, score: 51.74985868151864
From: malay To: kedah, score: 47.69151337899059
From: malay To: pahang, score: 56.16867018729128
From: malay To: negeri sembilan, score: 43.81721289079021
From: malay To: kelantan, score: 36.88914960449112
From: malay To: penang, score: 56.94874625842691
From: malay To: melaka, score: 64.25911695574484
average: 51.0748954224648

MalayMMLU

Source code for evaluation at https://github.com/mesolitica/malaya/tree/master/session/qwen2.5/evaluate-malaymmlu

Evaluation based on Accuracy@1,

STEM 60.37658616455178
Language 63.32697201017812
Social science 54.09077768141082
Others 57.7356680259055
Humanities 60.09101251422071

Evaluation based on Accuracy@5,

STEM 61.07245190339746
Language 63.02480916030534
Social science 54.31338537149465
Others 58.21060206284481
Humanities 60.73265073947668

Special thanks

Special thanks to https://www.sns.com.my and Nvidia for 8x H100 node!

Downloads last month
1
Safetensors
Model size
1.78B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mesolitica/Malaysian-Qwen2.5-1.5B-Reasoning-GRPO

Dataset used to train mesolitica/Malaysian-Qwen2.5-1.5B-Reasoning-GRPO

Collection including mesolitica/Malaysian-Qwen2.5-1.5B-Reasoning-GRPO