Qwen2.5-0.5B Reasoning-Finetuned with SFT + GRPO
This model is a reasoning-enhanced version of Qwen2.5-0.5B-Instruct, fine-tuned in two stages:
- Supervised Fine-Tuning (SFT) on the Bespoke-Stratos-17k dataset.
- Group Relative Policy Optimization (GRPO) reinforcement learning on NuminaMath-TIR.
Training was performed on AWS SageMaker using a custom container based on trl==0.16.1
and transformers==4.36.2
, with LoRA adapters applied and merged back into the final model.
π οΈ Training Details
1. Supervised Fine-Tuning (SFT)
- Base model:
Qwen/Qwen2.5-0.5B-Instruct
- Dataset:
bespokelabs/Bespoke-Stratos-17k
- Max sequence length: 1024
- Training epochs: 2
- Batch size: 8 (with gradient accumulation)
- LoRA:
- Target modules:
q_proj
,v_proj
- Rank: 8, Alpha: 16, Dropout: 0.05
- Target modules:
- Instance:
ml.g5.12xlarge
(1x A10G GPU)
Training script: train_sft.py
2. Reinforcement Learning (GRPO)
- Base model: Output of SFT (LoRA merged)
- Dataset:
AI-MO/NuminaMath-TIR
- Reward Functions:
- Accuracy (cosine-scaled)
- Format adherence
- Reasoning steps
- Conciseness
- Repetition penalty
- Batch size: 4
- Max steps: 2000
- GRPO:
- Beta: 0.1
- Generations per step: 2
- Instance:
ml.g5.12xlarge
Training script: train_grpo.py
π¦ Model Files
File | Description |
---|---|
model.safetensors |
Final model weights |
config.json |
Model configuration |
tokenizer.json , tokenizer_config.json , vocab.json |
Tokenizer files |
merges.txt |
Byte Pair Encoding merges |
special_tokens_map.json , added_tokens.json |
Special token mappings |
generation_config.json |
Default generation parameters |
π How to Use
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Essi-Narim/qwen2.5b-sft-grpo-reasoning"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, torch_dtype="auto", device_map="auto")
prompt = "<|im_start|>user\nSolve: (5 + 7) Γ (3 - 1)<|im_end|>\n<|im_start|>assistant\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Supports Chat UI (Instruct format).
π Acknowledgments
- Qwen2.5-0.5B-Instruct β Base model.
- DeepSeek R1 / GRPO β Training inspiration.
- Bespoke-Stratos-17k Dataset
- NuminaMath-TIR Dataset
π Citation
If you use this model or build upon this pipeline, please cite:
Esmaeil Narimissa.
"Reasoning on a Budget: Miniaturizing DeepSeek R1 with SFT-GRPO Alignment for Instruction-Tuned LLMs"
TechRxiv, May 2025.
DOI: 10.36227/techrxiv.174742969.91776650/v1
π Related Resources
- π§ Paper: TechRxiv Preprint
- π§° Codebase: GitHub Repository
- Downloads last month
- 13
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support