Qwen2.5-0.5B Reasoning-Finetuned with SFT + GRPO

This model is a reasoning-enhanced version of Qwen2.5-0.5B-Instruct, fine-tuned in two stages:

  1. Supervised Fine-Tuning (SFT) on the Bespoke-Stratos-17k dataset.
  2. Group Relative Policy Optimization (GRPO) reinforcement learning on NuminaMath-TIR.

Training was performed on AWS SageMaker using a custom container based on trl==0.16.1 and transformers==4.36.2, with LoRA adapters applied and merged back into the final model.


πŸ› οΈ Training Details

1. Supervised Fine-Tuning (SFT)

  • Base model: Qwen/Qwen2.5-0.5B-Instruct
  • Dataset: bespokelabs/Bespoke-Stratos-17k
  • Max sequence length: 1024
  • Training epochs: 2
  • Batch size: 8 (with gradient accumulation)
  • LoRA:
    • Target modules: q_proj, v_proj
    • Rank: 8, Alpha: 16, Dropout: 0.05
  • Instance: ml.g5.12xlarge (1x A10G GPU)

Training script: train_sft.py


2. Reinforcement Learning (GRPO)

  • Base model: Output of SFT (LoRA merged)
  • Dataset: AI-MO/NuminaMath-TIR
  • Reward Functions:
    • Accuracy (cosine-scaled)
    • Format adherence
    • Reasoning steps
    • Conciseness
    • Repetition penalty
  • Batch size: 4
  • Max steps: 2000
  • GRPO:
    • Beta: 0.1
    • Generations per step: 2
  • Instance: ml.g5.12xlarge

Training script: train_grpo.py


πŸ“¦ Model Files

File Description
model.safetensors Final model weights
config.json Model configuration
tokenizer.json, tokenizer_config.json, vocab.json Tokenizer files
merges.txt Byte Pair Encoding merges
special_tokens_map.json, added_tokens.json Special token mappings
generation_config.json Default generation parameters

πŸš€ How to Use

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Essi-Narim/qwen2.5b-sft-grpo-reasoning"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, torch_dtype="auto", device_map="auto")

prompt = "<|im_start|>user\nSolve: (5 + 7) Γ— (3 - 1)<|im_end|>\n<|im_start|>assistant\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Supports Chat UI (Instruct format).


πŸ“š Acknowledgments


πŸ“„ Citation

If you use this model or build upon this pipeline, please cite:

Esmaeil Narimissa.
"Reasoning on a Budget: Miniaturizing DeepSeek R1 with SFT-GRPO Alignment for Instruction-Tuned LLMs"
TechRxiv, May 2025.
DOI: 10.36227/techrxiv.174742969.91776650/v1

πŸ”— Related Resources

Downloads last month
13
Safetensors
Model size
494M params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support