GRPO-LoRA-Base

This is a LoRA adapter trained using the GRPO (Group Relative Policy Optimization) algorithm with a multi-label reward model, fine-tuned on Qwen2.5-0.5B for safe and aligned language generation.

🔍 Overview

Base Model: Qwen/Qwen2.5-0.5B-Instruct
Tuning Method: GRPO (No value critic, group-based relative rewards)
LoRA Adapter: Applied to attention and MLP projection layers
Epochs: 3
Steps: 1000
GPU Memory Usage: ~50% (4-bit + LoRA)

📊 Reward Model

A RoBERTa-based multi-label regression model was used to compute rewards on four alignment axes:

Politeness
Meaningfulness
Actionability
Safety

Each output was scored in [0,1], and the sum of the four scores was used as the scalar reward.

🧪 Training Data

Dataset: 7,000 adversarial prompts crafted to challenge LLM alignment
Format: Prompt-response pairs with human-annotated alignment scores
Split: 6K training / 1K validation

🏁 Evaluation

Metric	Base	Fine-Tuned	Δ
Politeness	0.48	0.59	+0.11
Meaningfulness	0.61	0.65	+0.04
Actionability	0.53	0.66	+0.13
Safety	0.42	0.70	+0.28
Combined	0.54	0.66	+0.12

🚀 How to Use

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")

adapter = PeftModel.from_pretrained(base_model, "hydroxai/grpo_saved_lora")

inputs = tokenizer("How can we improve online safety?", return_tensors="pt")
outputs = adapter.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

✍️ Citation

If you use this model, please cite:

@article{li2025safegrpo,
  title     = {Optimizing Safe and Aligned Language Generation: A Multi-Objective GRPO Approach},
  author    = {Li, Xuying and Li, Zhuo and Kosuga, Yuji and Bian, Victor},
  journal   = {arXiv preprint arXiv:2503.21819},
  year      = {2025},
  url       = {https://arxiv.org/abs/2503.21819}
}

Maintained by HydroX AI.

hydroxai
/

grpo_saved_lora_05