SafeGRPO-LoRA
Collection
3 items
β’
Updated
This is a LoRA adapter trained using the GRPO (Group Relative Policy Optimization) algorithm with a multi-label reward model, fine-tuned on Qwen2.5-0.5B for safe and aligned language generation.
A RoBERTa-based multi-label regression model was used to compute rewards on four alignment axes:
Each output was scored in [0,1], and the sum of the four scores was used as the scalar reward.
Metric | Base | Fine-Tuned | Ξ |
---|---|---|---|
Politeness | 0.48 | 0.59 | +0.11 |
Meaningfulness | 0.61 | 0.65 | +0.04 |
Actionability | 0.53 | 0.66 | +0.13 |
Safety | 0.42 | 0.70 | +0.28 |
Combined | 0.54 | 0.66 | +0.12 |
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
adapter = PeftModel.from_pretrained(base_model, "hydroxai/grpo_saved_lora")
inputs = tokenizer("How can we improve online safety?", return_tensors="pt")
outputs = adapter.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
If you use this model, please cite:
@article{li2025safegrpo,
title = {Optimizing Safe and Aligned Language Generation: A Multi-Objective GRPO Approach},
author = {Li, Xuying and Li, Zhuo and Kosuga, Yuji and Bian, Victor},
journal = {arXiv preprint arXiv:2503.21819},
year = {2025},
url = {https://arxiv.org/abs/2503.21819}
}
Maintained by HydroX AI.