Safetensors

GRPO-LoRA-Base

This is a LoRA adapter trained using the GRPO (Group Relative Policy Optimization) algorithm with a multi-label reward model, fine-tuned on Qwen2.5-0.5B for safe and aligned language generation.

πŸ” Overview

  • Base Model: Qwen/Qwen2.5-0.5B-Instruct
  • Tuning Method: GRPO (No value critic, group-based relative rewards)
  • LoRA Adapter: Applied to attention and MLP projection layers
  • Epochs: 3
  • Steps: 1000
  • GPU Memory Usage: ~50% (4-bit + LoRA)

πŸ“Š Reward Model

A RoBERTa-based multi-label regression model was used to compute rewards on four alignment axes:

  • Politeness
  • Meaningfulness
  • Actionability
  • Safety

Each output was scored in [0,1], and the sum of the four scores was used as the scalar reward.

πŸ§ͺ Training Data

  • Dataset: 7,000 adversarial prompts crafted to challenge LLM alignment
  • Format: Prompt-response pairs with human-annotated alignment scores
  • Split: 6K training / 1K validation

🏁 Evaluation

Metric Base Fine-Tuned Ξ”
Politeness 0.48 0.59 +0.11
Meaningfulness 0.61 0.65 +0.04
Actionability 0.53 0.66 +0.13
Safety 0.42 0.70 +0.28
Combined 0.54 0.66 +0.12

πŸš€ How to Use

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")

adapter = PeftModel.from_pretrained(base_model, "hydroxai/grpo_saved_lora")

inputs = tokenizer("How can we improve online safety?", return_tensors="pt")
outputs = adapter.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

✍️ Citation

If you use this model, please cite:

@article{li2025safegrpo,
  title     = {Optimizing Safe and Aligned Language Generation: A Multi-Objective GRPO Approach},
  author    = {Li, Xuying and Li, Zhuo and Kosuga, Yuji and Bian, Victor},
  journal   = {arXiv preprint arXiv:2503.21819},
  year      = {2025},
  url       = {https://arxiv.org/abs/2503.21819}
}

Maintained by HydroX AI.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for hydroxai/grpo_saved_lora_05

Base model

Qwen/Qwen2.5-0.5B
Finetuned
(335)
this model

Collection including hydroxai/grpo_saved_lora_05