Chinese GRPO Qwen2.5-7B (50% Dataset)
使用GRPO (Group Relative Policy Optimization)方法訓練的中文大語言模型,專門優化處理敏感議題的中立性和推理能力。
模型詳情
- 基礎模型: Qwen/Qwen2.5-7B-Instruct
- 訓練方法: GRPO with LoRA
- 訓練數據: 50%中文推理數據集 (12,238個preference pairs)
- 訓練時間: 39小時44分鐘
- 硬體: NVIDIA RTX 4090 24GB
- 最終獎勵分數: 0.66
訓練配置
LoRA配置:
r: 16
alpha: 32
dropout: 0.05
target_modules: ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
訓練參數:
learning_rate: 3e-05
batch_size: 16
gradient_accumulation_steps: 2
num_epochs: 2
total_steps: 5,506
優化設置:
quantization: 4-bit
gradient_checkpointing: true
bf16: true
dataloader_num_workers: 0 # 解決pickle錯誤
性能指標
- 最終訓練損失: 0.058
- 獎勵分數: 0.6604 ± 0.068
- KL散度: 1.90
- 處理tokens: 65,428,674
使用方法
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
import torch
# 量化配置
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True
)
# 載入基礎模型
base_model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-7B-Instruct",
quantization_config=quantization_config,
device_map="auto",
trust_remote_code=True
)
# 載入LoRA權重
model = PeftModel.from_pretrained(
base_model,
"RayTsai/chinese-grpo-qwen2.5-7b-50percent"
)
# 載入tokenizer
tokenizer = AutoTokenizer.from_pretrained("RayTsai/chinese-grpo-qwen2.5-7b-50percent")
# 使用模型
prompt = "問題:[您的問題]\n\n選項:\nA. [選項A]\nB. [選項B]\nC. [選項C]\nD. [選項D]\n\n請選擇正確答案並說明理由。"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
特色功能
- 中立性優化: 通過GRPO訓練提升回答的客觀性
- 推理能力: 不僅給出答案,還提供詳細推理過程
- 穩定性: 40小時訓練確保模型收斂
- 效率: 4-bit量化支援在消費級GPU運行
訓練日誌
- 開始時間: 2024-06-24 22:31:39
- 結束時間: 2024-06-26 14:19:35
- 總步數: 5,506/5,508 (99.96%)
- 保存檢查點: 27個(每200步)
引用
@misc{chinese-grpo-2025,
author = {Ray Tsai},
title = {Chinese GRPO Qwen2.5-7B 50% Dataset},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/RayTsai/chinese-grpo-qwen2.5-7b-50percent}
}
授權
Apache License 2.0
聯繫方式
如有問題或合作意向,請通過HuggingFace平台聯繫。
Model tree for RayTsai/chinese-grpo-qwen2.5-7b-50percent
Evaluation results
- Reward Scoreself-reported0.660
- Training Lossself-reported0.058