Chinese GRPO Qwen2.5-7B (40% Dataset)

使用GRPO (Group Relative Policy Optimization)方法訓練的中文大語言模型，專門優化處理敏感議題的中立性和推理能力。

模型詳情

基礎模型: Qwen/Qwen2.5-7B-Instruct
訓練方法: GRPO with LoRA
訓練數據: 40%中文推理數據集 (12,238個preference pairs)
訓練時間: 39小時44分鐘
硬體: NVIDIA RTX 4090 24GB
最終獎勵分數: 0.66

訓練配置

LoRA配置:
  r: 16
  alpha: 32
  dropout: 0.05
  target_modules: ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]

訓練參數:
  learning_rate: 3e-05
  batch_size: 16
  gradient_accumulation_steps: 2
  num_epochs: 2
  total_steps: 5,506
  
優化設置:
  quantization: 4-bit
  gradient_checkpointing: true
  bf16: true
  dataloader_num_workers: 0  # 解決pickle錯誤

性能指標

最終訓練損失: 0.058
獎勵分數: 0.6604 ± 0.068
KL散度: 1.90
處理tokens: 65,428,674

使用方法

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
import torch

# 量化配置
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True
)

# 載入基礎模型
base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-7B-Instruct",
    quantization_config=quantization_config,
    device_map="auto",
    trust_remote_code=True
)

# 載入LoRA權重
model = PeftModel.from_pretrained(
    base_model,
    "RayTsai/chinese-grpo-qwen2.5-7b-40percent"
)

# 載入tokenizer
tokenizer = AutoTokenizer.from_pretrained("RayTsai/chinese-grpo-qwen2.5-7b-40percent")

# 使用模型
prompt = "問題：[您的問題]\n\n選項：\nA. [選項A]\nB. [選項B]\nC. [選項C]\nD. [選項D]\n\n請選擇正確答案並說明理由。"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

特色功能

中立性優化: 通過GRPO訓練提升回答的客觀性
推理能力: 不僅給出答案，還提供詳細推理過程
穩定性: 40小時訓練確保模型收斂
效率: 4-bit量化支援在消費級GPU運行

訓練日誌

開始時間: 2024-06-24 22:31:39
結束時間: 2024-06-26 14:19:35
總步數: 5,506/5,508 (99.96%)
保存檢查點: 27個（每200步）

引用

@misc{chinese-grpo-2025,
  author = {Ray Tsai},
  title = {Chinese GRPO Qwen2.5-7B 40% Dataset},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/RayTsai/chinese-grpo-qwen2.5-7b-40percent}
}

授權

Apache License 2.0

聯繫方式

如有問題或合作意向，請通過HuggingFace平台聯繫。

RayTsai
/

chinese-grpo-qwen2.5-7b-40percent

Chinese GRPO Qwen2.5-7B (40% Dataset)

模型詳情

訓練配置

性能指標

使用方法

特色功能

訓練日誌

引用

授權

聯繫方式

Model tree for RayTsai/chinese-grpo-qwen2.5-7b-40percent

Evaluation results