Chinese GRPO Qwen2.5-7B (40% Dataset)

使用GRPO (Group Relative Policy Optimization)方法訓練的中文大語言模型,專門優化處理敏感議題的中立性和推理能力。

模型詳情

  • 基礎模型: Qwen/Qwen2.5-7B-Instruct
  • 訓練方法: GRPO with LoRA
  • 訓練數據: 40%中文推理數據集 (12,238個preference pairs)
  • 訓練時間: 39小時44分鐘
  • 硬體: NVIDIA RTX 4090 24GB
  • 最終獎勵分數: 0.66

訓練配置

LoRA配置:
  r: 16
  alpha: 32
  dropout: 0.05
  target_modules: ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]

訓練參數:
  learning_rate: 3e-05
  batch_size: 16
  gradient_accumulation_steps: 2
  num_epochs: 2
  total_steps: 5,506
  
優化設置:
  quantization: 4-bit
  gradient_checkpointing: true
  bf16: true
  dataloader_num_workers: 0  # 解決pickle錯誤

性能指標

  • 最終訓練損失: 0.058
  • 獎勵分數: 0.6604 ± 0.068
  • KL散度: 1.90
  • 處理tokens: 65,428,674

使用方法

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
import torch

# 量化配置
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True
)

# 載入基礎模型
base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-7B-Instruct",
    quantization_config=quantization_config,
    device_map="auto",
    trust_remote_code=True
)

# 載入LoRA權重
model = PeftModel.from_pretrained(
    base_model,
    "RayTsai/chinese-grpo-qwen2.5-7b-40percent"
)

# 載入tokenizer
tokenizer = AutoTokenizer.from_pretrained("RayTsai/chinese-grpo-qwen2.5-7b-40percent")

# 使用模型
prompt = "問題:[您的問題]\n\n選項:\nA. [選項A]\nB. [選項B]\nC. [選項C]\nD. [選項D]\n\n請選擇正確答案並說明理由。"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

特色功能

  1. 中立性優化: 通過GRPO訓練提升回答的客觀性
  2. 推理能力: 不僅給出答案,還提供詳細推理過程
  3. 穩定性: 40小時訓練確保模型收斂
  4. 效率: 4-bit量化支援在消費級GPU運行

訓練日誌

  • 開始時間: 2024-06-24 22:31:39
  • 結束時間: 2024-06-26 14:19:35
  • 總步數: 5,506/5,508 (99.96%)
  • 保存檢查點: 27個(每200步)

引用

@misc{chinese-grpo-2025,
  author = {Ray Tsai},
  title = {Chinese GRPO Qwen2.5-7B 40% Dataset},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/RayTsai/chinese-grpo-qwen2.5-7b-40percent}
}

授權

Apache License 2.0

聯繫方式

如有問題或合作意向,請通過HuggingFace平台聯繫。

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Model tree for RayTsai/chinese-grpo-qwen2.5-7b-40percent

Base model

Qwen/Qwen2.5-7B
Adapter
(483)
this model

Evaluation results