zephyr-7b-dpo-qlora

This model is a fine-tuned version of FaeMo/zephyr-7b-sft-qlora on the HuggingFaceH4/ultrafeedback_binarized dataset. It achieves the following results on the evaluation set:

  • Loss: 0.4984
  • Rewards/chosen: -1.7141
  • Rewards/rejected: -2.7356
  • Rewards/accuracies: 0.7380
  • Rewards/margins: 1.0215
  • Logps/rejected: -520.1021
  • Logps/chosen: -442.1446
  • Logits/rejected: -0.6472
  • Logits/chosen: -0.8116

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-06
  • train_batch_size: 4
  • eval_batch_size: 8
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 2
  • gradient_accumulation_steps: 4
  • total_train_batch_size: 32
  • total_eval_batch_size: 16
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_ratio: 0.1
  • num_epochs: 1

Training results

Training Loss Epoch Step Logits/chosen Logits/rejected Logps/chosen Logps/rejected Validation Loss Rewards/accuracies Rewards/chosen Rewards/margins Rewards/rejected
0.6637 0.0523 100 -2.1064 -2.0116 -272.9435 -256.4698 0.6629 0.6960 -0.0221 0.0772 -0.0993
0.602 0.1047 200 -2.0787 -1.9907 -315.8761 -325.2777 0.6024 0.6980 -0.4514 0.3359 -0.7874
0.6039 0.1570 300 -2.1217 -2.0386 -397.0779 -416.8304 0.6019 0.6940 -1.2635 0.4394 -1.7029
0.5585 0.2094 400 -1.7646 -1.6961 -400.3457 -438.6909 0.5523 0.7350 -1.2961 0.6254 -1.9215
0.5064 0.2617 500 -1.5362 -1.4471 -482.7069 -538.2687 0.5590 0.7180 -2.1197 0.7975 -2.9173
0.5405 0.3141 600 -1.4743 -1.3715 -384.6164 -431.6833 0.5277 0.7460 -1.1388 0.7126 -1.8514
0.5165 0.3664 700 -1.2683 -1.1511 -390.5944 -445.0399 0.5212 0.7440 -1.1986 0.7864 -1.9850
0.545 0.4187 800 -1.0790 -0.9338 -409.5362 -465.9337 0.5156 0.7410 -1.3880 0.8059 -2.1939
0.5079 0.4711 900 -1.3680 -1.2508 -423.0117 -489.8168 0.5144 0.7320 -1.5228 0.9100 -2.4328
0.4872 0.5234 1000 -1.1743 -1.0344 -429.4106 -494.0053 0.5079 0.7330 -1.5868 0.8879 -2.4746
0.4962 0.5758 1100 -1.1130 -0.9681 -410.1423 -473.7390 0.5052 0.7420 -1.3941 0.8779 -2.2720
0.494 0.6281 1200 -0.9262 -0.7778 -445.8872 -522.5185 0.5027 0.7390 -1.7515 1.0082 -2.7598
0.4848 0.6805 1300 0.5030 -1.4533 -2.3941 0.7420 0.9408 -485.9530 -416.0602 -1.0210 -1.1597
0.4792 0.7328 1400 0.5000 -1.7471 -2.7718 0.7390 1.0247 -523.7210 -445.4379 -0.5887 -0.7571
0.4773 0.7851 1500 0.4987 -1.6362 -2.6113 0.7370 0.9751 -507.6723 -434.3538 -0.6593 -0.8222
0.5122 0.8375 1600 0.4988 -1.5837 -2.5412 0.7420 0.9575 -500.6636 -429.1013 -0.7098 -0.8688
0.4726 0.8898 1700 0.4981 -1.7114 -2.7207 0.7380 1.0094 -518.6156 -441.8715 -0.6430 -0.8071
0.4909 0.9422 1800 0.4984 -1.7246 -2.7501 0.7390 1.0254 -521.5475 -443.1978 -0.6501 -0.8142
0.4967 0.9945 1900 0.4984 -1.7140 -2.7353 0.7390 1.0213 -520.0685 -442.1331 -0.6647 -0.8274

Framework versions

  • PEFT 0.14.0
  • Transformers 4.45.0
  • Pytorch 2.6.0+cu124
  • Datasets 3.3.2
  • Tokenizers 0.20.3
Downloads last month
0
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no pipeline_tag.

Model tree for FaeMo/zephyr-7b-dpo-qlora

Adapter
(1783)
this model

Dataset used to train FaeMo/zephyr-7b-dpo-qlora