llama3-8b-mypo3_sim-full-beta12.5-lr4e-7

This model is a fine-tuned version of princeton-nlp/Llama-3-Base-8B-SFT on the HuggingFaceH4/ultrafeedback_binarized dataset. It achieves the following results on the evaluation set:

Loss: 1.3762
Rewards/chosen: 0.0655
Rewards/rejected: -0.3701
Rewards/accuracies: 0.7560
Rewards/margins: 0.4356
Logps/rejected: -1.5190
Logps/chosen: -1.2659
Logits/rejected: -1.1037
Logits/chosen: -1.0759

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 4e-07
train_batch_size: 4
eval_batch_size: 8
seed: 42
distributed_type: multi-GPU
num_devices: 4
gradient_accumulation_steps: 2
total_train_batch_size: 32
total_eval_batch_size: 32
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 1

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
1.379	0.0523	100	1.3804	-0.0143	-0.0874	0.6448	0.0731	-1.4964	-1.2723	-1.0455	-1.0137
1.3997	0.1047	200	1.4037	-0.1231	-0.3189	0.7024	0.1958	-1.5149	-1.2810	-1.0477	-1.0177
1.4069	0.1570	300	1.4016	0.1112	-0.1817	0.7302	0.2929	-1.5039	-1.2623	-1.0539	-1.0256
1.4067	0.2094	400	1.4060	0.0174	-0.3274	0.7202	0.3448	-1.5156	-1.2698	-1.0205	-0.9955
1.4144	0.2617	500	1.3973	0.0997	-0.3029	0.7222	0.4026	-1.5136	-1.2632	-1.0800	-1.0518
1.4259	0.3141	600	1.4098	0.0202	-0.3600	0.7242	0.3802	-1.5182	-1.2695	-1.0593	-1.0335
1.3595	0.3664	700	1.4119	0.0323	-0.3666	0.7222	0.3989	-1.5187	-1.2686	-1.0663	-1.0400
1.449	0.4187	800	1.4198	-0.0062	-0.4193	0.7242	0.4130	-1.5230	-1.2716	-1.0568	-1.0320
1.4411	0.4711	900	1.4068	0.0924	-0.3174	0.75	0.4098	-1.5148	-1.2638	-1.0695	-1.0427
1.379	0.5234	1000	1.3951	0.1021	-0.3451	0.7460	0.4471	-1.5170	-1.2630	-1.0724	-1.0471
1.4269	0.5758	1100	1.4001	0.2006	-0.2040	0.7321	0.4046	-1.5057	-1.2551	-1.0807	-1.0548
1.3973	0.6281	1200	1.3843	0.0314	-0.4097	0.7421	0.4411	-1.5222	-1.2686	-1.0827	-1.0560
1.3629	0.6805	1300	1.3831	0.0455	-0.3913	0.7421	0.4367	-1.5207	-1.2675	-1.0595	-1.0347
1.3587	0.7328	1400	1.3861	0.1402	-0.2996	0.7440	0.4398	-1.5134	-1.2599	-1.0802	-1.0539
1.3972	0.7851	1500	1.3793	0.0976	-0.3469	0.7401	0.4445	-1.5172	-1.2633	-1.0829	-1.0565
1.3762	0.8375	1600	1.3783	0.0925	-0.3479	0.7480	0.4404	-1.5172	-1.2637	-1.0900	-1.0631
1.3757	0.8898	1700	1.3774	0.0540	-0.3880	0.7480	0.4420	-1.5204	-1.2668	-1.0737	-1.0482
1.3685	0.9422	1800	1.3773	0.0739	-0.3636	0.7480	0.4375	-1.5185	-1.2652	-1.0894	-1.0627
1.3649	0.9945	1900	1.3769	0.0610	-0.3706	0.7460	0.4315	-1.5191	-1.2663	-1.1038	-1.0760

Framework versions

Transformers 4.43.1
Pytorch 2.1.2+cu121
Datasets 2.18.0
Tokenizers 0.19.1

aaaalongaa
/

llama3-8b-mypo3_sim-full-beta12.5-lr4e-7

llama3-8b-mypo3_sim-full-beta12.5-lr4e-7

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for aaaalongaa/llama3-8b-mypo3_sim-full-beta12.5-lr4e-7

Dataset used to train aaaalongaa/llama3-8b-mypo3_sim-full-beta12.5-lr4e-7

Evaluation results