Llama-2-7b-hf-DPO-LookAhead-5_TTree1.4_TT0.9_TP0.7_TE0.2_V7

This model is a fine-tuned version of meta-llama/Llama-2-7b-hf on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6702	0.2993	66	0.6613	0.0837	-0.0035	0.7000	0.0872	-86.6308	-75.8190	0.3314	0.3469
0.686	0.5986	132	0.5646	0.0172	-0.3322	0.8000	0.3494	-89.9173	-76.4838	0.3494	0.3651
0.7758	0.8980	198	0.5747	0.0543	-0.2153	0.9000	0.2696	-88.7488	-76.1133	0.3694	0.3845
0.6695	1.1973	264	0.5693	-0.2661	-0.6699	0.7000	0.4038	-93.2946	-79.3173	0.3321	0.3466
0.5453	1.4966	330	0.5472	-0.6038	-1.1332	0.6000	0.5294	-97.9278	-82.6945	0.2266	0.2424
0.5922	1.7959	396	0.5142	-0.9005	-1.6462	0.6000	0.7457	-103.0579	-85.6614	0.1303	0.1477
0.2128	2.0952	462	0.4825	-1.1082	-1.9752	0.8000	0.8670	-106.3474	-87.7384	0.0713	0.0898
0.1372	2.3946	528	0.4425	-1.4160	-2.5347	0.8000	1.1187	-111.9428	-90.8164	-0.0224	-0.0028
0.3622	2.6939	594	0.4437	-1.5113	-2.6570	0.8000	1.1457	-113.1660	-91.7698	-0.0636	-0.0435
0.1555	2.9932	660	0.4437	-1.5898	-2.7509	0.7000	1.1611	-114.1047	-92.5540	-0.0729	-0.0526