Llama-2-7b-hf-DPO-LookAhead-5_Q2_TTree1.4_TT0.9_TP0.7_TE0.2_V3

This model is a fine-tuned version of meta-llama/Llama-2-7b-hf on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.7186	0.3012	75	0.6922	0.0088	-0.0056	0.6667	0.0144	-159.7317	-125.9045	0.2763	0.3051
0.6878	0.6024	150	0.6645	0.0065	-0.0784	0.6667	0.0850	-160.4602	-125.9270	0.2430	0.2714
0.7115	0.9036	225	0.6671	0.1245	0.0380	0.5833	0.0865	-159.2964	-124.7477	0.2585	0.2872
0.2588	1.2048	300	0.5773	-0.4124	-0.9074	0.6667	0.4951	-168.7503	-130.1161	0.1854	0.2129
0.5429	1.5060	375	0.6801	-0.4887	-0.7667	0.5	0.2780	-167.3426	-130.8791	0.0976	0.1239
0.3313	1.8072	450	0.7539	-0.6406	-0.7950	0.5	0.1545	-167.6264	-132.3980	0.0143	0.0407
0.2905	2.1084	525	0.8112	-1.3875	-1.4781	0.4167	0.0906	-174.4566	-139.8674	-0.1544	-0.1306
0.1737	2.4096	600	0.8469	-1.9078	-2.0075	0.4167	0.0997	-179.7509	-145.0706	-0.2506	-0.2282
0.2314	2.7108	675	0.8622	-1.8032	-1.8934	0.4167	0.0902	-178.6097	-144.0242	-0.2567	-0.2341