Llama-2-7b-hf-DPO-LookAhead-5_TTree1.4_TT0.9_TP0.7_TE0.2_V7

This model is a fine-tuned version of meta-llama/Llama-2-7b-hf on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.7096	0.3004	67	0.6970	-0.0217	-0.0183	0.5	-0.0034	-112.5630	-79.6082	0.6024	0.6487
0.6684	0.6009	134	0.6829	-0.0429	-0.0704	0.8000	0.0275	-113.0842	-79.8203	0.5780	0.6246
0.7283	0.9013	201	0.6982	0.0550	0.0616	0.6000	-0.0067	-111.7634	-78.8413	0.5848	0.6319
0.2339	1.2018	268	0.6630	-0.1631	-0.2504	0.7000	0.0873	-114.8840	-81.0225	0.4681	0.5163
0.3526	1.5022	335	0.6523	-0.5545	-0.6837	0.6000	0.1292	-119.2165	-84.9362	0.3518	0.4006
0.2787	1.8027	402	0.6181	-0.4772	-0.6749	0.6000	0.1977	-119.1291	-84.1633	0.3107	0.3615
0.2577	2.1031	469	0.6856	-1.0419	-1.1941	0.5	0.1522	-124.3209	-89.8106	0.1666	0.2190
0.0942	2.4036	536	0.7344	-1.5330	-1.7182	0.6000	0.1852	-129.5615	-94.7212	0.0278	0.0822
0.0952	2.7040	603	0.7581	-1.7006	-1.8759	0.6000	0.1753	-131.1391	-96.3973	-0.0046	0.0503