qwen2.5-0.5b-expo-L2EXPO-25-2

This model is a fine-tuned version of hZzy/qwen2.5-0.5b-sft3-25-2 on an unknown dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Objective	Reward Accuracy	Logp Accuracy	Log Diff Policy	Chosen Logps	Rejected Logps	Chosen Rewards	Rejected Rewards	Logits
0.5039	0.1577	50	0.5116	0.5048	0.5470	0.5218	1.0926	-92.9307	-94.0233	-0.0055	-0.0062	-1.2157
0.5118	0.3154	100	0.5106	0.5038	0.5772	0.5386	2.1626	-94.4899	-96.6525	-0.0070	-0.0089	-1.3464
0.5278	0.4731	150	0.5086	0.5014	0.5738	0.5576	5.1593	-135.0134	-140.1726	-0.0475	-0.0524	-1.7394
0.4845	0.6307	200	0.5046	0.4964	0.5755	0.5772	12.1099	-208.8495	-220.9594	-0.1214	-0.1332	-2.1451
0.4953	0.7884	250	0.5007	0.4912	0.5934	0.5923	19.7754	-249.4757	-269.2511	-0.1620	-0.1815	-2.6017
0.4661	0.9461	300	0.4969	0.4857	0.6012	0.5968	27.9289	-288.4738	-316.4027	-0.2010	-0.2286	-2.9416
0.4725	1.1038	350	0.4936	0.4822	0.6124	0.6023	33.0923	-295.9875	-329.0798	-0.2085	-0.2413	-3.2578
0.4881	1.2615	400	0.4913	0.4795	0.6102	0.6113	37.9280	-299.2147	-337.1428	-0.2117	-0.2493	-3.5394
0.4575	1.4192	450	0.4891	0.4761	0.6214	0.6119	42.1253	-322.7853	-364.9105	-0.2353	-0.2771	-3.9786
0.4817	1.5769	500	0.4882	0.4743	0.6214	0.6174	47.9842	-360.6322	-408.6165	-0.2732	-0.3208	-4.2328
0.4459	1.7346	550	0.4858	0.4719	0.6180	0.6158	51.5714	-355.0592	-406.6306	-0.2676	-0.3188	-4.4310
0.4515	1.8922	600	0.4846	0.4700	0.6219	0.6230	55.1050	-364.5329	-419.6379	-0.2771	-0.3318	-4.6179