LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization

This repo provides the checkpoint of Mistral-7B-LongPO-512K in our paper "LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization".

(Note that it is an experimental an experimental version (for rebuttal purposes) that may have not been fully tuned or provided with sufficient data to achieve convergence.)

Highlights of LongPO

Self-evolving long-context alignment without human/superior LLMs annotations.
Extending context length while keeping aligned in one stage.
No degradation on short-context capabilities.

Models and Training Data

Models	Base Model	Training Data	# Data Samples
Mistral-7B-LongPO-128K	Mistral-7B-Instruct-v0.2	HF Link	45K
Qwen2.5-7B-LongPO-128K	Qwen2.5-7B-Instruct	HF Link	32K
Mistral-7B-LongPO-256K-EXP*	Mistral-7B-LongPO-128K	HF Link	16K
Mistral-7B-LongPO-512K-EXP*	Mistral-7B-LongPO-128K	HF Link	2.5K

* indicates an experimental version (for rebuttal purposes) that may have not been fully tuned or provided with sufficient data to achieve convergence.

Evaluation

InfiniteBench

Model	Train/Claimed Length	En.Sum	En.QA	En.MC	AVG.
GPT-4-128K	128K	14.73	22.44	67.25	34.81
Qwen2-72B	128K	24.32ᵇ	7.03ᵇ	72.05ᵇ	34.47ᵇ
LLaMA 3.1-70B	128K	33.55ᵇ	36.08ᵇ	69.00ᵇ	46.21ᵇ
LLaMA 3.1-8B	128K	28.06ᵇ	30.47ᵇ	58.08ᵇ	38.87ᵇ
GLM-4-9B	128K	14.84ᵇ	9.51ᵇ	67.25ᵇ	30.53ᵇ
GLM-4-9B-1M	1M	28.3	9.7	68.6	35.53
LWM-7B-1M	1M	4.33ᵇ	0.0ᵇ	3.06ᵇ	2.46ᵇ
YaRN-Mistral-7B	128K	9.09	9.55	27.95	15.53
Mistral-7B	32K	22.13	4.93	14.41	13.82
- SFT	128K	23.44	13.45	53.21	30.03
- DPO	128K	15.21	10.34	48.14	25.56
- LongPO (iter1)	128K	27.05	23.51	67.25	39.27
- LongPO (iter2)	256K	28.16	24.43	66.35	39.65
- LongPO (iter3)	512K	29.10	27.85	66.67	41.21
Qwen2.5-7B	128K	22.89	6.08	52.4	27.12
- LongPO (iter1)	128K	32.06	17.32	72.05	40.48

Our results are evaluated with greedy decoding.
Baseline results marked with ᵇ are evaluated by us, while unmarked baseline results are sourced from their official report.

RULER

Model	NIAH	VT	AGG	QA	AVG (13 tasks)
Qwen2.5-7B-Instruct	82.10	80.09	74.50	54.30	76.50
Qwen2.5-7B-LongPO-128K	95.82	89.71	78.67	59.40	87.11
Mistral-7B-Instruct-v0.2	72.60	74.40	64.40	52.20	68.40
Mistral-7B-LongPO-128K	96.88	96.49	71.55	64.81	88.02
Mistral-7B-LongPO-256K-EXP	96.80	97.00	69.14	64.87	87.65
Mistral-7B-LongPO-512K-EXP	97.28	97.48	69.22	64.92	88.00

Short Context

Model	MMLU	ARC-C	Hellaswag	Winogrande	Avg
Mistral-7B-Instruct-v0.2	59.15	59.26	83.2	78.4	70.00
Mistral-7B-LongPO-128K	59.99	59.34	82.99	78.53	70.21
Mistral-7B-LongPO-256K-EXP	59.47	60.28	83.14	78.14	70.26
Mistral-7B-LongPO-512K-EXP	59.51	60.58	82.87	77.66	70.16
Qwen2.5-7B-Instruct	74.28	67.15	81.41	74.66	74.38
Qwen2.5-7B-LongPO-128K	73.64	65.70	80.82	74.98	73.79

Citation

If you find our project useful, hope you can star our repo and cite our paper as follows:

@inproceedings{
    chen2025longpo,
    title={Long{PO}: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization},
    author={Guanzheng Chen and Xin Li and Michael Shieh and Lidong Bing},
    booktitle={The Thirteenth International Conference on Learning Representations},
    year={2025},
    url={https://openreview.net/forum?id=qTrEq31Shm}
}

DAMO-NLP-SG
/

Mistral-7B-LongPO-512K-EXP