File size: 3,041 Bytes
6889851 6fc2da7 6889851 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 |
---
tags:
- qwen2.5
- RL
- reasoning
library_name: transformers
pipeline_tag: text-generation
license: apache-2.0
base_model: voidful/Llama-3.2-8B-Instruct
---
# Introduction
**AMPO**, a novel framework that intelligently leverages guidance from multiple, diverse teacher models, intervening only when the on-policy model fails. Our two core contributions, Adaptive Multi-Guidance Replacement and Comprehension-based Guidance Selection, ensure that this external knowledge is used both efficiently and effectively.
[](https://arxiv.org/abs/2510.02227) [](https://github.com/SII-Enigma/AMPO)
### Key Highlights:
- **Adaptive Multi-Guidance Replacement**: Minimizes intervention by providing external guidance only upon complete on-policy failure, preserving self-discovery while enhancing reasoning efficiency.
- **Comprehension-based Guidance Selection**: Improves learning effectiveness by guiding the model to assimilate the most comprehensible external solutions, demonstrably boosting performance.
- **Superior Performance:** Achieves better performance and efficiency compared to using RL or SFT alone.
### Multi-Guidance Pool
Teacher Models: AceReason-Nemotron-1.1-7B, DeepSeek-R1-Distill-Qwen-7B, OpenR1-Qwen-7B, Qwen3-8B(thinking)
## Inference Example
Here’s an example of using AMPO for inference:
```python
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
model_path = "SII-Enigma/Llama3.2-8B-Ins-AMPO"
question = "which number is larger? 9.11 or 9.9?"
tokenizer = AutoTokenizer.from_pretrained(model_path)
messages = [{"role": "user", "content": question}]
chat = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
llm = LLM(model=model_path)
params = SamplingParams(temperature=0.6, max_tokens=8192)
outputs = llm.generate([chat], params)
print(outputs[0].outputs[0].text)
```
# Acknowledgement
AMPO builds upon [LUFFY](https://github.com/ElliottYan/LUFFY), [veRL](https://github.com/volcengine/verl), [RLPR](https://github.com/OpenBMB/RLPR) and utilizes [vLLM](https://github.com/vllm-project/vllm) for inference. We utilize [Math-Verify](https://github.com/huggingface/Math-Verify) for math reasoning evaluation. We thank the open-source community for codes, datasets and backbones.
# Citation
If you find our model, data, or evaluation code useful, please kindly cite our paper:
```bib
@misc{yuan2025teacheradaptivemultiguidancepolicy,
title={More Than One Teacher: Adaptive Multi-Guidance Policy Optimization for Diverse Exploration},
author={Xiaoyang Yuan and Yujuan Ding and Yi Bin and Wenqi Shao and Jinyu Cai and Jingkuan Song and Yang Yang and Heng Tao Shen},
year={2025},
eprint={2510.02227},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2510.02227},
}
``` |