--- tags: - qwen2.5 - RL - reasoning library_name: transformers pipeline_tag: text-generation license: apache-2.0 base_model: voidful/Llama-3.2-8B-Instruct --- # Introduction **AMPO**, a novel framework that intelligently leverages guidance from multiple, diverse teacher models, intervening only when the on-policy model fails. Our two core contributions, Adaptive Multi-Guidance Replacement and Comprehension-based Guidance Selection, ensure that this external knowledge is used both efficiently and effectively. [![Paper](https://img.shields.io/badge/paper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https://arxiv.org/abs/2510.02227) [![Github](https://img.shields.io/badge/AMPO-000000?style=for-the-badge&logo=github&logoColor=000&logoColor=white)](https://github.com/SII-Enigma/AMPO) ### Key Highlights: - **Adaptive Multi-Guidance Replacement**: Minimizes intervention by providing external guidance only upon complete on-policy failure, preserving self-discovery while enhancing reasoning efficiency. - **Comprehension-based Guidance Selection**: Improves learning effectiveness by guiding the model to assimilate the most comprehensible external solutions, demonstrably boosting performance. - **Superior Performance:** Achieves better performance and efficiency compared to using RL or SFT alone. ### Multi-Guidance Pool Teacher Models: AceReason-Nemotron-1.1-7B, DeepSeek-R1-Distill-Qwen-7B, OpenR1-Qwen-7B, Qwen3-8B(thinking) ## Inference Example Here’s an example of using AMPO for inference: ```python from transformers import AutoTokenizer from vllm import LLM, SamplingParams model_path = "SII-Enigma/Llama3.2-8B-Ins-AMPO" question = "which number is larger? 9.11 or 9.9?" tokenizer = AutoTokenizer.from_pretrained(model_path) messages = [{"role": "user", "content": question}] chat = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) llm = LLM(model=model_path) params = SamplingParams(temperature=0.6, max_tokens=8192) outputs = llm.generate([chat], params) print(outputs[0].outputs[0].text) ``` # Acknowledgement AMPO builds upon [LUFFY](https://github.com/ElliottYan/LUFFY), [veRL](https://github.com/volcengine/verl), [RLPR](https://github.com/OpenBMB/RLPR) and utilizes [vLLM](https://github.com/vllm-project/vllm) for inference. We utilize [Math-Verify](https://github.com/huggingface/Math-Verify) for math reasoning evaluation. We thank the open-source community for codes, datasets and backbones. # Citation If you find our model, data, or evaluation code useful, please kindly cite our paper: ```bib @misc{yuan2025teacheradaptivemultiguidancepolicy, title={More Than One Teacher: Adaptive Multi-Guidance Policy Optimization for Diverse Exploration}, author={Xiaoyang Yuan and Yujuan Ding and Yi Bin and Wenqi Shao and Jinyu Cai and Jingkuan Song and Yang Yang and Heng Tao Shen}, year={2025}, eprint={2510.02227}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2510.02227}, } ```