Papers
arxiv:2510.19307

Unified Reinforcement and Imitation Learning for Vision-Language Models

Published on Oct 22
· Submitted by Byung-Kwan Lee on Oct 23
Authors:
,
,
,

Abstract

A unified reinforcement and imitation learning algorithm creates efficient, lightweight vision-language models that match or exceed leading VLMs in performance.

AI-generated summary

Vision-Language Models (VLMs) have achieved remarkable progress, yet their large scale often renders them impractical for resource-constrained environments. This paper introduces Unified Reinforcement and Imitation Learning (RIL), a novel and efficient training algorithm designed to create powerful, lightweight VLMs. RIL distinctively combines the strengths of reinforcement learning with adversarial imitation learning. This enables smaller student VLMs not only to mimic the sophisticated text generation of large teacher models but also to systematically improve their generative capabilities through reinforcement signals. Key to our imitation framework is an LLM-based discriminator that adeptly distinguishes between student and teacher outputs, complemented by guidance from multiple large teacher VLMs to ensure diverse learning. This unified learning strategy, leveraging both reinforcement and imitation, empowers student models to achieve significant performance gains, making them competitive with leading closed-source VLMs. Extensive experiments on diverse vision-language benchmarks demonstrate that RIL significantly narrows the performance gap with state-of-the-art open- and closed-source VLMs and, in several instances, surpasses them.

Community

Paper author Paper submitter
edited 2 days ago

ArXiv: https://arxiv.org/abs/2510.19307
Project page: https://byungkwanlee.github.io/RIL-page/

  • Unified Learning: Combines reinforcement learning (GRPO) and imitation learning (GAIL) to help small VLMs mimic both how and what to generate from larger teacher models.

  • Dual Reward System: Integrates a discriminator-based similarity reward with LLM-as-a-Judge accuracy feedback, ensuring responses are both stylistically aligned and factually correct.

  • Teacher Diversity: Learns from multiple large teacher VLMs (e.g., Qwen2.5-VL-72B and InternVL3-78B), improving robustness and generalization.

  • No “think” phase: RIL-trained models keep the same fast inference speed as standard models — ideal for deployment in mobile and resource-constrained environments.

Hi, very impressive results.
Do you have the ablation study of the following:

  1. The difference between pure LLM judge using Dr.GRPO v.s. LLM judge + discriminator using Dr.GRPO.
  2. Direct token-wise distillation v.s. your RM-as-a-proxy RLVR distillation.
    Thanks

Also, what the metric of the discriminator?

·
Paper author

A1. If my understanding is correct, then LLM judge using Dr.GRPO is already reported in Table 1 and Table 5(d).
Please note that LLM judge + discriminator using Dr. GRPO is just our proposed method.

A2. We didn't directly compare token-wise distillation with our RIL method, because it is different training mechanism where distillation is closer to SFT, but RIL is closer to RL. However, in Table 5(f), we show that employing RIL method to the distillation-based models works also well.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.19307 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.19307 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.19307 in a Space README.md to link it from this page.

Collections including this paper 3