Discrete Diffusion for Reflective Vision-Language-Action Models in Autonomous Driving
Abstract
ReflectDrive uses a reflection mechanism with discrete diffusion and pre-trained Diffusion Language Models to generate safe trajectories for autonomous driving systems.
End-to-End (E2E) solutions have emerged as a mainstream approach for autonomous driving systems, with Vision-Language-Action (VLA) models representing a new paradigm that leverages pre-trained multimodal knowledge from Vision-Language Models (VLMs) to interpret and interact with complex real-world environments. However, these methods remain constrained by the limitations of imitation learning, which struggles to inherently encode physical rules during training. Existing approaches often rely on complex rule-based post-refinement, employ reinforcement learning that remains largely limited to simulation, or utilize diffusion guidance that requires computationally expensive gradient calculations. To address these challenges, we introduce ReflectDrive, a novel learning-based framework that integrates a reflection mechanism for safe trajectory generation via discrete diffusion. We first discretize the two-dimensional driving space to construct an action codebook, enabling the use of pre-trained Diffusion Language Models for planning tasks through fine-tuning. Central to our approach is a safety-aware reflection mechanism that performs iterative self-correction without gradient computation. Our method begins with goal-conditioned trajectory generation to model multi-modal driving behaviors. Based on this, we apply local search methods to identify unsafe tokens and determine feasible solutions, which then serve as safe anchors for inpainting-based regeneration. Evaluated on the NAVSIM benchmark, ReflectDrive demonstrates significant advantages in safety-critical trajectory generation, offering a scalable and reliable solution for autonomous driving systems.
Community
ReflectDrive, a novel framework that integrates a reflection mechanism for safe trajectory generation via discrete diffusion. Our method bridges the gap between imitation learning and safety-critical deployment by enabling gradient-free, iterative self-correction.
Code will be released soon: https://github.com/pixeli99/ReflectDrive
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- IRL-VLA: Training an Vision-Language-Action Policy via Reward World Model (2025)
- DriveDPO: Policy Learning via Safety DPO For End-to-End Autonomous Driving (2025)
- AnchDrive: Bootstrapping Diffusion Policies with Hybrid Trajectory Anchors for End-to-End Driving (2025)
- ViLaD: A Large Vision Language Diffusion Framework for End-to-End Autonomous Driving (2025)
- ImagiDrive: A Unified Imagination-and-Planning Framework for Autonomous Driving (2025)
- Drive As You Like: Strategy-Level Motion Planning Based on A Multi-Head Diffusion Model (2025)
- Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper