Papers
arxiv:2509.10129

Towards Reliable and Interpretable Document Question Answering via VLMs

Published on Sep 12
Authors:
,
,
,

Abstract

DocExplainerV0 is a bounding-box prediction module that decouples answer generation from spatial localization in Vision-Language Models, addressing localization challenges in document understanding.

AI-generated summary

Vision-Language Models (VLMs) have shown strong capabilities in document understanding, particularly in identifying and extracting textual information from complex documents. Despite this, accurately localizing answers within documents remains a major challenge, limiting both interpretability and real-world applicability. To address this, we introduce DocExplainerV0, a plug-and-play bounding-box prediction module that decouples answer generation from spatial localization. This design makes it applicable to existing VLMs, including proprietary systems where fine-tuning is not feasible. Through systematic evaluation, we provide quantitative insights into the gap between textual accuracy and spatial grounding, showing that correct answers often lack reliable localization. Our standardized framework highlights these shortcomings and establishes a benchmark for future research toward more interpretable and robust document information extraction VLMs.

Community

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.10129 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.10129 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.