DocExplainer: Document VQA with Bounding Box Localization

License: CC BY 4.0

HuggingFace

Model description

DocExplainer is a an approach to Document Visual Question Answering (Document VQA) with bounding box localization.
Unlike standard VLMs that only provide text-based answers, DocExplainer adds visual evidence through bounding boxes, making model predictions more interpretable.
It is designed as a plug-and-play module to be combined with existing Vision-Language Models (VLMs), decoupling answer generation from spatial grounding.

letxbe ai logo Logo Unifi

Model Details

DocExplainer is a fine-tuned SigLIP2 Giant-based regressor that predicts bounding box coordinates for answer localization in document images. The system operates in a two-stage process:

  1. Question Answering: Any VLM is used as a black box component to generate a textual answer given in input a document image and question.
  2. Bounding Box Explanation: DocExplainer takes the image, question, and generated answer to predict the coordinates of the supporting evidence.

Model Architecture

DocExplainer builds on SigLIP2 Giant visual and text embeddings.

https://i.postimg.cc/2yN9GYwb/Blank-diagram-5.jpg

Training Procedure

  • Visual and textual embeddings from SigLiP2 are projected into a shared latent space, fused via fully connected layers.
  • A regression head outputs normalized coordinates [x1, y1, x2, y2].
  • Backbone: SigLiP2 Giant (frozen).
  • Loss Function: Smooth L1 (Huber loss) applied to normalized coordinates in [0,1].

Training Setup

  • Dataset: BoundingDocs v2.0
  • Epochs: 20
  • Optimizer: AdamW
  • Hardware: 1 × NVIDIA L40S-1-48G GPU
  • Model Selection: Best checkpoint chosen by highest mean IoU on the validation split.

Quick Start

Here is a simple example of how to use DocExplainer to get an answer and its corresponding bounding box from a document image.

from PIL import Image
import requests
import torch
from transformers import AutoModel, AutoModelForImageTextToText, AutoProcessor
import json 

url = "https://i.postimg.cc/BvftyvS3/image-1d100e9.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
question = "What is the invoice number?"

# -----------------------
# 1. Load SmolVLM2-2.2B for answer generation
# -----------------------
vlm_model = AutoModelForImageTextToText.from_pretrained(
    "HuggingFaceTB/SmolVLM2-2.2B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="flash_attention_2"
)
processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM2-2.2B-Instruct")

PROMPT = """Based only on the document image, answer the following question:
Question: {QUESTION}
Provide ONLY a JSON response in the following format (no trailing commas!):
{{
  "content": "answer"
}}
"""

prompt_text = PROMPT.format(QUESTION=question)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": prompt_text},
        ]
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(vlm_model.device, dtype=torch.bfloat16)

input_length = inputs['input_ids'].shape[1]
generated_ids = vlm_model.generate(**inputs, do_sample=False, max_new_tokens=2056)

output_ids = generated_ids[:, input_length:]
generated_texts = processor.batch_decode(
    output_ids,
    skip_special_tokens=True,
)

decoded_output = generated_texts[0].replace("Assistant:", "", 1).strip()
answer = json.loads(decoded_output)['content']

print(f"Answer: {answer}")

# -----------------------
# 2. Load DocExplainer for bounding box prediction
# -----------------------
explainer = AutoModel.from_pretrained("letxbe/DocExplainer", trust_remote_code=True)
bbox = explainer.predict(image, answer)
print(f"Predicted bounding box (normalized): {bbox}")

Example Output:

Question: What is the invoice number?
Answer: 3Y8M2d-846

Predicted BBox: [0.6353235244750977, 0.03685223311185837, 0.8617828488349915, 0.058749228715896606]

Visualized Answer Location: Invoice with predicted bounding box

Performance

Architecture Prompting ANLS MeanIoU
Smolvlm-2.2B Zero-shot 0.527 0.011
Anchors 0.543 0.026
CoT 0.561 0.011
Qwen2-vl-7B Zero-shot 0.691 0.048
Anchors 0.694 0.051
CoT 0.720 0.038
Claude Sonnet 4 Zero-shot 0.737 0.031
Smolvlm-2.2B + DocExplainer Zero-shot 0.572 0.175
Qwen2-vl-7B + DocExplainer Zero-shot 0.689 0.188
Smol + Naive OCR Zero-shot 0.556 0.405
Qwen + Naive OCR Zero-shot 0.690 0.494

Document VQA performance of different models and prompting strategies on the BoundingDocs v2.0 dataset.
The best value is shown in bold, the second-best value is underlined.

Citation

If you use DocExplainer, please cite:

@misc{chen2025reliableinterpretabledocumentquestion,
      title={Towards Reliable and Interpretable Document Question Answering via VLMs}, 
      author={Alessio Chen and Simone Giovannini and Andrea Gemelli and Fabio Coppini and Simone Marinai},
      year={2025},
      eprint={2509.10129},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.10129}, 
}

Limitations

  • Prototype only: Intended as a first approach, not a production-ready solution.
  • Dataset constraints: Current evaluation is limited to cases where an answer fits in a single bounding box. Answers requiring reasoning over multiple regions or not fully captured by OCR cannot be properly
Downloads last month
36
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train letxbe/DocExplainer