This is a temporary repo forked from openbmb/evisrag-7b.
VisRAG 2.0: Evidence-Guided Multi-Image Reasoning in Visual Retrieval-Augmented Generation
β’ π Introduction β’ π News β’ βοΈ Setup β’ β‘οΈ Training
β’ π Evaluation β’ π§ Usage β’ π Lisense β’ π§ Contact β’
π Introduction
EVisRAG (VisRAG 2.0) is an evidence-guided Vision Retrieval-augmented Generation framework that equips VLMs for multi-image questions by first linguistically observing retrieved images to collect per-image evidence, then reasoning over those cues to answer. EVisRAG trains with Reward-Scoped GRPO, applying fine-grained token-level rewards to jointly optimize visual perception and reasoning.
π News
- 20251001: Released EVisRAG (VisRAG 2.0), an end-to-end Vision-Language Model. Released our Paper on arXiv. Released our Model on Hugging Face. Released our Code on GitHub
β¨ EVisRAG Pipeline
EVisRAG is an end-to-end framework which equips VLMs with precise visual perception during reasoning in multi-image scenarios. We trained and released VLRMs with EVisRAG built on Qwen2.5-VL-7B-Instruct, and Qwen2.5-VL-3B-Instruct.
βοΈ Setup
git clone https://github.com/OpenBMB/VisRAG.git
conda create --name EVisRAG python==3.10
conda activate EVisRAG
cd EVisRAG
pip install -r EVisRAG_requirements.txt
β‘οΈ Training
Stage1: SFT (based on LLaMA-Factory)
git clone https://github.com/hiyouga/LLaMA-Factory.git
bash evisrag_scripts/full_sft.sh
Stage2: RS-GRPO (based on Easy-R1)
bash evisrag_scripts/run_rsgrpo.sh
Notes:
- The training data is available on Hugging Face under
EVisRAG-Train
, which is referenced at the beginning of this page. - We adopt a two-stage training strategy. In the first stage, please clone
LLaMA-Factory
and update the model path in the full_sft.sh script. In the second stage, we built our customized algorithmRS-GRPO
based onEasy-R1
, specifically designed for EVisRAG, whose implementation can be found insrc/RS-GRPO
.
π Evaluation
bash evisrag_scripts/predict.sh
bash evisrag_scripts/eval.sh
Notes:
- The test data is available on Hugging Face under
EVisRAG-Test-xxx
, as referenced at the beginning of this page. - To run the evaluation, first execute the
predict.sh
script. The model outputs will be saved in the preds directory. Then, use theeval.sh
script to evaluate the predictions. The metricsEM
,Accuracy
, andF1
will be reported directly.
π§ Usage
Model on Hugging Face: https://huggingface.co/openbmb/EVisRAG-7B
from transformers import AutoProcessor
from vllm import LLM, SamplingParams
from qwen_vl_utils import process_vision_info
def evidence_promot_grpo(query):
return f"""You are an AI Visual QA assistant. I will provide you with a question and several images. Please follow the four steps below:
Step 1: Observe the Images
First, analyze the question and consider what types of images may contain relevant information. Then, examine each image one by one, paying special attention to aspects related to the question. Identify whether each image contains any potentially relevant information.
Wrap your observations within <observe></observe> tags.
Step 2: Record Evidences from Images
After reviewing all images, record the evidence you find for each image within <evidence></evidence> tags.
If you are certain that an image contains no relevant information, record it as: [i]: no relevant information(where i denotes the index of the image).
If an image contains relevant evidence, record it as: [j]: [the evidence you find for the question](where j is the index of the image).
Step 3: Reason Based on the Question and Evidences
Based on the recorded evidences, reason about the answer to the question.
Include your step-by-step reasoning within <think></think> tags.
Step 4: Answer the Question
Provide your final answer based only on the evidences you found in the images.
Wrap your answer within <answer></answer> tags.
Avoid adding unnecessary contents in your final answer, like if the question is a yes/no question, simply answer "yes" or "no".
If none of the images contain sufficient information to answer the question, respond with <answer>insufficient to answer</answer>.
Formatting Requirements:
Use the exact tags <observe>, <evidence>, <think>, and <answer> for structured output.
It is possible that none, one, or several images contain relevant evidence.
If you find no evidence or little evidence, and insufficient to help you answer the question, follow the instructions above for insufficient information.
Question and images are provided below. Please follow the steps as instructed.
Question: {query}
"""
model_path = "xxx"
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True, padding_side='left')
imgs, query = ["imgpath1", "imgpath2", ..., "imgpathX"], "What xxx?"
input_prompt = evidence_promot_grpo(query)
content = [{"type": "text", "text": input_prompt}]
for imgP in imgs:
content.append({
"type": "image",
"image": imgP
})
msg = [{
"role": "user",
"content": content,
}]
llm = LLM(
model=model_path,
tensor_parallel_size=1,
dtype="bfloat16",
limit_mm_per_prompt={"image":5, "video":0},
)
sampling_params = SamplingParams(
temperature=0.1,
repetition_penalty=1.05,
max_tokens=2048,
)
prompt = processor.apply_chat_template(
msg,
tokenize=False,
add_generation_prompt=True,
)
image_inputs, _ = process_vision_info(msg)
msg_input = [{
"prompt": prompt,
"multi_modal_data": {"image": image_inputs},
}]
output_texts = llm.generate(msg_input,
sampling_params=sampling_params,
)
print(output_texts[0].outputs[0].text)
π License
- The code in this repo is released under the Apache-2.0 License.
- The usage of EVisRAG model weights must strictly follow MiniCPM Model License.md.
π§ Contact
EVisRAG
- Yubo Sun: [email protected]
- Chunyi Peng: [email protected]
- Downloads last month
- -