Vision-Language for Reasoning (VLr)
Collection
Thinking / No-Thinking
•
2 items
•
Updated
•
2
The Enesidaon-VLR-7B-no-Thinking model is a high-fidelity vision-language reasoning (experimental) model designed for fine-grained multimodal comprehension. Built on top of Qwen2.5-VL-7B-Instruct, this model improves image captioning, sampled video reasoning, and detailed document understanding. Unlike standard approaches, it explicitly grounds its textual reasoning steps to visual coordinates, enabling precise and explainable multimodal reasoning. The model is trained using supervised fine-tuning (SFT) on visually-grounded reasoning traces and further optimized via GRPO reinforcement learning, resulting in superior chain-of-thought reasoning without overthinking or unnecessary hallucinations.
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"prithivMLmods/Enesidaon-VLR-7B-no-Thinking", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("prithivMLmods/Enesidaon-VLR-7B-no-Thinking")
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Describe this image with reasoning."},
],
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
This model is intended for: