UV-CoT: Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization

This repository hosts the UV-CoT model, presented in the paper Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization.

Overview

Chain-of-thought (CoT) reasoning greatly improves the interpretability and problem-solving abilities of multimodal large language models (MLLMs). Existing approaches primarily focus on text CoT, limiting their ability to leverage visual cues. Unsupervised Visual CoT (UV-CoT) introduces a novel framework for image-level CoT reasoning via preference optimization, eliminating the need for extensive labeled bounding-box data.

UV-CoT achieves this by performing preference comparisons between model-generated bounding boxes. It generates preference data automatically, then uses an evaluator MLLM (e.g., OmniLLM-12B) to rank responses, which serves as supervision to train the target MLLM (e.g., LLaVA-1.5-7B). This approach emulates human perception—identifying key regions and reasoning based on them—thereby improving visual comprehension, particularly in spatial reasoning tasks.

Figure 1: UV-CoT Overview

Visualizations

Qualitative examples demonstrating UV-CoT's visual reasoning:

Figure 5: UV-CoT Visualization 1
Figure 6: UV-CoT Visualization 2

Installation

To set up the environment and install necessary packages, follow these steps:

  1. Clone this repository and navigate to the UV-CoT folder:

    git clone https://github.com/UV-CoT
    cd UV-CoT
    
  2. Create a conda environment and install the package:

    conda create -n uv-cot python=3.10 -y
    conda activate uv-cot
    pip install -e .
    
  3. Install the required spaCy model:

    wget https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.7.3/en_core_web_trf-3.7.3.tar.gz
    pip install en_core_web_trf-3.7.3.tar.gz
    

Usage

You can load and use the UV-CoT model with the transformers library. For detailed information on preference data curation, training, and evaluation, please refer to the official GitHub repository.

Here's a basic example of how to use the model for inference:

from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image
import requests
import torch

# Load model and processor
model_id = "kesenZhaoNTU/UV-CoT" # Use this model_id to load UV-CoT
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
processor = AutoProcessor.from_pretrained(model_id)

# Load an example image
image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/bird.jpg"
image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")

# Define the conversation prompt
prompt = "Describe the image in detail."
messages = [
    {"role": "user", "content": f"<image>
{prompt}"}
]

# Apply the chat template to format the prompt for the model
text = processor.apply_chat_template(messages, add_generation_prompt=True)

# Prepare inputs for the model
inputs = processor(text=text, images=image, return_tensors="pt").to(model.device)

# Generate response
output = model.generate(**inputs, max_new_tokens=200)
print(processor.decode(output[0], skip_special_tokens=True))

Citation

If our work assists your research, feel free to give us a star ⭐ or cite us using:

@misc{zhao2025unsupervisedvisualchainofthoughtreasoning,
      title={Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization}, 
      author={Kesen Zhao and Beier Zhu and Qianru Sun and Hanwang Zhang},
      year={2025},
      eprint={2504.18397},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.18397}, 
}
Downloads last month
13
Safetensors
Model size
6.76B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kesenZhaoNTU/UV-CoT

Finetuned
(24)
this model