MultimodalQwenLogitReranker-3B

Model Name: MultimodalQwenLogitReranker-3B
Model Type: Multilingual Multimodal Reranker
Base Model: Qwen/Qwen2.5-VL-3B-Instruct
Architecture Modifications: LoRA fine-tuned classifier on the "yes" vs "no" token logits, using sigmoid for scoring (inspired by Qwen text reranker : https://qwenlm.github.io/blog/qwen3-embedding)
Training Setup: Resource-constrained (single A100, batch size 2)

Model Description

QwenLogitReranker is a multilingual reranking model trained with a simple but effective strategy inspired by Alibaba Qwen Text Reranker. Instead of adding a classification head, it computes relevance scores using a sigmoid function on the logit difference between the tokens “yes” and “no.”

This model is designed to be lightweight, general-purpose, and compatible with multimodal QwenVL.

Training Details

Training Dataset: DocVQA (2,000 randomly sampled training examples)
Epochs: 1
Batch Size: 2
Negative Mining: In-batch hard negative
Loss Function: Binary classification (logit diff between “yes” and “no” passed through sigmoid)
Optimizer: AdamW
Fine-Tuning Method: LoRA + transformers trainer (with specific trick to deal with Qwen 2.5 pixel_values being unbatched)
Hardware: Single A100 GPU

Evaluation Results (NDCG@5)

Dataset	Jina Reranker m0 (Baseline)	QwenLogitReranker
UlrickBL/vidore_benchmark_economics_reports_v2_reranker_adapted	0.735	0.799
UlrickBL/vidore_benchmark_2_biomedical_lectures_v2_reranker_adapted	0.763	0.755
UlrickBL/vidore_benchmark_2_esg_reports_human_labeled_v2_reranker_adapted	0.851	0.820
UlrickBL/vidore_benchmark_docvqa_reranker_adapted	0.767	0.747
UlrickBL/vidore_benchmark_2_esg_reports_v2_reranker_adapted	0.920	0.910
Inference time (4898*2810 image, T4 GPU)	2.212 s	1.161 s

Note: Despite smaller training data, diversity of data and compute, QwenLogitReranker shows competitive or superior performance, especially in Economics.

Limitations

Trained only on a small subset (2000 samples) of DocVQA
One epoch of training — performance could likely improve with more compute/data
Currently uses causal language model decoding to simulate classification; slower than embedding-based methods (making it bidertional like collama could improve performances but need compute)

Load model

   import torch
   from PIL import Image
   from torch import nn
   from peft import PeftModel, PeftConfig
   from huggingface_hub import hf_hub_download
   from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
   from qwen_vl_utils import process_vision_info

   
   class Qwen2_5Reranker(nn.Module):
       def __init__(self, base_model):
           super().__init__()
           self.base_model = base_model
       def forward(self, input_ids,pixel_values, attention_mask,image_grid_thw,original_length=None,labels=None):
           # Readapt pixel values
           if len(pixel_values.shape)==3 :
               pixel_values = pixel_values.transpose(0, 1).reshape(-1, pixel_values.shape[-1])
               pixel_values = pixel_values[:original_length[0].item()]
   
               
           generated_ids = self.base_model.forward(input_ids=input_ids,pixel_values=pixel_values,image_grid_thw=image_grid_thw, attention_mask=attention_mask)
   
           logits =generated_ids.logits
           batch_size = logits.size(0)
           batch_indices = torch.arange(batch_size, device=logits.device)
           
           lengths = attention_mask.sum(dim=1)
           token_pos = lengths -1
           token_id_yes = 9454
           token_id_no = 2753
           
           selected_logits = logits[batch_indices, token_pos]
   
           yes_logits = selected_logits[:, token_id_yes]  # shape: [batch_size]
           no_logits  = selected_logits[:, token_id_no]   # shape: [batch_size]
           
           logit_diff = yes_logits - no_logits
   
           prob_yes = torch.sigmoid(logit_diff)
       
           return prob_yes
   
   # Load the model
   max_pixels = 1080*28*28
   model_qwen = Qwen2_5_VLForConditionalGeneration.from_pretrained(
       "Qwen/Qwen2.5-VL-3B-Instruct", torch_dtype=torch.bfloat16, device_map="auto", output_hidden_states=True,
   )
   processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct",max_pixels=max_pixels)
   
   base = PeftModel.from_pretrained(model_qwen, "UlrickBL/qwen_vl_reranker_adapter_V2")
   
   model = Qwen2_5Reranker(base_model=base, hidden_dim=2048)
   
   model=model.to("cuda")
   model.eval()

Inference code

  import time

  start_time = time.time()
  
  url = "https://oto.hms.harvard.edu/sites/g/files/omnuum8391/files/2025-04/PowerPoint-Presentation-Graphic.jpg"
  
  response = requests.get(url)
  image = Image.open(BytesIO(response.content))
  
  query = "<|im_start|>system\nYou will be given an picture and a query. Answer 'Yes' if the answer to the query can be found in the picture, else 'No'<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Query : "+"What is the Harvard study departement in the question ?"+" \nAre the picture and query related ?<|im_end|>\n<|im_start|>assistant\n"
  
  inputs = processor(
                  text=[query],
                  images=[image],
                  padding=True,
                  return_tensors="pt",
              )
  
  inputs.to("cuda")
  
  with torch.no_grad():
      batch_scores = model(**inputs)
  end_time = time.time()
  
  print(f"Time taken : {end_time - start_time:.4f} seconds")

Future Work

Expand training with additional multilingual and domain-specific datasets
Increase batch size and epoch number
Compare with last hidden state + classification layer

UlrickBL
/

MultimodalQwenLogitReranker-3B