Could you release the best model (LLaVA-1.6 + LoRA) reported in the paper?
Hi, thank you for your great work on VLM2Vec! I have a quick question regarding the models you released.
According to the paper, the best-performing model on ImageNet-1K is LLaVA-1.6 finetuned with LoRA, which achieves top-1 accuracy of 0.745. However, the currently available TIGER-Lab/VLM2Vec-LLaVa-Next
seems to be fully finetuned, as there's no adapter_config.json
in the repo.
I evaluated this model (TIGER-Lab/VLM2Vec-LLaVa-Next
) using the command below and obtained an ImageNet-1K accuracy of only 0.207, which is far from the reported result. Here's the command I used :
python eval.py \
--model_name TIGER-Lab/VLM2Vec-LLaVa-Next \
--model_backbone llava_next \
--encode_output_path llava_next_outputs/ \
--image_resolution high \
--num_crops 4 \
--max_len 256 \
--pooling last \
--normalize True \
--dataset_name TIGER-Lab/MMEB-eval \
--subset_name ImageNet-1K \
--dataset_split test \
--per_device_eval_batch_size 2 \
--image_dir eval_images/
In contrast, when I evaluated TIGER-Lab/VLM2Vec-LoRA
on the lora setup, I got a 0.68 accuracy, which seems much closer to the expected performance.
Would it be possible to release the LLaVA-1.6 + LoRA model used in the paper, or provide instructions to reproduce it (e.g., adapter weights and configuration)?
Thanks again for your time and amazing work!
same problem. when i use this command to eval model, i only get the 0.015 and 0.029 on MSCOCO_i2t and ViusualNews_i2t
Thanks for letting me know. I will take a look soon and update here.
BTW, this is the model fine-tuned with LoRA, I merged it to the full model so that it will be more convenient for people to use.
Hi
@yibingwei
@LightSunKing
, thanks a lot for bringing up this issue.
Regarding the low results, they were caused by the --max_len 256 parameter, which truncated the image tokens.
You can simply remove this parameter, and the results should then be reproducible.
This parameter can be a bit confusing. For some models' processors, it represents max_text_length, in which case it's fine to use. But for others, it refers to the combined length of image and text tokens, in which case it should be removed.
I'll update the documentation to clarify this and avoid future confusion. As a general rule, it's safer not to use this parameter.
Also, just FYI, our best-performing models are now the VLM2Vec_Qwen series (https://huggingface.co/collections/TIGER-Lab/vlm2vec-6705f418271d085836e0cdd5). We’ll also be releasing the VLM2Vec_v2 series of code and models later this week, which will offer even better performance.