The performance difference between llava-hf/llava-1.5-7b-hf and liuhaotian/llava-v1.5-7b on MME benchmark.

#44
by NaForAll - opened

I find a performance difference between hf version and liu version. The results are pretty low when I test llava-1.5 hf with MME benchmark. While liu releases his llava 1.5, which performs over 1500 scores on MME in his paper https://arxiv.org/abs/2310.03744, I find llava 1.5 hf performs around 1000 scores on MME perception and cognitive tasks many times. This performance gap is pretty confusing.

My code and last record of performance are as follows:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_id = args.model_path
model = LlavaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    trust_remote_code=True,
).to(device)

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

mme_folder = "\MME\"
mme_type_dict = {
"Perception": ["existence", "count", "position", "color", "posters", "celebrity", "scene", "landmark", "artwork", "OCR"],
"Cognition": ["commonsense_reasoning", "numerical_calculation", "text_translation", "code_reasoning"]
}

for type, task_list in mme_type_dict.items():
    for task in task_list:
        answer_path = os.path.join(mme_folder, "gen_answers/{}.txt".format(task))
        if not os.path.exists(os.path.join(mme_folder, "gen_answers".format(task))):
            os.makedirs(os.path.join(mme_folder, "gen_answers".format(task)))
        answer_file = open(answer_path, 'w')
        gt_file = open(os.path.join(mme_folder, 'eval_tool/Your_Results/examples/{}.txt'.format(task)), 'r', encoding='utf-8')
        gt_lines = gt_file.readlines()

        for gt_line in tqdm(gt_lines, desc=task):
            img_id = gt_line.split("\t")[0]
            qs = gt_line.split("\t")[1]
            gt_answer = gt_line.split("\t")[2]

            raw_image = Image.open(os.path.join(mme_folder, "{}/{}".format(task, img_id)))
            question = "<image> " + qs

            inputs = processor(images=raw_image, text=question, return_tensors='pt', padding=True, truncation=True, max_length=512).to(device)

            output = model.generate(
                **inputs,
                do_sample=False,
                max_new_tokens=32,
                use_cache=True,
                )

            gen_answer = processor.decode(output[0][inputs.input_ids.size(1):], skip_special_tokens=True).strip().replace("\n", "")
            
            answer_file.write("{}\t{}\t{}\t{}".format(img_id, qs, gt_answer.replace("\n", ""), gen_answer) + "\n")

        answer_file.close()

image.png

NaForAll changed discussion status to closed
NaForAll changed discussion status to open
NaForAll changed discussion status to closed

I encountered the same problem.

In my case I'm using lmms-eval to do the evaluation. The performance of llava-1.5-7b-hf on MME-p benchmark is 1482, which is also lower than 1510. (I'm using transformers==4.50.3)

Llava Hugging Face org
edited 12 days ago

One thing which is different between the original version and the HF version is that the original one uses padding for image processing (see https://huggingface.co/llava-hf/llava-1.5-7b-hf/discussions/26#66cf46a5a523b74b5f90fa72). This got added in https://github.com/huggingface/transformers/pull/33191. The logits are equivalent.

Hence you could try evaluating by using LlavaImageProcessor instead of CLIPImageProcessor?

One thing which is different between the original version and the HF version is that the original one uses padding for image processing (see https://huggingface.co/llava-hf/llava-1.5-7b-hf/discussions/26#66cf46a5a523b74b5f90fa72). This got added in https://github.com/huggingface/transformers/pull/33191. The logits are equivalent.

Hence you could try evaluating by using LlavaImageProcessor instead of CLIPImageProcessor?

I changed image_processor_class = "AutoImageProcessor" to image_processor_class = "LlavaImageProcessor" in processing_llava.py and evaluated again, but the MME-p result is still 1482. I'm wondering whether this problem is caused by the evaluation framework or transformers library. Is the model weights of llava-1.5-7b-hf equivalent to the original llava model?

One thing which is different between the original version and the HF version is that the original one uses padding for image processing (see https://huggingface.co/llava-hf/llava-1.5-7b-hf/discussions/26#66cf46a5a523b74b5f90fa72). This got added in https://github.com/huggingface/transformers/pull/33191. The logits are equivalent.

Hence you could try evaluating by using LlavaImageProcessor instead of CLIPImageProcessor?

I changed image_processor_class = "AutoImageProcessor" to image_processor_class = "LlavaImageProcessor" in processing_llava.py and evaluated again, but the MME-p result is still 1482. I'm wondering whether this problem is caused by the evaluation framework or transformers library. Is the model weights of llava-1.5-7b-hf equivalent to the original llava model?

I tried using the evaluation code provided in the original LLaVA repository again, only modifying the model generation part to use the Huggingface version. The MME-p evaluation result was 1456, still lower than the original model. Hope someone can fix this issue, because the inference implementation in the Huggingface version is much better than the original. The original inference code has caused a lot of trouble for my research.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment