The performance difference between llava-hf/llava-1.5-7b-hf and liuhaotian/llava-v1.5-7b on MME benchmark.

#44

by NaForAll - opened Jan 8

Jan 8

•

I find a performance difference between hf version and liu version. The results are pretty low when I test llava-1.5 hf with MME benchmark. While liu releases his llava 1.5, which performs over 1500 scores on MME in his paper https://arxiv.org/abs/2310.03744, I find llava 1.5 hf performs around 1000 scores on MME perception and cognitive tasks many times. This performance gap is pretty confusing.

My code and last record of performance are as follows:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_id = args.model_path
model = LlavaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    trust_remote_code=True,
).to(device)

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

mme_folder = "\MME\"
mme_type_dict = {
"Perception": ["existence", "count", "position", "color", "posters", "celebrity", "scene", "landmark", "artwork", "OCR"],
"Cognition": ["commonsense_reasoning", "numerical_calculation", "text_translation", "code_reasoning"]
}

for type, task_list in mme_type_dict.items():
    for task in task_list:
        answer_path = os.path.join(mme_folder, "gen_answers/{}.txt".format(task))
        if not os.path.exists(os.path.join(mme_folder, "gen_answers".format(task))):
            os.makedirs(os.path.join(mme_folder, "gen_answers".format(task)))
        answer_file = open(answer_path, 'w')
        gt_file = open(os.path.join(mme_folder, 'eval_tool/Your_Results/examples/{}.txt'.format(task)), 'r', encoding='utf-8')
        gt_lines = gt_file.readlines()

        for gt_line in tqdm(gt_lines, desc=task):
            img_id = gt_line.split("\t")[0]
            qs = gt_line.split("\t")[1]
            gt_answer = gt_line.split("\t")[2]

            raw_image = Image.open(os.path.join(mme_folder, "{}/{}".format(task, img_id)))
            question = "<image> " + qs

            inputs = processor(images=raw_image, text=question, return_tensors='pt', padding=True, truncation=True, max_length=512).to(device)

            output = model.generate(
                **inputs,
                do_sample=False,
                max_new_tokens=32,
                use_cache=True,
                )

            gen_answer = processor.decode(output[0][inputs.input_ids.size(1):], skip_special_tokens=True).strip().replace("\n", "")
            
            answer_file.write("{}\t{}\t{}\t{}".format(img_id, qs, gt_answer.replace("\n", ""), gen_answer) + "\n")

        answer_file.close()

NaForAll changed discussion status to closed Jan 8

NaForAll changed discussion status to open Jan 8

NaForAll changed discussion status to closed Jan 8

lalahaha1

Feb 18

•

edited Feb 18

I encountered the same problem.

MoyusiteruIori

Apr 3

In my case I'm using lmms-eval to do the evaluation. The performance of llava-1.5-7b-hf on MME-p benchmark is 1482, which is also lower than 1510. (I'm using transformers==4.50.3)

nielsr

Llava Hugging Face org Apr 3

•

edited Apr 3

One thing which is different between the original version and the HF version is that the original one uses padding for image processing (see https://huggingface.co/llava-hf/llava-1.5-7b-hf/discussions/26#66cf46a5a523b74b5f90fa72). This got added in https://github.com/huggingface/transformers/pull/33191. The logits are equivalent.

Hence you could try evaluating by using LlavaImageProcessor instead of CLIPImageProcessor?

MoyusiteruIori

Apr 4

One thing which is different between the original version and the HF version is that the original one uses padding for image processing (see https://huggingface.co/llava-hf/llava-1.5-7b-hf/discussions/26#66cf46a5a523b74b5f90fa72). This got added in https://github.com/huggingface/transformers/pull/33191. The logits are equivalent.

Hence you could try evaluating by using LlavaImageProcessor instead of CLIPImageProcessor?

I changed image_processor_class = "AutoImageProcessor" to image_processor_class = "LlavaImageProcessor" in processing_llava.py and evaluated again, but the MME-p result is still 1482. I'm wondering whether this problem is caused by the evaluation framework or transformers library. Is the model weights of llava-1.5-7b-hf equivalent to the original llava model?

myIori

Apr 4

One thing which is different between the original version and the HF version is that the original one uses padding for image processing (see https://huggingface.co/llava-hf/llava-1.5-7b-hf/discussions/26#66cf46a5a523b74b5f90fa72). This got added in https://github.com/huggingface/transformers/pull/33191. The logits are equivalent.

Hence you could try evaluating by using LlavaImageProcessor instead of CLIPImageProcessor?

I changed image_processor_class = "AutoImageProcessor" to image_processor_class = "LlavaImageProcessor" in processing_llava.py and evaluated again, but the MME-p result is still 1482. I'm wondering whether this problem is caused by the evaluation framework or transformers library. Is the model weights of llava-1.5-7b-hf equivalent to the original llava model?

I tried using the evaluation code provided in the original LLaVA repository again, only modifying the model generation part to use the Huggingface version. The MME-p evaluation result was 1456, still lower than the original model. Hope someone can fix this issue, because the inference implementation in the Huggingface version is much better than the original. The original inference code has caused a lot of trouble for my research.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment