The performance difference between llava-hf/llava-1.5-7b-hf and liuhaotian/llava-v1.5-7b on MME benchmark.
I find a performance difference between hf version and liu version. The results are pretty low when I test llava-1.5 hf with MME benchmark. While liu releases his llava 1.5, which performs over 1500 scores on MME in his paper https://arxiv.org/abs/2310.03744, I find llava 1.5 hf performs around 1000 scores on MME perception and cognitive tasks many times. This performance gap is pretty confusing.
My code and last record of performance are as follows:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_id = args.model_path
model = LlavaForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float16,
trust_remote_code=True,
).to(device)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
mme_folder = "\MME\"
mme_type_dict = {
"Perception": ["existence", "count", "position", "color", "posters", "celebrity", "scene", "landmark", "artwork", "OCR"],
"Cognition": ["commonsense_reasoning", "numerical_calculation", "text_translation", "code_reasoning"]
}
for type, task_list in mme_type_dict.items():
for task in task_list:
answer_path = os.path.join(mme_folder, "gen_answers/{}.txt".format(task))
if not os.path.exists(os.path.join(mme_folder, "gen_answers".format(task))):
os.makedirs(os.path.join(mme_folder, "gen_answers".format(task)))
answer_file = open(answer_path, 'w')
gt_file = open(os.path.join(mme_folder, 'eval_tool/Your_Results/examples/{}.txt'.format(task)), 'r', encoding='utf-8')
gt_lines = gt_file.readlines()
for gt_line in tqdm(gt_lines, desc=task):
img_id = gt_line.split("\t")[0]
qs = gt_line.split("\t")[1]
gt_answer = gt_line.split("\t")[2]
raw_image = Image.open(os.path.join(mme_folder, "{}/{}".format(task, img_id)))
question = "<image> " + qs
inputs = processor(images=raw_image, text=question, return_tensors='pt', padding=True, truncation=True, max_length=512).to(device)
output = model.generate(
**inputs,
do_sample=False,
max_new_tokens=32,
use_cache=True,
)
gen_answer = processor.decode(output[0][inputs.input_ids.size(1):], skip_special_tokens=True).strip().replace("\n", "")
answer_file.write("{}\t{}\t{}\t{}".format(img_id, qs, gt_answer.replace("\n", ""), gen_answer) + "\n")
answer_file.close()
I encountered the same problem.
In my case I'm using lmms-eval to do the evaluation. The performance of llava-1.5-7b-hf on MME-p benchmark is 1482, which is also lower than 1510. (I'm using transformers==4.50.3)
One thing which is different between the original version and the HF version is that the original one uses padding for image processing (see https://huggingface.co/llava-hf/llava-1.5-7b-hf/discussions/26#66cf46a5a523b74b5f90fa72). This got added in https://github.com/huggingface/transformers/pull/33191. The logits are equivalent.
Hence you could try evaluating by using LlavaImageProcessor
instead of CLIPImageProcessor
?
One thing which is different between the original version and the HF version is that the original one uses padding for image processing (see https://huggingface.co/llava-hf/llava-1.5-7b-hf/discussions/26#66cf46a5a523b74b5f90fa72). This got added in https://github.com/huggingface/transformers/pull/33191. The logits are equivalent.
Hence you could try evaluating by using
LlavaImageProcessor
instead ofCLIPImageProcessor
?
I changed image_processor_class = "AutoImageProcessor"
to image_processor_class = "LlavaImageProcessor"
in processing_llava.py and evaluated again, but the MME-p result is still 1482. I'm wondering whether this problem is caused by the evaluation framework or transformers library. Is the model weights of llava-1.5-7b-hf equivalent to the original llava model?
One thing which is different between the original version and the HF version is that the original one uses padding for image processing (see https://huggingface.co/llava-hf/llava-1.5-7b-hf/discussions/26#66cf46a5a523b74b5f90fa72). This got added in https://github.com/huggingface/transformers/pull/33191. The logits are equivalent.
Hence you could try evaluating by using
LlavaImageProcessor
instead ofCLIPImageProcessor
?I changed
image_processor_class = "AutoImageProcessor"
toimage_processor_class = "LlavaImageProcessor"
in processing_llava.py and evaluated again, but the MME-p result is still 1482. I'm wondering whether this problem is caused by the evaluation framework or transformers library. Is the model weights of llava-1.5-7b-hf equivalent to the original llava model?
I tried using the evaluation code provided in the original LLaVA repository again, only modifying the model generation part to use the Huggingface version. The MME-p evaluation result was 1456, still lower than the original model. Hope someone can fix this issue, because the inference implementation in the Huggingface version is much better than the original. The original inference code has caused a lot of trouble for my research.