tuandunghcmut
/

vlm_clone_2

Model card Files Files and versions Community

tuandunghcmut commited on Apr 11

Commit

dae9dfe

verified ·

1 Parent(s): 28829b5

Add files using upload-large-folder tool

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

VLMEvalKit/vlmeval/utils/__pycache__/matching_util.cpython-310.pyc +0 -0
VLMEvalKit/vlmeval/vlm/__pycache__/__init__.cpython-310.pyc +0 -0
VLMEvalKit/vlmeval/vlm/__pycache__/base.cpython-310.pyc +0 -0
VLMEvalKit/vlmeval/vlm/__pycache__/cogvlm.cpython-310.pyc +0 -0
VLMEvalKit/vlmeval/vlm/__pycache__/eagle_x.cpython-310.pyc +0 -0
VLMEvalKit/vlmeval/vlm/__pycache__/h2ovl_mississippi.cpython-310.pyc +0 -0
VLMEvalKit/vlmeval/vlm/__pycache__/idefics.cpython-310.pyc +0 -0
VLMEvalKit/vlmeval/vlm/__pycache__/instructblip.cpython-310.pyc +0 -0
VLMEvalKit/vlmeval/vlm/__pycache__/kosmos.cpython-310.pyc +0 -0
VLMEvalKit/vlmeval/vlm/__pycache__/llama_vision.cpython-310.pyc +0 -0
VLMEvalKit/vlmeval/vlm/__pycache__/mgm.cpython-310.pyc +0 -0
VLMEvalKit/vlmeval/vlm/__pycache__/minigpt4.cpython-310.pyc +0 -0
VLMEvalKit/vlmeval/vlm/__pycache__/mixsense.cpython-310.pyc +0 -0
VLMEvalKit/vlmeval/vlm/__pycache__/molmo.cpython-310.pyc +0 -0
VLMEvalKit/vlmeval/vlm/__pycache__/monkey.cpython-310.pyc +0 -0
VLMEvalKit/vlmeval/vlm/__pycache__/moondream.cpython-310.pyc +0 -0
VLMEvalKit/vlmeval/vlm/__pycache__/mplug_owl2.cpython-310.pyc +0 -0
VLMEvalKit/vlmeval/vlm/__pycache__/nvlm.cpython-310.pyc +0 -0
VLMEvalKit/vlmeval/vlm/__pycache__/omchat.cpython-310.pyc +0 -0
VLMEvalKit/vlmeval/vlm/__pycache__/open_flamingo.cpython-310.pyc +0 -0
VLMEvalKit/vlmeval/vlm/__pycache__/paligemma.cpython-310.pyc +0 -0
VLMEvalKit/vlmeval/vlm/__pycache__/pandagpt.cpython-310.pyc +0 -0
VLMEvalKit/vlmeval/vlm/__pycache__/parrot.cpython-310.pyc +0 -0
VLMEvalKit/vlmeval/vlm/__pycache__/phi3_vision.cpython-310.pyc +0 -0
VLMEvalKit/vlmeval/vlm/__pycache__/pixtral.cpython-310.pyc +0 -0
VLMEvalKit/vlmeval/vlm/__pycache__/qh_360vl.cpython-310.pyc +0 -0
VLMEvalKit/vlmeval/vlm/__pycache__/sail_vl.cpython-310.pyc +0 -0
VLMEvalKit/vlmeval/vlm/__pycache__/slime.cpython-310.pyc +0 -0
VLMEvalKit/vlmeval/vlm/__pycache__/transcore_m.cpython-310.pyc +0 -0
VLMEvalKit/vlmeval/vlm/__pycache__/vila.cpython-310.pyc +0 -0
VLMEvalKit/vlmeval/vlm/__pycache__/visualglm.cpython-310.pyc +0 -0
VLMEvalKit/vlmeval/vlm/__pycache__/wemm.cpython-310.pyc +0 -0
VLMEvalKit/vlmeval/vlm/__pycache__/yi_vl.cpython-310.pyc +0 -0
VLMEvalKit/vlmeval/vlm/internvl/__init__.py +3 -0
VLMEvalKit/vlmeval/vlm/internvl/__pycache__/__init__.cpython-310.pyc +0 -0
VLMEvalKit/vlmeval/vlm/internvl/__pycache__/internvl_chat.cpython-310.pyc +0 -0
VLMEvalKit/vlmeval/vlm/internvl/__pycache__/utils.cpython-310.pyc +0 -0
VLMEvalKit/vlmeval/vlm/internvl/internvl_chat.py +353 -0
VLMEvalKit/vlmeval/vlm/internvl/utils.py +349 -0
VLMEvalKit/vlmeval/vlm/llava/__init__.py +4 -0
VLMEvalKit/vlmeval/vlm/llava/__pycache__/__init__.cpython-310.pyc +0 -0
VLMEvalKit/vlmeval/vlm/llava/__pycache__/llava.cpython-310.pyc +0 -0
VLMEvalKit/vlmeval/vlm/llava/__pycache__/llava_xtuner.cpython-310.pyc +0 -0
VLMEvalKit/vlmeval/vlm/llava/llava.py +897 -0
VLMEvalKit/vlmeval/vlm/llava/llava_xtuner.py +239 -0
VLMEvalKit/vlmeval/vlm/misc/blip2_instruct_vicuna13b.yaml +43 -0
VLMEvalKit/vlmeval/vlm/misc/blip2_instruct_vicuna7b.yaml +43 -0
VLMEvalKit/vlmeval/vlm/misc/minigpt4_13b_eval.yaml +37 -0
VLMEvalKit/vlmeval/vlm/misc/minigpt4_7b_eval.yaml +38 -0
VLMEvalKit/vlmeval/vlm/misc/minigptv2_eval.yaml +36 -0

VLMEvalKit/vlmeval/utils/__pycache__/matching_util.cpython-310.pyc ADDED Viewed

Binary file (2.01 kB). View file

VLMEvalKit/vlmeval/vlm/__pycache__/__init__.cpython-310.pyc ADDED Viewed

Binary file (3.17 kB). View file

VLMEvalKit/vlmeval/vlm/__pycache__/base.cpython-310.pyc ADDED Viewed

Binary file (7.59 kB). View file

VLMEvalKit/vlmeval/vlm/__pycache__/cogvlm.cpython-310.pyc ADDED Viewed

Binary file (4.6 kB). View file

VLMEvalKit/vlmeval/vlm/__pycache__/eagle_x.cpython-310.pyc ADDED Viewed

Binary file (6.17 kB). View file

VLMEvalKit/vlmeval/vlm/__pycache__/h2ovl_mississippi.cpython-310.pyc ADDED Viewed

Binary file (4.67 kB). View file

VLMEvalKit/vlmeval/vlm/__pycache__/idefics.cpython-310.pyc ADDED Viewed

Binary file (8.54 kB). View file

VLMEvalKit/vlmeval/vlm/__pycache__/instructblip.cpython-310.pyc ADDED Viewed

Binary file (2.13 kB). View file

VLMEvalKit/vlmeval/vlm/__pycache__/kosmos.cpython-310.pyc ADDED Viewed

Binary file (4.13 kB). View file

VLMEvalKit/vlmeval/vlm/__pycache__/llama_vision.cpython-310.pyc ADDED Viewed

Binary file (7.55 kB). View file

VLMEvalKit/vlmeval/vlm/__pycache__/mgm.cpython-310.pyc ADDED Viewed

Binary file (4.75 kB). View file

VLMEvalKit/vlmeval/vlm/__pycache__/minigpt4.cpython-310.pyc ADDED Viewed

Binary file (2.91 kB). View file

VLMEvalKit/vlmeval/vlm/__pycache__/mixsense.cpython-310.pyc ADDED Viewed

Binary file (1.81 kB). View file

VLMEvalKit/vlmeval/vlm/__pycache__/molmo.cpython-310.pyc ADDED Viewed

Binary file (2.31 kB). View file

VLMEvalKit/vlmeval/vlm/__pycache__/monkey.cpython-310.pyc ADDED Viewed

Binary file (3.17 kB). View file

VLMEvalKit/vlmeval/vlm/__pycache__/moondream.cpython-310.pyc ADDED Viewed

Binary file (5.25 kB). View file

VLMEvalKit/vlmeval/vlm/__pycache__/mplug_owl2.cpython-310.pyc ADDED Viewed

Binary file (4.89 kB). View file

VLMEvalKit/vlmeval/vlm/__pycache__/nvlm.cpython-310.pyc ADDED Viewed

Binary file (5.07 kB). View file

VLMEvalKit/vlmeval/vlm/__pycache__/omchat.cpython-310.pyc ADDED Viewed

Binary file (5.51 kB). View file

VLMEvalKit/vlmeval/vlm/__pycache__/open_flamingo.cpython-310.pyc ADDED Viewed

Binary file (3.11 kB). View file

VLMEvalKit/vlmeval/vlm/__pycache__/paligemma.cpython-310.pyc ADDED Viewed

Binary file (1.76 kB). View file

VLMEvalKit/vlmeval/vlm/__pycache__/pandagpt.cpython-310.pyc ADDED Viewed

Binary file (2.35 kB). View file

VLMEvalKit/vlmeval/vlm/__pycache__/parrot.cpython-310.pyc ADDED Viewed

Binary file (7.52 kB). View file

VLMEvalKit/vlmeval/vlm/__pycache__/phi3_vision.cpython-310.pyc ADDED Viewed

Binary file (4.48 kB). View file

VLMEvalKit/vlmeval/vlm/__pycache__/pixtral.cpython-310.pyc ADDED Viewed

Binary file (2.45 kB). View file

VLMEvalKit/vlmeval/vlm/__pycache__/qh_360vl.cpython-310.pyc ADDED Viewed

Binary file (2.24 kB). View file

VLMEvalKit/vlmeval/vlm/__pycache__/sail_vl.cpython-310.pyc ADDED Viewed

Binary file (15.4 kB). View file

VLMEvalKit/vlmeval/vlm/__pycache__/slime.cpython-310.pyc ADDED Viewed

Binary file (2.73 kB). View file

VLMEvalKit/vlmeval/vlm/__pycache__/transcore_m.cpython-310.pyc ADDED Viewed

Binary file (6.05 kB). View file

VLMEvalKit/vlmeval/vlm/__pycache__/vila.cpython-310.pyc ADDED Viewed

Binary file (3.77 kB). View file

VLMEvalKit/vlmeval/vlm/__pycache__/visualglm.cpython-310.pyc ADDED Viewed

Binary file (1.47 kB). View file

VLMEvalKit/vlmeval/vlm/__pycache__/wemm.cpython-310.pyc ADDED Viewed

Binary file (2.81 kB). View file

VLMEvalKit/vlmeval/vlm/__pycache__/yi_vl.cpython-310.pyc ADDED Viewed

Binary file (4.7 kB). View file

VLMEvalKit/vlmeval/vlm/internvl/__init__.py ADDED Viewed

	@@ -0,0 +1,3 @@


1	+ from .internvl_chat import InternVLChat
2	+
3	+ __all__ = ['InternVLChat']

VLMEvalKit/vlmeval/vlm/internvl/__pycache__/__init__.cpython-310.pyc ADDED Viewed

Binary file (244 Bytes). View file

VLMEvalKit/vlmeval/vlm/internvl/__pycache__/internvl_chat.cpython-310.pyc ADDED Viewed

Binary file (10.4 kB). View file

VLMEvalKit/vlmeval/vlm/internvl/__pycache__/utils.cpython-310.pyc ADDED Viewed

Binary file (11.4 kB). View file

VLMEvalKit/vlmeval/vlm/internvl/internvl_chat.py ADDED Viewed

	@@ -0,0 +1,353 @@

+import math
+import pandas as pd
+import random
+import re
+import string
+import torch
+import torch.distributed as dist
+import torchvision.transforms as T
+import transformers
+import warnings
+from PIL import Image
+from torchvision.transforms.functional import InterpolationMode
+from transformers import AutoTokenizer, AutoConfig, AutoModel, CLIPImageProcessor
+from .utils import (build_multi_choice_prompt,
+                    build_video_prompt,
+                    build_mpo_prompt,
+                    build_mcq_cot_prompt,
+                    build_qa_cot_prompt,
+                    mpo_post_processing,
+                    reorganize_prompt,
+                    split_model, load_image)
+from .utils import mpo_prompt_with_final_answer, mpo_prompt_without_final_answer
+from ..base import BaseModel
+from ...dataset import DATASET_TYPE, DATASET_MODALITY
+from ...smp import *
+class InternVLChat(BaseModel):
+    INSTALL_REQ = False
+    INTERLEAVE = True
+    def __init__(self,
+                 model_path='OpenGVLab/InternVL-Chat-V1-5',
+                 load_in_8bit=False,
+                 use_mpo_prompt=False,
+                 version='V1.0',
+                 **kwargs):
+        assert model_path is not None
+        assert version_cmp(transformers.__version__, '4.37.2', 'ge')
+        self.use_mpo_prompt = use_mpo_prompt
+        self.model_path = model_path
+        self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, use_fast=False)
+        # Regular expression to match the pattern 'Image' followed by a number, e.g. Image1
+        self.pattern = r'Image(\d+)'
+        # Replacement pattern to insert a hyphen between 'Image' and the number, e.g. Image-1
+        self.replacement = r'Image-\1'
+        # Convert InternVL2 response to dataset format
+        # e.g. Image1 -> Image-1
+        # Regular expression to match the pattern 'Image-' followed by a number
+        self.reverse_pattern = r'Image-(\d+)'
+        # Replacement pattern to remove the hyphen (Image-1 -> Image1)
+        self.reverse_replacement = r'Image\1'
+        if auto_split_flag():
+            device_map, visible_devices = split_model(model_path=model_path)
+            self.device = visible_devices[0]
+            self.model = AutoModel.from_pretrained(
+                model_path,
+                torch_dtype=torch.bfloat16,
+                load_in_8bit=load_in_8bit,
+                trust_remote_code=True,
+                low_cpu_mem_usage=True,
+                device_map=device_map).eval()
+        else:
+            self.model = AutoModel.from_pretrained(
+                model_path,
+                torch_dtype=torch.bfloat16,
+                load_in_8bit=load_in_8bit,
+                trust_remote_code=True,
+                low_cpu_mem_usage=True).eval().cuda()
+            self.device = 'cuda'
+        self.image_size = self.model.config.vision_config.image_size
+        self.version = version
+        kwargs_default = dict(do_sample=False, max_new_tokens=4096, top_p=None)
+        kwargs_default.update(kwargs)
+        self.kwargs = kwargs_default
+        warnings.warn(f'Following kwargs received: {self.kwargs}, will use as generation config. ')
+    def use_custom_prompt(self, dataset):
+        assert dataset is not None
+        if listinstr(['MMDU', 'MME-RealWorld', 'MME-RealWorld-CN'], dataset):
+            # For Multi-Turn we don't have custom prompt
+            return False
+        if DATASET_MODALITY(dataset) == 'VIDEO':
+            # For Video benchmarks we don't have custom prompt at here
+            return False
+        else:
+            return True
+    def build_prompt(self, line, dataset=None):
+        assert self.use_custom_prompt(dataset)
+        assert dataset is None or isinstance(dataset, str)
+        tgt_path = self.dump_image(line, dataset)
+        if dataset is not None and DATASET_TYPE(dataset) == 'Y/N':
+            question = line['question']
+            if listinstr(['MME'], dataset):
+                prompt = question + ' Answer the question using a single word or phrase.'
+            elif listinstr(['HallusionBench', 'AMBER'], dataset):
+                prompt = question + ' Please answer yes or no. Answer the question using a single word or phrase.'
+            else:
+                prompt = question
+        elif dataset is not None and DATASET_TYPE(dataset) == 'MCQ':
+            prompt = build_multi_choice_prompt(line, dataset)
+            if os.getenv('USE_COT') == '1':
+                prompt = build_mcq_cot_prompt(line, prompt)
+        elif dataset is not None and DATASET_TYPE(dataset) == 'VQA':
+            question = line['question']
+            if listinstr(['LLaVABench', 'WildVision'], dataset):
+                prompt = question + '\nAnswer this question in detail.'
+            elif listinstr(['OCRVQA', 'TextVQA', 'ChartQA', 'DocVQA', 'InfoVQA', 'OCRBench',
+                            'DUDE', 'SLIDEVQA', 'GQA', 'MMLongBench_DOC'], dataset):
+                prompt = question + '\nAnswer the question using a single word or phrase.'
+            elif listinstr(['MathVista', 'MathVision', 'VCR', 'MTVQA', 'MMVet', 'MathVerse',
+                            'MMDU', 'CRPE', 'MIA-Bench', 'MM-Math', 'DynaMath', 'QSpatial'], dataset):
+                prompt = question
+                if os.getenv('USE_COT') == '1':
+                    prompt = build_qa_cot_prompt(line, prompt)
+            else:
+                prompt = question + '\nAnswer the question using a single word or phrase.'
+        else:
+            # VQA_ex_prompt: OlympiadBench, VizWiz
+            prompt = line['question']
+            if os.getenv('USE_COT') == '1':
+                prompt = build_qa_cot_prompt(line, prompt)
+        message = [dict(type='text', value=prompt)]
+        message.extend([dict(type='image', value=s) for s in tgt_path])
+        if self.use_mpo_prompt:
+            message = build_mpo_prompt(message, line, dataset)
+        return message
+    def set_max_num(self, dataset):
+        # The total limit on the number of images processed, set to avoid Out-of-Memory issues.
+        self.total_max_num = 64
+        if dataset is None:
+            self.max_num = 6
+            return None
+        res_12_datasets = ['ChartQA_TEST', 'MMMU_DEV_VAL', 'MMMU_TEST', 'MME-RealWorld',
+                           'VCR_EN', 'VCR_ZH', 'OCRVQA']
+        res_18_datasets = ['DocVQA_VAL', 'DocVQA_TEST', 'DUDE', 'MMLongBench_DOC', 'SLIDEVQA']
+        res_24_datasets = ['InfoVQA_VAL', 'InfoVQA_TEST', 'OCRBench', 'HRBench4K', 'HRBench8K']
+        if DATASET_MODALITY(dataset) == 'VIDEO':
+            self.max_num = 1
+        elif listinstr(res_12_datasets, dataset):
+            self.max_num = 12
+        elif listinstr(res_18_datasets, dataset):
+            self.max_num = 18
+        elif listinstr(res_24_datasets, dataset):
+            self.max_num = 24
+        else:
+            self.max_num = 6
+    def generate_v1_2(self, message, dataset=None):
+        self.INTERLEAVE = False
+        prompt, image_path = self.message_to_promptimg(message, dataset=dataset)
+        image = Image.open(image_path).convert('RGB')
+        image = image.resize((self.image_size, self.image_size))
+        image_processor = CLIPImageProcessor.from_pretrained(self.model_path)
+        pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
+        pixel_values = pixel_values.to(torch.bfloat16).to(self.device)
+        with torch.no_grad():
+            response = self.model.chat(self.tokenizer, pixel_values=pixel_values,
+                                       question=prompt, generation_config=self.kwargs)
+        return response
+    def generate_v1_5(self, message, dataset=None):
+        image_num = len([x for x in message if x['type'] == 'image'])
+        max_num = max(1, min(self.max_num, self.total_max_num // image_num))
+        prompt = '\n'.join([x['value'] for x in message if x['type'] == 'text'])
+        if DATASET_MODALITY(dataset) == 'VIDEO':
+            prompt = build_video_prompt(prompt, dataset)
+        if image_num > 1:
+            image_path = [x['value'] for x in message if x['type'] == 'image']
+            pixel_values_list = []
+            for file_name in image_path:
+                pixel_values_list.append(load_image(file_name, max_num=max_num).to(self.device).to(torch.bfloat16))
+            pixel_values = torch.cat(pixel_values_list, dim=0)
+        elif image_num == 1:
+            image_path = [x['value'] for x in message if x['type'] == 'image'][0]
+            pixel_values = load_image(image_path, max_num=max_num).to(self.device).to(torch.bfloat16)
+        else:
+            pixel_values = None
+        with torch.no_grad():
+            response = self.model.chat(
+                self.tokenizer,
+                pixel_values=pixel_values,
+                question=prompt,
+                generation_config=self.kwargs,
+                verbose=True)
+        return response
+    def generate_v2(self, message, dataset=None):
+        image_num = len([x for x in message if x['type'] == 'image'])
+        max_num = max(1, min(self.max_num, self.total_max_num // image_num))
+        prompt = reorganize_prompt(message, image_num, dataset=dataset)
+        if dataset is not None and DATASET_MODALITY(dataset) == 'VIDEO':
+            prompt = build_video_prompt(prompt, dataset)
+        if image_num > 1:
+            image_path = [x['value'] for x in message if x['type'] == 'image']
+            num_patches_list, pixel_values_list = [], []
+            for image_idx, file_name in enumerate(image_path):
+                upscale_flag = image_idx == 0 and dataset is not None and listinstr(['MMMU'], dataset)
+                curr_pixel_values = load_image(
+                    file_name, max_num=max_num, upscale=upscale_flag).to(self.device).to(torch.bfloat16)
+                num_patches_list.append(curr_pixel_values.size(0))
+                pixel_values_list.append(curr_pixel_values)
+            pixel_values = torch.cat(pixel_values_list, dim=0)
+        elif image_num == 1:
+            image_path = [x['value'] for x in message if x['type'] == 'image'][0]
+            upscale_flag = dataset is not None and listinstr(['MMMU'], dataset)
+            pixel_values = load_image(
+                image_path, max_num=max_num, upscale=upscale_flag).to(self.device).to(torch.bfloat16)
+            num_patches_list = [pixel_values.size(0)]
+        else:
+            pixel_values = None
+            num_patches_list = []
+        with torch.no_grad():
+            response = self.model.chat(
+                self.tokenizer,
+                pixel_values=pixel_values,
+                num_patches_list=num_patches_list,
+                question=prompt,
+                generation_config=self.kwargs,
+                verbose=True
+            )
+        if self.use_mpo_prompt:
+            response = mpo_post_processing(response, dataset)
+        return response
+    def generate_inner(self, message, dataset=None):
+        self.set_max_num(dataset)
+        print(f'InternVL model version: {self.version}')
+        if self.version in ['V1.1', 'V1.2']:
+            return self.generate_v1_2(message, dataset)
+        elif self.version == 'V1.5':
+            return self.generate_v1_5(message, dataset)
+        elif self.version == 'V2.0':
+            return self.generate_v2(message, dataset)
+        else:
+            raise ValueError(f'Unsupported version: {self.version}')
+    def build_history(self, message):
+        # Global Variables
+        image_path = []
+        image_cnt = 0
+        def concat_tilist(tilist):
+            nonlocal image_cnt  # Declare image_cnt as nonlocal to modify it
+            prompt = ''
+            for item in tilist:
+                # Substitute the pattern in the text
+                if item['type'] == 'text':
+                    prompt += re.sub(self.pattern, self.replacement, item['value'])
+                elif item['type'] == 'image':
+                    image_cnt += 1
+                    prompt += '<image>\n'
+                    image_path.append(item['value'])
+            return prompt
+        # Only previous messages
+        assert len(message) % 2 == 0
+        history = []
+        for i in range(len(message) // 2):
+            m1, m2 = message[2 * i], message[2 * i + 1]
+            assert m1['role'] == 'user' and m2['role'] == 'assistant'
+            history.append((concat_tilist(m1['content']), concat_tilist(m2['content'])))
+        return history, image_path, image_cnt
+    def chat_inner_v2(self, message, dataset=None):
+        if len(message) > 1:
+            history, image_path, image_cnt = self.build_history(message[:-1])
+        else:
+            history, image_path, image_cnt = None, [], 1
+        current_msg = message[-1]
+        question = ''
+        # If message is just text in the conversation
+        if len(current_msg['content']) == 1 and current_msg['content'][0]['type'] == 'text':
+            question = current_msg['content'][0]['value']
+            question = re.sub(self.pattern, self.replacement, question)  # Fix pattern as per InternVL
+        else:
+            for msg in current_msg['content']:
+                if msg['type'] == 'text':
+                    question += re.sub(self.pattern, self.replacement, msg['value'])
+                elif msg['type'] == 'image':
+                    image_cnt += 1
+                    question += '<image>\n'
+                    image_path.append(msg['value'])
+        if image_cnt > 1:
+            num_patches_list = []
+            pixel_values_list = []
+            for image_idx, file_name in enumerate(image_path):
+                upscale_flag = image_idx == 0 and dataset is not None and listinstr(['MMMU_DEV_VAL'], dataset)
+                curr_pixel_values = load_image(
+                    file_name, max_num=self.max_num, upscale=upscale_flag).to(self.device).to(torch.bfloat16)
+                num_patches_list.append(curr_pixel_values.size(0))
+                pixel_values_list.append(curr_pixel_values)
+            pixel_values = torch.cat(pixel_values_list, dim=0)
+        elif image_cnt == 1:
+            upscale_flag = listinstr(['MMMU_DEV_VAL'], dataset)
+            pixel_values = load_image(
+                image_path, max_num=self.max_num, upscale=upscale_flag).to(self.device).to(torch.bfloat16)
+            num_patches_list = [pixel_values.size(0)]
+        else:
+            pixel_values = None
+            num_patches_list = []
+        response, history = self.model.chat(
+            self.tokenizer,
+            pixel_values=pixel_values,
+            num_patches_list=num_patches_list,
+            question=question,
+            generation_config=self.kwargs,
+            history=history,
+            return_history=True
+        )
+        response = re.sub(self.reverse_pattern, self.reverse_replacement, response)
+        return response
+    def chat_inner(self, message, dataset=None):
+        self.set_max_num(dataset)
+        if self.version in ['V1.1', 'V1.2']:
+            raise ValueError(f'Unsupported version for Multi-Turn: {self.version}')
+        elif self.version == 'V1.5':
+            raise ValueError(f'Unsupported version for Multi-Turn: {self.version}')
+        elif self.version == 'V2.0':
+            kwargs_default = dict(do_sample=False, max_new_tokens=512, top_p=None, num_beams=1)
+            self.kwargs = kwargs_default
+            return self.chat_inner_v2(message, dataset)
+        else:
+            raise ValueError(f'Unsupported version for Multi-Turn: {self.version}')

VLMEvalKit/vlmeval/vlm/internvl/utils.py ADDED Viewed

	@@ -0,0 +1,349 @@

+import math
+import pandas as pd
+import random
+import re
+import string
+import torch
+import torch.distributed as dist
+import torchvision.transforms as T
+import transformers
+import warnings
+from PIL import Image
+from torchvision.transforms.functional import InterpolationMode
+from transformers import AutoTokenizer, AutoConfig, AutoModel, CLIPImageProcessor
+from ..base import BaseModel
+from ...dataset import DATASET_TYPE, DATASET_MODALITY
+from ...smp import *
+IMAGENET_MEAN = (0.485, 0.456, 0.406)
+IMAGENET_STD = (0.229, 0.224, 0.225)
+def build_transform(input_size):
+    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
+    transform = T.Compose([
+        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
+        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
+        T.ToTensor(),
+        T.Normalize(mean=MEAN, std=STD)
+    ])
+    return transform
+def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
+    best_ratio_diff = float('inf')
+    best_ratio = (1, 1)
+    area = width * height
+    for ratio in target_ratios:
+        target_aspect_ratio = ratio[0] / ratio[1]
+        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
+        if ratio_diff < best_ratio_diff:
+            best_ratio_diff = ratio_diff
+            best_ratio = ratio
+        elif ratio_diff == best_ratio_diff:
+            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
+                best_ratio = ratio
+    return best_ratio
+def dynamic_preprocess(image, min_num=1, max_num=6, image_size=448, use_thumbnail=False):
+    orig_width, orig_height = image.size
+    aspect_ratio = orig_width / orig_height
+    # calculate the existing image aspect ratio
+    target_ratios = set(
+        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
+        i * j <= max_num and i * j >= min_num)
+    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
+    # find the closest aspect ratio to the target
+    target_aspect_ratio = find_closest_aspect_ratio(
+        aspect_ratio, target_ratios, orig_width, orig_height, image_size)
+    # calculate the target width and height
+    target_width = image_size * target_aspect_ratio[0]
+    target_height = image_size * target_aspect_ratio[1]
+    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
+    # resize the image
+    resized_img = image.resize((target_width, target_height))
+    processed_images = []
+    for i in range(blocks):
+        box = (
+            (i % (target_width // image_size)) * image_size,
+            (i // (target_width // image_size)) * image_size,
+            ((i % (target_width // image_size)) + 1) * image_size,
+            ((i // (target_width // image_size)) + 1) * image_size
+        )
+        # split the image
+        split_img = resized_img.crop(box)
+        processed_images.append(split_img)
+    assert len(processed_images) == blocks
+    if use_thumbnail and len(processed_images) != 1:
+        thumbnail_img = image.resize((image_size, image_size))
+        processed_images.append(thumbnail_img)
+    return processed_images
+def load_image(image_file, input_size=448, max_num=6, upscale=False):
+    image = Image.open(image_file).convert('RGB')
+    if upscale:
+        image = image.resize((image.width * 2, image.height * 2), Image.BILINEAR)
+    transform = build_transform(input_size=input_size)
+    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
+    pixel_values = [transform(image) for image in images]
+    pixel_values = torch.stack(pixel_values)
+    return pixel_values
+def get_local_rank_and_local_world_size():
+    if not dist.is_available():
+        return 0, 1
+    if not dist.is_initialized():
+        return 0, 1
+    if 'SLURM_LOCALID' in os.environ:
+        local_rank = int(os.environ['SLURM_LOCALID'])
+        local_world_size = int(os.environ['SLURM_NTASKS_PER_NODE'])
+        return local_rank, local_world_size
+    if 'LOCAL_RANK' in os.environ and 'LOCAL_WORLD_SIZE' in os.environ:
+        return int(os.environ['LOCAL_RANK']), int(os.environ['LOCAL_WORLD_SIZE'])
+    raise NotImplementedError(
+        "Fail to get local_rank and local_world_size! "
+        "Please ensure that you set the environment variable "
+        "`LOCAL_RANK` and `LOCAL_WORLD_SIZE`"
+    )
+def split_model(model_path):
+    num_gpus_per_node = 8
+    rank, world_size = get_rank_and_world_size()
+    try:
+        local_rank, local_world_size = get_local_rank_and_local_world_size()
+    except:
+        local_rank = rank
+    if 'GPUS_PER_PROCESS' in os.environ:
+        gpus_per_process = int(os.environ['GPUS_PER_PROCESS'])
+    else:
+        gpus_per_process = 8  # default to use 8 GPUs for one model
+    start_gpu = local_rank * gpus_per_process
+    end_gpu = start_gpu + gpus_per_process
+    assert end_gpu <= num_gpus_per_node, f"Process {local_rank} tries to access GPU {end_gpu}, " \
+                                         f"but only {num_gpus_per_node} GPUs are available per node."
+    visible_devices = list(range(start_gpu, end_gpu))
+    device_map = {}
+    config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
+    num_gpus_for_vit = 0.5
+    num_layers = config.llm_config.num_hidden_layers
+    num_layers_per_gpu = math.ceil(num_layers / (len(visible_devices) - num_gpus_for_vit))
+    num_layers_per_gpu = [num_layers_per_gpu] * len(visible_devices)
+    num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
+    layer_cnt = 0
+    for i, num_layer in enumerate(num_layers_per_gpu):
+        for j in range(num_layer):
+            device_map[f'language_model.model.layers.{layer_cnt}'] = visible_devices[i]
+            layer_cnt += 1
+    device_map['vision_model'] = visible_devices[0]
+    device_map['mlp1'] = visible_devices[0]
+    device_map['language_model.model.tok_embeddings'] = visible_devices[0]
+    device_map['language_model.model.embed_tokens'] = visible_devices[0]
+    device_map['language_model.output'] = visible_devices[0]
+    device_map['language_model.model.norm'] = visible_devices[0]
+    device_map['language_model.lm_head'] = visible_devices[0]
+    device_map[f'language_model.model.layers.{num_layers - 1}'] = visible_devices[0]
+    return device_map, visible_devices
+def split_model_old(model_name):
+    import math
+    device_map = {}
+    num_gpus = torch.cuda.device_count()
+    rank, world_size = get_rank_and_world_size()
+    num_gpus = num_gpus // world_size
+    num_layers_map = {
+        'InternVL2-8B': 32,
+        'InternVL2-26B': 48,
+        'InternVL2-40B': 60,
+        'InternVL2-Llama3-76B': 80
+    }
+    if model_name not in num_layers_map:
+        return 'cuda'
+    num_layers = num_layers_map[model_name]
+    # Since the first GPU will be used for ViT, treat it as 0.5 GPU.
+    num_layers_per_gpu = math.ceil(num_layers / (num_gpus - 0.5))
+    num_layers_per_gpu = [num_layers_per_gpu] * num_gpus
+    num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
+    layer_cnt = 0
+    for i, num_layer in enumerate(num_layers_per_gpu):
+        for j in range(num_layer):
+            device_map[f'language_model.model.layers.{layer_cnt}'] = rank + world_size * i
+            layer_cnt += 1
+    device_map['vision_model'] = rank
+    device_map['mlp1'] = rank
+    device_map['language_model.model.tok_embeddings'] = rank
+    device_map['language_model.model.embed_tokens'] = rank
+    device_map['language_model.output'] = rank
+    device_map['language_model.model.norm'] = rank
+    device_map['language_model.lm_head'] = rank
+    device_map['language_model.model.rotary_emb'] = rank
+    device_map[f'language_model.model.layers.{num_layers - 1}'] = rank
+    return device_map
+def build_mcq_cot_prompt(line, prompt):
+    cot_prompt = (
+        "Answer the preceding multiple choice question. The last line of your response should follow "
+        "this format: 'Answer: \\boxed{$LETTER}' (without quotes), where LETTER is one of the options. "
+        "If you are uncertain or the problem is too complex, make a reasoned guess based on the "
+        "information provided. Avoid repeating steps indefinitely—provide your best guess even if "
+        "unsure. Think step by step logically, considering all relevant information before answering."
+    )
+    prompt = prompt.replace("Answer with the option's letter from the given choices directly.", '').strip()
+    prompt = prompt + '\n' + cot_prompt
+    return prompt
+def build_qa_cot_prompt(line, prompt):
+    cot_prompt = (
+        "Answer the preceding question. The last line of your response should follow this format: "
+        "'Answer: \\boxed{$FINAL_ANSWER}' (without quotes), where 'FINAL_ANSWER' is your conclusion "
+        "based on the reasoning provided. If you are uncertain or the problem is too complex, make "
+        "a reasoned guess based on the information provided. Avoid repeating steps indefinitely—"
+        "provide your best guess even if unsure. Think step by step logically, considering all "
+        "relevant information before answering."
+    )
+    prompt = prompt + '\n' + cot_prompt
+    return prompt
+def build_multi_choice_prompt(line, dataset=None):
+    question = line['question']
+    hint = line['hint'] if ('hint' in line and not pd.isna(line['hint'])) else None
+    if hint is not None:
+        question = hint + '\n' + question
+    options = {
+        cand: line[cand]
+        for cand in string.ascii_uppercase
+        if cand in line and not pd.isna(line[cand])
+    }
+    for key, item in options.items():
+        question += f'\n{key}. {item}'
+    prompt = question
+    if len(options):
+        prompt += '\n请直接回答选项字母。' if cn_string(
+            prompt) else "\nAnswer with the option's letter from the given choices directly."
+    else:
+        prompt += '\n请直接回答问题。' if cn_string(prompt) else '\nAnswer the question directly.'
+    return prompt
+def build_video_prompt(prompt, dataset=None, max_frames=64):
+    for start in range(0, max_frames, 8):
+        images_to_remove = ''.join([f'<Image-{i}>' for i in range(start + 1, start + 9)])
+        prompt = prompt.replace(images_to_remove, '')
+    for i in range(max_frames):
+        prompt = prompt.replace(f'Image-{i + 1}', f'Frame-{i + 1}')
+    if listinstr(['MMBench-Video'], dataset):
+        prompt = prompt.replace('\nAnswer:', '')
+    elif listinstr(['Video-MME'], dataset):
+        prompt = prompt.replace('\nAnswer:', '')
+        prompt += "\nAnswer with the option's letter from the given choices directly."
+    elif listinstr(['MVBench'], dataset):
+        prompt = prompt.replace('Best option:(', '')
+    return prompt
+def reorganize_prompt(message, image_num, dataset=None):
+    if dataset is not None and listinstr(['MUIRBench'], dataset):
+        prompt = '\n'.join([x['value'] for x in message if x['type'] == 'text'])
+        images_to_remove = ' '.join(['<image>'] * image_num)
+        prompt = prompt.replace(images_to_remove, '')
+        for i in range(image_num):
+            prompt = prompt.replace('<image>', f'<Image-{i + 1}>', 1)
+        prompt = ''.join([f'Image-{i + 1}: <image>\n' for i in range(image_num)]) + prompt
+    elif image_num == 1:
+        prompt = '<image>\n' + '\n'.join([x['value'] for x in message if x['type'] == 'text'])
+    else:
+        prompt, image_idx = '', 1
+        for x in message:
+            if x['type'] == 'text':
+                prompt += x['value']
+            elif x['type'] == 'image':
+                prompt += f'<Image-{image_idx}>'
+                image_idx += 1
+        prompt = ''.join([f'Image-{i + 1}: <image>\n' for i in range(image_num)]) + prompt
+        images_to_remove = ''.join([f'<Image-{i + 1}>' for i in range(image_num)])
+        prompt = prompt.replace(images_to_remove, '')
+    return prompt
+mpo_prompt_with_final_answer = (
+    "Your task is to answer the question below. "
+    "Give step by step reasoning before you answer, and when you're ready to answer, "
+    "please use the format \"Final answer: ..\""
+    "\n\n"
+    "Question:"
+    "\n\n"
+    "{question}"
+)
+mpo_prompt_without_final_answer = (
+    "Your task is to answer the question below. "
+    "Give step by step reasoning. "
+    "\n\n"
+    "Question:"
+    "\n\n"
+    "{question}"
+)
+def mpo_post_processing(response, dataset):
+    def extract_answer(text):
+        match = re.search(r'(Final answer:|Answer:)\s*(.*)', text, re.IGNORECASE)
+        if match:
+            return match.group(2).strip()
+        return text
+    if dataset is not None and (DATASET_TYPE(dataset) in ['Y/N', 'MCQ'] or listinstr(['CRPE'], dataset)):
+        response = extract_answer(response).strip()
+    return response
+def build_mpo_prompt(message, line, dataset):
+    if not listinstr(['LLaVABench'], dataset):
+        if listinstr(['MMVet'], dataset):
+            cot_prompt = mpo_prompt_without_final_answer
+        else:
+            cot_prompt = mpo_prompt_with_final_answer
+        question_orig = line['question']
+        if listinstr(['MathVerse', 'MathVision'], dataset):
+            question_orig = question_orig.split('Question:', 1)[-1].strip()
+            question_orig = question_orig.replace('Choices:\n', '').strip()
+        prompt = cot_prompt.format(question=question_orig)
+    else:
+        prompt = line['question']
+    message[0]['value'] = prompt
+    return message

VLMEvalKit/vlmeval/vlm/llava/__init__.py ADDED Viewed

	@@ -0,0 +1,4 @@

+from .llava import LLaVA, LLaVA_Next, LLaVA_Next2, LLaVA_OneVision, LLaVA_OneVision_HF
+from .llava_xtuner import LLaVA_XTuner
+__all__ = ['LLaVA', 'LLaVA_Next', 'LLaVA_XTuner', 'LLaVA_Next2', 'LLaVA_OneVision', 'LLaVA_OneVision_HF']

VLMEvalKit/vlmeval/vlm/llava/__pycache__/__init__.cpython-310.pyc ADDED Viewed

Binary file (402 Bytes). View file

VLMEvalKit/vlmeval/vlm/llava/__pycache__/llava.cpython-310.pyc ADDED Viewed

Binary file (22.1 kB). View file

VLMEvalKit/vlmeval/vlm/llava/__pycache__/llava_xtuner.cpython-310.pyc ADDED Viewed

Binary file (6.95 kB). View file

VLMEvalKit/vlmeval/vlm/llava/llava.py ADDED Viewed

	@@ -0,0 +1,897 @@

+import torch
+from PIL import Image
+from abc import abstractproperty
+import sys
+import os.path as osp
+from ..base import BaseModel
+from ...smp import *
+from ...dataset import DATASET_TYPE, DATASET_MODALITY
+import copy
+import requests
+class LLaVA(BaseModel):
+    INSTALL_REQ = True
+    INTERLEAVE = True
+    def __init__(self, model_path="liuhaotian/llava_v1.5_7b", **kwargs):
+        try:
+            from llava.model.builder import load_pretrained_model
+            from llava.mm_utils import get_model_name_from_path
+        except Exception as err:
+            logging.critical(
+                "Please install llava from https://github.com/haotian-liu/LLaVA"
+            )
+            raise err
+        assert osp.exists(model_path) or splitlen(model_path) == 2
+        self.system_prompt = (
+            "A chat between a curious human and an artificial intelligence assistant. "
+            "The assistant gives helpful, detailed, and polite answers to the human's questions. "
+        )
+        self.stop_str = "</s>"
+        if model_path == "Lin-Chen/ShareGPT4V-7B":
+            model_name = "llava-v1.5-7b"
+        elif model_path == "Lin-Chen/ShareGPT4V-13B":
+            model_name = "llava-v1.5-13b"
+        else:
+            model_name = get_model_name_from_path(model_path)
+        try:
+            self.tokenizer, self.model, self.image_processor, self.context_len = (
+                load_pretrained_model(
+                    model_path=model_path,
+                    model_base=None,
+                    model_name=model_name,
+                    device="cpu",
+                    device_map="cpu",
+                )
+            )
+        except Exception as err:
+            if "ShareGPT4V" in model_path:
+                import llava
+                logging.critical(
+                    "Please manually remove the encoder type check in "
+                    f"{llava.__path__[0]}/model/multimodal_encoder/builder.py "
+                    "Line 8 to use the ShareGPT4V model. "
+                )
+            else:
+                logging.critical("Unknown error when loading LLaVA model.")
+            raise err
+        self.model = self.model.cuda()
+        self.conv_mode = "llava_v1"
+        kwargs_default = dict(
+            do_sample=False,
+            temperature=0,
+            max_new_tokens=512,
+            top_p=None,
+            num_beams=1,
+            use_cache=True,
+        )  # noqa E501
+        kwargs_default.update(kwargs)
+        self.kwargs = kwargs_default
+        warnings.warn(
+            f"Following kwargs received: {self.kwargs}, will use as generation config. "
+        )
+    def use_custom_prompt(self, dataset):
+        assert dataset is not None
+        if DATASET_TYPE(dataset) == "MCQ":
+            return True
+        return False
+    def build_prompt(self, line, dataset=None):
+        assert self.use_custom_prompt(dataset)
+        assert dataset is None or isinstance(dataset, str)
+        tgt_path = self.dump_image(line, dataset)
+        question = line["question"]
+        hint = line["hint"] if ("hint" in line and not pd.isna(line["hint"])) else None
+        if hint is not None:
+            question = hint + "\n" + question
+        options = {
+            cand: line[cand]
+            for cand in string.ascii_uppercase
+            if cand in line and not pd.isna(line[cand])
+        }
+        for key, item in options.items():
+            question += f"\n{key}. {item}"
+        prompt = question
+        if len(options):
+            prompt += (
+                "\n请直接回答选项字母。"
+                if cn_string(prompt)
+                else "\nAnswer with the option's letter from the given choices directly."
+            )
+        else:
+            prompt += (
+                "\n请直接回答问题。"
+                if cn_string(prompt)
+                else "\nAnswer the question directly."
+            )
+        message = [dict(type="image", value=s) for s in tgt_path]
+        message.append(dict(type="text", value=prompt))
+        return message
+    def concat_tilist(self, message):
+        text, images = "", []
+        for item in message:
+            if item["type"] == "text":
+                text += item["value"]
+            elif item["type"] == "image":
+                text += " <image> "
+                images.append(item["value"])
+        return text, images
+    def chat_inner(self, message, dataset=None):
+        from llava.mm_utils import (
+            process_images,
+            tokenizer_image_token,
+            KeywordsStoppingCriteria,
+        )
+        from llava.constants import IMAGE_TOKEN_INDEX
+        prompt = self.system_prompt
+        images = []
+        for utter in message:
+            prompt += "USER: " if utter["role"] == "user" else "ASSISTANT: "
+            content, images_sub = self.concat_tilist(utter["content"])
+            prompt += content
+            images.extend(images_sub)
+            prompt += " " if utter["role"] == "user" else self.stop_str
+        assert message[-1]["role"] == "user", message
+        prompt += "ASSISTANT: "
+        images = [Image.open(s).convert("RGB") for s in images]
+        args = abstractproperty()
+        args.image_aspect_ratio = "pad"
+        image_tensor = process_images(images, self.image_processor, args).to(
+            "cuda", dtype=torch.float16
+        )
+        input_ids = (
+            tokenizer_image_token(
+                prompt, self.tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt"
+            )
+            .unsqueeze(0)
+            .cuda()
+        )
+        keywords = [self.stop_str]
+        stopping_criteria = KeywordsStoppingCriteria(
+            keywords, self.tokenizer, input_ids
+        )
+        with torch.inference_mode():
+            output_ids = self.model.generate(
+                input_ids,
+                images=image_tensor,
+                stopping_criteria=[stopping_criteria],
+                **self.kwargs,
+            )
+        output = self.tokenizer.batch_decode(output_ids, skip_special_tokens=True)[
+            0
+        ].strip()
+        return output
+    def generate_inner(self, message, dataset=None):
+        from llava.mm_utils import (
+            process_images,
+            tokenizer_image_token,
+            KeywordsStoppingCriteria,
+        )
+        from llava.constants import IMAGE_TOKEN_INDEX
+        # Support interleave text and image
+        content, images = self.concat_tilist(message)
+        images = [Image.open(s).convert("RGB") for s in images]
+        args = abstractproperty()
+        args.image_aspect_ratio = "pad"
+        if images:
+            image_tensor = process_images(images, self.image_processor, args).to(
+                "cuda", dtype=torch.float16
+            )
+        else:
+            image_tensor = None
+        prompt = self.system_prompt + "USER: " + content + " ASSISTANT: "
+        input_ids = (
+            tokenizer_image_token(
+                prompt, self.tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt"
+            )
+            .unsqueeze(0)
+            .cuda()
+        )
+        keywords = [self.stop_str]
+        stopping_criteria = KeywordsStoppingCriteria(
+            keywords, self.tokenizer, input_ids
+        )
+        with torch.inference_mode():
+            output_ids = self.model.generate(
+                input_ids,
+                images=image_tensor,
+                stopping_criteria=[stopping_criteria],
+                **self.kwargs,
+            )
+        output = self.tokenizer.batch_decode(output_ids, skip_special_tokens=True)[
+            0
+        ].strip()
+        return output
+class LLaVA_Next(BaseModel):
+    INSTALL_REQ = False
+    INTERLEAVE = True
+    def __init__(self, model_path="llava-hf/llava-v1.6-vicuna-7b-hf", **kwargs):
+        import transformers
+        from transformers import (
+            LlavaNextProcessor,
+            LlavaNextForConditionalGeneration,
+            AutoProcessor,
+            LlavaForConditionalGeneration,
+        )
+        self.model_path = model_path
+        if "34b" in model_path.lower():
+            self.processor = LlavaNextProcessor.from_pretrained(
+                self.model_path, use_fast=False
+            )
+        elif "interleave" in model_path.lower():
+            self.processor = AutoProcessor.from_pretrained(self.model_path)
+        else:
+            self.processor = LlavaNextProcessor.from_pretrained(self.model_path)
+        flash_attn_flag = False
+        try:
+            import flash_attn
+            flash_attn_flag = True
+        except ImportError:
+            pass
+        if flash_attn_flag:
+            if "interleave" in model_path.lower():
+                model = LlavaForConditionalGeneration.from_pretrained(
+                    self.model_path,
+                    torch_dtype=torch.float16,
+                    low_cpu_mem_usage=True,
+                    use_flash_attention_2=True,
+                )
+            else:
+                model = LlavaNextForConditionalGeneration.from_pretrained(
+                    self.model_path,
+                    torch_dtype=torch.float16,
+                    low_cpu_mem_usage=True,
+                    use_flash_attention_2=True,
+                )
+        else:
+            if "interleave" in model_path.lower():
+                model = LlavaForConditionalGeneration.from_pretrained(
+                    self.model_path, torch_dtype=torch.float16, low_cpu_mem_usage=True
+                )
+            else:
+                model = LlavaNextForConditionalGeneration.from_pretrained(
+                    self.model_path, torch_dtype=torch.float16, low_cpu_mem_usage=True
+                )
+        model = model.eval()
+        self.model = model.cuda()
+        kwargs_default = dict(
+            do_sample=False, temperature=0, max_new_tokens=512, top_p=None, num_beams=1
+        )
+        kwargs_default.update(kwargs)
+        self.kwargs = kwargs_default
+        warnings.warn(
+            f"Following kwargs received: {self.kwargs}, will use as generation config. "
+        )
+    def apply_prompt_template(self, prompt):
+        model_path = self.model_path.lower()
+        if "mistral" in model_path:
+            template = "[INST] PLACEHOLDER [/INST]"
+        elif "vicuna" in model_path:
+            template = (
+                "A chat between a curious human and an artificial intelligence assistant. "
+                "The assistant gives helpful, detailed, and polite answers to the human's questions. "
+                "USER: PLACEHOLDER ASSISTANT:"
+            )
+        elif "34b" in model_path:
+            template = (
+                "<|im_start|>system\nAnswer the questions.<|im_end|><|im_start|>user\nPLACEHOLDER<|im_end|>"
+                "<|im_start|>assistant\n"
+            )
+        else:
+            raise NotImplementedError(
+                f"Prompt template for {model_path} not implemented."
+            )
+        prompt = template.replace("PLACEHOLDER", f"<image>\n{prompt}")
+        return prompt
+    def output_process(self, answer):
+        if "<s>" in answer:
+            answer = answer.replace("<s>", "").strip()
+        if "[/INST]" in answer:
+            answer = answer.split("[/INST]")[1].strip()
+        elif "ASSISTANT:" in answer:
+            answer = answer.split("ASSISTANT:")[1].strip()
+        elif "assistant\n" in answer:
+            answer = answer.split("assistant\n")[1].strip()
+        elif "<|end_header_id|>\n\n" in answer:
+            answer = answer.split("<|end_header_id|>\n\n")[2].strip()
+        if "</s>" in answer:
+            answer = answer.split("</s>")[0].strip()
+        elif "<|im_end|>" in answer:
+            answer = answer.split("<|im_end|>")[0].strip()
+        elif "<|eot_id|>" in answer:
+            answer = answer.split("<|eot_id|>")[0].strip()
+        return answer
+    def use_custom_prompt(self, dataset):
+        assert dataset is not None
+        if DATASET_TYPE(dataset) == "MCQ":
+            return True
+        return False
+    def build_prompt(self, line, dataset=None):
+        assert self.use_custom_prompt(dataset)
+        assert dataset is None or isinstance(dataset, str)
+        tgt_path = self.dump_image(line, dataset)
+        question = line["question"]
+        hint = line["hint"] if ("hint" in line and not pd.isna(line["hint"])) else None
+        if hint is not None:
+            question = hint + "\n" + question
+        options = {
+            cand: line[cand]
+            for cand in string.ascii_uppercase
+            if cand in line and not pd.isna(line[cand])
+        }
+        for key, item in options.items():
+            question += f"\n{key}. {item}"
+        prompt = question
+        if len(options):
+            prompt += (
+                "\n请直接回答选项字母。"
+                if cn_string(prompt)
+                else "\nAnswer with the option's letter from the given choices directly."
+            )
+        else:
+            prompt += (
+                "\n请直接回答问题。"
+                if cn_string(prompt)
+                else "\nAnswer the question directly."
+            )
+        message = [dict(type="image", value=s) for s in tgt_path]
+        message.append(dict(type="text", value=prompt))
+        return message
+    def generate_inner(self, message, dataset=None):
+        content, images = [], []
+        for msg in message:
+            if msg["type"] == "text":
+                content.append({"type": msg["type"], "text": msg["value"]})
+            else:
+                content.append({"type": "image"})
+                images.append(Image.open(msg["value"]).convert("RGB"))
+        conversation = [
+            {
+                "role": "user",
+                "content": content,
+            }
+        ]
+        prompt = self.processor.apply_chat_template(
+            conversation, add_generation_prompt=True
+        )
+        inputs = self.processor(prompt, images, return_tensors="pt").to(
+            "cuda", torch.float16
+        )
+        output = self.model.generate(**inputs, **self.kwargs)
+        answer = self.processor.decode(output[0], skip_special_token=True)
+        answer = self.output_process(answer)
+        return answer
+class LLaVA_Next2(BaseModel):
+    INSTALL_REQ = True
+    INTERLEAVE = True
+    DEFAULT_IMAGE_TOKEN = "<image>"
+    IMAGE_TOKEN_INDEX = -200
+    def __init__(self, model_path="lmms-lab/llama3-llava-next-8b", **kwargs):
+        assert model_path is not None
+        try:
+            from llava.model.builder import load_pretrained_model
+            from llava.conversation import conv_templates, SeparatorStyle
+            from llava.mm_utils import (
+                get_model_name_from_path,
+                tokenizer_image_token,
+                KeywordsStoppingCriteria,
+            )
+        except Exception as err:
+            logging.critical(
+                "Please `pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git`"
+            )
+            raise err
+        model_name = get_model_name_from_path(model_path)
+        tokenizer, model, image_processor, _ = load_pretrained_model(
+            model_path, None, model_name, device_map=None
+        )
+        model.cuda().eval()
+        model.tie_weights()
+        if "llama3" in model_path.lower():
+            conv_mode = "llava_llama_3"
+        elif "qwen" in model_path.lower():
+            conv_mode = "qwen_1_5"
+        self.conv_template = conv_mode
+        self.conv_templates = conv_templates
+        self.tokenizer = tokenizer
+        self.model = model
+        self.image_processor = image_processor
+        self.tokenizer_image_token = tokenizer_image_token
+        self.KeywordStoppingCriteria = KeywordsStoppingCriteria
+        self.SeparatorStyle = SeparatorStyle
+    def generate_inner(self, message, dataset=None):
+        content, images = "", []
+        for msg in message:
+            if msg["type"] == "text":
+                content += msg["value"]
+            else:
+                images.append(Image.open(msg["value"]).convert("RGB"))
+                content += self.DEFAULT_IMAGE_TOKEN + "\n"
+        preprocess = self.image_processor.preprocess
+        image_tokenizer = self.tokenizer_image_token
+        image_tensor = [
+            preprocess(f, return_tensors="pt")["pixel_values"][0].half().cuda()
+            for f in images
+        ]
+        image_tensor = torch.stack(image_tensor)
+        conv = copy.deepcopy(self.conv_templates[self.conv_template])
+        conv.append_message(conv.roles[0], content)
+        conv.append_message(conv.roles[1], None)
+        prompt_question = conv.get_prompt()
+        input_ids = image_tokenizer(
+            prompt_question, self.tokenizer, self.IMAGE_TOKEN_INDEX, return_tensors="pt"
+        )
+        input_ids = input_ids.unsqueeze(0).cuda()
+        stop_str = conv.sep if conv.sep_style != self.SeparatorStyle.TWO else conv.sep2
+        keywords = [stop_str]
+        stopping_criteria = self.KeywordStoppingCriteria(
+            keywords, self.tokenizer, input_ids
+        )
+        cont = self.model.generate(
+            input_ids,
+            images=image_tensor,
+            do_sample=False,
+            temperature=0,
+            max_new_tokens=512,
+            stopping_criteria=[stopping_criteria],
+        )
+        text_outputs = self.tokenizer.batch_decode(cont, skip_special_tokens=True)[0]
+        return text_outputs
+class LLaVA_OneVision(BaseModel):
+    INSTALL_REQ = True
+    INTERLEAVE = True
+    VIDEO_LLM = True
+    DEFAULT_IMAGE_TOKEN = "<image>"
+    IMAGE_TOKEN_INDEX = -200
+    # This function is used to split InternVL2-Llama3-76B
+    def split_model(self, model_path):
+        import math
+        device_map = {}
+        num_gpus = torch.cuda.device_count()
+        rank, world_size = get_rank_and_world_size()
+        num_gpus = num_gpus // world_size
+        if "72b" not in model_path.lower():
+            return None
+        # embed_tokens, vision_tower, mm_projector, lm_head are treated as 2 layers
+        num_layers = 80 + 8
+        num_layers_per_gpu = math.ceil(num_layers / num_gpus)
+        num_layers_per_gpu = [num_layers_per_gpu] * num_gpus
+        num_layers_per_gpu[0] -= 6
+        num_layers_per_gpu[-1] -= 2
+        layer_cnt = 0
+        for i, num_layer in enumerate(num_layers_per_gpu):
+            for j in range(num_layer):
+                device_map[f"model.layers.{layer_cnt}"] = rank + world_size * i
+                layer_cnt += 1
+        last_gpu = rank + world_size * (num_gpus - 1)
+        device_map["model.image_newline"] = rank
+        device_map["model.embed_tokens"] = rank
+        device_map["model.norm"] = rank
+        device_map["model.vision_tower"] = rank
+        device_map["model.vision_resampler"] = rank
+        device_map["model.mm_projector"] = rank
+        device_map["lm_head"] = last_gpu
+        return device_map
+    def __init__(self, model_path="lmms-lab/llava-onevision-qwen2-7b-si", **kwargs):
+        assert model_path is not None
+        try:
+            from llava.model.builder import load_pretrained_model
+            from llava.conversation import conv_templates, SeparatorStyle
+            from llava.mm_utils import (
+                get_model_name_from_path,
+                process_images,
+                tokenizer_image_token,
+                KeywordsStoppingCriteria,
+            )  # noqa: E501
+        except Exception as err:
+            logging.critical(
+                "Please `pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git`"
+            )
+            raise err
+        video_kwargs_default = dict(
+            overwrite=True, mm_spatial_pool_mode="average", force_sample=True
+        )
+        video_kwargs_default.update(kwargs)
+        self.video_kwargs = video_kwargs_default
+        overwrite_config = None
+        if "video" in model_path.lower():
+            if self.video_kwargs["overwrite"]:
+                overwrite_config = {}
+                overwrite_config["mm_spatial_pool_mode"] = self.video_kwargs[
+                    "mm_spatial_pool_mode"
+                ]
+        rank, world_size = get_rank_and_world_size()
+        model_name = get_model_name_from_path(model_path)
+        device_map = self.split_model(model_path)
+        if device_map is None:
+            if auto_split_flag():
+                assert world_size == 1, 'Only support world_size == 1 when AUTO_SPLIT set for non-72B LLaVA-OneVision'
+                logging.warning('Currently, we only support to split the non-72B model across all GPUs.')
+                tokenizer, model, image_processor, _ = load_pretrained_model(
+                    model_path,
+                    None,
+                    model_name,
+                    device_map="auto",
+                    overwrite_config=overwrite_config,
+                )
+            else:
+                tokenizer, model, image_processor, _ = load_pretrained_model(
+                    model_path,
+                    None,
+                    model_name,
+                    device_map="cpu",
+                    overwrite_config=overwrite_config,
+                )
+                model.cuda()
+        else:
+            tokenizer, model, image_processor, _ = load_pretrained_model(
+                model_path,
+                None,
+                model_name,
+                device_map=device_map,
+                overwrite_config=overwrite_config,
+            )
+        model.eval()
+        model.tie_weights()
+        if "llava" in model_path.lower():
+            conv_mode = "qwen_1_5"
+        if 'llava-video' in model_path.lower():
+            self.nframe = 64
+        else:
+            self.nframe = 16
+            if "72b" in model_path.lower():
+                self.nframe = 32
+        if "video" in model_path.lower():
+            self.force_sample = self.video_kwargs["force_sample"]
+        else:
+            self.force_sample = False
+        self.conv_template = conv_mode
+        self.conv_templates = conv_templates
+        self.tokenizer = tokenizer
+        self.model = model
+        self.image_processor = image_processor
+        self.tokenizer_image_token = tokenizer_image_token
+        self.process_images = (
+            process_images  # Store process_images as a class attribute
+        )
+        self.KeywordStoppingCriteria = KeywordsStoppingCriteria
+        self.SeparatorStyle = SeparatorStyle
+    def generate_inner_image(self, message, dataset=None):
+        content, images = "", []
+        image_sizes = []  # Store image sizes
+        for msg in message:
+            if msg["type"] == "text":
+                content += msg["value"]
+            else:
+                img = Image.open(msg["value"]).convert("RGB")
+                images.append(img)
+                image_sizes.append(img.size)  # Store the size of each image
+                content += self.DEFAULT_IMAGE_TOKEN + "\n"
+        # Process images using the class attribute self.process_images
+        image_tensor = self.process_images(
+            images, self.image_processor, self.model.config
+        )
+        image_tensor = [
+            _image.to(dtype=torch.float16, device="cuda") for _image in image_tensor
+        ]
+        conv = copy.deepcopy(self.conv_templates[self.conv_template])
+        conv.append_message(conv.roles[0], content)
+        conv.append_message(conv.roles[1], None)
+        prompt_question = conv.get_prompt()
+        input_ids = self.tokenizer_image_token(
+            prompt_question, self.tokenizer, self.IMAGE_TOKEN_INDEX, return_tensors="pt"
+        )
+        input_ids = input_ids.unsqueeze(0).cuda()
+        stop_str = conv.sep if conv.sep_style != self.SeparatorStyle.TWO else conv.sep2
+        keywords = [stop_str]
+        stopping_criteria = self.KeywordStoppingCriteria(
+            keywords, self.tokenizer, input_ids
+        )
+        # Pass image sizes along with other parameters
+        cont = self.model.generate(
+            input_ids,
+            images=image_tensor,
+            image_sizes=image_sizes,  # Pass the image sizes here
+            do_sample=False,
+            temperature=0,
+            max_new_tokens=512,
+            stopping_criteria=[stopping_criteria],
+        )
+        text_outputs = self.tokenizer.batch_decode(cont, skip_special_tokens=True)[0]
+        return text_outputs
+    def generate_inner_video(self, message, dataset=None):
+        content, text_content, visual_content, videos = "", "", "", []
+        for msg in message:
+            if msg["type"] == "text":
+                text_content += msg["value"]
+            else:
+                videos.append(msg["value"])
+                visual_content += self.DEFAULT_IMAGE_TOKEN + "\n"
+        if len(videos) > 1:
+            raise ValueError(
+                "LLaVA-OneVision does not support multiple videos as input."
+            )
+        video_frames, frame_time, video_time = self.load_video(
+            videos[0], self.nframe, self.force_sample
+        )
+        time_instruciton = (
+            f"The video lasts for {video_time:.2f} seconds,"
+            f"and {len(video_frames[0])} frames are uniformly sampled from it."
+            f"These frames are located at {frame_time}."
+            f"Please answer the following questions related to this video.\n"
+        )
+        if self.force_sample:
+            content = visual_content + time_instruciton + text_content
+        else:
+            content = visual_content + text_content
+        image_tensors = []
+        frames = (
+            self.image_processor.preprocess(video_frames, return_tensors="pt")[
+                "pixel_values"
+            ]
+            .half()
+            .cuda()
+        )
+        image_tensors.append(frames)
+        conv = copy.deepcopy(self.conv_templates[self.conv_template])
+        conv.append_message(conv.roles[0], content)
+        conv.append_message(conv.roles[1], None)
+        prompt_question = conv.get_prompt()
+        input_ids = self.tokenizer_image_token(
+            prompt_question, self.tokenizer, self.IMAGE_TOKEN_INDEX, return_tensors="pt"
+        )
+        input_ids = input_ids.unsqueeze(0).cuda()
+        image_sizes = [frame.size for frame in video_frames]
+        modalities = ["video"] * len(video_frames)
+        stop_str = conv.sep if conv.sep_style != self.SeparatorStyle.TWO else conv.sep2
+        keywords = [stop_str]
+        stopping_criteria = self.KeywordStoppingCriteria(
+            keywords, self.tokenizer, input_ids
+        )
+        # Pass image sizes along with other parameters
+        cont = self.model.generate(
+            input_ids,
+            images=image_tensors,
+            image_sizes=image_sizes,  # Pass the image sizes here
+            do_sample=False,
+            temperature=0,
+            max_new_tokens=512,
+            modalities=modalities,
+            stopping_criteria=[stopping_criteria],
+        )
+        text_outputs = self.tokenizer.batch_decode(cont, skip_special_tokens=True)[0]
+        return text_outputs
+    def load_video(self, video_path, max_frames_num, force_sample=False, fps=1):
+        from decord import VideoReader, cpu
+        import numpy as np
+        if max_frames_num == 0:
+            return np.zeros((1, 336, 336, 3))
+        vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
+        total_frame_num = len(vr)
+        video_time = total_frame_num / vr.get_avg_fps()
+        fps = round(vr.get_avg_fps() / fps)
+        frame_idx = [i for i in range(0, len(vr), fps)]
+        frame_time = [i / fps for i in frame_idx]
+        if len(frame_idx) > max_frames_num or force_sample:
+            sample_fps = max_frames_num
+            uniform_sampled_frames = np.linspace(
+                0, total_frame_num - 1, sample_fps, dtype=int
+            )
+            frame_idx = uniform_sampled_frames.tolist()
+            frame_time = [i / vr.get_avg_fps() for i in frame_idx]
+        frame_time = ",".join([f"{i:.2f}s" for i in frame_time])
+        spare_frames = vr.get_batch(frame_idx).asnumpy()
+        # import pdb;pdb.set_trace()
+        return spare_frames, frame_time, video_time
+    def generate_inner(self, message, dataset=None):
+        if DATASET_MODALITY(dataset) == 'VIDEO':
+            return self.generate_inner_video(message, dataset)
+        else:
+            return self.generate_inner_image(message, dataset)
+class LLaVA_OneVision_HF(BaseModel):
+    INSTALL_REQ = True
+    INTERLEAVE = True
+    VIDEO_LLM = True
+    DEFAULT_IMAGE_TOKEN = "<image>"
+    IMAGE_TOKEN_INDEX = -200
+    def __init__(self, model_path="llava-hf/llava-onevision-qwen2-0.5b-ov-hf", **kwargs):
+        from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration
+        assert model_path is not None, "Model path must be provided."
+        self.model = LlavaOnevisionForConditionalGeneration.from_pretrained(
+            model_path, torch_dtype=torch.float16, low_cpu_mem_usage=True
+        ).to(0)
+        self.processor = AutoProcessor.from_pretrained(model_path)
+        self.video_kwargs = kwargs.get("video_kwargs", {})
+        self.force_sample = self.video_kwargs.get("force_sample", False)
+        self.nframe = kwargs.get("nframe", 8)
+        self.fps = 1
+    def generate_inner_image(self, message, dataset=None):
+        content, images = "", []
+        image_sizes = []
+        for msg in message:
+            if msg["type"] == "text":
+                content += msg["value"]
+            elif msg["type"] == "image":
+                img = Image.open(msg["value"]).convert("RGB")
+                images.append(img)
+                image_sizes.append(img.size)
+                content += self.DEFAULT_IMAGE_TOKEN + "\n"
+        conversation = [
+            {
+                "role": "user",
+                "content": [
+                    {"type": "text", "text": content.split("\n", 1)[-1]},
+                    {"type": "image"},
+                ],
+            }
+        ]
+        prompt = self.processor.apply_chat_template(conversation, add_generation_prompt=True)
+        inputs = self.processor(images=images, text=prompt, return_tensors="pt").to(0, torch.float16)
+        output = self.model.generate(**inputs, max_new_tokens=100)
+        return self.processor.decode(output[0], skip_special_tokens=True)
+    def generate_inner_video(self, message, dataset=None):
+        content, text_content, visual_content, videos = "", "", "", []
+        for msg in message:
+            if msg["type"] == "text":
+                text_content += msg["value"]
+            elif msg["type"] == "video":
+                videos.append(msg["value"])
+                visual_content += self.DEFAULT_IMAGE_TOKEN + "\n"
+        if len(videos) > 1:
+            raise ValueError("LLaVA-OneVision does not support multiple videos as input.")
+        video_frames, frame_time, video_time = self.load_video(
+            videos[0], self.nframe, fps=1, force_sample=self.force_sample
+        )
+        time_instruction = (
+            f"The video lasts for {video_time:.2f} seconds, "
+            f"and {len(video_frames)} frames are uniformly sampled from it. "
+            f"These frames are located at {frame_time}. "
+            f"Please answer the following questions related to this video.\n"
+        )
+        content = visual_content + time_instruction + text_content
+        conversation = [
+            {
+                "role": "user",
+                "content": [{"type": "text", "text": content}, {"type": "video"}],
+            }
+        ]
+        prompt = self.processor.apply_chat_template(conversation, add_generation_prompt=True)
+        inputs = self.processor(videos=video_frames, text=prompt, return_tensors="pt").to(0, torch.float16)
+        output = self.model.generate(**inputs, max_new_tokens=512)
+        return self.processor.decode(output[0], skip_special_tokens=True)
+    def load_video(self, video_path, max_frames_num, fps=1, force_sample=False):
+        from decord import VideoReader, cpu
+        import numpy as np
+        vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
+        total_frame_num = len(vr)
+        avg_fps = vr.get_avg_fps()
+        if avg_fps == 0:
+            raise ValueError(f"Video '{video_path}' has an average FPS of 0, which is invalid.")
+        if fps <= 0:
+            raise ValueError("FPS argument must be greater than 0.")
+        effective_fps = round(avg_fps / fps)
+        frame_idx = list(range(0, total_frame_num, effective_fps))
+        frame_time = [i / avg_fps for i in frame_idx]
+        if len(frame_idx) > max_frames_num or force_sample:
+            uniform_sampled_frames = np.linspace(0, total_frame_num - 1, max_frames_num, dtype=int)
+            frame_idx = uniform_sampled_frames.tolist()
+            frame_time = [i / avg_fps for i in frame_idx]
+        frame_time_str = ", ".join([f"{t:.2f}s" for t in frame_time])
+        video_frames = vr.get_batch(frame_idx).asnumpy()
+        video_time = total_frame_num / avg_fps
+        return video_frames, frame_time_str, video_time
+    def generate_inner(self, message, dataset=None):
+        if DATASET_MODALITY(dataset) == "VIDEO":
+            return self.generate_inner_video(message, dataset)
+        else:
+            return self.generate_inner_image(message, dataset)

VLMEvalKit/vlmeval/vlm/llava/llava_xtuner.py ADDED Viewed

	@@ -0,0 +1,239 @@

+import os
+import os.path as osp
+import string
+import sys
+import warnings
+import pandas as pd
+import torch
+from huggingface_hub import snapshot_download
+from PIL import Image
+from transformers import (AutoModel, AutoModelForCausalLM, AutoTokenizer,
+                          CLIPImageProcessor, CLIPVisionModel,
+                          GenerationConfig, StoppingCriteriaList)
+from ..base import BaseModel
+from ...smp import *
+from ...dataset import DATASET_TYPE
+class LLaVA_XTuner(BaseModel):
+    INSTALL_REQ = True
+    INTERLEAVE = False
+    def __init__(self,
+                 llava_path,
+                 llm_path=None,
+                 visual_encoder_path='openai/clip-vit-large-patch14-336',
+                 visual_select_layer=-2,
+                 prompt_template=None,
+                 stop_words=[],
+                 torch_dtype=torch.float16):
+        try:
+            from peft import PeftModel
+            from xtuner.utils import PROMPT_TEMPLATE, StopWordStoppingCriteria
+        except Exception as err:
+            logging.critical(
+                'Please install xtuner with `pip install -U xtuner` before '
+                'using LLaVA_XTuner')
+            raise err
+        if not osp.isdir(llava_path):
+            cache_path = get_cache_path(llava_path)
+            if cache_path is not None:
+                llava_path = cache_path
+            else:
+                llava_path = snapshot_download(repo_id=llava_path)
+        assert osp.exists(llava_path) and osp.isdir(llava_path)
+        # build visual_encoder
+        if 'llm' in os.listdir(llava_path):
+            assert llm_path is None, (
+                "Please don't specify the `llm_path` since passed "
+                '`llava_path` contains a LLM!')
+            llm_path = osp.join(llava_path, 'llm')
+        else:
+            assert llm_path is not None, 'Please specify the `llm_path`!'
+        llm = AutoModelForCausalLM.from_pretrained(llm_path,
+                                                   trust_remote_code=True,
+                                                   torch_dtype=torch_dtype,
+                                                   device_map='cpu')
+        tokenizer = AutoTokenizer.from_pretrained(llm_path,
+                                                  trust_remote_code=True,
+                                                  encode_special_tokens=True)
+        print(f'Load LLM from {llm_path}')
+        # build visual_encoder
+        if 'visual_encoder' in os.listdir(llava_path):
+            assert visual_encoder_path is None, (
+                "Please don't specify the `visual_encoder_path` since passed "
+                '`llava_path` contains a visual encoder!')
+            visual_encoder_path = osp.join(llava_path, 'visual_encoder')
+        else:
+            assert visual_encoder_path is not None, (
+                'Please specify the `visual_encoder_path`!')
+        visual_encoder = CLIPVisionModel.from_pretrained(
+            visual_encoder_path, torch_dtype=torch_dtype, device_map='cpu')
+        image_processor = CLIPImageProcessor.from_pretrained(
+            visual_encoder_path)
+        print(f'Load visual_encoder from {visual_encoder_path}')
+        # load adapter
+        if 'llm_adapter' in os.listdir(llava_path):
+            adapter_path = osp.join(llava_path, 'llm_adapter')
+            llm = PeftModel.from_pretrained(llm,
+                                            adapter_path,
+                                            trust_remote_code=True,
+                                            device_map='cpu')
+            print(f'Load LLM adapter from {llava_path}')
+        if 'visual_encoder_adapter' in os.listdir(llava_path):
+            adapter_path = osp.join(llava_path, 'visual_encoder_adapter')
+            visual_encoder = PeftModel.from_pretrained(visual_encoder,
+                                                       adapter_path,
+                                                       trust_remote_code=True,
+                                                       device_map='cpu')
+            print(f'Load visual_encoder adapter from {llava_path}')
+        # build projector
+        projector_path = osp.join(llava_path, 'projector')
+        projector = AutoModel.from_pretrained(projector_path,
+                                              trust_remote_code=True,
+                                              torch_dtype=torch_dtype,
+                                              device_map='cpu')
+        print(f'Load projector from {llava_path}')
+        llm.eval()
+        visual_encoder.eval()
+        projector.eval()
+        self.llm = llm.cuda()
+        self.tokenizer = tokenizer
+        self.visual_encoder = visual_encoder.cuda()
+        self.image_processor = image_processor
+        self.projector = projector.cuda()
+        self.visual_select_layer = visual_select_layer
+        if prompt_template is not None:
+            # modified prompt template
+            if prompt_template == 'llama3_chat':
+                self.prompt_template = dict(
+                    SYSTEM=('<|start_header_id|>system<|end_header_id|>\n\n'
+                            '{system}<|eot_id|>'),
+                    INSTRUCTION=(
+                        '<|start_header_id|>user<|end_header_id|>\n\n{input}<|eot_id|>'
+                        '<|start_header_id|>assistant<|end_header_id|>\n\n'),
+                    SUFFIX='<|eot_id|>',
+                    SUFFIX_AS_EOS=True,
+                    STOP_WORDS=['<|eot_id|>'])
+            else:
+                self.prompt_template = PROMPT_TEMPLATE[prompt_template]
+            stop_words += self.prompt_template.get('STOP_WORDS', [])
+        else:
+            self.prompt_template = None
+        self.stop_criteria = StoppingCriteriaList()
+        for word in stop_words:
+            self.stop_criteria.append(
+                StopWordStoppingCriteria(self.tokenizer, word))
+    def build_gen_config(self, dataset):
+        gen_kwargs = dict(max_new_tokens=512,
+                          do_sample=True,
+                          temperature=1,
+                          num_beams=5,
+                          eos_token_id=self.tokenizer.eos_token_id,
+                          pad_token_id=self.tokenizer.pad_token_id
+                          if self.tokenizer.pad_token_id is not None else
+                          self.tokenizer.eos_token_id)
+        # For single word generation
+        if (dataset is not None
+                and DATASET_TYPE(dataset) in ['MCQ', 'Y/N']):
+            gen_kwargs.update(
+                dict(max_new_tokens=5, do_sample=False, num_beams=1))
+        return GenerationConfig(**gen_kwargs)
+    def use_custom_prompt(self, dataset):
+        assert dataset is not None
+        if DATASET_TYPE(dataset) == 'MCQ':
+            return True
+        return False
+    def build_prompt(self, line, dataset=None):
+        assert self.use_custom_prompt(dataset)
+        assert dataset is None or isinstance(dataset, str)
+        tgt_path = self.dump_image(line, dataset)
+        question = line['question']
+        hint = line['hint'] if ('hint' in line
+                                and not pd.isna(line['hint'])) else None
+        if hint is not None:
+            question = hint + '\n' + question
+        options = {
+            cand: line[cand]
+            for cand in string.ascii_uppercase
+            if cand in line and not pd.isna(line[cand])
+        }
+        for key, item in options.items():
+            question += f'\n{key}. {item}'
+        if not cn_string(question):
+            prompt = question + '\n' + ("Answer with the option's letter "
+                                        'from the given choices directly.')
+        else:
+            prompt = question + '\n' + '请直接回答选项字母。'
+        message = [dict(type='text', value=prompt)]
+        message.extend([dict(type='image', value=s) for s in tgt_path])
+        return message
+    def generate_inner(self, message, dataset=None):
+        from xtuner.dataset.utils import expand2square
+        from xtuner.model.utils import prepare_inputs_labels_for_multimodal
+        from xtuner.utils import DEFAULT_IMAGE_TOKEN, IMAGE_TOKEN_INDEX
+        prompt, image_path = self.message_to_promptimg(message, dataset=dataset)
+        prompt = prompt.replace('<image>', '')
+        image = Image.open(image_path).convert('RGB')
+        image = expand2square(
+            image,
+            tuple(int(x * 255) for x in self.image_processor.image_mean))
+        image = self.image_processor.preprocess(
+            image, return_tensors='pt')['pixel_values'][0]
+        image = image.cuda().unsqueeze(0)
+        visual_outputs = self.visual_encoder(image, output_hidden_states=True)
+        pixel_values = self.projector(
+            visual_outputs.hidden_states[self.visual_select_layer][:, 1:])
+        inputs = DEFAULT_IMAGE_TOKEN + '\n' + prompt
+        if self.prompt_template:
+            inputs = self.prompt_template['INSTRUCTION'].format(input=inputs)
+        chunk_encode = []
+        for idx, chunk in enumerate(inputs.split(DEFAULT_IMAGE_TOKEN)):
+            if idx == 0:
+                cur_encode = self.tokenizer(chunk)
+            else:
+                cur_encode = self.tokenizer(chunk, add_special_tokens=False)
+            chunk_encode.append(cur_encode)
+        assert len(chunk_encode) == 2
+        ids = []
+        for idx, cur_chunk_encode in enumerate(chunk_encode):
+            ids.extend(cur_chunk_encode['input_ids'])
+            if idx != len(chunk_encode) - 1:
+                ids.append(IMAGE_TOKEN_INDEX)
+        ids = torch.tensor(ids).cuda().unsqueeze(0)
+        mm_inputs = prepare_inputs_labels_for_multimodal(
+            llm=self.llm, input_ids=ids, pixel_values=pixel_values)
+        gen_config = self.build_gen_config(dataset)
+        generate_output = self.llm.generate(
+            **mm_inputs,
+            generation_config=gen_config,
+            streamer=None,
+            bos_token_id=self.tokenizer.bos_token_id,
+            stopping_criteria=self.stop_criteria)
+        predict = self.tokenizer.decode(generate_output[0],
+                                        skip_special_tokens=True).strip()
+        return predict

VLMEvalKit/vlmeval/vlm/misc/blip2_instruct_vicuna13b.yaml ADDED Viewed

	@@ -0,0 +1,43 @@

+ # Copyright (c) 2022, salesforce.com, inc.
+ # All rights reserved.
+ # SPDX-License-Identifier: BSD-3-Clause
+ # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
+model:
+  arch: instruct_vicuna13b
+  load_finetuned: False
+  load_pretrained: True
+  pretrained: "https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/InstructBLIP/instruct_blip_vicuna13b_trimmed.pth"
+  finetuned: ""
+  # vit encoder
+  image_size: 224
+  drop_path_rate: 0
+  use_grad_checkpoint: False
+  vit_precision: "fp16"
+  freeze_vit: True
+  # Q-Former
+  num_query_token: 32
+  # path to Vicuna checkpoint
+  llm_model: "Please set the path to your vicuna-13b-v1.1"
+  # generation configs
+  prompt: ""
+preprocess:
+    vis_processor:
+        train:
+          name: "blip2_image_train"
+          image_size: 224
+        eval:
+          name: "blip_image_eval"
+          image_size: 224
+    text_processor:
+        train:
+          name: "blip_caption"
+        eval:
+          name: "blip_caption"

VLMEvalKit/vlmeval/vlm/misc/blip2_instruct_vicuna7b.yaml ADDED Viewed

	@@ -0,0 +1,43 @@

+ # Copyright (c) 2022, salesforce.com, inc.
+ # All rights reserved.
+ # SPDX-License-Identifier: BSD-3-Clause
+ # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
+model:
+  arch: instruct_vicuna7b
+  load_finetuned: False
+  load_pretrained: True
+  pretrained: "https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/InstructBLIP/instruct_blip_vicuna7b_trimmed.pth"
+  finetuned: ""
+  # vit encoder
+  image_size: 224
+  drop_path_rate: 0
+  use_grad_checkpoint: False
+  vit_precision: "fp16"
+  freeze_vit: True
+  # Q-Former
+  num_query_token: 32
+  # path to Vicuna checkpoint
+  llm_model: "Please set the path to your vicuna-7b-v1.1"
+  # generation configs
+  prompt: ""
+preprocess:
+    vis_processor:
+        train:
+          name: "blip2_image_train"
+          image_size: 224
+        eval:
+          name: "blip_image_eval"
+          image_size: 224
+    text_processor:
+        train:
+          name: "blip_caption"
+        eval:
+          name: "blip_caption"

VLMEvalKit/vlmeval/vlm/misc/minigpt4_13b_eval.yaml ADDED Viewed

	@@ -0,0 +1,37 @@

+model:
+  arch: minigpt4
+  model_type: pretrain_vicuna_7b
+  max_txt_len: 160
+  end_sym: "###"
+  low_resource: True
+  prompt_template: '###Human: {} ###Assistant: '
+  ckpt: "please set this value to the path of pretrained checkpoint"
+  # vit encoder
+  image_size: 224
+  drop_path_rate: 0
+  use_grad_checkpoint: False
+  vit_precision: "fp16"
+  freeze_vit: True
+  freeze_qformer: True
+  # Q-Former
+  num_query_token: 32
+  # generation configs
+  prompt: ""
+  llama_model: "please set this value to the path of vicuna-13b-v0"
+datasets:
+  cc_sbu_align:
+    vis_processor:
+      train:
+        name: "blip2_image_eval"
+        image_size: 224
+    text_processor:
+      train:
+        name: "blip_caption"
+run:
+  task: image_text_pretrain

VLMEvalKit/vlmeval/vlm/misc/minigpt4_7b_eval.yaml ADDED Viewed

	@@ -0,0 +1,38 @@

+model:
+  arch: minigpt4
+  model_type: pretrain_vicuna_7b
+  max_txt_len: 160
+  end_sym: "###"
+  low_resource: True
+  prompt_template: '###Human: {} ###Assistant: '
+  ckpt: "please set this value to the path of pretrained checkpoint"
+  # vit encoder
+  image_size: 224
+  drop_path_rate: 0
+  use_grad_checkpoint: False
+  vit_precision: "fp16"
+  freeze_vit: True
+  freeze_qformer: True
+  # Q-Former
+  num_query_token: 32
+  # generation configs
+  prompt: ""
+  llama_model: "please set this value to the path of vicuna-7b-v0"
+datasets:
+  cc_sbu_align:
+    vis_processor:
+      train:
+        name: "blip2_image_eval"
+        image_size: 224
+    text_processor:
+      train:
+        name: "blip_caption"
+run:
+  task: image_text_pretrain

VLMEvalKit/vlmeval/vlm/misc/minigptv2_eval.yaml ADDED Viewed

	@@ -0,0 +1,36 @@

+model:
+  arch: minigpt_v2
+  model_type: pretrain
+  max_txt_len: 160
+  end_sym: "</s>"
+  low_resource: True
+  prompt_template: '[INST] {} [/INST]'
+  ckpt: "please set this value to the path of pretrained checkpoint"
+  lora_r: 64
+  lora_alpha: 16
+  # vit encoder
+  image_size: 448
+  drop_path_rate: 0
+  use_grad_checkpoint: False
+  vit_precision: "fp16"
+  freeze_vit: True
+  # generation configs
+  prompt: ""
+  # LLM
+  llama_model: "please set this value to the path of llama2-chat-7b"
+datasets:
+  cc_sbu_align:
+    vis_processor:
+      train:
+        name: "blip2_image_eval"
+        image_size: 448
+    text_processor:
+      train:
+        name: "blip_caption"
+run:
+  task: image_text_pretrain