tuandunghcmut
/

vlm_clone_2

Model card Files Files and versions Community

tuandunghcmut commited on Apr 10

Commit

0d2c90e

verified ·

1 Parent(s): 80ed450

Add files using upload-large-folder tool

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

Ovis/.ipynb_checkpoints/README-checkpoint.md +110 -0
Ovis/Ovis1.6-Gemma2-27B/1c18c1e92281df303545f22c27200f046fc44ec4/__init__.py +0 -0
Ovis/Ovis1.6-Gemma2-27B/1c18c1e92281df303545f22c27200f046fc44ec4/__pycache__/configuration_ovis.cpython-310.pyc +0 -0
Ovis/Ovis1.6-Gemma2-27B/1c18c1e92281df303545f22c27200f046fc44ec4/__pycache__/configuration_ovis.cpython-39.pyc +0 -0
Ovis/Ovis1.6-Gemma2-27B/1c18c1e92281df303545f22c27200f046fc44ec4/__pycache__/modeling_ovis.cpython-310.pyc +0 -0
Ovis/Ovis1.6-Gemma2-27B/1c18c1e92281df303545f22c27200f046fc44ec4/__pycache__/modeling_ovis.cpython-39.pyc +0 -0
Ovis/Ovis1.6-Gemma2-27B/1c18c1e92281df303545f22c27200f046fc44ec4/configuration_ovis.py +201 -0
Ovis/Ovis1.6-Gemma2-27B/1c18c1e92281df303545f22c27200f046fc44ec4/modeling_ovis.py +625 -0
Ovis/Ovis1.6-Gemma2-27B/__init__.py +0 -0
Ovis/docs/license/QWEN_LICENSE +53 -0
Ovis/ovis.egg-info/PKG-INFO +30 -0
Ovis/ovis.egg-info/SOURCES.txt +24 -0
Ovis/ovis.egg-info/dependency_links.txt +1 -0
Ovis/ovis.egg-info/requires.txt +26 -0
Ovis/ovis.egg-info/top_level.txt +1 -0
Ovis/ovis/__init__.py +3 -0
Ovis/ovis/model/__init__.py +2 -0
Ovis/ovis/model/__pycache__/conversation_formatter.cpython-311.pyc +0 -0
Ovis/ovis/model/__pycache__/modeling_ovis.cpython-310.pyc +0 -0
Ovis/ovis/model/modeling_ovis.py +434 -0
Ovis/ovis/model/visual_tokenizer/base_visual_tokenizer.py +264 -0
Ovis/ovis/model/visual_tokenizer/clip_visual_tokenizer.py +41 -0
Ovis/ovis/model/visual_tokenizer/siglip_visual_tokenizer.py +43 -0
Ovis/ovis/serve/__pycache__/runner.cpython-310.pyc +0 -0
Ovis/ovis/serve/__pycache__/runner.cpython-311.pyc +0 -0
Ovis/ovis/train/dataset/__init__.py +0 -0
Ovis/ovis/train/dataset/caption_dataset.py +67 -0
Ovis/ovis/train/dataset/conversation_dataset.py +67 -0
Ovis/ovis/train/dataset/multimodal_dataset.py +72 -0
Ovis/ovis/util/__init__.py +0 -0
Ovis/ovis/util/__pycache__/__init__.cpython-310.pyc +0 -0
Ovis/ovis/util/__pycache__/__init__.cpython-311.pyc +0 -0
Ovis/ovis/util/__pycache__/constants.cpython-310.pyc +0 -0
Ovis/ovis/util/__pycache__/constants.cpython-311.pyc +0 -0
Ovis/ovis/util/__pycache__/utils.cpython-311.pyc +0 -0
VLMEvalKit_old/outputs/Qwen2-VL-2B-Instruct/DSPAR_MINI/2024-12-24_08-17-11/Qwen2-VL-2B-Instruct/T2024-12-24_08-17-20_Gd57daa89/json/.ipynb_checkpoints/Qwen2-VL-2B-Instruct_DSPAR_MINI_6345-checkpoint.json +1 -0
VLMEvalKit_old/outputs/Qwen2-VL-2B-Instruct/DSPAR_MINI/2024-12-24_08-17-11/Qwen2-VL-2B-Instruct/T2024-12-24_08-17-20_Gd57daa89/json/Qwen2-VL-2B-Instruct_DSPAR_MINI_6275.json +1 -0
VLMEvalKit_old/outputs/Qwen2-VL-2B-Instruct/DSPAR_MINI/2024-12-24_08-17-11/Qwen2-VL-2B-Instruct/T2024-12-24_08-17-20_Gd57daa89/json/Qwen2-VL-2B-Instruct_DSPAR_MINI_6278.json +1 -0
VLMEvalKit_old/outputs/Qwen2-VL-2B-Instruct/DSPAR_MINI/2024-12-24_08-17-11/Qwen2-VL-2B-Instruct/T2024-12-24_08-17-20_Gd57daa89/json/Qwen2-VL-2B-Instruct_DSPAR_MINI_6385.json +1 -0
VLMEvalKit_old/outputs/Qwen2-VL-2B-Instruct/DSPAR_MINI/2024-12-24_08-17-11/Qwen2-VL-2B-Instruct/T2024-12-24_08-17-20_Gd57daa89/json/Qwen2-VL-2B-Instruct_DSPAR_MINI_6481.json +1 -0
VLMEvalKit_old/outputs/Qwen2-VL-2B-Instruct/DSPAR_MINI/2024-12-24_08-17-11/Qwen2-VL-2B-Instruct/T2024-12-24_08-17-20_Gd57daa89/json/Qwen2-VL-2B-Instruct_DSPAR_MINI_6500.json +1 -0
VLMEvalKit_old/outputs/Qwen2-VL-2B-Instruct/DSPAR_MINI/2024-12-24_08-17-11/Qwen2-VL-2B-Instruct/T2024-12-24_08-17-20_Gd57daa89/json/Qwen2-VL-2B-Instruct_DSPAR_MINI_6513.json +1 -0
VLMEvalKit_old/outputs/Qwen2-VL-2B-Instruct/DSPAR_MINI/2024-12-24_08-17-11/Qwen2-VL-2B-Instruct/T2024-12-24_08-17-20_Gd57daa89/json/Qwen2-VL-2B-Instruct_DSPAR_MINI_6515.json +1 -0
VLMEvalKit_old/outputs/Qwen2-VL-2B-Instruct/DSPAR_MINI/2024-12-24_08-17-11/Qwen2-VL-2B-Instruct/T2024-12-24_08-17-20_Gd57daa89/json/Qwen2-VL-2B-Instruct_DSPAR_MINI_6517.json +1 -0
VLMEvalKit_old/outputs/Qwen2-VL-2B-Instruct/DSPAR_MINI/2024-12-24_08-17-11/Qwen2-VL-2B-Instruct/T2024-12-24_08-17-20_Gd57daa89/json/Qwen2-VL-2B-Instruct_DSPAR_MINI_6521.json +1 -0
VLMEvalKit_old/outputs/Qwen2-VL-2B-Instruct/DSPAR_MINI/2024-12-24_08-17-11/Qwen2-VL-2B-Instruct/T2024-12-24_08-17-20_Gd57daa89/json/Qwen2-VL-2B-Instruct_DSPAR_MINI_6533.json +1 -0
VLMEvalKit_old/outputs/Qwen2-VL-2B-Instruct/DSPAR_MINI/2024-12-24_08-17-11/Qwen2-VL-2B-Instruct/T2024-12-24_08-17-20_Gd57daa89/json/Qwen2-VL-2B-Instruct_DSPAR_MINI_6539.json +1 -0
VLMEvalKit_old/outputs/Qwen2-VL-2B-Instruct/DSPAR_MINI/2024-12-24_08-17-11/Qwen2-VL-2B-Instruct/T2024-12-24_08-17-20_Gd57daa89/json/Qwen2-VL-2B-Instruct_DSPAR_MINI_6580.json +1 -0
VLMEvalKit_old/outputs/Qwen2-VL-2B-Instruct/DSPAR_MINI/2024-12-24_08-17-11/Qwen2-VL-2B-Instruct/T2024-12-24_08-17-20_Gd57daa89/json/Qwen2-VL-2B-Instruct_DSPAR_MINI_6589.json +1 -0
VLMEvalKit_old/outputs/Qwen2-VL-2B-Instruct/DSPAR_MINI/2024-12-24_08-17-11/Qwen2-VL-2B-Instruct/T2024-12-24_08-17-20_Gd57daa89/json/Qwen2-VL-2B-Instruct_DSPAR_MINI_6608.json +1 -0

Ovis/.ipynb_checkpoints/README-checkpoint.md ADDED Viewed

	@@ -0,0 +1,110 @@

+# Ovis: Structural Embedding Alignment for Multimodal Large Language Model
+Ovis (Open VISion) is a novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings. For a comprehensive introduction, please refer to the [Ovis paper](https://arxiv.org/abs/2405.20797).
+<div style="text-align: center;">
+  <img style="max-width: 100%;" src="docs/ovis-illustration.png" alt="Ovis Illustration"/>
+</div>
+## Release
+- [11/26] 🔥 Announcing [Ovis1.6-Gemma2-27B](https://huggingface.co/AIDC-AI/Ovis1.6-Gemma2-27B)!
+- [11/04] 🔥 Announcing quantized versions of Ovis1.6: [Ovis1.6-Gemma2-9B-GPTQ-Int4](https://huggingface.co/AIDC-AI/Ovis1.6-Gemma2-9B-GPTQ-Int4) and [Ovis1.6-Llama3.2-3B-GPTQ-Int4](https://huggingface.co/AIDC-AI/Ovis1.6-Llama3.2-3B-GPTQ-Int4)!
+- [10/22] 🔥 Announcing Ovis1.6-Llama3.2-3B ([Model](https://huggingface.co/AIDC-AI/Ovis1.6-Llama3.2-3B), [Demo](https://huggingface.co/spaces/AIDC-AI/Ovis1.6-Llama3.2-3B))!
+- [09/19] 🔥 Announcing Ovis1.6-Gemma2-9B ([Model](https://huggingface.co/AIDC-AI/Ovis1.6-Gemma2-9B), [Demo](https://huggingface.co/spaces/AIDC-AI/Ovis1.6-Gemma2-9B))! This latest release further enhances high-resolution image processing, is trained on a larger, more diverse, and higher-quality dataset, and refines the training process with DPO training following instruction-tuning.
+- [07/24] 🔥 Introducing Ovis1.5, featuring improved high-resolution image processing and optimized training data for enhanced performance.
+- [06/14] 🔥 Launch of Ovis1.0, the inaugural version of the Ovis model.
+## Contents
+- [Install](#install)
+- [Model](#model)
+- [Performance](#performance)
+- [Finetune](#finetune)
+- [Inference](#inference)
+- [Quantization](#quantization)
+- [Citation](#citation)
+- [Team](#team)
+- [License](#license)
+## Install
+Ovis has been tested with Python 3.10, Torch 2.4.0, Transformers 4.46.2, and DeepSpeed 0.15.4. For a comprehensive list of package dependencies, please consult the `requirements.txt` file. Before finetuning or inference, please install Ovis as follows.
+```bash
+git clone [email protected]:AIDC-AI/Ovis.git
+conda create -n ovis python=3.10 -y
+conda activate ovis
+cd Ovis
+pip install -r requirements.txt
+pip install -e .
+```
+## Model
+Ovis can be instantiated with popular LLMs. We provide the following Ovis MLLMs:
+| Ovis MLLMs        | ViT         | LLM                |                          Model Weights                          | Demo                                                             |
+|:------------------|:-----------:|:------------------:|:---------------------------------------------------------------:|:----------------------------------------------------------------:|
+| Ovis1.6-Gemma2-27B | Siglip-400M | Gemma2-27B-It       | [Huggingface](https://huggingface.co/AIDC-AI/Ovis1.6-Gemma2-27B) | - |
+| Ovis1.6-Gemma2-9B | Siglip-400M | Gemma2-9B-It       | [Huggingface](https://huggingface.co/AIDC-AI/Ovis1.6-Gemma2-9B) | [Space](https://huggingface.co/spaces/AIDC-AI/Ovis1.6-Gemma2-9B) |
+| Ovis1.6-Llama3.2-3B | Siglip-400M | Llama-3.2-3B-Instruct       | [Huggingface](https://huggingface.co/AIDC-AI/Ovis1.6-Llama3.2-3B) | [Space](https://huggingface.co/spaces/AIDC-AI/Ovis1.6-Llama3.2-3B) |
+## Performance
+With **29B** parameters, **Ovis1.6-Gemma2-27B** achieves exceptional performance in the [OpenCompass](https://github.com/open-compass/VLMEvalKit) benchmark, ranking among the top-tier open-source MLLMs.
+![performance-Ovis1_6-Gemma2-27B](docs/performance/Ovis1_6-Gemma2-27B.png)
+With just **10B** parameters, **Ovis1.6-Gemma2-9B** leads the [OpenCompass](https://github.com/open-compass/VLMEvalKit) benchmark among open-source MLLMs within **30B** parameters.
+![performance-Ovis1_6-Gemma2-9B](docs/performance/Ovis1_6-Gemma2-9B.png)
+**Ovis1.6-Llama3.2-3B** leads the [OpenCompass](https://github.com/open-compass/VLMEvalKit) benchmark among open-source MLLMs under **4B** parameters, even surpassing Llama-3.2-11B-Vision-Instruct.
+![performance-Ovis1_6-Llama3_2-3B](docs/performance/Ovis1_6-Llama3_2-3B.png)
+## Finetune
+Finetuning Ovis1.6-Gemma2-9B is supported in [ms-swift](https://github.com/modelscope/ms-swift).
+## Inference
+We provide an inference wrapper in `ovis/serve/runner.py`, which can be used as:
+```python
+from PIL import Image
+from ovis.serve.runner import RunnerArguments, OvisRunner
+image = Image.open('temp.png')
+text = 'PROMPT'
+runner_args = RunnerArguments(model_path='AIDC-AI/Ovis1.6-Gemma2-27B')
+runner = OvisRunner(runner_args)
+generation = runner.run([image, text])
+```
+Based on [Gradio](https://github.com/gradio-app/gradio), Ovis can also be accessed via a web user interface:
+```bash
+python ovis/serve/server.py --model_path MODEL_PATH --port PORT
+```
+## Quantization
+We quantized Ovis1.6 using AutoGPTQ. For detailed information on running and creating your own quantized version, please refer to the respective Huggingface model cards: [Ovis1.6-Gemma2-9B-GPTQ-Int4](https://huggingface.co/AIDC-AI/Ovis1.6-Gemma2-9B-GPTQ-Int4) and [Ovis1.6-Llama3.2-3B-GPTQ-Int4](https://huggingface.co/AIDC-AI/Ovis1.6-Llama3.2-3B-GPTQ-Int4). Quantized Ovis1.6 maintains performance comparable to its non-quantized counterpart while requiring less GPU memory:
+- Benchmark performance:
+![performance-Ovis1_6-Gemma2-9B-GPTQ-Int4](docs/performance/Ovis1_6-Gemma2-9B-GPTQ-Int4.png)
+![performance-Ovis1_6-Llama3_2-3B-GPTQ-Int4](docs/performance/Ovis1_6-Llama3_2-3B-GPTQ-Int4.png)
+- GPU memory usage (max_partition=9):
+![performance-Ovis1_6-VRAM-Comparison](docs/performance/Ovis1_6-VRAM-Comparison.png)
+## Citation
+If you find Ovis useful, please cite the paper
+```
+@article{lu2024ovis,
+  title={Ovis: Structural Embedding Alignment for Multimodal Large Language Model},
+  author={Shiyin Lu and Yang Li and Qing-Guo Chen and Zhao Xu and Weihua Luo and Kaifu Zhang and Han-Jia Ye},
+  year={2024},
+  journal={arXiv:2405.20797}
+}
+```
+## Team
+This work is a collaborative effort by the MarcoVL team. We would also like to provide links to the following MLLM papers from our team:
+- [Parrot: Multilingual Visual Instruction Tuning](https://arxiv.org/abs/2406.02539)
+- [Wings: Learning Multimodal LLMs without Text-only Forgetting](https://arxiv.org/abs/2406.03496)
+## License
+This project is licensed under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0.txt) (SPDX-License-Identifier: Apache-2.0).
+## Disclaimer
+We used compliance-checking algorithms during the training process, to ensure the compliance of the trained model to the best of our ability. Due to the complexity of the data and the diversity of language model usage scenarios, we cannot guarantee that the model is completely free of copyright issues or improper content. If you believe anything infringes on your rights or generates improper content, please contact us, and we will promptly address the matter.

Ovis/Ovis1.6-Gemma2-27B/1c18c1e92281df303545f22c27200f046fc44ec4/__init__.py ADDED Viewed

File without changes

Ovis/Ovis1.6-Gemma2-27B/1c18c1e92281df303545f22c27200f046fc44ec4/__pycache__/configuration_ovis.cpython-310.pyc ADDED Viewed

Binary file (6.66 kB). View file

Ovis/Ovis1.6-Gemma2-27B/1c18c1e92281df303545f22c27200f046fc44ec4/__pycache__/configuration_ovis.cpython-39.pyc ADDED Viewed

Binary file (6.6 kB). View file

Ovis/Ovis1.6-Gemma2-27B/1c18c1e92281df303545f22c27200f046fc44ec4/__pycache__/modeling_ovis.cpython-310.pyc ADDED Viewed

Binary file (21.1 kB). View file

Ovis/Ovis1.6-Gemma2-27B/1c18c1e92281df303545f22c27200f046fc44ec4/__pycache__/modeling_ovis.cpython-39.pyc ADDED Viewed

Binary file (21.1 kB). View file

Ovis/Ovis1.6-Gemma2-27B/1c18c1e92281df303545f22c27200f046fc44ec4/configuration_ovis.py ADDED Viewed

	@@ -0,0 +1,201 @@

+from abc import ABC, abstractmethod
+from typing import List, Dict, Union, Optional
+from transformers import PretrainedConfig, AutoConfig
+IGNORE_ID = -100
+IMAGE_TOKEN_ID = -200
+IMAGE_TOKEN = "<image>"
+IMAGE_ATOM_ID = -300
+IMAGE_INDICATOR_IDS = [-301, -302, -303, -304, -305]
+# ----------------------------------------------------------------------
+#                     Visual Tokenizer Configuration
+# ----------------------------------------------------------------------
+class BaseVisualTokenizerConfig(PretrainedConfig):
+    def __init__(
+        self,
+        vocab_size=16384,
+        tokenize_function="softmax",
+        tau=1.0,
+        depths=None,
+        drop_cls_token=False,
+        backbone_config: Optional[Union[PretrainedConfig, dict]] = None,
+        hidden_stride: int = 1,
+        **kwargs
+    ):
+        super().__init__(**kwargs)
+        self.vocab_size = vocab_size
+        self.tokenize_function = tokenize_function
+        self.tau = tau
+        if isinstance(depths, str):
+            depths = [int(x) for x in depths.split('|')]
+        self.depths = depths
+        self.backbone_kwargs = {}
+        self.drop_cls_token = drop_cls_token
+        if backbone_config is not None:
+            assert isinstance(backbone_config, (PretrainedConfig, dict)), \
+                f"expect `backbone_config` to be instance of PretrainedConfig or dict, but got {type(backbone_config)} type"
+            if not isinstance(backbone_config, PretrainedConfig):
+                model_type = backbone_config['model_type']
+                backbone_config.pop('model_type')
+                backbone_config = AutoConfig.for_model(model_type, **backbone_config)
+        self.backbone_config = backbone_config
+        self.hidden_stride = hidden_stride
+class SiglipVisualTokenizerConfig(BaseVisualTokenizerConfig):
+    model_type = "siglip_visual_tokenizer"
+    def __init__(self, **kwargs):
+        super().__init__(**kwargs)
+        if self.drop_cls_token:
+            self.drop_cls_token = False
+        if self.depths:
+            assert len(self.depths) == 1
+            self.backbone_kwargs['num_hidden_layers'] = self.depths[0]
+AutoConfig.register("siglip_visual_tokenizer", SiglipVisualTokenizerConfig)
+# ----------------------------------------------------------------------
+#                           Ovis Configuration
+# ----------------------------------------------------------------------
+class OvisConfig(PretrainedConfig):
+    model_type = "ovis"
+    def __init__(
+        self,
+        llm_config: Optional[Union[PretrainedConfig, dict]] = None,
+        visual_tokenizer_config: Optional[Union[PretrainedConfig, dict]] = None,
+        multimodal_max_length=8192,
+        hidden_size=None,
+        conversation_formatter_class=None,
+        llm_attn_implementation=None,
+        disable_tie_weight=False,
+        **kwargs
+    ):
+        super().__init__(**kwargs)
+        if llm_config is not None:
+            assert isinstance(llm_config, (PretrainedConfig, dict)), \
+                f"expect `llm_config` to be instance of PretrainedConfig or dict, but got {type(llm_config)} type"
+            if not isinstance(llm_config, PretrainedConfig):
+                model_type = llm_config['model_type']
+                llm_config.pop('model_type')
+                llm_config = AutoConfig.for_model(model_type, **llm_config)
+        self.llm_config = llm_config
+        if visual_tokenizer_config is not None:
+            assert isinstance(visual_tokenizer_config, (PretrainedConfig, dict)), \
+                f"expect `visual_tokenizer_config` to be instance of PretrainedConfig or dict, but got {type(visual_tokenizer_config)} type"
+            if not isinstance(visual_tokenizer_config, PretrainedConfig):
+                model_type = visual_tokenizer_config['model_type']
+                visual_tokenizer_config.pop('model_type')
+                visual_tokenizer_config = AutoConfig.for_model(model_type, **visual_tokenizer_config)
+        self.visual_tokenizer_config = visual_tokenizer_config
+        self.multimodal_max_length = multimodal_max_length
+        self.hidden_size = hidden_size
+        self.conversation_formatter_class = conversation_formatter_class
+        self.llm_attn_implementation = llm_attn_implementation
+        self.disable_tie_weight = disable_tie_weight
+# ----------------------------------------------------------------------
+#                         Conversation Formatter
+# ----------------------------------------------------------------------
+class ConversationFormatter(ABC):
+    support_tokenizer_types = None
+    def __init__(self, tokenizer):
+        tokenizer_type = type(tokenizer).__name__
+        assert tokenizer_type in self.support_tokenizer_types, \
+            f'Invalid tokenizer type, expected one from `{self.support_tokenizer_types}`, but got `{tokenizer_type}`'
+        self.tokenizer = tokenizer
+        self.image_token = IMAGE_TOKEN
+        self.image_token_id = IMAGE_TOKEN_ID
+        self.ignore_id = IGNORE_ID
+    def _tokenize_with_image_symbol(self, text):
+        text_chunks = [self.tokenizer(chunk, add_special_tokens=False).input_ids for chunk in
+                       text.split(self.image_token)]
+        token_ids = []
+        num_chuck = len(text_chunks)
+        for i, chunk in enumerate(text_chunks):
+            token_ids.extend(chunk)
+            if i < num_chuck - 1:
+                token_ids.append(self.image_token_id)
+        return token_ids
+    @abstractmethod
+    def format(self, conversations: List[Dict], generation_preface=None):
+        pass
+    @abstractmethod
+    def format_query(self, query, generation_preface=""):
+        pass
+class GemmaConversationFormatter(ConversationFormatter):
+    support_tokenizer_types = ['GemmaTokenizer', 'GemmaTokenizerFast']
+    def __init__(self, tokenizer):
+        super().__init__(tokenizer)
+        # Gemma does not support system prompt
+        self.from2role = {
+            "human": "<start_of_turn>user\n",
+            "gpt": "<start_of_turn>model\n",
+        }
+        self.gpt_token_num = None
+        self.im_end = "<end_of_turn>\n"
+        self.bos_token = "<bos>"
+        self.bos_token_ids = None
+    def format(self, conversations: List[Dict], generation_preface=None):
+        if self.gpt_token_num is None:
+            self.gpt_token_num = len(self.tokenizer(self.from2role["gpt"], add_special_tokens=False).input_ids)
+        if self.bos_token_ids is None:
+            self.bos_token_ids = self.tokenizer(self.bos_token, add_special_tokens=False).input_ids
+        if conversations[0]["from"] == "system":
+            raise ValueError("Gemma does not support system prompt")
+        if generation_preface is not None:
+            conversations.append({
+                "from": "gpt",
+                "value": generation_preface
+            })
+        prompt = "" + self.bos_token
+        input_ids = [] + self.bos_token_ids
+        labels = [] + [IGNORE_ID] * len(input_ids)
+        num_conversation = len(conversations)
+        for i, conversation in enumerate(conversations):
+            frm = conversation["from"]
+            role = self.from2role[frm]
+            message = conversation["value"].strip()
+            text = role + message
+            if i < num_conversation - 1 or generation_preface is None:
+                text += self.im_end
+            prompt += text
+            token_ids = self._tokenize_with_image_symbol(text)
+            input_ids.extend(token_ids)
+            label_ids = [self.ignore_id] * len(token_ids)
+            if frm == "gpt":
+                # learning `\n` following `im_end` is meaningless, so the last `\n` token is ignored in label
+                label_ids[self.gpt_token_num:-1] = token_ids[self.gpt_token_num:-1]
+            labels.extend(label_ids)
+        assert self._tokenize_with_image_symbol(prompt) == input_ids
+        assert len(input_ids) == len(labels)
+        return prompt, input_ids, labels
+    def format_query(self, query, generation_preface=""):
+        prompt, input_ids, _ = self.format([{
+            "from": "human",
+            "value": query
+        }], generation_preface=generation_preface)
+        return prompt, input_ids

Ovis/Ovis1.6-Gemma2-27B/1c18c1e92281df303545f22c27200f046fc44ec4/modeling_ovis.py ADDED Viewed

	@@ -0,0 +1,625 @@

+# Copyright (C) 2024 AIDC-AI
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import logging
+import os
+import importlib.metadata
+from packaging import version
+from importlib import import_module
+from typing import List, Callable, Union, Optional, Dict
+import PIL.Image
+import torch
+import transformers
+from torch import Tensor
+from torch.nn import init
+from torch.nn.functional import softmax, gumbel_softmax, pad
+from transformers.utils import is_flash_attn_2_available
+from transformers import PreTrainedModel, AutoModel, AutoTokenizer, AutoModelForCausalLM, AutoImageProcessor
+from transformers import SiglipImageProcessor, SiglipVisionModel
+from transformers.cache_utils import HybridCache
+from transformers.generation.utils import GenerateOutput
+from .configuration_ovis import BaseVisualTokenizerConfig, SiglipVisualTokenizerConfig
+from .configuration_ovis import OvisConfig, ConversationFormatter
+from .configuration_ovis import IGNORE_ID, IMAGE_ATOM_ID, IMAGE_INDICATOR_IDS, IMAGE_TOKEN_ID
+# ----------------------------------------------------------------------
+#                            Visual Tokenizer
+# ----------------------------------------------------------------------
+class BaseVisualTokenizer(PreTrainedModel):
+    base_model_prefix = "backbone"
+    main_input_name = None
+    _image_processor_class = None
+    _image_processor_kwargs = {}
+    _backbone_class = None
+    _backbone_name_or_path = None
+    def __init__(self, config: BaseVisualTokenizerConfig, *inputs, **kwargs):
+        super().__init__(config, *inputs, **kwargs)
+        self.image_processor = AutoImageProcessor.from_pretrained(kwargs['image_processor_name_or_path'])
+        self.backbone = AutoModel.from_config(self.config.backbone_config)
+        head_dim = self.config.vocab_size - len(IMAGE_INDICATOR_IDS)  # reserved tokens for IMAGE_INDICATORS
+        self.head = torch.nn.Sequential(
+            torch.nn.Linear(
+                self.backbone.config.hidden_size * self.config.hidden_stride * self.config.hidden_stride, head_dim,
+                bias=False
+            ),
+            torch.nn.LayerNorm(head_dim)
+        )
+        assert all((self.image_processor.do_resize,
+                    not getattr(self.image_processor, 'do_center_crop', False),
+                    self.image_processor.do_rescale,
+                    self.image_processor.do_normalize
+                    )), f"image_processor `{self.image_processor}` is not supported currently"
+    def get_backbone(self):
+        return self.backbone
+    def get_image_processor(self):
+        return self.image_processor
+    def mock_input(self):
+        height, width = self.get_image_size()
+        return torch.zeros(1, 3, height, width), self.construct_image_placeholders((1, 1))
+    def get_head(self):
+        return self.head
+    def get_image_size(self):
+        raise NotImplementedError
+    @staticmethod
+    def construct_image_placeholders(grid):
+        image_placeholders = [IMAGE_INDICATOR_IDS[0], IMAGE_ATOM_ID, IMAGE_INDICATOR_IDS[1]]
+        if grid[0] * grid[1] > 1:
+            for r in range(grid[0]):
+                for c in range(grid[1]):
+                    image_placeholders.append(IMAGE_ATOM_ID)
+                    if c < grid[1] - 1:
+                        image_placeholders.append(IMAGE_INDICATOR_IDS[2])
+                if r < grid[0] - 1:
+                    image_placeholders.append(IMAGE_INDICATOR_IDS[3])
+        image_placeholders.append(IMAGE_INDICATOR_IDS[4])
+        return image_placeholders
+    def preprocess_image(self, image: PIL.Image.Image, max_partition=9, covering_threshold=0.9, convert_to_rgb=True):
+        def _preprocess(img: PIL.Image.Image, side):
+            # first resize and preprocess
+            w, h = img.size
+            if w == h:
+                new_width = new_height = side
+            elif w > h:
+                new_width = side
+                new_height = int(h / w * new_width)
+            else:
+                new_height = side
+                new_width = int(w / h * new_height)
+            new_size = dict(height=new_height, width=new_width)
+            pixel_values = self.image_processor.preprocess(img, size=new_size, return_tensors='pt')['pixel_values']
+            # then pad to square
+            square_values = torch.zeros([1, 3, side, side], dtype=pixel_values.dtype, device=pixel_values.device)
+            new_height, new_width = pixel_values.shape[2:]
+            if new_height == new_width:
+                square_values[:, :, :, :] = pixel_values
+            elif new_height > new_width:
+                from_index = (side - new_width) // 2
+                square_values[:, :, :, from_index:from_index + new_width] = pixel_values
+            else:
+                from_index = (side - new_height) // 2
+                square_values[:, :, from_index:from_index + new_height, :] = pixel_values
+            return square_values
+        def _partition(img, grid):
+            w, h = img.size
+            row_height = h // grid[0]
+            col_width = w // grid[1]
+            partition = []
+            for row in range(grid[0]):
+                for col in range(grid[1]):
+                    left = col * col_width
+                    upper = row * row_height
+                    right = w if col == grid[1] - 1 else (col + 1) * col_width
+                    lower = h if row == grid[0] - 1 else (row + 1) * row_height
+                    partition.append((left, upper, right, lower))
+            return partition
+        def _covering_area(left, upper, right, lower, side):
+            w = right - left
+            h = lower - upper
+            w, h = max(w, h), min(w, h)
+            if w > side:
+                h = h / w * side
+                w = side
+            return w * h
+        def _get_best_grid(img, side):
+            img_area = img.size[0] * img.size[1]
+            candidate_grids = []
+            for i in range(1, max_partition + 1):
+                for j in range(1, max_partition + 1):
+                    if i * j <= max_partition:
+                        candidate_grids.append((i, j))
+            all_grids = []
+            good_grids = []
+            for grid in candidate_grids:
+                partition = _partition(img, grid)
+                covering_ratio = sum([_covering_area(*p, side) for p in partition]) / img_area
+                assert covering_ratio <= 1.0
+                all_grids.append((grid, covering_ratio))
+                if covering_ratio > covering_threshold:
+                    good_grids.append((grid, covering_ratio))
+            if len(good_grids) > 0:
+                # pick the good partition with minimum #sub_images and break the tie using covering_ratio
+                return sorted(good_grids, key=lambda x: (x[0][0] * x[0][1], -x[1]))[0][0]
+            else:
+                # pick the partition with maximum covering_ratio and break the tie using #sub_images
+                return sorted(all_grids, key=lambda x: (-x[1], x[0][0] * x[0][1]))[0][0]
+        if convert_to_rgb and image.mode != 'RGB':
+            image = image.convert('RGB')
+        sides = self.get_image_size()
+        if sides[0] != sides[1]:
+            raise ValueError('get_image_size() returns non-square size')
+        side = sides[0]
+        grid = _get_best_grid(image, side)
+        partition = _partition(image, grid)
+        crops = [image.crop(p) for p in partition]
+        if len(crops) > 1:
+            crops.insert(0, image)
+        pixel_values = torch.cat([_preprocess(crop, side) for crop in crops], dim=0)
+        image_placeholders = self.construct_image_placeholders(grid)
+        return pixel_values, image_placeholders
+    def tokenize(self, logits):
+        def st_argmax(y_soft, dim):  # straight-through softmax
+            index = y_soft.max(dim, keepdim=True)[1]
+            y_hard = torch.zeros_like(y_soft, memory_format=torch.legacy_contiguous_format).scatter_(dim, index, 1.0)
+            ret = y_hard - y_soft.detach() + y_soft
+            return ret
+        if self.config.tokenize_function == 'softmax':
+            tokens = softmax(logits, dim=-1)
+        elif self.config.tokenize_function == 'gumbel_argmax':
+            tokens = gumbel_softmax(logits, tau=self.config.tau, hard=True)
+        elif self.config.tokenize_function == 'st_argmax':
+            tokens = st_argmax(logits, dim=-1)
+        else:
+            raise ValueError(
+                f'Invalid `max_type`, expected softmax or gumbel_argmax or st_argmax, but got {self.config.tokenize_function}')
+        return tokens
+    def encode(self, pixel_values):
+        output = self.backbone(pixel_values, output_hidden_states=True, return_dict=True)
+        features = output.hidden_states[-1]
+        if self.config.drop_cls_token:
+            features = features[:, 1:, :]
+        # merge number of `hidden_stride * hidden_stride` hidden states together to reduce token sequence length
+        # e.g., for hidden_stride=3, this leads to a token length reduction: 729 -> 81 for siglip
+        if self.config.hidden_stride > 1:
+            n, l, d = features.shape  # this `d` maybe different from the above `d
+            sqrt_l = int(l ** 0.5)
+            assert sqrt_l ** 2 == l, "The token sequence length should be a perfect square."
+            features = features.reshape(n, sqrt_l, sqrt_l, d)
+            pl = (self.config.hidden_stride - (sqrt_l % self.config.hidden_stride)) % self.config.hidden_stride
+            features = pad(features, (0, 0, 0, pl, 0, pl), "constant", 0)
+            sqrt_l += pl
+            features = features.reshape(n, sqrt_l // self.config.hidden_stride, self.config.hidden_stride,
+                                        sqrt_l // self.config.hidden_stride, self.config.hidden_stride, d)
+            features = features.permute(0, 1, 3, 2, 4, 5)  # [n, sqrt_l/hs, sqrt_l/hs, hs, hs, d]
+            features = features.flatten(3)  # [n, sqrt_l/hs, sqrt_l/hs, hs*hs*d]
+            features = features.reshape(
+                n, -1, self.config.hidden_stride * self.config.hidden_stride * d)
+        return features
+    def forward(self, pixel_values) -> torch.Tensor:  # [BatchSize, ImageShape] -> [BatchSize, #Token, VocabSize]
+        features = self.encode(pixel_values)
+        logits = self.head(features)
+        tokens = self.tokenize(logits)
+        # tokens' shape is [BatchSize, #Token, VocabSize-5], so padding with [BatchSize, #Token, 5], after
+        # which, tokens' shape should become [BatchSize, #Token, VocabSize]
+        batch_size, token_len, _ = tokens.shape
+        padding_tensor = torch.zeros(size=(batch_size, token_len, len(IMAGE_INDICATOR_IDS)),
+                                     dtype=tokens.dtype,
+                                     device=tokens.device,
+                                     layout=tokens.layout,
+                                     requires_grad=False)
+        tokens = torch.cat((tokens, padding_tensor), dim=2)
+        return tokens
+class SiglipVisualTokenizer(BaseVisualTokenizer):
+    config_class = SiglipVisualTokenizerConfig
+    supports_gradient_checkpointing = True
+    _no_split_modules = ["SiglipVisionTransformer"]
+    _image_processor_class = SiglipImageProcessor
+    _image_processor_kwargs = {}
+    _backbone_class = SiglipVisionModel
+    _backbone_name_or_path = "google/siglip-so400m-patch14-384"
+    def get_image_size(self):
+        height = self.image_processor.size["height"]
+        width = self.image_processor.size["width"]
+        return height, width
+AutoModel.register(SiglipVisualTokenizerConfig, SiglipVisualTokenizer)
+# ----------------------------------------------------------------------
+#                                  Ovis
+# ----------------------------------------------------------------------
+class VisualEmbedding(torch.nn.Embedding):
+    def forward(self, visual_tokens: Tensor) -> Tensor:
+        if visual_tokens.dtype in [torch.int8, torch.int16, torch.int32, torch.int64, torch.long]:
+            return super().forward(visual_tokens)
+        return torch.matmul(visual_tokens, self.weight)
+    def reset_parameters(self, mean=0., std=1.) -> None:
+        init.normal_(self.weight, mean=mean, std=std)
+        self._fill_padding_idx_with_zero()
+class OvisPreTrainedModel(PreTrainedModel):
+    config_class = OvisConfig
+    base_model_prefix = "ovis"
+class Ovis(OvisPreTrainedModel):
+    def __init__(self, config: OvisConfig, *inputs, **kwargs):
+        super().__init__(config, *inputs, **kwargs)
+        attn_kwargs = dict()
+        if self.config.llm_attn_implementation:
+            if self.config.llm_attn_implementation == "sdpa":
+                raise ValueError("`sdpa` is currently not supported")
+            elif self.config.llm_attn_implementation == "flash_attention_2":
+                assert (is_flash_attn_2_available() and
+                        version.parse(importlib.metadata.version("flash_attn")) >= version.parse("2.6.3")), \
+                    "Using `flash_attention_2` requires having `flash_attn>=2.6.3` installed."
+            attn_kwargs["attn_implementation"] = self.config.llm_attn_implementation
+        self.llm = AutoModelForCausalLM.from_config(self.config.llm_config, **attn_kwargs)
+        assert self.config.hidden_size == self.llm.config.hidden_size, "hidden size mismatch"
+        self.text_tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path)
+        self.visual_tokenizer = AutoModel.from_config(self.config.visual_tokenizer_config,
+                                                      image_processor_name_or_path=self.config.name_or_path)
+        self.vte = VisualEmbedding(
+            self.config.visual_tokenizer_config.vocab_size,
+            self.config.hidden_size,
+            device=self.visual_tokenizer.device,
+            dtype=self.visual_tokenizer.dtype
+        )
+        def _merge_modules(modules_list: tuple):
+            merged_modules = []
+            for modules in modules_list:
+                merged_modules.extend(modules if modules else [])
+            return merged_modules
+        self._no_split_modules = _merge_modules((self.llm._no_split_modules, self.visual_tokenizer._no_split_modules))
+        self._skip_keys_device_placement = self.llm._skip_keys_device_placement
+        self._keep_in_fp32_modules = _merge_modules(
+            (self.llm._keep_in_fp32_modules, self.visual_tokenizer._keep_in_fp32_modules))
+        self.is_parallelizable = all((self.llm.is_parallelizable, self.visual_tokenizer.is_parallelizable))
+        self.supports_gradient_checkpointing = all(
+            (self.llm.supports_gradient_checkpointing, self.visual_tokenizer.supports_gradient_checkpointing))
+        self._supports_flash_attn_2 = True
+        self._supports_sdpa = False
+    def get_text_tokenizer(self):
+        return self.text_tokenizer
+    def get_visual_tokenizer(self):
+        return self.visual_tokenizer
+    def tie_weights(self):
+        if not self.config.disable_tie_weight:
+            self.get_llm().tie_weights()
+    def get_llm(self):
+        return self.llm
+    def get_vte(self):
+        return self.vte
+    def get_wte(self):
+        return self.llm.get_input_embeddings()
+    def get_conversation_formatter(self) -> ConversationFormatter:
+        if getattr(self, 'conversation_formatter', None) is None:
+            self.conversation_formatter = getattr(import_module(".configuration_ovis", __package__),
+                                                  self.config.conversation_formatter_class)(self.text_tokenizer)
+        return self.conversation_formatter
+    def forward(
+        self,
+        input_ids: torch.Tensor,
+        attention_mask: torch.Tensor,
+        labels: Optional[torch.Tensor],
+        pixel_values: List[Optional[torch.Tensor]],
+        **kwargs
+    ):
+        assert self.training, "`forward` can only be used in training. For inference, use `generate`."
+        _, inputs_embeds, labels, attention_mask = self.merge_multimodal(
+            text_input_ids=input_ids,
+            text_attention_masks=attention_mask,
+            text_labels=labels,
+            pixel_values=pixel_values
+        )
+        return self.llm(inputs_embeds=inputs_embeds, labels=labels, attention_mask=attention_mask, **kwargs)
+    def merge_multimodal(
+        self,
+        text_input_ids: torch.Tensor,
+        text_attention_masks: torch.Tensor,
+        text_labels: Optional[torch.Tensor],
+        pixel_values: List[Optional[torch.Tensor]],
+        left_padding: bool = False
+    ):
+        input_device = text_input_ids.device
+        visual_vocab_szie = self.get_visual_tokenizer().config.vocab_size
+        visual_indicator_embeds = self.get_vte()(
+            torch.tensor(
+                list(range(visual_vocab_szie - 5, visual_vocab_szie)),
+                dtype=torch.long,
+                device=self.get_visual_tokenizer().device
+            )
+        ).to(device=input_device)
+        if self.training:
+            # When training, to be compatible with deepspeed zero, each sample has to include pixel_value tensor.
+            # For text-only sample, one can simply use a full zero tensor as pixel_value, which will be ignored
+            # (see below in this function); so, the gradient will not be affected.
+            num_images = [x.shape[0] for x in pixel_values]
+            visual_tokens = self.visual_tokenizer(torch.cat([x for x in pixel_values], dim=0))
+            visual_embeds = torch.split(self.get_vte()(visual_tokens).to(dtype=self.dtype, device=input_device),
+                                        split_size_or_sections=num_images, dim=0)
+            visual_input_ids = torch.split(torch.argmax(visual_tokens, dim=-1).to(device=input_device),
+                                           split_size_or_sections=num_images, dim=0)
+            visual_labels = [torch.full(x.shape, IGNORE_ID, dtype=torch.long, device=input_device) for x in
+                             visual_input_ids]
+        else:
+            # When inference, sample can include only text with `None` pixel_value
+            num_images = [x.shape[0] if x is not None else 0 for x in pixel_values]
+            if sum(num_images) > 0:
+                visual_tokens = self.visual_tokenizer(torch.cat([x for x in pixel_values if x is not None], dim=0))
+                visual_embeds = torch.split(self.get_vte()(visual_tokens).to(dtype=self.dtype, device=input_device),
+                                            split_size_or_sections=num_images, dim=0)
+                visual_input_ids = torch.split(torch.argmax(visual_tokens, dim=-1).to(device=input_device),
+                                               split_size_or_sections=num_images, dim=0)
+                visual_labels = [torch.full(x.shape, IGNORE_ID, dtype=torch.long, device=input_device) for x in
+                                 visual_input_ids]
+            else:
+                # just placeholders
+                visual_embeds = [None] * len(num_images)
+                visual_input_ids = [None] * len(num_images)
+                visual_labels = [None] * len(num_images)
+            if text_labels is None:
+                text_labels = torch.full(text_input_ids.shape, IGNORE_ID, dtype=torch.long, device=input_device)
+        input_embeds = []
+        attention_masks = []
+        labels = []
+        for text_input_id, text_label, text_attention_mask, visual_embed, visual_input_id, visual_label in zip(
+                text_input_ids, text_labels, text_attention_masks, visual_embeds, visual_input_ids, visual_labels
+        ):
+            placeholder_token_mask = torch.lt(text_input_id, 0)
+            text_embed = self.get_wte()(torch.masked_fill(text_input_id, placeholder_token_mask, 0))
+            for i, indicator_id in enumerate(IMAGE_INDICATOR_IDS):
+                text_embed[text_input_id == indicator_id] = visual_indicator_embeds[i]
+            image_atom_positions = torch.where(torch.eq(text_input_id, IMAGE_ATOM_ID))[0].tolist()
+            if len(image_atom_positions) > 0:
+                input_embed_parts = []
+                attention_mask_parts = []
+                label_parts = []
+                prev_image_atom_position = -1
+                for index, image_atom_position in enumerate(image_atom_positions):
+                    input_embed_parts.append(
+                        text_embed[prev_image_atom_position + 1:image_atom_position, :])
+                    label_parts.append(
+                        text_label[prev_image_atom_position + 1:image_atom_position])
+                    attention_mask_parts.append(
+                        text_attention_mask[prev_image_atom_position + 1:image_atom_position])
+                    input_embed_parts.append(visual_embed[index])
+                    attention_mask_parts.append(
+                        torch.ones_like(visual_label[index], dtype=torch.bool))
+                    label_parts.append(visual_label[index])
+                    prev_image_atom_position = image_atom_position
+                if prev_image_atom_position + 1 < text_input_id.shape[0]:
+                    input_embed_parts.append(
+                        text_embed[prev_image_atom_position + 1:, :])
+                    attention_mask_parts.append(
+                        text_attention_mask[prev_image_atom_position + 1:])
+                    label_parts.append(
+                        text_label[prev_image_atom_position + 1:])
+                input_embed = torch.cat(input_embed_parts, dim=0)
+                attention_mask = torch.cat(attention_mask_parts, dim=0)
+                label = torch.cat(label_parts, dim=0)
+            else:
+                input_embed = text_embed
+                attention_mask = text_attention_mask
+                label = text_label
+                if self.training:
+                    # Make visual_embed & visual_indicator_embeds involved in the backward graph,
+                    # to be compatible with deepspeed zero and ddp.
+                    input_embed += torch.sum(visual_embed * 0.0) + torch.sum(visual_indicator_embeds * 0.0)
+            input_embeds.append(input_embed)
+            attention_masks.append(attention_mask)
+            labels.append(label)
+        if self.training:  # padding to self.config.multimodal_max_length for increased training speed
+            padding_size = max(0, self.config.multimodal_max_length - len(input_embeds[0]))
+            input_embeds[0] = torch.nn.ConstantPad2d((0, 0, 0, padding_size), 0.0)(input_embeds[0])
+            attention_masks[0] = torch.nn.ConstantPad1d((0, padding_size), False)(attention_masks[0])
+            labels[0] = torch.nn.ConstantPad1d((0, padding_size), IGNORE_ID)(labels[0])
+        batch_input_embeds = self.pad_truncate_sequence(input_embeds, batch_first=True, padding_value=0.0, left_padding=left_padding)
+        batch_attention_mask = self.pad_truncate_sequence(attention_masks, batch_first=True, padding_value=False, left_padding=left_padding)
+        batch_labels = self.pad_truncate_sequence(labels, batch_first=True, padding_value=IGNORE_ID, left_padding=left_padding)
+        return visual_input_ids, batch_input_embeds, batch_labels, batch_attention_mask
+    def pad_truncate_sequence(self, sequences: List[torch.Tensor], batch_first: bool = True, padding_value: float = 0.0, left_padding: bool = False) -> torch.Tensor:
+        if left_padding == False:
+            pad_sequence = torch.nn.utils.rnn.pad_sequence(sequences, batch_first=batch_first, padding_value=padding_value)
+            return pad_sequence[:,:self.config.multimodal_max_length]
+        else:
+            pad_sequence = torch.nn.utils.rnn.pad_sequence([i.flip(dims=[0]) for i in sequences],batch_first=True, padding_value=padding_value).flip(dims=[1])
+            return pad_sequence[:,-self.config.multimodal_max_length:]
+    def preprocess_inputs(
+        self,
+        text_or_conversations: Union[List[Dict], str],
+        images: Optional[List[PIL.Image.Image]],
+        max_partition=9,
+        generation_preface='',
+        return_labels=False,
+        propagate_exception=True
+    ):
+        # convert text to conversations
+        if isinstance(text_or_conversations, str):
+            conversations = [{
+                "from": "human",
+                "value": text_or_conversations
+            }]
+        elif isinstance(text_or_conversations, list):
+            conversations = text_or_conversations
+        else:
+            raise ValueError(f'Invalid type of `text_or_conversations`, expected `List[Dict]` or `str`,'
+                             f' but got {type(text_or_conversations)}')
+        # format conversations
+        prompt, raw_input_ids, raw_labels = self.get_conversation_formatter().format(
+            conversations, generation_preface=generation_preface)
+        # place image placeholders
+        input_ids = []
+        labels = []
+        pixel_values = []
+        invalidate_label = False
+        image_token_indices = [i for i, v in enumerate(raw_input_ids) if v == IMAGE_TOKEN_ID]
+        last_image_token_index = -1
+        for i in range(len(image_token_indices)):
+            head = 0 if i == 0 else image_token_indices[i - 1] + 1
+            tail = image_token_indices[i]
+            last_image_token_index = tail
+            input_ids.extend(raw_input_ids[head:tail])
+            labels.extend(raw_labels[head:tail])
+            try:
+                image = images[i]
+                raw_pixel_values, image_placeholders = self.visual_tokenizer.preprocess_image(
+                    image, max_partition=max_partition)
+            except Exception as e:
+                if propagate_exception:
+                    raise e
+                logging.exception(e)
+                invalidate_label = True
+                raw_pixel_values, image_placeholders = self.visual_tokenizer.mock_input()
+            input_ids.extend(image_placeholders)
+            labels.extend([IGNORE_ID] * len(image_placeholders))
+            pixel_values.append(raw_pixel_values)
+        input_ids.extend(raw_input_ids[last_image_token_index + 1:])
+        labels.extend(raw_labels[last_image_token_index + 1:])
+        # return tensors
+        input_ids = torch.tensor(input_ids, dtype=torch.long)
+        labels = torch.tensor([IGNORE_ID] * len(labels) if invalidate_label else labels, dtype=torch.long)
+        pixel_values = torch.cat(pixel_values, dim=0) if len(pixel_values) > 0 else None
+        if return_labels:
+            return prompt, input_ids, pixel_values, labels
+        else:
+            return prompt, input_ids, pixel_values
+    def save_pretrained(
+        self,
+        save_directory: Union[str, os.PathLike],
+        is_main_process: bool = True,
+        state_dict: Optional[dict] = None,
+        save_function: Callable = torch.save,
+        push_to_hub: bool = False,
+        max_shard_size: Union[int, str] = "5GB",
+        safe_serialization: bool = True,
+        variant: Optional[str] = None,
+        token: Optional[Union[str, bool]] = None,
+        save_peft_format: bool = True,
+        **kwargs
+    ):
+        super().save_pretrained(save_directory,
+                                is_main_process=is_main_process,
+                                state_dict=state_dict,
+                                save_function=save_function,
+                                safe_serialization=safe_serialization)
+        self.get_text_tokenizer().save_pretrained(save_directory)
+        self.get_visual_tokenizer().get_image_processor().save_pretrained(save_directory)
+    def _get_hybrid_cache_for_llm(self, batch_size: int, max_cache_len: int):
+        cache_cls = HybridCache
+        llm = self.get_llm()
+        need_new_cache = (
+            not hasattr(llm, "_cache")
+            or (not isinstance(llm._cache, cache_cls))
+            or llm._cache.batch_size != batch_size
+            or llm._cache.max_cache_len < max_cache_len
+        )
+        if need_new_cache:
+            if hasattr(llm.config, "_pre_quantization_dtype"):
+                cache_dtype = llm.config._pre_quantization_dtype
+            else:
+                cache_dtype = llm.dtype
+            llm._cache = cache_cls(
+                config=llm.config,
+                batch_size=batch_size,
+                max_cache_len=max_cache_len,
+                device=llm.device,
+                dtype=cache_dtype,
+            )
+        else:
+            llm._cache.reset()
+        return llm._cache
+    # TODO: support batch generation
+    def generate(
+        self,
+        inputs: Optional[torch.Tensor] = None,
+        **kwargs
+    ) -> Union[GenerateOutput, torch.LongTensor]:
+        _, inputs_embeds, labels, attention_mask = self.merge_multimodal(
+            text_input_ids=inputs,
+            text_attention_masks=kwargs.pop('attention_mask'),
+            text_labels=None,
+            pixel_values=kwargs.pop('pixel_values'),
+            left_padding=True
+        )
+        if getattr(self.generation_config, 'cache_implementation') == 'hybrid':  # mainly for Gemma2
+            kwargs['past_key_values'] = self._get_hybrid_cache_for_llm(
+                getattr(kwargs, "num_beams", inputs_embeds.shape[0]), kwargs['max_new_tokens'] + inputs_embeds.shape[-2])
+            self.get_llm()._supports_cache_class = True
+            kwargs['cache_implementation'] = None
+        return self.llm.generate(inputs=None, inputs_embeds=inputs_embeds, attention_mask=attention_mask, **kwargs)

Ovis/Ovis1.6-Gemma2-27B/__init__.py ADDED Viewed

File without changes

Ovis/docs/license/QWEN_LICENSE ADDED Viewed

	@@ -0,0 +1,53 @@

+Tongyi Qianwen LICENSE AGREEMENT
+Tongyi Qianwen Release Date: August 3, 2023
+By clicking to agree or by using or distributing any portion or element of the Tongyi Qianwen Materials, you will be deemed to have recognized and accepted the content of this Agreement, which is effective immediately.
+1. Definitions
+    a. This Tongyi Qianwen LICENSE AGREEMENT (this "Agreement") shall mean the terms and conditions for use, reproduction, distribution and modification of the Materials as defined by this Agreement.
+    b. "We"(or "Us") shall mean Alibaba Cloud.
+    c. "You" (or "Your") shall mean a natural person or legal entity exercising the rights granted by this Agreement and/or using the Materials for any purpose and in any field of use.
+    d. "Third Parties" shall mean individuals or legal entities that are not under common control with Us or You.
+    e. "Tongyi Qianwen" shall mean the large language models (including Qwen model and Qwen-Chat model), and software and algorithms, consisting of trained model weights, parameters (including optimizer states), machine-learning model code, inference-enabling code, training-enabling code, fine-tuning enabling code and other elements of the foregoing distributed by Us.
+    f. "Materials" shall mean, collectively, Alibaba Cloud's proprietary Tongyi Qianwen and Documentation (and any portion thereof) made available under this Agreement.
+    g. "Source" form shall mean the preferred form for making modifications, including but not limited to model source code, documentation source, and configuration files.
+    h. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation,
+ and conversions to other media types.
+2. Grant of Rights
+You are granted a non-exclusive, worldwide, non-transferable and royalty-free limited license under Alibaba Cloud's intellectual property or other rights owned by Us embodied in the Materials to use, reproduce, distribute, copy, create derivative works of, and make modifications to the Materials.
+3. Redistribution
+You may reproduce and distribute copies of the Materials or derivative works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions:
+    a. You shall give any other recipients of the Materials or derivative works a copy of this Agreement;
+    b. You shall cause any modified files to carry prominent notices stating that You changed the files;
+    c. You shall retain in all copies of the Materials that You distribute the following attribution notices within a "Notice" text file distributed as a part of such copies: "Tongyi Qianwen is licensed under the Tongyi Qianwen LICENSE AGREEMENT, Copyright (c) Alibaba Cloud. All Rights Reserved."; and
+    d. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such derivative works as a whole, provided Your use, reproduction, and distribution of the work otherwise complies with the terms and conditions of this Agreement.
+4. Restrictions
+If you are commercially using the Materials, and your product or service has more than 100 million monthly active users, You shall request a license from Us. You cannot exercise your rights under this Agreement without our express authorization.
+5. Rules of use
+    a. The Materials may be subject to export controls or restrictions in China, the United States or other countries or regions. You shall comply with applicable laws and regulations in your use of the Materials.
+    b. You can not use the Materials or any output therefrom to improve any other large language model (excluding Tongyi Qianwen or derivative works thereof).
+6. Intellectual Property
+    a. We retain ownership of all intellectual property rights in and to the Materials and derivatives made by or for Us. Conditioned upon compliance with the terms and conditions of this Agreement, with respect to any derivative works and modifications of the Materials that are made by you, you are and will be the owner of such derivative works and modifications.
+    b. No trademark license is granted to use the trade names, trademarks, service marks, or product names of Us, except as required to fulfill notice requirements under this Agreement or as required for reasonable and customary use in describing and redistributing the Materials.
+    c. If you commence a lawsuit or other proceedings (including a cross-claim or counterclaim in a lawsuit) against Us or any entity alleging that the Materials or any output therefrom, or any part of the foregoing, infringe any intellectual property or other right owned or licensable by you, then all licences granted to you under this Agreement shall terminate as of the date such lawsuit or other proceeding is commenced or brought.
+7. Disclaimer of Warranty and Limitation of Liability
+    a. We are not obligated to support, update, provide training for, or develop any further version of the Tongyi Qianwen Materials or to grant any license thereto.
+    b. THE MATERIALS ARE PROVIDED "AS IS" WITHOUT ANY EXPRESS OR IMPLIED WARRANTY OF ANY KIND INCLUDING WARRANTIES OF MERCHANTABILITY, NONINFRINGEMENT, OR FITNESS FOR A PARTICULAR PURPOSE. WE MAKE NO WARRANTY AND ASSUME NO RESPONSIBILITY FOR THE SAFETY OR STABILITY OF THE MATERIALS AND ANY OUTPUT THEREFROM.
+    c. IN NO EVENT SHALL WE BE LIABLE TO YOU FOR ANY DAMAGES, INCLUDING, BUT NOT LIMITED TO ANY DIRECT, OR INDIRECT, SPECIAL OR CONSEQUENTIAL DAMAGES ARISING FROM YOUR USE OR INABILITY TO USE THE MATERIALS OR ANY OUTPUT OF IT, NO MATTER HOW IT’S CAUSED.
+    d. You will defend, indemnify and hold harmless Us from and against any claim by any third party arising out of or related to your use or distribution of the Materials.
+8. Survival and Termination.
+    a. The term of this Agreement shall commence upon your acceptance of this Agreement or access to the Materials and will continue in full force and effect until terminated in accordance with the terms and conditions herein.
+    b. We may terminate this Agreement if you breach any of the terms or conditions of this Agreement. Upon termination of this Agreement, you must delete and cease use of the Materials. Sections 7 and 9 shall survive the termination of this Agreement.
+9. Governing Law and Jurisdiction.
+    a. This Agreement and any dispute arising out of or relating to it will be governed by the laws of China, without regard to conflict of law principles, and the UN Convention on Contracts for the International Sale of Goods does not apply to this Agreement.
+    b. The People's Courts in Hangzhou City shall have exclusive jurisdiction over any dispute arising out of this Agreement.

Ovis/ovis.egg-info/PKG-INFO ADDED Viewed

	@@ -0,0 +1,30 @@

+Metadata-Version: 2.1
+Name: ovis
+Version: 1.6.0
+License-File: LICENSE
+Requires-Dist: torch==2.4.0
+Requires-Dist: transformers==4.46.2
+Requires-Dist: tokenizers==0.20.3
+Requires-Dist: sentencepiece==0.1.99
+Requires-Dist: pyarrow==14.0.2
+Requires-Dist: accelerate==1.1.0
+Requires-Dist: pydantic==2.8.2
+Requires-Dist: markdown2[all]
+Requires-Dist: numpy==1.24.3
+Requires-Dist: scikit-learn==1.2.2
+Requires-Dist: requests
+Requires-Dist: httpx
+Requires-Dist: uvicorn
+Requires-Dist: fastapi==0.112.4
+Requires-Dist: einops==0.6.1
+Requires-Dist: einops-exts==0.0.4
+Requires-Dist: timm==0.6.13
+Requires-Dist: tiktoken
+Requires-Dist: transformers_stream_generator==0.0.4
+Requires-Dist: scipy
+Requires-Dist: pandas
+Requires-Dist: torchaudio
+Requires-Dist: xformers
+Requires-Dist: pillow==10.3.0
+Requires-Dist: deepspeed==0.15.4
+Requires-Dist: gradio

Ovis/ovis.egg-info/SOURCES.txt ADDED Viewed

	@@ -0,0 +1,24 @@

+LICENSE
+README.md
+setup.py
+ovis/__init__.py
+ovis.egg-info/PKG-INFO
+ovis.egg-info/SOURCES.txt
+ovis.egg-info/dependency_links.txt
+ovis.egg-info/requires.txt
+ovis.egg-info/top_level.txt
+ovis/model/__init__.py
+ovis/model/configuration_ovis.py
+ovis/model/conversation_formatter.py
+ovis/model/modeling_ovis.py
+ovis/train/__init__.py
+ovis/train/arguments.py
+ovis/train/callback.py
+ovis/train/train.py
+ovis/train/dataset/__init__.py
+ovis/train/dataset/caption_dataset.py
+ovis/train/dataset/conversation_dataset.py
+ovis/train/dataset/multimodal_dataset.py
+ovis/util/__init__.py
+ovis/util/constants.py
+ovis/util/utils.py

Ovis/ovis.egg-info/dependency_links.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+

Ovis/ovis.egg-info/requires.txt ADDED Viewed

	@@ -0,0 +1,26 @@

+torch==2.4.0
+transformers==4.46.2
+tokenizers==0.20.3
+sentencepiece==0.1.99
+pyarrow==14.0.2
+accelerate==1.1.0
+pydantic==2.8.2
+markdown2[all]
+numpy==1.24.3
+scikit-learn==1.2.2
+requests
+httpx
+uvicorn
+fastapi==0.112.4
+einops==0.6.1
+einops-exts==0.0.4
+timm==0.6.13
+tiktoken
+transformers_stream_generator==0.0.4
+scipy
+pandas
+torchaudio
+xformers
+pillow==10.3.0
+deepspeed==0.15.4
+gradio

Ovis/ovis.egg-info/top_level.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ ovis

Ovis/ovis/__init__.py ADDED Viewed

	@@ -0,0 +1,3 @@


1	+ import os
2	+
3	+ os.environ["TOKENIZERS_PARALLELISM"] = "false"

Ovis/ovis/model/__init__.py ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ from .visual_tokenizer.clip_visual_tokenizer import ClipVisualTokenizerConfig, ClipVisualTokenizer
2	+ from .visual_tokenizer.siglip_visual_tokenizer import SiglipVisualTokenizerConfig, SiglipVisualTokenizer

Ovis/ovis/model/__pycache__/conversation_formatter.cpython-311.pyc ADDED Viewed

Binary file (12 kB). View file

Ovis/ovis/model/__pycache__/modeling_ovis.cpython-310.pyc ADDED Viewed

Binary file (14 kB). View file

Ovis/ovis/model/modeling_ovis.py ADDED Viewed

	@@ -0,0 +1,434 @@

+import logging
+import os
+from packaging import version
+from datetime import datetime
+from importlib import import_module
+from typing import List, Union, Callable, Optional, Dict
+import PIL.Image
+import deepspeed
+import torch
+import transformers
+from torch import Tensor
+from torch.nn import init
+from transformers import PreTrainedModel, AutoConfig, AutoModel, AutoTokenizer, AutoModelForCausalLM
+from transformers.cache_utils import HybridCache
+from transformers.generation.utils import GenerateOutput
+from transformers.integrations.deepspeed import is_deepspeed_zero3_enabled, deepspeed_config
+from ovis.model.configuration_ovis import OvisConfig
+from ovis.model.conversation_formatter import ConversationFormatter
+from ovis.util.constants import IGNORE_ID, BEGIN_LINE, END_LINE, IMAGE_ATOM_ID, IMAGE_INDICATOR_IDS, \
+    IMAGE_TOKEN_ID
+from ovis.util.utils import rank0_print
+class VisualEmbedding(torch.nn.Embedding):
+    def forward(self, visual_tokens: Tensor) -> Tensor:
+        if visual_tokens.dtype in [torch.int8, torch.int16, torch.int32, torch.int64, torch.long]:
+            return super().forward(visual_tokens)
+        return torch.matmul(visual_tokens, self.weight)
+    def reset_parameters(self, mean=0., std=1.) -> None:
+        init.normal_(self.weight, mean=mean, std=std)
+        self._fill_padding_idx_with_zero()
+class OvisPreTrainedModel(PreTrainedModel):
+    config_class = OvisConfig
+    base_model_prefix = "ovis"
+class Ovis(OvisPreTrainedModel):
+    def __init__(self, config: OvisConfig, *inputs, **kwargs):
+        super().__init__(config, *inputs, **kwargs)
+        if kwargs.get('train_from_scratch'):
+            self.llm = kwargs['llm']
+            self.generation_config = self.llm.generation_config
+            self.config.llm_config = self.llm.config
+            self.config.hidden_size = self.llm.config.hidden_size  # for deepspeed auto configuration
+            self.text_tokenizer = kwargs['text_tokenizer']
+            self.visual_tokenizer = kwargs['visual_tokenizer']
+            self.config.visual_tokenizer_config = self.visual_tokenizer.config
+        else:
+            attn_kwargs = dict()
+            if self.config.llm_attn_implementation:
+                attn_kwargs['attn_implementation'] = self.config.llm_attn_implementation
+            self.llm = AutoModelForCausalLM.from_config(self.config.llm_config, **attn_kwargs)
+            assert self.config.hidden_size == self.llm.config.hidden_size, "hidden size mismatch"
+            self.text_tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path)
+            self.visual_tokenizer = AutoModel.from_config(self.config.visual_tokenizer_config,
+                                                          image_processor_name_or_path=self.config.name_or_path)
+        # initialize vte
+        if is_deepspeed_zero3_enabled():
+            with deepspeed.zero.Init(config_dict_or_path=deepspeed_config()):
+                self.vte = VisualEmbedding(self.config.visual_tokenizer_config.vocab_size, self.config.hidden_size)
+        else:
+            self.vte = VisualEmbedding(self.config.visual_tokenizer_config.vocab_size, self.config.hidden_size,
+                                       device=self.visual_tokenizer.device, dtype=self.visual_tokenizer.dtype)
+        def _merge_modules(modules_list: tuple):
+            merged_modules = []
+            for modules in modules_list:
+                merged_modules.extend(modules if modules else [])
+            return merged_modules
+        self._no_split_modules = _merge_modules((self.llm._no_split_modules, self.visual_tokenizer._no_split_modules))
+        self._skip_keys_device_placement = self.llm._skip_keys_device_placement
+        self._keep_in_fp32_modules = _merge_modules(
+            (self.llm._keep_in_fp32_modules, self.visual_tokenizer._keep_in_fp32_modules))
+        self.is_parallelizable = all((self.llm.is_parallelizable, self.visual_tokenizer.is_parallelizable))
+        self.supports_gradient_checkpointing = all(
+            (self.llm.supports_gradient_checkpointing, self.visual_tokenizer.supports_gradient_checkpointing))
+        self._supports_flash_attn_2 = all(
+            (self.llm._supports_flash_attn_2, self.visual_tokenizer._supports_flash_attn_2))
+        self._supports_sdpa = all((self.llm._supports_sdpa, self.visual_tokenizer._supports_sdpa))
+    def get_text_tokenizer(self):
+        return self.text_tokenizer
+    def get_visual_tokenizer(self):
+        return self.visual_tokenizer
+    def tie_weights(self):
+        if not self.config.disable_tie_weight:
+            self.get_llm().tie_weights()
+    def re_init_vte(self, mean, std):
+        vte = self.get_vte()
+        rank0_print(BEGIN_LINE)
+        rank0_print(f'[{datetime.now()}] Before re-initialization of vte: ')
+        with deepspeed.zero.GatheredParameters([vte.weight]):
+            rank0_print(f'vte.weight: {vte.weight}')
+        with deepspeed.zero.GatheredParameters([vte.weight], modifier_rank=0):
+            if not is_deepspeed_zero3_enabled() or deepspeed.comm.get_rank() == 0:
+                vte.reset_parameters(mean, std)
+        rank0_print(f'[{datetime.now()}] After re-initialization of vte:')
+        with deepspeed.zero.GatheredParameters([vte.weight]):
+            rank0_print(f'vte.weight: {vte.weight}')
+        rank0_print(END_LINE)
+    def get_monitor_tensors(self):
+        monitor_tensors = dict(
+            wte=self.get_wte().weight,
+            lm_head=self.get_lm_head().weight,
+            vte=self.get_vte().weight
+        )
+        monitor_tensors.update(
+            {f'visual_tokenizer_{k}': v for k, v in self.get_visual_tokenizer().get_monitor_tensors().items()})
+        return monitor_tensors
+    def get_lm_head(self):
+        return self.get_llm().get_output_embeddings()
+    def get_llm(self):
+        return self.llm
+    def get_vte(self):
+        return self.vte
+    def get_wte(self):
+        return self.llm.get_input_embeddings()
+    def get_conversation_formatter(self) -> ConversationFormatter:
+        if getattr(self, 'conversation_formatter', None) is None:
+            self.conversation_formatter = getattr(import_module(".conversation_formatter", __package__),
+                                                  self.config.conversation_formatter_class)(self.text_tokenizer)
+        return self.conversation_formatter
+    def forward(
+        self,
+        input_ids: torch.Tensor,
+        attention_mask: torch.Tensor,
+        labels: Optional[torch.Tensor],
+        pixel_values: List[Optional[torch.Tensor]],
+        **kwargs
+    ):
+        assert self.training, "`forward` can only be used in training. For inference, use `generate`."
+        _, inputs_embeds, labels, attention_mask = self.merge_multimodal(
+            text_input_ids=input_ids,
+            text_attention_masks=attention_mask,
+            text_labels=labels,
+            pixel_values=pixel_values
+        )
+        return self.llm(inputs_embeds=inputs_embeds, labels=labels, attention_mask=attention_mask, **kwargs)
+    def merge_multimodal(
+        self,
+        text_input_ids: torch.Tensor,
+        text_attention_masks: torch.Tensor,
+        text_labels: Optional[torch.Tensor],
+        pixel_values: List[Optional[torch.Tensor]]
+    ):
+        input_device = text_input_ids.device
+        visual_vocab_szie = self.get_visual_tokenizer().config.vocab_size
+        visual_indicator_embeds = self.get_vte()(
+            torch.tensor(
+                list(range(visual_vocab_szie - 5, visual_vocab_szie)),
+                dtype=torch.long,
+                device=self.get_visual_tokenizer().device
+            )
+        ).to(device=input_device)
+        if self.training:
+            # When training, to be compatible with deepspeed zero, each sample has to include pixel_value tensor.
+            # For text-only sample, one can simply use a full zero tensor as pixel_value, which will be ignored
+            # (see below in this function); so, the gradient will not be affected.
+            num_images = [x.shape[0] for x in pixel_values]
+            visual_tokens = self.visual_tokenizer(torch.cat([x for x in pixel_values], dim=0))
+            visual_embeds = torch.split(self.get_vte()(visual_tokens).to(dtype=self.dtype, device=input_device),
+                                        split_size_or_sections=num_images, dim=0)
+            visual_input_ids = torch.split(torch.argmax(visual_tokens, dim=-1).to(device=input_device),
+                                           split_size_or_sections=num_images, dim=0)
+            visual_labels = [torch.full(x.shape, IGNORE_ID, dtype=torch.long, device=input_device) for x in
+                             visual_input_ids]
+        else:
+            # When inference, sample can include only text with `None` pixel_value
+            num_images = [x.shape[0] if x is not None else 0 for x in pixel_values]
+            if sum(num_images) > 0:
+                visual_tokens = self.visual_tokenizer(torch.cat([x for x in pixel_values if x is not None], dim=0))
+                visual_embeds = torch.split(self.get_vte()(visual_tokens).to(dtype=self.dtype, device=input_device),
+                                            split_size_or_sections=num_images, dim=0)
+                visual_input_ids = torch.split(torch.argmax(visual_tokens, dim=-1).to(device=input_device),
+                                               split_size_or_sections=num_images, dim=0)
+                visual_labels = [torch.full(x.shape, IGNORE_ID, dtype=torch.long, device=input_device) for x in
+                                 visual_input_ids]
+            else:
+                # just placeholders
+                visual_embeds = [None] * len(num_images)
+                visual_input_ids = [None] * len(num_images)
+                visual_labels = [None] * len(num_images)
+            # just placeholders
+            text_labels = torch.full(text_input_ids.shape, IGNORE_ID, dtype=torch.long, device=input_device)
+        input_embeds = []
+        attention_masks = []
+        labels = []
+        for text_input_id, text_label, text_attention_mask, visual_embed, visual_input_id, visual_label in zip(
+                text_input_ids, text_labels, text_attention_masks, visual_embeds, visual_input_ids, visual_labels
+        ):
+            placeholder_token_mask = torch.lt(text_input_id, 0)
+            text_embed = self.get_wte()(torch.masked_fill(text_input_id, placeholder_token_mask, 0))
+            for i, indicator_id in enumerate(IMAGE_INDICATOR_IDS):
+                text_embed[text_input_id == indicator_id] = visual_indicator_embeds[i]
+            image_atom_positions = torch.where(torch.eq(text_input_id, IMAGE_ATOM_ID))[0].tolist()
+            if len(image_atom_positions) > 0:
+                input_embed_parts = []
+                attention_mask_parts = []
+                label_parts = []
+                prev_image_atom_position = -1
+                for index, image_atom_position in enumerate(image_atom_positions):
+                    input_embed_parts.append(
+                        text_embed[prev_image_atom_position + 1:image_atom_position, :])
+                    label_parts.append(
+                        text_label[prev_image_atom_position + 1:image_atom_position])
+                    attention_mask_parts.append(
+                        text_attention_mask[prev_image_atom_position + 1:image_atom_position])
+                    input_embed_parts.append(visual_embed[index])
+                    attention_mask_parts.append(
+                        torch.ones_like(visual_label[index], dtype=torch.bool))
+                    label_parts.append(visual_label[index])
+                    prev_image_atom_position = image_atom_position
+                if prev_image_atom_position + 1 < text_input_id.shape[0]:
+                    input_embed_parts.append(
+                        text_embed[prev_image_atom_position + 1:, :])
+                    attention_mask_parts.append(
+                        text_attention_mask[prev_image_atom_position + 1:])
+                    label_parts.append(
+                        text_label[prev_image_atom_position + 1:])
+                input_embed = torch.cat(input_embed_parts, dim=0)
+                attention_mask = torch.cat(attention_mask_parts, dim=0)
+                label = torch.cat(label_parts, dim=0)
+            else:
+                input_embed = text_embed
+                attention_mask = text_attention_mask
+                label = text_label
+                if self.training:
+                    # Make visual_embed & visual_indicator_embeds involved in the backward graph,
+                    # to be compatible with deepspeed zero and ddp.
+                    input_embed += torch.sum(visual_embed * 0.0) + torch.sum(visual_indicator_embeds * 0.0)
+            input_embeds.append(input_embed)
+            attention_masks.append(attention_mask)
+            labels.append(label)
+        if self.training:  # padding to self.config.multimodal_max_length for increased training speed
+            padding_size = max(0, self.config.multimodal_max_length - len(input_embeds[0]))
+            input_embeds[0] = torch.nn.ConstantPad2d((0, 0, 0, padding_size), 0.0)(input_embeds[0])
+            attention_masks[0] = torch.nn.ConstantPad1d((0, padding_size), False)(attention_masks[0])
+            labels[0] = torch.nn.ConstantPad1d((0, padding_size), IGNORE_ID)(labels[0])
+        batch_input_embeds = torch.nn.utils.rnn.pad_sequence(input_embeds, batch_first=True, padding_value=0.0)[:,
+                             :self.config.multimodal_max_length, :]
+        batch_attention_mask = torch.nn.utils.rnn.pad_sequence(attention_masks, batch_first=True, padding_value=False)[
+                               :,
+                               :self.config.multimodal_max_length]
+        batch_labels = torch.nn.utils.rnn.pad_sequence(labels, batch_first=True, padding_value=IGNORE_ID)[:,
+                       :self.config.multimodal_max_length]
+        return visual_input_ids, batch_input_embeds, batch_labels, batch_attention_mask
+    def preprocess_inputs(
+        self,
+        text_or_conversations: Union[List[Dict], str],
+        images: Optional[List[PIL.Image.Image]],
+        max_partition=9,
+        generation_preface='',
+        return_labels=False,
+        propagate_exception=True
+    ):
+        # convert text to conversations
+        if isinstance(text_or_conversations, str):
+            conversations = [{
+                "from": "human",
+                "value": text_or_conversations
+            }]
+        elif isinstance(text_or_conversations, list):
+            conversations = text_or_conversations
+        else:
+            raise ValueError(f'Invalid type of `text_or_conversations`, expected `List[Dict]` or `str`,'
+                             f' but got {type(text_or_conversations)}')
+        # format conversations
+        prompt, raw_input_ids, raw_labels = self.get_conversation_formatter().format(
+            conversations, generation_preface=generation_preface)
+        # place image placeholders
+        input_ids = []
+        labels = []
+        pixel_values = []
+        invalidate_label = False
+        image_token_indices = [i for i, v in enumerate(raw_input_ids) if v == IMAGE_TOKEN_ID]
+        last_image_token_index = -1
+        for i in range(len(image_token_indices)):
+            head = 0 if i == 0 else image_token_indices[i - 1] + 1
+            tail = image_token_indices[i]
+            last_image_token_index = tail
+            input_ids.extend(raw_input_ids[head:tail])
+            labels.extend(raw_labels[head:tail])
+            try:
+                image = images[i]
+                raw_pixel_values, image_placeholders = self.visual_tokenizer.preprocess_image(
+                    image, max_partition=max_partition)
+            except Exception as e:
+                if propagate_exception:
+                    raise e
+                logging.exception(e)
+                invalidate_label = True
+                raw_pixel_values, image_placeholders = self.visual_tokenizer.mock_input()
+            input_ids.extend(image_placeholders)
+            labels.extend([IGNORE_ID] * len(image_placeholders))
+            pixel_values.append(raw_pixel_values)
+        input_ids.extend(raw_input_ids[last_image_token_index + 1:])
+        labels.extend(raw_labels[last_image_token_index + 1:])
+        # return tensors
+        input_ids = torch.tensor(input_ids, dtype=torch.long)
+        labels = torch.tensor([IGNORE_ID] * len(labels) if invalidate_label else labels, dtype=torch.long)
+        pixel_values = torch.cat(pixel_values, dim=0) if len(pixel_values) > 0 else None
+        if return_labels:
+            return prompt, input_ids, pixel_values, labels
+        else:
+            return prompt, input_ids, pixel_values
+    def save_pretrained(
+        self,
+        save_directory: Union[str, os.PathLike],
+        is_main_process: bool = True,
+        state_dict: Optional[dict] = None,
+        save_function: Callable = torch.save,
+        push_to_hub: bool = False,
+        max_shard_size: Union[int, str] = "5GB",
+        safe_serialization: bool = True,
+        variant: Optional[str] = None,
+        token: Optional[Union[str, bool]] = None,
+        save_peft_format: bool = True,
+        **kwargs
+    ):
+        super().save_pretrained(save_directory,
+                                is_main_process=is_main_process,
+                                state_dict=state_dict,
+                                save_function=save_function,
+                                safe_serialization=safe_serialization)
+        self.get_text_tokenizer().save_pretrained(save_directory)
+        self.get_visual_tokenizer().get_image_processor().save_pretrained(save_directory)
+        # uncomment the following will additionally save a separate visual tokenizer
+        # visual_tokenizer_directory = os.path.join(save_directory, 'visual_tokenizer')
+        # self.get_visual_tokenizer().save_pretrained(visual_tokenizer_directory,
+        #                                             is_main_process=is_main_process,
+        #                                             state_dict=None,
+        #                                             save_function=save_function,
+        #                                             safe_serialization=safe_serialization)
+        # self.get_visual_tokenizer().get_image_processor().save_pretrained(visual_tokenizer_directory)
+    def _get_hybrid_cache_for_llm(self, batch_size: int, max_cache_len: int):
+        cache_cls = HybridCache
+        llm = self.get_llm()
+        if version.parse(transformers.__version__) >= version.parse("4.46.0"):
+            need_new_cache = (
+                not hasattr(llm, "_cache")
+                or (not isinstance(llm._cache, cache_cls))
+                or llm._cache.batch_size != batch_size
+                or llm._cache.max_cache_len < max_cache_len
+            )
+        else:
+            need_new_cache = (
+                not hasattr(llm, "_cache")
+                or (not isinstance(llm._cache, cache_cls))
+                or llm._cache.max_batch_size != batch_size
+                or llm._cache.max_cache_len < max_cache_len
+            )
+        if need_new_cache:
+            if hasattr(llm.config, "_pre_quantization_dtype"):
+                cache_dtype = llm.config._pre_quantization_dtype
+            else:
+                cache_dtype = llm.dtype
+            if version.parse(transformers.__version__) >= version.parse("4.46.0"):
+                llm._cache = cache_cls(
+                    config=llm.config,
+                    batch_size=batch_size,
+                    max_cache_len=max_cache_len,
+                    device=llm.device,
+                    dtype=cache_dtype,
+                )
+            else:
+                llm._cache = cache_cls(
+                    config=llm.config,
+                    max_batch_size=batch_size,
+                    max_cache_len=max_cache_len,
+                    device=llm.device,
+                    dtype=cache_dtype,
+                )
+        else:
+            llm._cache.reset()
+        return llm._cache
+    # TODO: support batch generation
+    def generate(
+        self,
+        inputs: Optional[torch.Tensor] = None,
+        **kwargs
+    ) -> Union[GenerateOutput, torch.LongTensor]:
+        assert inputs.shape[0] == 1, 'Currently, only support `batch_size=1`'
+        _, inputs_embeds, labels, attention_mask = self.merge_multimodal(
+            text_input_ids=inputs,
+            text_attention_masks=kwargs.pop('attention_mask'),
+            text_labels=None,
+            pixel_values=kwargs.pop('pixel_values')
+        )
+        if getattr(self.generation_config, 'cache_implementation') == 'hybrid':  # mainly for Gemma2
+            kwargs['past_key_values'] = self._get_hybrid_cache_for_llm(
+                getattr(kwargs, "num_beams", 1), kwargs['max_new_tokens'] + inputs_embeds.shape[-2])
+            self.get_llm()._supports_cache_class = True
+            kwargs['cache_implementation'] = None
+        return self.llm.generate(inputs=None, inputs_embeds=inputs_embeds, attention_mask=attention_mask, **kwargs)
+AutoConfig.register("ovis", OvisConfig)
+AutoModelForCausalLM.register(OvisConfig, Ovis)

Ovis/ovis/model/visual_tokenizer/base_visual_tokenizer.py ADDED Viewed

	@@ -0,0 +1,264 @@

+from typing import Union, Optional
+import PIL.Image
+import torch
+from torch.nn.functional import softmax, gumbel_softmax, pad
+from transformers import PretrainedConfig, PreTrainedModel, AutoImageProcessor, AutoModel, AutoConfig
+from ovis.util.constants import IMAGE_INDICATOR_IDS, IMAGE_ATOM_ID
+class BaseVisualTokenizerConfig(PretrainedConfig):
+    def __init__(
+        self,
+        vocab_size=16384,
+        tokenize_function="softmax",
+        tau=1.0,
+        depths=None,
+        drop_cls_token=False,
+        backbone_config: Optional[Union[PretrainedConfig, dict]] = None,
+        hidden_stride: int = 1,
+        **kwargs
+    ):
+        super().__init__(**kwargs)
+        self.vocab_size = vocab_size
+        self.tokenize_function = tokenize_function
+        self.tau = tau
+        if isinstance(depths, str):
+            depths = [int(x) for x in depths.split('|')]
+        self.depths = depths
+        self.backbone_kwargs = {}
+        self.drop_cls_token = drop_cls_token
+        if backbone_config is not None:
+            assert isinstance(backbone_config, (PretrainedConfig, dict)), \
+                f"expect `backbone_config` to be instance of PretrainedConfig or dict, but got {type(backbone_config)} type"
+            if not isinstance(backbone_config, PretrainedConfig):
+                model_type = backbone_config['model_type']
+                backbone_config.pop('model_type')
+                backbone_config = AutoConfig.for_model(model_type, **backbone_config)
+        self.backbone_config = backbone_config
+        self.hidden_stride = hidden_stride
+class BaseVisualTokenizer(PreTrainedModel):
+    base_model_prefix = "backbone"
+    main_input_name = None
+    _image_processor_class = None
+    _image_processor_kwargs = {}
+    _backbone_class = None
+    _backbone_name_or_path = None
+    def __init__(self, config: BaseVisualTokenizerConfig, *inputs, **kwargs):
+        super().__init__(config, *inputs, **kwargs)
+        if kwargs.get('train_from_scratch'):
+            self.image_processor = self._image_processor_class.from_pretrained(self._backbone_name_or_path,
+                                                                               **self._image_processor_kwargs)
+            self.backbone = self._backbone_class.from_pretrained(self._backbone_name_or_path,
+                                                                 **self.config.backbone_kwargs)
+            self.config.backbone_config = self.backbone.config
+        else:
+            self.image_processor = AutoImageProcessor.from_pretrained(kwargs['image_processor_name_or_path'])
+            self.backbone = AutoModel.from_config(self.config.backbone_config)
+        head_dim = self.config.vocab_size - len(IMAGE_INDICATOR_IDS)  # reserved tokens for IMAGE_INDICATORS
+        self.head = torch.nn.Sequential(
+            torch.nn.Linear(
+                self.backbone.config.hidden_size * self.config.hidden_stride * self.config.hidden_stride, head_dim,
+                bias=False
+            ),
+            torch.nn.LayerNorm(head_dim)
+        )
+        assert all((self.image_processor.do_resize,
+                    not getattr(self.image_processor, 'do_center_crop', False),
+                    self.image_processor.do_rescale,
+                    self.image_processor.do_normalize
+                    )), f"image_processor `{self.image_processor}` is not supported currently"
+    def get_backbone(self):
+        return self.backbone
+    def get_monitor_tensors(self):
+        raise NotImplementedError
+    def get_image_processor(self):
+        return self.image_processor
+    def mock_input(self):
+        height, width = self.get_image_size()
+        return torch.zeros(1, 3, height, width), self.construct_image_placeholders((1, 1))
+    def get_head(self):
+        return self.head
+    def get_image_size(self):
+        raise NotImplementedError
+    @staticmethod
+    def construct_image_placeholders(grid):
+        image_placeholders = [IMAGE_INDICATOR_IDS[0], IMAGE_ATOM_ID, IMAGE_INDICATOR_IDS[1]]
+        if grid[0] * grid[1] > 1:
+            for r in range(grid[0]):
+                for c in range(grid[1]):
+                    image_placeholders.append(IMAGE_ATOM_ID)
+                    if c < grid[1] - 1:
+                        image_placeholders.append(IMAGE_INDICATOR_IDS[2])
+                if r < grid[0] - 1:
+                    image_placeholders.append(IMAGE_INDICATOR_IDS[3])
+        image_placeholders.append(IMAGE_INDICATOR_IDS[4])
+        return image_placeholders
+    def preprocess_image(self, image: PIL.Image.Image, max_partition=9, covering_threshold=0.9, convert_to_rgb=True):
+        def _preprocess(img: PIL.Image.Image, side):
+            # first resize and preprocess
+            w, h = img.size
+            if w == h:
+                new_width = new_height = side
+            elif w > h:
+                new_width = side
+                new_height = int(h / w * new_width)
+            else:
+                new_height = side
+                new_width = int(w / h * new_height)
+            new_size = dict(height=new_height, width=new_width)
+            pixel_values = self.image_processor.preprocess(img, size=new_size, return_tensors='pt')['pixel_values']
+            # then pad to square
+            square_values = torch.zeros([1, 3, side, side], dtype=pixel_values.dtype, device=pixel_values.device)
+            new_height, new_width = pixel_values.shape[2:]
+            if new_height == new_width:
+                square_values[:, :, :, :] = pixel_values
+            elif new_height > new_width:
+                from_index = (side - new_width) // 2
+                square_values[:, :, :, from_index:from_index + new_width] = pixel_values
+            else:
+                from_index = (side - new_height) // 2
+                square_values[:, :, from_index:from_index + new_height, :] = pixel_values
+            return square_values
+        def _partition(img, grid):
+            w, h = img.size
+            row_height = h // grid[0]
+            col_width = w // grid[1]
+            partition = []
+            for row in range(grid[0]):
+                for col in range(grid[1]):
+                    left = col * col_width
+                    upper = row * row_height
+                    right = w if col == grid[1] - 1 else (col + 1) * col_width
+                    lower = h if row == grid[0] - 1 else (row + 1) * row_height
+                    partition.append((left, upper, right, lower))
+            return partition
+        def _covering_area(left, upper, right, lower, side):
+            w = right - left
+            h = lower - upper
+            w, h = max(w, h), min(w, h)
+            if w > side:
+                h = h / w * side
+                w = side
+            return w * h
+        def _get_best_grid(img, side):
+            img_area = img.size[0] * img.size[1]
+            candidate_grids = []
+            for i in range(1, max_partition + 1):
+                for j in range(1, max_partition + 1):
+                    if i * j <= max_partition:
+                        candidate_grids.append((i, j))
+            all_grids = []
+            good_grids = []
+            for grid in candidate_grids:
+                partition = _partition(img, grid)
+                covering_ratio = sum([_covering_area(*p, side) for p in partition]) / img_area
+                assert covering_ratio <= 1.0
+                all_grids.append((grid, covering_ratio))
+                if covering_ratio > covering_threshold:
+                    good_grids.append((grid, covering_ratio))
+            if len(good_grids) > 0:
+                # pick the good partition with minimum #sub_images and break the tie using covering_ratio
+                return sorted(good_grids, key=lambda x: (x[0][0] * x[0][1], -x[1]))[0][0]
+            else:
+                # pick the partition with maximum covering_ratio and break the tie using #sub_images
+                return sorted(all_grids, key=lambda x: (-x[1], x[0][0] * x[0][1]))[0][0]
+        if convert_to_rgb and image.mode != 'RGB':
+            image = image.convert('RGB')
+        sides = self.get_image_size()
+        if sides[0] != sides[1]:
+            raise ValueError('get_image_size() returns non-square size')
+        side = sides[0]
+        grid = _get_best_grid(image, side)
+        partition = _partition(image, grid)
+        crops = [image.crop(p) for p in partition]
+        if len(crops) > 1:
+            crops.insert(0, image)
+        pixel_values = torch.cat([_preprocess(crop, side) for crop in crops], dim=0)
+        image_placeholders = self.construct_image_placeholders(grid)
+        return pixel_values, image_placeholders
+    def get_backbone_layer(self, index):
+        return self.backbone.vision_model.encoder.layers[index]
+    def tokenize(self, logits):
+        def st_argmax(y_soft, dim):  # straight-through softmax
+            index = y_soft.max(dim, keepdim=True)[1]
+            y_hard = torch.zeros_like(y_soft, memory_format=torch.legacy_contiguous_format).scatter_(dim, index, 1.0)
+            ret = y_hard - y_soft.detach() + y_soft
+            return ret
+        if self.config.tokenize_function == 'softmax':
+            tokens = softmax(logits, dim=-1)
+        elif self.config.tokenize_function == 'gumbel_argmax':
+            tokens = gumbel_softmax(logits, tau=self.config.tau, hard=True)
+        elif self.config.tokenize_function == 'st_argmax':
+            tokens = st_argmax(logits, dim=-1)
+        else:
+            raise ValueError(
+                f'Invalid `max_type`, expected softmax or gumbel_argmax or st_argmax, but got {self.config.tokenize_function}')
+        return tokens
+    def encode(self, pixel_values):
+        output = self.backbone(pixel_values, output_hidden_states=True, return_dict=True)
+        features = output.hidden_states[-1]
+        if self.config.drop_cls_token:
+            features = features[:, 1:, :]
+        # merge number of `hidden_stride * hidden_stride` hidden states together to reduce token sequence length
+        # e.g., for hidden_stride=3, this leads to a token length reduction: 729 -> 81 for siglip
+        if self.config.hidden_stride > 1:
+            n, l, d = features.shape  # this `d` maybe different from the above `d
+            sqrt_l = int(l ** 0.5)
+            assert sqrt_l ** 2 == l, "The token sequence length should be a perfect square."
+            features = features.reshape(n, sqrt_l, sqrt_l, d)
+            pl = (self.config.hidden_stride - (sqrt_l % self.config.hidden_stride)) % self.config.hidden_stride
+            features = pad(features, (0, 0, 0, pl, 0, pl), "constant", 0)
+            sqrt_l += pl
+            features = features.reshape(n, sqrt_l // self.config.hidden_stride, self.config.hidden_stride,
+                                        sqrt_l // self.config.hidden_stride, self.config.hidden_stride, d)
+            features = features.permute(0, 1, 3, 2, 4, 5)  # [n, sqrt_l/hs, sqrt_l/hs, hs, hs, d]
+            features = features.flatten(3)  # [n, sqrt_l/hs, sqrt_l/hs, hs*hs*d]
+            features = features.reshape(
+                n, -1, self.config.hidden_stride * self.config.hidden_stride * d)
+        return features
+    def forward(self, pixel_values) -> torch.Tensor:  # [BatchSize, ImageShape] -> [BatchSize, #Token, VocabSize]
+        features = self.encode(pixel_values)
+        logits = self.head(features)
+        tokens = self.tokenize(logits)
+        # tokens' shape is [BatchSize, #Token, VocabSize-5], so padding with [BatchSize, #Token, 5], after
+        # which, tokens' shape should become [BatchSize, #Token, VocabSize]
+        batch_size, token_len, _ = tokens.shape
+        padding_tensor = torch.zeros(size=(batch_size, token_len, len(IMAGE_INDICATOR_IDS)),
+                                     dtype=tokens.dtype,
+                                     device=tokens.device,
+                                     layout=tokens.layout,
+                                     requires_grad=False)
+        tokens = torch.cat((tokens, padding_tensor), dim=2)
+        return tokens

Ovis/ovis/model/visual_tokenizer/clip_visual_tokenizer.py ADDED Viewed

	@@ -0,0 +1,41 @@

+from transformers import AutoConfig, AutoModel
+from transformers import CLIPVisionModel, CLIPImageProcessor
+from .base_visual_tokenizer import BaseVisualTokenizerConfig, BaseVisualTokenizer
+MODEL_TYPE = "clip_visual_tokenizer"
+class ClipVisualTokenizerConfig(BaseVisualTokenizerConfig):
+    model_type = MODEL_TYPE
+    def __init__(self, **kwargs):
+        super().__init__(**kwargs)
+        if self.depths:
+            assert len(self.depths) == 1
+            self.backbone_kwargs['num_hidden_layers'] = self.depths[0]
+class ClipVisualTokenizer(BaseVisualTokenizer):
+    config_class = ClipVisualTokenizerConfig
+    supports_gradient_checkpointing = True
+    _no_split_modules = ["CLIPEncoderLayer"]
+    _image_processor_class = CLIPImageProcessor
+    _image_processor_kwargs = dict(do_center_crop=False)
+    _backbone_class = CLIPVisionModel
+    _backbone_name_or_path = "openai/clip-vit-large-patch14-336"
+    def get_monitor_tensors(self):
+        return dict(
+            backbone_bottom=self.backbone.vision_model.encoder.layers[0].self_attn.k_proj.weight,
+            backbone_top=self.backbone.vision_model.encoder.layers[-1].self_attn.out_proj.weight,
+            head=self.head[0].weight
+        )
+    def get_image_size(self):
+        height = self.image_processor.crop_size["height"]
+        width = self.image_processor.crop_size["width"]
+        return height, width
+AutoConfig.register(MODEL_TYPE, ClipVisualTokenizerConfig)
+AutoModel.register(ClipVisualTokenizerConfig, ClipVisualTokenizer)

Ovis/ovis/model/visual_tokenizer/siglip_visual_tokenizer.py ADDED Viewed

	@@ -0,0 +1,43 @@

+from transformers import AutoConfig, AutoModel
+from transformers import SiglipVisionModel, SiglipImageProcessor
+from .base_visual_tokenizer import BaseVisualTokenizerConfig, BaseVisualTokenizer
+MODEL_TYPE = "siglip_visual_tokenizer"
+class SiglipVisualTokenizerConfig(BaseVisualTokenizerConfig):
+    model_type = MODEL_TYPE
+    def __init__(self, **kwargs):
+        super().__init__(**kwargs)
+        if self.drop_cls_token:
+            self.drop_cls_token = False
+        if self.depths:
+            assert len(self.depths) == 1
+            self.backbone_kwargs['num_hidden_layers'] = self.depths[0]
+class SiglipVisualTokenizer(BaseVisualTokenizer):
+    config_class = SiglipVisualTokenizerConfig
+    supports_gradient_checkpointing = True
+    _no_split_modules = ["SiglipVisionTransformer"]
+    _image_processor_class = SiglipImageProcessor
+    _image_processor_kwargs = {}
+    _backbone_class = SiglipVisionModel
+    _backbone_name_or_path = "google/siglip-so400m-patch14-384"
+    def get_monitor_tensors(self):
+        return dict(
+            backbone_bottom=self.backbone.vision_model.encoder.layers[0].self_attn.k_proj.weight,
+            backbone_top=self.backbone.vision_model.encoder.layers[-1].self_attn.out_proj.weight,
+            head=self.head[0].weight
+        )
+    def get_image_size(self):
+        height = self.image_processor.size["height"]
+        width = self.image_processor.size["width"]
+        return height, width
+AutoConfig.register(MODEL_TYPE, SiglipVisualTokenizerConfig)
+AutoModel.register(SiglipVisualTokenizerConfig, SiglipVisualTokenizer)

Ovis/ovis/serve/__pycache__/runner.cpython-310.pyc ADDED Viewed

Binary file (3.45 kB). View file

Ovis/ovis/serve/__pycache__/runner.cpython-311.pyc ADDED Viewed

Binary file (6.61 kB). View file

Ovis/ovis/train/dataset/__init__.py ADDED Viewed

File without changes

Ovis/ovis/train/dataset/caption_dataset.py ADDED Viewed

	@@ -0,0 +1,67 @@

+import logging
+from datetime import datetime
+from typing import Dict
+import pandas
+import torch
+from ovis.train.dataset.multimodal_dataset import MultimodalDataset
+from ovis.util.constants import IMAGE_TOKEN, IGNORE_ID
+from ovis.util.utils import rank0_print
+class CaptionDataset(MultimodalDataset):
+    def load(self):
+        rank0_print(f"[{datetime.now()}] Loading dataset {self.name} from {self.meta_file} begin")
+        samples = pandas.read_parquet(self.meta_file, engine='pyarrow')
+        rank0_print(f"[{datetime.now()}] Loading dataset {self.name} end")
+        return samples
+    def __getitem__(self, i: int) -> Dict[str, torch.Tensor]:
+        sample = self.samples.iloc[i]
+        text = sample['caption']
+        image_path = sample['image_path']
+        # read and preprocess image
+        pixel_values, image_placeholders = self.visual_tokenizer.mock_input()
+        valid_image = False
+        image, e = self.read_image(image_path)
+        if image is None:
+            logging.warning(
+                f'reading image failed with index: {i}, image path: {image_path}, and exception: {e}')
+        else:
+            try:
+                pixel_values, image_placeholders = self.visual_tokenizer.preprocess_image(
+                    image, max_partition=self.max_partitions[0])
+                valid_image = True
+            except Exception as e:
+                logging.warning(
+                    f'preprocessing image failed with index: {i}, image path: {image_path}, and exception: {e}')
+        # preprocess text
+        if text is None:
+            logging.warning(f'text is `None`, index: {i}')
+            text = ""
+        if not valid_image:
+            logging.warning(f'image is not valid, so set text as empty, index: {i}, image path: {image_path}')
+            text = ""
+        text = text.replace(IMAGE_TOKEN, '').strip()
+        head, tail = self.caption_template.split(IMAGE_TOKEN)
+        head_ids = self.text_tokenizer(head, add_special_tokens=False).input_ids
+        tail_ids = self.text_tokenizer(tail, add_special_tokens=False).input_ids
+        text_ids = self.text_tokenizer(text, add_special_tokens=False).input_ids
+        input_ids = head_ids + image_placeholders + tail_ids + text_ids
+        labels = [IGNORE_ID] * (len(input_ids) - len(text_ids)) + text_ids
+        input_ids = input_ids[:self.text_max_length]
+        labels = labels[:self.text_max_length]
+        input_ids = torch.tensor(input_ids, dtype=torch.long)
+        labels = torch.tensor(labels, dtype=torch.long)
+        return dict(
+            pixel_values=pixel_values,
+            input_ids=input_ids,
+            labels=labels
+        )

Ovis/ovis/train/dataset/conversation_dataset.py ADDED Viewed

	@@ -0,0 +1,67 @@

+import copy
+import json
+import logging
+from datetime import datetime
+from typing import Dict
+import torch
+from ovis.train.dataset.multimodal_dataset import MultimodalDataset
+from ovis.util.utils import rank0_print
+class ConversationDataset(MultimodalDataset):
+    def load(self):
+        rank0_print(f"[{datetime.now()}] Loading dataset {self.name} from {self.meta_file} begin")
+        with open(self.meta_file, 'r', encoding='utf-8') as f:
+            samples = json.load(f)
+        rank0_print(f'#samples: {len(samples)}')
+        rank0_print(f'sample: {samples[0]}')
+        rank0_print(f"[{datetime.now()}] Loading dataset {self.name} end")
+        return samples
+    def __getitem__(self, i: int) -> Dict[str, torch.Tensor]:
+        sample = self.samples[i]
+        conversations = copy.deepcopy(sample["conversations"])
+        images = None
+        max_partition = None
+        if 'image' in sample:
+            image_paths = sample['image']
+            if isinstance(image_paths, str):
+                image_paths = [image_paths]
+            images = []
+            for image_path in image_paths:
+                image, e = self.read_image(image_path)
+                if image is None:
+                    logging.warning(
+                        f'reading image failed with index: {i}, image path: {image_path}, and exception: {e}')
+                    images = None
+                    break
+                images.append(image)
+        elif 'video' in sample:
+            raise RuntimeError('video is to be supported')
+        if images:
+            max_partition = self.max_partitions[0] if len(images) == 1 else self.max_partitions[1]
+        prompt, input_ids, pixel_values, labels = self.model.preprocess_inputs(
+            conversations,
+            images,
+            max_partition=max_partition,
+            generation_preface=None,
+            return_labels=True,
+            propagate_exception=False
+        )
+        if pixel_values is None:
+            pixel_values, _ = self.visual_tokenizer.mock_input()
+        input_ids = input_ids[:self.text_max_length]
+        labels = labels[:self.text_max_length]
+        return dict(
+            pixel_values=pixel_values,
+            input_ids=input_ids,
+            labels=labels
+        )

Ovis/ovis/train/dataset/multimodal_dataset.py ADDED Viewed

	@@ -0,0 +1,72 @@

+import logging
+import os
+from typing import Dict, Sequence, Union, List
+import torch
+from PIL import Image
+from torch.utils.data import Dataset
+from transformers import PreTrainedTokenizer
+from ovis.model.modeling_ovis import Ovis
+from ovis.train.arguments import TrainingArguments
+from ovis.util.constants import IGNORE_ID
+class MultimodalDataset(Dataset):
+    def __init__(self, name: str, info: Dict, model: Ovis, training_args: TrainingArguments):
+        self.name = name
+        self.meta_file = info['meta_file']
+        self.image_dir = info['image_dir']
+        self.caption_template = info.get('caption_template', None)
+        self.text_tokenizer = model.get_text_tokenizer()
+        self.visual_tokenizer = model.get_visual_tokenizer()
+        self.image_height, self.image_width = self.visual_tokenizer.get_image_size()
+        self.model = model
+        self.text_max_length = training_args.text_max_length
+        self.max_partitions = [int(m.strip()) for m in training_args.max_partitions.split('|')]
+        self.samples = self.load()
+    def load(self):
+        raise NotImplementedError
+    def __getitem__(self, i: int) -> Dict[str, torch.Tensor]:
+        raise NotImplementedError
+    def __len__(self):
+        return len(self.samples)
+    def read_image(self, path):
+        try:
+            full_path = os.path.join(self.image_dir, path)
+            image = Image.open(full_path).convert('RGB')
+            return image, None
+        except Exception as e:
+            return None, e
+class DataCollatorForMultimodalDataset:
+    def __init__(self, text_tokenizer: PreTrainedTokenizer):
+        self.text_tokenizer = text_tokenizer
+    def __call__(self, instances: Sequence[Dict]) -> Dict[str, Union[torch.Tensor, List[torch.Tensor]]]:
+        pixel_values, input_ids, labels = tuple([instance[key] for instance in instances]
+                                                for key in ("pixel_values", "input_ids", "labels"))
+        input_ids = torch.nn.utils.rnn.pad_sequence(
+            input_ids,
+            batch_first=True,
+            padding_value=self.text_tokenizer.pad_token_id)
+        attention_mask = torch.ne(input_ids, self.text_tokenizer.pad_token_id)
+        labels = torch.nn.utils.rnn.pad_sequence(
+            labels,
+            batch_first=True,
+            padding_value=IGNORE_ID)
+        num_valid_label = torch.not_equal(labels, IGNORE_ID).sum().item()
+        if num_valid_label == 0:
+            logging.warning(
+                f'[DataCollatorForMultimodalDataset] All labels in a batch are ignored, which may lead to training instability\n{input_ids=}\n{attention_mask=}\n{labels=}')
+        return dict(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            labels=labels,
+            pixel_values=pixel_values
+        )

Ovis/ovis/util/__init__.py ADDED Viewed

File without changes

Ovis/ovis/util/__pycache__/__init__.cpython-310.pyc ADDED Viewed

Binary file (145 Bytes). View file

Ovis/ovis/util/__pycache__/__init__.cpython-311.pyc ADDED Viewed

Binary file (161 Bytes). View file

Ovis/ovis/util/__pycache__/constants.cpython-310.pyc ADDED Viewed

Binary file (460 Bytes). View file

Ovis/ovis/util/__pycache__/constants.cpython-311.pyc ADDED Viewed

Binary file (503 Bytes). View file

Ovis/ovis/util/__pycache__/utils.cpython-311.pyc ADDED Viewed

Binary file (1.28 kB). View file

VLMEvalKit_old/outputs/Qwen2-VL-2B-Instruct/DSPAR_MINI/2024-12-24_08-17-11/Qwen2-VL-2B-Instruct/T2024-12-24_08-17-20_Gd57daa89/json/.ipynb_checkpoints/Qwen2-VL-2B-Instruct_DSPAR_MINI_6345-checkpoint.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ "Bức tranh này mô tả một người đàn ông đang đứng trên đường phố. Anh ấy đang nhìn xuống phía dưới, có vẻ như đang tập trung vào một vật gì đó. Anh ấy đang mặc một chiếc áo khoác màu xanh, quần jeans và giày sneaker trắng. Anh ấy có vẻ như đang ở trong một khu vực có nhiều cây xanh."

VLMEvalKit_old/outputs/Qwen2-VL-2B-Instruct/DSPAR_MINI/2024-12-24_08-17-11/Qwen2-VL-2B-Instruct/T2024-12-24_08-17-20_Gd57daa89/json/Qwen2-VL-2B-Instruct_DSPAR_MINI_6275.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ "Bức tranh này mô tả một người đàn ông đang đi bộ trên đường. Anh ấy đang đeo một chiếc khẩu trang y tế màu xanh lá cây và đang cầm một chiếc túi xách. Anh ấy có một chiếc đồng hồ đeo tay và một chiếc áo khoác ngắn. Anh ấy đang đi trên một con đường xanh."

VLMEvalKit_old/outputs/Qwen2-VL-2B-Instruct/DSPAR_MINI/2024-12-24_08-17-11/Qwen2-VL-2B-Instruct/T2024-12-24_08-17-20_Gd57daa89/json/Qwen2-VL-2B-Instruct_DSPAR_MINI_6278.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ "Bức tranh này mô tả một người đàn ông đang đi bộ trên đường. Anh ấy có một chiếc túi xách trên vai và đang đeo một chiếc kính. Anh ấy có một chiếc quần jeans và đôi giày. Anh ấy có một vẻ mặt buồn bã và đang cố gắng giữ cho mình không rơi vào tầm tay của người khác."

VLMEvalKit_old/outputs/Qwen2-VL-2B-Instruct/DSPAR_MINI/2024-12-24_08-17-11/Qwen2-VL-2B-Instruct/T2024-12-24_08-17-20_Gd57daa89/json/Qwen2-VL-2B-Instruct_DSPAR_MINI_6385.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ "Bức tranh này mô tả một người đi bộ trên đường. Người này có một chiếc mũ đen và một chiếc khẩu trang màu đỏ. Người này đang đi bộ trên một con đường xanh."

VLMEvalKit_old/outputs/Qwen2-VL-2B-Instruct/DSPAR_MINI/2024-12-24_08-17-11/Qwen2-VL-2B-Instruct/T2024-12-24_08-17-20_Gd57daa89/json/Qwen2-VL-2B-Instruct_DSPAR_MINI_6481.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ "Bức ảnh này mô tả một người đàn ông đang đi bộ trên đường. Anh ấy có mái tóc ngắn, mặc áo hoodie màu xanh nhạt và quần áo thể thao. Anh ấy đang đeo một chiếc túi xách trên vai."

VLMEvalKit_old/outputs/Qwen2-VL-2B-Instruct/DSPAR_MINI/2024-12-24_08-17-11/Qwen2-VL-2B-Instruct/T2024-12-24_08-17-20_Gd57daa89/json/Qwen2-VL-2B-Instruct_DSPAR_MINI_6500.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ "Bức tranh này mô tả một người đàn ông đang đi bộ trên đường. Anh ấy có một chiếc áo khoác màu xanh, quần đen và giày trắng. Anh ấy có vẻ như đang đi bộ trên một con đường màu xám."

VLMEvalKit_old/outputs/Qwen2-VL-2B-Instruct/DSPAR_MINI/2024-12-24_08-17-11/Qwen2-VL-2B-Instruct/T2024-12-24_08-17-20_Gd57daa89/json/Qwen2-VL-2B-Instruct_DSPAR_MINI_6513.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ "Bức tranh này mô tả một người đi bộ trên đường phố. Người này đang mặc một bộ đồ màu xanh, có một chiếc túi xách trên vai."

VLMEvalKit_old/outputs/Qwen2-VL-2B-Instruct/DSPAR_MINI/2024-12-24_08-17-11/Qwen2-VL-2B-Instruct/T2024-12-24_08-17-20_Gd57daa89/json/Qwen2-VL-2B-Instruct_DSPAR_MINI_6515.json ADDED Viewed

	@@ -0,0 +1 @@

+ "Bức tranh này mô tả một người phụ nữ đang đứng trên đường phố. Cô ấy đang mặc một chiếc áo thun màu hồng và quần short. Cô ấy có hai chân và hai tay. Cô ấy đang nhìn về phía trước và có vẻ như đang ngắm nhìn hoặc nhìn vào một thứ gì đó. Cô ấy có hai đôi giày màu đen. Bức tranh được chụp ở một góc đường phố."

VLMEvalKit_old/outputs/Qwen2-VL-2B-Instruct/DSPAR_MINI/2024-12-24_08-17-11/Qwen2-VL-2B-Instruct/T2024-12-24_08-17-20_Gd57daa89/json/Qwen2-VL-2B-Instruct_DSPAR_MINI_6517.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ "Bức tranh này mô tả một người đàn ông đang đứng trên một bức tường màu xanh. Anh ta đang nhìn về phía trước và có vẻ như đang tập trung vào một điều gì đó. Bức tranh có một màu sắc đơn giản và không có nhiều chi tiết."

VLMEvalKit_old/outputs/Qwen2-VL-2B-Instruct/DSPAR_MINI/2024-12-24_08-17-11/Qwen2-VL-2B-Instruct/T2024-12-24_08-17-20_Gd57daa89/json/Qwen2-VL-2B-Instruct_DSPAR_MINI_6521.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ "Bức tranh này mô tả một người phụ nữ đang đi bộ trên đường. Cô ấy đang đeo một chiếc túi xách trên vai và mặc một bộ quần áo màu đen. Cô ấy đang đi trên một con đường có một con đường đi bộ."

VLMEvalKit_old/outputs/Qwen2-VL-2B-Instruct/DSPAR_MINI/2024-12-24_08-17-11/Qwen2-VL-2B-Instruct/T2024-12-24_08-17-20_Gd57daa89/json/Qwen2-VL-2B-Instruct_DSPAR_MINI_6533.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ "Bức tranh này mô tả một người đàn ông đang đấm vào người khác trên đường. Người đàn ông đang đấm vào người khác bằng một cây cối."

VLMEvalKit_old/outputs/Qwen2-VL-2B-Instruct/DSPAR_MINI/2024-12-24_08-17-11/Qwen2-VL-2B-Instruct/T2024-12-24_08-17-20_Gd57daa89/json/Qwen2-VL-2B-Instruct_DSPAR_MINI_6539.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ "Bức tranh này mô tả một người đàn ông đang đi bộ trên đường. Anh ta đang mặc một bộ đồ bảo hộ lao động màu xanh dương, có một chiếc áo khoác và quần áo dài. Anh ta đang đứng trên một con phố có gạch cống và đường đi xước."

VLMEvalKit_old/outputs/Qwen2-VL-2B-Instruct/DSPAR_MINI/2024-12-24_08-17-11/Qwen2-VL-2B-Instruct/T2024-12-24_08-17-20_Gd57daa89/json/Qwen2-VL-2B-Instruct_DSPAR_MINI_6580.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ "Bức tranh này mô tả một người phụ nữ đang đi bộ trên đường. Cô ấy đang đeo một chiếc khẩu trang y tế màu xanh lá cây. Cô ấy đang mặc một chiếc áo khoác màu trắng, quần đen và giày sneaker màu trắng. Cô ấy cũng đang đeo một chiếc túi đeo chéo trên vai."

VLMEvalKit_old/outputs/Qwen2-VL-2B-Instruct/DSPAR_MINI/2024-12-24_08-17-11/Qwen2-VL-2B-Instruct/T2024-12-24_08-17-20_Gd57daa89/json/Qwen2-VL-2B-Instruct_DSPAR_MINI_6589.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ "Bức tranh này mô tả một người đàn ông đang cố gắng giữ lại tóc của mình. Anh ta đang cố gắng giữ lại tóc của mình bằng tay. Anh ta đang mặc một chiếc áo màu trắng và quần short màu xám. Anh ta cũng đang đeo một chiếc đồng hồ trên tay."

VLMEvalKit_old/outputs/Qwen2-VL-2B-Instruct/DSPAR_MINI/2024-12-24_08-17-11/Qwen2-VL-2B-Instruct/T2024-12-24_08-17-20_Gd57daa89/json/Qwen2-VL-2B-Instruct_DSPAR_MINI_6608.json ADDED Viewed

	@@ -0,0 +1 @@

+ "Bức tranh này mô tả một người đàn ông đang đi bộ trên đường. Anh ấy đang mặc một chiếc áo thun màu xám và một chiếc áo vest màu đỏ. Anh ấy đang cầm một chiếc túi xách màu trắng. Anh ấy có một chiếc kính kính màu đen và một chiếc mũ nón màu đen. Anh ấy có một chiếc điện thoại di động trong túi xách. Anh ấy có một chiếc áo dài màu đỏ và một chiếc áo dài màu xanh. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo d��i màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc áo dài màu xanh và một chiếc áo dài màu đỏ. Anh ấy có một chiếc"