Ivy1997 lbourdois commited on
Commit
a84e17f
·
verified ·
1 Parent(s): c0140fc

Improve language tag (#3)

Browse files

- Improve language tag (df7c3ebdc71ce58b78d1ced1d45363a471327215)


Co-authored-by: Loïck BOURDOIS <[email protected]>

Files changed (1) hide show
  1. README.md +140 -129
README.md CHANGED
@@ -1,130 +1,141 @@
1
- ---
2
- license: apache-2.0
3
- base_model:
4
- - Qwen/Qwen2.5-3B-Instruct
5
- - google/siglip-so400m-patch14-384
6
- tags:
7
- - multimodal
8
- - llava
9
- language:
10
- - en
11
- - zh
12
- pipeline_tag: visual-question-answering
13
- library_name: transformers
14
- ---
15
-
16
- ![logo.jpg](logo.jpg)
17
-
18
- <code>Ivy-VL</code> is a lightweight multimodal model with only 3B parameters. 
19
-
20
- It accepts both image and text inputs to generate text outputs. 
21
-
22
- Thanks to its lightweight design, it can be deployed on edge devices such as AI glasses and smartphones, offering low memory usage and high speed while maintaining strong performance on multimodal tasks. Some well-known small models include [PaliGemma 3B](https://huggingface.co/google/paligemma-3b-mix-448), [Moondream2](https://huggingface.co/vikhyatk/moondream2), [Qwen2-VL-2B](https://huggingface.co/Qwen/Qwen2-VL-2B), [InternVL2-2B](https://huggingface.co/OpenGVLab/InternVL2-2B), and [InternVL2_5-2B](https://huggingface.co/OpenGVLab/InternVL2_5-2B). Ivy-VL outperforms them on multiple benchmarks.
23
-
24
- # Model Summary:
25
-
26
- * Developed: AI Safeguard, CMU, Standford
27
-
28
- * Model type: Multi-modal model (image+text)
29
-
30
- * Language: Engligh and Chinese
31
-
32
- * License: Apache 2.0
33
-
34
- * Architecture: Based on LLaVA-One-Vision
35
-
36
- * LLM: Qwen/Qwen2.5-3B-Instruct
37
-
38
- * Vision Encoder: google/siglip-so400m-patch14-384
39
-
40
- * Notebook demo: [Ivy-VL-demo.ipynb](https://colab.research.google.com/drive/1D5_8sDRcP1HKlWtlqTH7s64xG8OH9NH0?usp=sharing)
41
-
42
- # Evaluation:
43
-
44
- ![evaluation.jpg](evaluation.jpg)
45
-
46
- Most of the performance data comes from the VLMEvalKit leaderboard or the original papers. We conducted evaluations using VLMEvalKit. Due to differences in environments and the LLMs used for evaluation, there may be slight variations in performance.
47
-
48
- # How to use:
49
-
50
-
51
- ```python
52
- # pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
53
- from llava.model.builder import load_pretrained_model
54
- from llava.mm_utils import process_images, tokenizer_image_token
55
- from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
56
- from llava.conversation import conv_templates
57
- from PIL import Image
58
- import requests
59
- import copy
60
- import torch
61
- import warnings
62
-
63
- warnings.filterwarnings("ignore")
64
-
65
- pretrained = "AI-Safeguard/Ivy-VL-llava"
66
-
67
- model_name = "llava_qwen"
68
- device = "cuda"
69
- device_map = "auto"
70
- tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map) # Add any other thing you want to pass in llava_model_args
71
-
72
- model.eval()
73
-
74
- # load image from url
75
- url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
76
- image = Image.open(requests.get(url, stream=True).raw)
77
-
78
- # load image from local environment
79
- # url = "./local_image.jpg"
80
- # image = Image.open(url)
81
-
82
- image_tensor = process_images([image], image_processor, model.config)
83
- image_tensor = [_image.to(dtype=torch.float16, device=device) for _image in image_tensor]
84
-
85
- conv_template = "qwen_1_5" # Make sure you use correct chat template for different models
86
- question = DEFAULT_IMAGE_TOKEN + "\nWhat is shown in this image?"
87
- conv = copy.deepcopy(conv_templates[conv_template])
88
- conv.append_message(conv.roles[0], question)
89
- conv.append_message(conv.roles[1], None)
90
- prompt_question = conv.get_prompt()
91
-
92
- input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
93
- image_sizes = [image.size]
94
-
95
- cont = model.generate(
96
- input_ids,
97
- images=image_tensor,
98
- image_sizes=image_sizes,
99
- do_sample=False,
100
- temperature=0,
101
- max_new_tokens=4096,
102
- )
103
-
104
- text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
105
-
106
- print(text_outputs)
107
- ```
108
-
109
- # Future Plan:
110
-
111
- * We plan to release more versions of LLMs in different sizes.
112
-
113
- * We will focus on improving the performance of the video modality.
114
-
115
- # Contact:
116
- Feel free to contact us if you have any questions or suggestions📧:
117
- * Email (Ivy Zhang): [email protected]
118
-
119
- # Citation:
120
-
121
- If you find our work helpful, please consider citing our Model:
122
- ```plaintext
123
- @misc{ivy2024ivy-vl,
124
- title={Ivy-VL:Compact Vision-Language Models Achieving SOTA with Optimal Data},
125
- url={https://huggingface.co/AI-Safeguard/Ivy-VL-llava},
126
- author={Ivy Zhang,Wei Peng,Jenny N,Theresa Yu and David Qiu},
127
- month={December},
128
- year={2024}
129
- }
 
 
 
 
 
 
 
 
 
 
 
130
  ```
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model:
4
+ - Qwen/Qwen2.5-3B-Instruct
5
+ - google/siglip-so400m-patch14-384
6
+ tags:
7
+ - multimodal
8
+ - llava
9
+ language:
10
+ - zho
11
+ - eng
12
+ - fra
13
+ - spa
14
+ - por
15
+ - deu
16
+ - ita
17
+ - rus
18
+ - jpn
19
+ - kor
20
+ - vie
21
+ - tha
22
+ - ara
23
+ pipeline_tag: visual-question-answering
24
+ library_name: transformers
25
+ ---
26
+
27
+ ![logo.jpg](logo.jpg)
28
+
29
+ <code>Ivy-VL</code> is a lightweight multimodal model with only 3B parameters. 
30
+
31
+ It accepts both image and text inputs to generate text outputs. 
32
+
33
+ Thanks to its lightweight design, it can be deployed on edge devices such as AI glasses and smartphones, offering low memory usage and high speed while maintaining strong performance on multimodal tasks. Some well-known small models include [PaliGemma 3B](https://huggingface.co/google/paligemma-3b-mix-448), [Moondream2](https://huggingface.co/vikhyatk/moondream2), [Qwen2-VL-2B](https://huggingface.co/Qwen/Qwen2-VL-2B), [InternVL2-2B](https://huggingface.co/OpenGVLab/InternVL2-2B), and [InternVL2_5-2B](https://huggingface.co/OpenGVLab/InternVL2_5-2B). Ivy-VL outperforms them on multiple benchmarks.
34
+
35
+ # Model Summary:
36
+
37
+ * Developed: AI Safeguard, CMU, Standford
38
+
39
+ * Model type: Multi-modal model (image+text)
40
+
41
+ * Language: Engligh and Chinese
42
+
43
+ * License: Apache 2.0
44
+
45
+ * Architecture: Based on LLaVA-One-Vision
46
+
47
+ * LLM: Qwen/Qwen2.5-3B-Instruct
48
+
49
+ * Vision Encoder: google/siglip-so400m-patch14-384
50
+
51
+ * Notebook demo: [Ivy-VL-demo.ipynb](https://colab.research.google.com/drive/1D5_8sDRcP1HKlWtlqTH7s64xG8OH9NH0?usp=sharing)
52
+
53
+ # Evaluation:
54
+
55
+ ![evaluation.jpg](evaluation.jpg)
56
+
57
+ Most of the performance data comes from the VLMEvalKit leaderboard or the original papers. We conducted evaluations using VLMEvalKit. Due to differences in environments and the LLMs used for evaluation, there may be slight variations in performance.
58
+
59
+ # How to use:
60
+
61
+
62
+ ```python
63
+ # pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
64
+ from llava.model.builder import load_pretrained_model
65
+ from llava.mm_utils import process_images, tokenizer_image_token
66
+ from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
67
+ from llava.conversation import conv_templates
68
+ from PIL import Image
69
+ import requests
70
+ import copy
71
+ import torch
72
+ import warnings
73
+
74
+ warnings.filterwarnings("ignore")
75
+
76
+ pretrained = "AI-Safeguard/Ivy-VL-llava"
77
+
78
+ model_name = "llava_qwen"
79
+ device = "cuda"
80
+ device_map = "auto"
81
+ tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map) # Add any other thing you want to pass in llava_model_args
82
+
83
+ model.eval()
84
+
85
+ # load image from url
86
+ url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
87
+ image = Image.open(requests.get(url, stream=True).raw)
88
+
89
+ # load image from local environment
90
+ # url = "./local_image.jpg"
91
+ # image = Image.open(url)
92
+
93
+ image_tensor = process_images([image], image_processor, model.config)
94
+ image_tensor = [_image.to(dtype=torch.float16, device=device) for _image in image_tensor]
95
+
96
+ conv_template = "qwen_1_5" # Make sure you use correct chat template for different models
97
+ question = DEFAULT_IMAGE_TOKEN + "\nWhat is shown in this image?"
98
+ conv = copy.deepcopy(conv_templates[conv_template])
99
+ conv.append_message(conv.roles[0], question)
100
+ conv.append_message(conv.roles[1], None)
101
+ prompt_question = conv.get_prompt()
102
+
103
+ input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
104
+ image_sizes = [image.size]
105
+
106
+ cont = model.generate(
107
+ input_ids,
108
+ images=image_tensor,
109
+ image_sizes=image_sizes,
110
+ do_sample=False,
111
+ temperature=0,
112
+ max_new_tokens=4096,
113
+ )
114
+
115
+ text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
116
+
117
+ print(text_outputs)
118
+ ```
119
+
120
+ # Future Plan:
121
+
122
+ * We plan to release more versions of LLMs in different sizes.
123
+
124
+ * We will focus on improving the performance of the video modality.
125
+
126
+ # Contact:
127
+ Feel free to contact us if you have any questions or suggestions📧:
128
+ * Email (Ivy Zhang): [email protected]
129
+
130
+ # Citation:
131
+
132
+ If you find our work helpful, please consider citing our Model:
133
+ ```plaintext
134
+ @misc{ivy2024ivy-vl,
135
+ title={Ivy-VL:Compact Vision-Language Models Achieving SOTA with Optimal Data},
136
+ url={https://huggingface.co/AI-Safeguard/Ivy-VL-llava},
137
+ author={Ivy Zhang,Wei Peng,Jenny N,Theresa Yu and David Qiu},
138
+ month={December},
139
+ year={2024}
140
+ }
141
  ```