ABDALLALSWAITI commited on
Commit
191e7b6
·
verified ·
1 Parent(s): b924a95

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +338 -448
README.md CHANGED
@@ -57,7 +57,7 @@ All notebooks are **beginner friendly**! Add your dataset, click "Run All", and
57
 
58
  ## Introduction
59
 
60
- In the past five months since Qwen2-VLs release, numerous developers have built new models on the Qwen2-VL vision-language models, providing us with valuable feedback. During this period, we focused on building more useful vision-language models. Today, we are excited to introduce the latest addition to the Qwen family: Qwen2.5-VL.
61
 
62
  #### Key Enhancements:
63
  * **Understand things visually**: Qwen2.5-VL is not only proficient in recognizing common objects such as flowers, birds, fish, and insects, but it is highly capable of analyzing texts, charts, icons, graphics, and layouts within images.
@@ -89,482 +89,372 @@ We enhance both training and inference speeds by strategically implementing wind
89
 
90
  We have three models with 3, 7 and 72 billion parameters. This repo contains the instruction-tuned 7B Qwen2.5-VL model. For more information, visit our [Blog](https://qwenlm.github.io/blog/qwen2.5-vl/) and [GitHub](https://github.com/QwenLM/Qwen2.5-VL).
91
 
 
92
 
 
93
 
94
- ## Evaluation
95
-
96
- ### Image benchmark
97
-
98
-
99
- | Benchmark | InternVL2.5-8B | MiniCPM-o 2.6 | GPT-4o-mini | Qwen2-VL-7B |**Qwen2.5-VL-7B** |
100
- | :--- | :---: | :---: | :---: | :---: | :---: |
101
- | MMMU<sub>val</sub> | 56 | 50.4 | **60**| 54.1 | 58.6|
102
- | MMMU-Pro<sub>val</sub> | 34.3 | - | 37.6| 30.5 | 41.0|
103
- | DocVQA<sub>test</sub> | 93 | 93 | - | 94.5 | **95.7** |
104
- | InfoVQA<sub>test</sub> | 77.6 | - | - |76.5 | **82.6** |
105
- | ChartQA<sub>test</sub> | 84.8 | - |- | 83.0 |**87.3** |
106
- | TextVQA<sub>val</sub> | 79.1 | 80.1 | -| 84.3 | **84.9**|
107
- | OCRBench | 822 | 852 | 785 | 845 | **864** |
108
- | CC_OCR | 57.7 | | | 61.6 | **77.8**|
109
- | MMStar | 62.8| | |60.7| **63.9**|
110
- | MMBench-V1.1-En<sub>test</sub> | 79.4 | 78.0 | 76.0| 80.7 | **82.6** |
111
- | MMT-Bench<sub>test</sub> | - | - | - |**63.7** |63.6 |
112
- | MMStar | **61.5** | 57.5 | 54.8 | 60.7 |63.9 |
113
- | MMVet<sub>GPT-4-Turbo</sub> | 54.2 | 60.0 | 66.9 | 62.0 | **67.1**|
114
- | HallBench<sub>avg</sub> | 45.2 | 48.1 | 46.1| 50.6 | **52.9**|
115
- | MathVista<sub>testmini</sub> | 58.3 | 60.6 | 52.4 | 58.2 | **68.2**|
116
- | MathVision | - | - | - | 16.3 | **25.07** |
117
-
118
- ### Video Benchmarks
119
-
120
- | Benchmark | Qwen2-VL-7B | **Qwen2.5-VL-7B** |
121
- | :--- | :---: | :---: |
122
- | MVBench | 67.0 | **69.6** |
123
- | PerceptionTest<sub>test</sub> | 66.9 | **70.5** |
124
- | Video-MME<sub>wo/w subs</sub> | 63.3/69.0 | **65.1**/**71.6** |
125
- | LVBench | | 45.3 |
126
- | LongVideoBench | | 54.7 |
127
- | MMBench-Video | 1.44 | 1.79 |
128
- | TempCompass | | 71.7 |
129
- | MLVU | | 70.2 |
130
- | CharadesSTA/mIoU | 43.6|
131
-
132
- ### Agent benchmark
133
- | Benchmarks | Qwen2.5-VL-7B |
134
- |-------------------------|---------------|
135
- | ScreenSpot | 84.7 |
136
- | ScreenSpot Pro | 29.0 |
137
- | AITZ_EM | 81.9 |
138
- | Android Control High_EM | 60.1 |
139
- | Android Control Low_EM | 93.7 |
140
- | AndroidWorld_SR | 25.5 |
141
- | MobileMiniWob++_SR | 91.4 |
142
 
143
  ## Requirements
144
- The code of Qwen2.5-VL has been in the latest Hugging face transformers and we advise you to build from source with command:
145
- ```
146
- pip install git+https://github.com/huggingface/transformers accelerate
147
- ```
148
- or you might encounter the following error:
149
- ```
150
- KeyError: 'qwen2_5_vl'
151
- ```
152
-
153
-
154
- ## Quickstart
155
-
156
- Below, we provide simple examples to show how to use Qwen2.5-VL with 🤖 ModelScope and 🤗 Transformers.
157
-
158
- The code of Qwen2.5-VL has been in the latest Hugging face transformers and we advise you to build from source with command:
159
- ```
160
- pip install git+https://github.com/huggingface/transformers accelerate
161
- ```
162
- or you might encounter the following error:
163
- ```
164
- KeyError: 'qwen2_5_vl'
165
- ```
166
-
167
-
168
- We offer a toolkit to help you handle various types of visual input more conveniently, as if you were using an API. This includes base64, URLs, and interleaved images and videos. You can install it using the following command:
169
 
170
  ```bash
171
- # It's highly recommanded to use `[decord]` feature for faster video loading.
172
- pip install qwen-vl-utils[decord]==0.0.8
173
- ```
174
-
175
- If you are not using Linux, you might not be able to install `decord` from PyPI. In that case, you can use `pip install qwen-vl-utils` which will fall back to using torchvision for video processing. However, you can still [install decord from source](https://github.com/dmlc/decord?tab=readme-ov-file#install-from-source) to get decord used when loading video.
176
-
177
- ### Using 🤗 Transformers to Chat
178
-
179
- Here we show a code snippet to show you how to use the chat model with `transformers` and `qwen_vl_utils`:
180
-
181
- ```python
182
- from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
183
- from qwen_vl_utils import process_vision_info
184
-
185
- # default: Load the model on the available device(s)
186
- model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
187
- "Qwen/Qwen2.5-VL-7B-Instruct", torch_dtype="auto", device_map="auto"
188
- )
189
-
190
- # We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
191
- # model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
192
- # "Qwen/Qwen2.5-VL-7B-Instruct",
193
- # torch_dtype=torch.bfloat16,
194
- # attn_implementation="flash_attention_2",
195
- # device_map="auto",
196
- # )
197
-
198
- # default processer
199
- processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
200
-
201
- # The default range for the number of visual tokens per image in the model is 4-16384.
202
- # You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
203
- # min_pixels = 256*28*28
204
- # max_pixels = 1280*28*28
205
- # processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)
206
-
207
- messages = [
208
- {
209
- "role": "user",
210
- "content": [
211
- {
212
- "type": "image",
213
- "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
214
- },
215
- {"type": "text", "text": "Describe this image."},
216
- ],
217
- }
218
- ]
219
-
220
- # Preparation for inference
221
- text = processor.apply_chat_template(
222
- messages, tokenize=False, add_generation_prompt=True
223
- )
224
- image_inputs, video_inputs = process_vision_info(messages)
225
- inputs = processor(
226
- text=[text],
227
- images=image_inputs,
228
- videos=video_inputs,
229
- padding=True,
230
- return_tensors="pt",
231
- )
232
- inputs = inputs.to("cuda")
233
-
234
- # Inference: Generation of the output
235
- generated_ids = model.generate(**inputs, max_new_tokens=128)
236
- generated_ids_trimmed = [
237
- out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
238
- ]
239
- output_text = processor.batch_decode(
240
- generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
241
- )
242
- print(output_text)
243
  ```
244
- <details>
245
- <summary>Multi image inference</summary>
246
-
247
- ```python
248
- # Messages containing multiple images and a text query
249
- messages = [
250
- {
251
- "role": "user",
252
- "content": [
253
- {"type": "image", "image": "file:///path/to/image1.jpg"},
254
- {"type": "image", "image": "file:///path/to/image2.jpg"},
255
- {"type": "text", "text": "Identify the similarities between these images."},
256
- ],
257
- }
258
- ]
259
 
260
- # Preparation for inference
261
- text = processor.apply_chat_template(
262
- messages, tokenize=False, add_generation_prompt=True
263
- )
264
- image_inputs, video_inputs = process_vision_info(messages)
265
- inputs = processor(
266
- text=[text],
267
- images=image_inputs,
268
- videos=video_inputs,
269
- padding=True,
270
- return_tensors="pt",
271
- )
272
- inputs = inputs.to("cuda")
273
-
274
- # Inference
275
- generated_ids = model.generate(**inputs, max_new_tokens=128)
276
- generated_ids_trimmed = [
277
- out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
278
- ]
279
- output_text = processor.batch_decode(
280
- generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
281
- )
282
- print(output_text)
283
- ```
284
- </details>
285
 
286
- <details>
287
- <summary>Video inference</summary>
288
 
289
  ```python
290
- # Messages containing a images list as a video and a text query
291
- messages = [
292
- {
293
- "role": "user",
294
- "content": [
295
- {
296
- "type": "video",
297
- "video": [
298
- "file:///path/to/frame1.jpg",
299
- "file:///path/to/frame2.jpg",
300
- "file:///path/to/frame3.jpg",
301
- "file:///path/to/frame4.jpg",
302
- ],
303
- },
304
- {"type": "text", "text": "Describe this video."},
305
- ],
306
- }
307
- ]
308
-
309
- # Messages containing a local video path and a text query
310
- messages = [
311
- {
312
- "role": "user",
313
- "content": [
314
- {
315
- "type": "video",
316
- "video": "file:///path/to/video1.mp4",
317
- "max_pixels": 360 * 420,
318
- "fps": 1.0,
319
- },
320
- {"type": "text", "text": "Describe this video."},
321
- ],
322
- }
323
- ]
324
-
325
- # Messages containing a video url and a text query
326
- messages = [
327
- {
328
- "role": "user",
329
- "content": [
330
- {
331
- "type": "video",
332
- "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-VL/space_woaudio.mp4",
333
- },
334
- {"type": "text", "text": "Describe this video."},
335
- ],
336
- }
337
- ]
338
-
339
- #In Qwen 2.5 VL, frame rate information is also input into the model to align with absolute time.
340
- # Preparation for inference
341
- text = processor.apply_chat_template(
342
- messages, tokenize=False, add_generation_prompt=True
343
- )
344
- image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
345
- inputs = processor(
346
- text=[text],
347
- images=image_inputs,
348
- videos=video_inputs,
349
- fps=fps,
350
- padding=True,
351
- return_tensors="pt",
352
- **video_kwargs,
353
  )
354
- inputs = inputs.to("cuda")
355
-
356
- # Inference
357
- generated_ids = model.generate(**inputs, max_new_tokens=128)
358
- generated_ids_trimmed = [
359
- out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
360
- ]
361
- output_text = processor.batch_decode(
362
- generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
363
- )
364
- print(output_text)
365
- ```
366
 
367
- Video URL compatibility largely depends on the third-party library version. The details are in the table below. change the backend by `FORCE_QWENVL_VIDEO_READER=torchvision` or `FORCE_QWENVL_VIDEO_READER=decord` if you prefer not to use the default one.
 
368
 
369
- | Backend | HTTP | HTTPS |
370
- |-------------|------|-------|
371
- | torchvision >= 0.19.0 | ✅ | ✅ |
372
- | torchvision < 0.19.0 | ❌ | ❌ |
373
- | decord | ✅ | ❌ |
374
- </details>
375
-
376
- <details>
377
- <summary>Batch inference</summary>
378
-
379
- ```python
380
- # Sample messages for batch inference
381
- messages1 = [
382
- {
383
- "role": "user",
384
- "content": [
385
- {"type": "image", "image": "file:///path/to/image1.jpg"},
386
- {"type": "image", "image": "file:///path/to/image2.jpg"},
387
- {"type": "text", "text": "What are the common elements in these pictures?"},
388
- ],
389
- }
390
- ]
391
- messages2 = [
392
- {"role": "system", "content": "You are a helpful assistant."},
393
- {"role": "user", "content": "Who are you?"},
394
- ]
395
- # Combine messages for batch processing
396
- messages = [messages1, messages2]
397
 
398
- # Preparation for batch inference
399
- texts = [
400
- processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
401
- for msg in messages
402
- ]
403
- image_inputs, video_inputs = process_vision_info(messages)
404
- inputs = processor(
405
- text=texts,
406
- images=image_inputs,
407
- videos=video_inputs,
408
- padding=True,
409
- return_tensors="pt",
410
- )
411
- inputs = inputs.to("cuda")
412
-
413
- # Batch Inference
414
- generated_ids = model.generate(**inputs, max_new_tokens=128)
415
- generated_ids_trimmed = [
416
- out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
417
- ]
418
- output_texts = processor.batch_decode(
419
- generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
420
  )
421
- print(output_texts)
422
- ```
423
- </details>
424
-
425
- ### 🤖 ModelScope
426
- We strongly advise users especially those in mainland China to use ModelScope. `snapshot_download` can help you solve issues concerning downloading checkpoints.
427
-
428
-
429
- ### More Usage Tips
430
 
431
- For input images, we support local files, base64, and URLs. For videos, we currently only support local files.
 
 
 
432
 
433
- ```python
434
- # You can directly insert a local file path, a URL, or a base64-encoded image into the position where you want in the text.
435
- ## Local file path
436
- messages = [
437
- {
438
- "role": "user",
439
- "content": [
440
- {"type": "image", "image": "file:///path/to/your/image.jpg"},
441
- {"type": "text", "text": "Describe this image."},
442
- ],
443
- }
444
- ]
445
- ## Image URL
446
- messages = [
447
- {
448
- "role": "user",
449
- "content": [
450
- {"type": "image", "image": "http://path/to/your/image.jpg"},
451
- {"type": "text", "text": "Describe this image."},
452
- ],
453
- }
454
- ]
455
- ## Base64 encoded image
456
  messages = [
457
  {
458
  "role": "user",
459
  "content": [
460
- {"type": "image", "image": "data:image;base64,/9j/..."},
461
- {"type": "text", "text": "Describe this image."},
462
- ],
463
  }
464
  ]
465
- ```
466
- #### Image Resolution for performance boost
467
-
468
- The model supports a wide range of resolution inputs. By default, it uses the native resolution for input, but higher resolutions can enhance performance at the cost of more computation. Users can set the minimum and maximum number of pixels to achieve an optimal configuration for their needs, such as a token count range of 256-1280, to balance speed and memory usage.
469
 
470
- ```python
471
- min_pixels = 256 * 28 * 28
472
- max_pixels = 1280 * 28 * 28
473
- processor = AutoProcessor.from_pretrained(
474
- "Qwen/Qwen2.5-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels
475
- )
 
 
 
 
 
 
 
 
 
476
  ```
477
 
478
- Besides, We provide two methods for fine-grained control over the image size input to the model:
479
 
480
- 1. Define min_pixels and max_pixels: Images will be resized to maintain their aspect ratio within the range of min_pixels and max_pixels.
481
-
482
- 2. Specify exact dimensions: Directly set `resized_height` and `resized_width`. These values will be rounded to the nearest multiple of 28.
483
 
484
  ```python
485
- # min_pixels and max_pixels
486
- messages = [
487
- {
488
- "role": "user",
489
- "content": [
490
- {
491
- "type": "image",
492
- "image": "file:///path/to/your/image.jpg",
493
- "resized_height": 280,
494
- "resized_width": 420,
495
- },
496
- {"type": "text", "text": "Describe this image."},
497
- ],
498
- }
499
- ]
500
- # resized_height and resized_width
501
- messages = [
502
- {
503
- "role": "user",
504
- "content": [
505
- {
506
- "type": "image",
507
- "image": "file:///path/to/your/image.jpg",
508
- "min_pixels": 50176,
509
- "max_pixels": 50176,
510
- },
511
- {"type": "text", "text": "Describe this image."},
512
- ],
513
- }
514
- ]
515
- ```
516
-
517
- ### Processing Long Texts
518
-
519
- The current `config.json` is set for context length up to 32,768 tokens.
520
- To handle extensive inputs exceeding 32,768 tokens, we utilize [YaRN](https://arxiv.org/abs/2309.00071), a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts.
521
-
522
- For supported frameworks, you could add the following to `config.json` to enable YaRN:
523
-
524
- {
525
- ...,
526
- "type": "yarn",
527
- "mrope_section": [
528
- 16,
529
- 24,
530
- 24
531
- ],
532
- "factor": 4,
533
- "original_max_position_embeddings": 32768
534
- }
535
-
536
- However, it should be noted that this method has a significant impact on the performance of temporal and spatial localization tasks, and is therefore not recommended for use.
537
-
538
- At the same time, for long video inputs, since MRoPE itself is more economical with ids, the max_position_embeddings can be directly modified to a larger value, such as 64k.
539
-
540
-
541
-
542
-
543
- ## Citation
544
-
545
- If you find our work helpful, feel free to give us a cite.
546
-
547
- ```
548
- @misc{qwen2.5-VL,
549
- title = {Qwen2.5-VL},
550
- url = {https://qwenlm.github.io/blog/qwen2.5-vl/},
551
- author = {Qwen Team},
552
- month = {January},
553
- year = {2025}
554
- }
555
-
556
- @article{Qwen2VL,
557
- title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
558
- author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
559
- journal={arXiv preprint arXiv:2409.12191},
560
- year={2024}
561
- }
562
-
563
- @article{Qwen-VL,
564
- title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
565
- author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
566
- journal={arXiv preprint arXiv:2308.12966},
567
- year={2023}
568
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
569
  ```
570
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57
 
58
  ## Introduction
59
 
60
+ In the past five months since Qwen2-VL's release, numerous developers have built new models on the Qwen2-VL vision-language models, providing us with valuable feedback. During this period, we focused on building more useful vision-language models. Today, we are excited to introduce the latest addition to the Qwen family: Qwen2.5-VL.
61
 
62
  #### Key Enhancements:
63
  * **Understand things visually**: Qwen2.5-VL is not only proficient in recognizing common objects such as flowers, birds, fish, and insects, but it is highly capable of analyzing texts, charts, icons, graphics, and layouts within images.
 
89
 
90
  We have three models with 3, 7 and 72 billion parameters. This repo contains the instruction-tuned 7B Qwen2.5-VL model. For more information, visit our [Blog](https://qwenlm.github.io/blog/qwen2.5-vl/) and [GitHub](https://github.com/QwenLM/Qwen2.5-VL).
91
 
92
+ # Using Qwen2.5-VL 7B with 4-bit Quantization
93
 
94
+ This guide demonstrates how to use the 4-bit quantized version of Qwen2.5-VL, a multimodal vision-language model that can understand images and generate descriptive text. The 4-bit quantization significantly reduces memory requirements while maintaining good performance.
95
 
96
+ ## Table of Contents
97
+ - [Requirements](#requirements)
98
+ - [Standard Implementation](#standard-implementation)
99
+ - [Memory-Efficient Implementation](#memory-efficient-implementation)
100
+ - [Quantization Benefits](#quantization-benefits)
101
+ - [Performance Tips](#performance-tips)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
102
 
103
  ## Requirements
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
104
 
105
  ```bash
106
+ pip install transformers torch bitsandbytes accelerate pillow huggingface_hub
107
+ pip install qwen-vl-utils[decord]==0.0.8 # For video support (recommended)
108
+ # OR
109
+ pip install qwen-vl-utils # Falls back to torchvision for video
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
110
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
111
 
112
+ ## Standard Implementation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
113
 
114
+ This implementation provides a good balance between performance and memory efficiency:
 
115
 
116
  ```python
117
+ import torch
118
+ from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration, BitsAndBytesConfig
119
+ from huggingface_hub import login
120
+ import requests
121
+ from PIL import Image
122
+ from io import BytesIO
123
+
124
+ # Login to Hugging Face with token
125
+ # You need to use a valid token with access to the model
126
+ token = "YOUR_HF_TOKEN" # Replace with your valid token
127
+ login(token)
128
+
129
+ # Configure quantization
130
+ bnb_config = BitsAndBytesConfig(
131
+ load_in_4bit=True,
132
+ bnb_4bit_compute_dtype=torch.float16,
133
+ bnb_4bit_use_double_quant=True,
134
+ bnb_4bit_quant_type="nf4"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
135
  )
 
 
 
 
 
 
 
 
 
 
 
 
136
 
137
+ # Model ID
138
+ model_id = "unsloth/Qwen2.5-VL-7B-Instruct-unsloth-bnb-4bit"
139
 
140
+ # Load processor
141
+ processor = AutoProcessor.from_pretrained(model_id, token=token)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
142
 
143
+ # Load model
144
+ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
145
+ model_id,
146
+ quantization_config=bnb_config,
147
+ device_map="auto",
148
+ token=token
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
149
  )
 
 
 
 
 
 
 
 
 
150
 
151
+ # Process image from URL
152
+ image_url = "https://i.pinimg.com/736x/69/cd/59/69cd59a5ee5e041aa00f088465befbad.jpg"
153
+ response = requests.get(image_url)
154
+ image = Image.open(BytesIO(response.content)).convert("RGB")
155
 
156
+ # Create message according to Qwen2.5-VL format
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
157
  messages = [
158
  {
159
  "role": "user",
160
  "content": [
161
+ {"type": "image", "image": image},
162
+ {"type": "text", "text": "Describe this image in detail."}
163
+ ]
164
  }
165
  ]
 
 
 
 
166
 
167
+ # Process input
168
+ text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
169
+ inputs = processor(text=[text], images=[image], return_tensors="pt").to("cuda")
170
+
171
+ # Generate response
172
+ with torch.no_grad():
173
+ output_ids = model.generate(**inputs, max_new_tokens=200)
174
+
175
+ # Decode response
176
+ response = processor.batch_decode(
177
+ output_ids[:, inputs.input_ids.shape[1]:],
178
+ skip_special_tokens=True
179
+ )[0]
180
+
181
+ print(response)
182
  ```
183
 
184
+ ## Memory-Efficient Implementation
185
 
186
+ This version includes optimizations for systems with limited resources, with better error handling and memory management:
 
 
187
 
188
  ```python
189
+ import torch
190
+ import transformers
191
+ from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration, BitsAndBytesConfig
192
+ from huggingface_hub import login
193
+ import requests
194
+ from PIL import Image
195
+ from io import BytesIO
196
+ import gc
197
+ import os
198
+
199
+ # Login to Hugging Face with token
200
+ token = "YOUR_HF_TOKEN" # Replace with your valid token
201
+ login(token)
202
+
203
+ # Set environment variables to optimize memory usage
204
+ os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:128"
205
+
206
+ def process_vision_info(messages):
207
+ """Process images and videos from messages"""
208
+ image_inputs = []
209
+ video_inputs = None
210
+
211
+ for message in messages:
212
+ if message["role"] == "user" and isinstance(message["content"], list):
213
+ for content in message["content"]:
214
+ if content["type"] == "image":
215
+ # Handle image from URL
216
+ if isinstance(content["image"], str) and content["image"].startswith("http"):
217
+ try:
218
+ response = requests.get(content["image"], timeout=10)
219
+ response.raise_for_status()
220
+ image = Image.open(BytesIO(response.content)).convert("RGB")
221
+ image_inputs.append(image)
222
+ except (requests.RequestException, IOError) as e:
223
+ print(f"Error loading image from URL: {e}")
224
+ # Handle base64 images
225
+ elif isinstance(content["image"], str) and content["image"].startswith("data:image"):
226
+ try:
227
+ import base64
228
+ # Extract base64 data after the comma
229
+ base64_data = content["image"].split(',')[1]
230
+ image_data = base64.b64decode(base64_data)
231
+ image = Image.open(BytesIO(image_data)).convert("RGB")
232
+ image_inputs.append(image)
233
+ except Exception as e:
234
+ print(f"Error loading base64 image: {e}")
235
+ # Handle local file paths
236
+ elif isinstance(content["image"], str) and content["image"].startswith("file://"):
237
+ try:
238
+ file_path = content["image"][7:] # Remove 'file://'
239
+ image = Image.open(file_path).convert("RGB")
240
+ image_inputs.append(image)
241
+ except Exception as e:
242
+ print(f"Error loading local image: {e}")
243
+ else:
244
+ print("Unsupported image format or source")
245
+
246
+ return image_inputs, video_inputs
247
+
248
+ # Print versions for debugging
249
+ print(f"Transformers version: {transformers.__version__}")
250
+ print(f"PyTorch version: {torch.__version__}")
251
+ print(f"CUDA available: {torch.cuda.is_available()}")
252
+ if torch.cuda.is_available():
253
+ print(f"CUDA device: {torch.cuda.get_device_name(0)}")
254
+ print(f"CUDA memory allocated: {torch.cuda.memory_allocated(0)/1024**3:.2f} GB")
255
+ print(f"CUDA memory reserved: {torch.cuda.memory_reserved(0)/1024**3:.2f} GB")
256
+
257
+ # Load the 4-bit quantized model from Unsloth
258
+ model_id = "unsloth/Qwen2.5-VL-7B-Instruct-unsloth-bnb-4bit"
259
+ try:
260
+ # Free GPU memory before loading
261
+ if torch.cuda.is_available():
262
+ torch.cuda.empty_cache()
263
+ gc.collect()
264
+
265
+ # Load the processor first (less memory intensive)
266
+ print("Loading processor...")
267
+ processor = AutoProcessor.from_pretrained(model_id, token=token)
268
+
269
+ # Configure quantization parameters
270
+ quantization_config = BitsAndBytesConfig(
271
+ load_in_4bit=True,
272
+ bnb_4bit_compute_dtype=torch.float16,
273
+ bnb_4bit_use_double_quant=True,
274
+ bnb_4bit_quant_type="nf4",
275
+ llm_int8_enable_fp32_cpu_offload=True
276
+ )
277
+
278
+ print("Loading model...")
279
+ # Try loading with GPU offloading enabled
280
+ try:
281
+ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
282
+ model_id,
283
+ token=token,
284
+ device_map="auto",
285
+ quantization_config=quantization_config,
286
+ low_cpu_mem_usage=True,
287
+ )
288
+ print("Model loaded successfully with GPU acceleration")
289
+ except (ValueError, RuntimeError, torch.cuda.OutOfMemoryError) as e:
290
+ print(f"GPU loading failed: {e}")
291
+ print("Falling back to CPU-only mode")
292
+
293
+ # Clean up any partially loaded model
294
+ if 'model' in locals():
295
+ del model
296
+ torch.cuda.empty_cache()
297
+ gc.collect()
298
+
299
+ # Try again with CPU only
300
+ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
301
+ model_id,
302
+ token=token,
303
+ device_map="cpu",
304
+ torch_dtype=torch.float32,
305
+ )
306
+ print("Model loaded on CPU successfully")
307
+
308
+ # Print model's device map if available
309
+ if hasattr(model, 'hf_device_map'):
310
+ print("Model device map:")
311
+ for module, device in model.hf_device_map.items():
312
+ print(f" {module}: {device}")
313
+
314
+ # Example message with an image
315
+ messages = [
316
+ {
317
+ "role": "user",
318
+ "content": [
319
+ {
320
+ "type": "image",
321
+ "image": "https://i.pinimg.com/736x/69/cd/59/69cd59a5ee5e041aa00f088465befbad.jpg",
322
+ },
323
+ {"type": "text", "text": "Describe this image in detail."},
324
+ ],
325
+ }
326
+ ]
327
+
328
+ # Process the messages
329
+ print("Processing input...")
330
+ text = processor.apply_chat_template(
331
+ messages, tokenize=False, add_generation_prompt=True
332
+ )
333
+ image_inputs, video_inputs = process_vision_info(messages)
334
+
335
+ # Check if we have valid image inputs
336
+ if not image_inputs:
337
+ raise ValueError("No valid images were processed")
338
+
339
+ # Prepare inputs for the model
340
+ inputs = processor(
341
+ text=[text],
342
+ images=image_inputs,
343
+ videos=video_inputs,
344
+ padding=True,
345
+ return_tensors="pt",
346
+ )
347
+
348
+ # Determine which device to use based on model's main device
349
+ if hasattr(model, 'hf_device_map'):
350
+ # Find the primary device (usually where the first transformer block is)
351
+ for key, device in model.hf_device_map.items():
352
+ if 'transformer.blocks.0' in key or 'model.embed_tokens' in key:
353
+ input_device = device
354
+ break
355
+ else:
356
+ # Default to first device in the map
357
+ input_device = next(iter(model.hf_device_map.values()))
358
+ else:
359
+ # If not distributed, use the model's device
360
+ input_device = next(model.parameters()).device
361
+
362
+ print(f"Using device {input_device} for inputs")
363
+ inputs = {k: v.to(input_device) for k, v in inputs.items()}
364
+
365
+ # Generate the response
366
+ print("Generating response...")
367
+ with torch.no_grad():
368
+ generation_config = {
369
+ "max_new_tokens": 256,
370
+ "do_sample": True,
371
+ "temperature": 0.7,
372
+ "top_p": 0.9,
373
+ }
374
+ generated_ids = model.generate(**inputs, **generation_config)
375
+
376
+ # Process the output
377
+ generated_ids_trimmed = [
378
+ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs["input_ids"], generated_ids)
379
+ ]
380
+ output_text = processor.batch_decode(
381
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
382
+ )
383
+
384
+ # Print the response
385
+ print("\nModel response:")
386
+ print(output_text[0])
387
+ except Exception as e:
388
+ import traceback
389
+ print(f"An error occurred: {e}")
390
+ print(traceback.format_exc())
391
+ finally:
392
+ # Clean up
393
+ if torch.cuda.is_available():
394
+ torch.cuda.empty_cache()
395
  ```
396
 
397
+ ## Quantization Benefits
398
+
399
+ The 4-bit quantized model offers several advantages:
400
+
401
+ 1. **Reduced Memory Usage**: Uses approximately 4-5GB of VRAM compared to 14-16GB for the full model
402
+ 2. **Wider Accessibility**: Can run on consumer GPUs with limited VRAM (e.g., RTX 3060, GTX 1660)
403
+ 3. **CPU Fallback**: The memory-efficient implementation can fall back to CPU if GPU memory is insufficient
404
+ 4. **Minimal Performance Loss**: The quantized model maintains most of the reasoning capabilities of the full model
405
+
406
+ ## Performance Tips
407
+
408
+ 1. **Control Image Resolution**:
409
+ ```python
410
+ processor = AutoProcessor.from_pretrained(
411
+ model_id,
412
+ token=token,
413
+ min_pixels=256*28*28, # Lower bound
414
+ max_pixels=1280*28*28 # Upper bound
415
+ )
416
+ ```
417
+
418
+ 2. **Enable Flash Attention 2** for better performance (if supported):
419
+ ```python
420
+ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
421
+ model_id,
422
+ token=token,
423
+ torch_dtype=torch.bfloat16,
424
+ attn_implementation="flash_attention_2",
425
+ device_map="auto",
426
+ quantization_config=bnb_config
427
+ )
428
+ ```
429
+
430
+ 3. **Memory Management**:
431
+ - Call `torch.cuda.empty_cache()` and `gc.collect()` before and after using the model
432
+ - Set environment variables: `os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:128"`
433
+ - Use `low_cpu_mem_usage=True` when loading the model
434
+
435
+ 4. **Generation Parameters**:
436
+ - Adjust `max_new_tokens` based on your needs (lower values use less memory)
437
+ - Use temperature and top_p to control randomness:
438
+ ```python
439
+ generation_config = {
440
+ "max_new_tokens": 256,
441
+ "do_sample": True,
442
+ "temperature": 0.7,
443
+ "top_p": 0.9,
444
+ }
445
+ ```
446
+
447
+ 5. **Multi-Image Processing**:
448
+ When working with multiple images, batch processing them properly can save memory and improve efficiency:
449
+ ```python
450
+ messages = [
451
+ {
452
+ "role": "user",
453
+ "content": [
454
+ {"type": "image", "image": "url_to_image1"},
455
+ {"type": "image", "image": "url_to_image2"},
456
+ {"type": "text", "text": "Compare these two images."}
457
+ ]
458
+ }
459
+ ]
460
+ ```