File size: 19,633 Bytes
b924a95
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
191e7b6
b924a95
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
191e7b6
b924a95
191e7b6
b924a95
191e7b6
 
 
 
 
 
b924a95
 
 
 
191e7b6
 
 
 
b924a95
 
191e7b6
b924a95
191e7b6
b924a95
 
191e7b6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b924a95
 
191e7b6
ef61fcc
b924a95
191e7b6
 
b924a95
191e7b6
 
 
 
 
 
b924a95
 
191e7b6
 
 
 
b924a95
191e7b6
b924a95
 
 
 
191e7b6
 
 
b924a95
 
 
191e7b6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b924a95
 
191e7b6
b924a95
191e7b6
b924a95
 
191e7b6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ef61fcc
191e7b6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b924a95
 
191e7b6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
---
base_model: Qwen/Qwen2.5-VL-7B-Instruct
language:
- en
library_name: transformers
pipeline_tag: image-text-to-text
license: apache-2.0
tags:
- multimodal
- qwen
- qwen2
- unsloth
- transformers
- vision
---
<div>
  <p style="margin-bottom: 0;margin-top:0;">
    <em>Unsloth's <a href="https://unsloth.ai/blog/dynamic-4bit">Dynamic 4-bit Quants</a> is selectively quantized, greatly improving accuracy over standard 4-bit.</em>
  </p>
  <div style="display: flex; gap: 5px; align-items: center;margin-top:0; ">
    <a href="https://github.com/unslothai/unsloth/">
      <img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="133">
    </a>
    <a href="https://discord.gg/unsloth">
      <img src="https://github.com/unslothai/unsloth/raw/main/images/Discord%20button.png" width="173">
    </a>
    <a href="https://docs.unsloth.ai/">
      <img src="https://raw.githubusercontent.com/unslothai/unsloth/refs/heads/main/images/documentation%20green%20button.png" width="143">
    </a>
  </div>
<h1 style="margin-top: 0rem;">Finetune LLMs 2-5x faster with 70% less memory via Unsloth</h2>
</div>
We have a free Google Colab Tesla T4 notebook for Qwen2-VL (7B) here: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen2_VL_(7B)-Vision.ipynb

## ✨ Finetune for Free

All notebooks are **beginner friendly**! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face.

| Unsloth supports          |    Free Notebooks                                                                                           | Performance | Memory use |
|-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------|
| **Llama-3.2 (3B)**      | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb)               | 2.4x faster | 58% less |
| **Llama-3.2 (11B vision)**      | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)               | 2x faster | 60% less |
| **Qwen2 VL (7B)**      | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen2_VL_(7B)-Vision.ipynb)               | 1.8x faster | 60% less |
| **Qwen2.5 (7B)**      | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen2.5_(7B)-Alpaca.ipynb)               | 2x faster | 60% less |
| **Llama-3.1 (8B)**      | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-Alpaca.ipynb)               | 2.4x faster | 58% less |
| **Phi-3.5 (mini)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Phi_3.5_Mini-Conversational.ipynb)               | 2x faster | 50% less |
| **Gemma 2 (9B)**      | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Gemma2_(9B)-Alpaca.ipynb)               | 2.4x faster | 58% less |
| **Mistral (7B)**    | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Mistral_v0.3_(7B)-Conversational.ipynb)               | 2.2x faster | 62% less |

[<img src="https://raw.githubusercontent.com/unslothai/unsloth/refs/heads/main/images/documentation%20green%20button.png" width="200"/>](https://docs.unsloth.ai)

- This [Llama 3.2 conversational notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb) is useful for ShareGPT ChatML / Vicuna templates.
- This [text completion notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Mistral_(7B)-Text_Completion.ipynb) is for raw text. This [DPO notebook](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing) replicates Zephyr.
- \* Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster.

# Qwen2.5-VL

## Introduction

In the past five months since Qwen2-VL's release, numerous developers have built new models on the Qwen2-VL vision-language models, providing us with valuable feedback. During this period, we focused on building more useful vision-language models. Today, we are excited to introduce the latest addition to the Qwen family: Qwen2.5-VL.

#### Key Enhancements:
* **Understand things visually**: Qwen2.5-VL is not only proficient in recognizing common objects such as flowers, birds, fish, and insects, but it is highly capable of analyzing texts, charts, icons, graphics, and layouts within images.

* **Being agentic**: Qwen2.5-VL directly plays as a visual agent that can reason and dynamically direct tools, which is capable of computer use and phone use.

* **Understanding long videos and capturing events**: Qwen2.5-VL can comprehend videos of over 1 hour, and this time it has a new ability of cpaturing event by pinpointing the relevant video segments.

* **Capable of visual localization in different formats**: Qwen2.5-VL can accurately localize objects in an image by generating bounding boxes or points, and it can provide stable JSON outputs for coordinates and attributes.

* **Generating structured outputs**: for data like scans of invoices, forms, tables, etc. Qwen2.5-VL supports structured outputs of their contents, benefiting usages in finance, commerce, etc.


#### Model Architecture Updates:

* **Dynamic Resolution and Frame Rate Training for Video Understanding**:

We extend dynamic resolution to the temporal dimension by adopting dynamic FPS sampling, enabling the model to comprehend videos at various sampling rates. Accordingly, we update mRoPE in the time dimension with IDs and absolute time alignment, enabling the model to learn temporal sequence and speed, and ultimately acquire the ability to pinpoint specific moments.

<p align="center">
    <img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-VL/qwen2.5vl_arc.jpeg" width="80%"/>
<p>


* **Streamlined and Efficient Vision Encoder**

We enhance both training and inference speeds by strategically implementing window attention into the ViT. The ViT architecture is further optimized with SwiGLU and RMSNorm, aligning it with the structure of the Qwen2.5 LLM.


We have three models with 3, 7 and 72 billion parameters. This repo contains the instruction-tuned 7B Qwen2.5-VL model. For more information, visit our [Blog](https://qwenlm.github.io/blog/qwen2.5-vl/) and [GitHub](https://github.com/QwenLM/Qwen2.5-VL).

# Using Qwen2.5-VL 7B with 4-bit Quantization

This guide demonstrates how to use the 4-bit quantized version of Qwen2.5-VL, a multimodal vision-language model that can understand images and generate descriptive text. The 4-bit quantization significantly reduces memory requirements while maintaining good performance.

## Table of Contents
- [Requirements](#requirements)
- [Standard Implementation](#standard-implementation)
- [Memory-Efficient Implementation](#memory-efficient-implementation)
- [Quantization Benefits](#quantization-benefits)
- [Performance Tips](#performance-tips)

## Requirements

```bash
pip install transformers torch bitsandbytes accelerate pillow huggingface_hub
pip install qwen-vl-utils[decord]==0.0.8  # For video support (recommended)
# OR
pip install qwen-vl-utils  # Falls back to torchvision for video
```

## Standard Implementation

This implementation provides a good balance between performance and memory efficiency:

```python
import torch
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration, BitsAndBytesConfig
from huggingface_hub import login
import requests
from PIL import Image
from io import BytesIO

# Login to Hugging Face with token
# You need to use a valid token with access to the model
token = "YOUR_HF_TOKEN"  # Replace with your valid token
login(token)

# Configure quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

# Model ID
model_id = "ABDALLALSWAITI/Qwen2.5-VL-7B-Instruct-unsloth-bnb-4bit-copy"

# Load processor
processor = AutoProcessor.from_pretrained(model_id, token=token)

# Load model
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
    token=token
)

# Process image from URL
image_url = "https://i.pinimg.com/736x/69/cd/59/69cd59a5ee5e041aa00f088465befbad.jpg"
response = requests.get(image_url)
image = Image.open(BytesIO(response.content)).convert("RGB")

# Create message according to Qwen2.5-VL format
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "Describe this image in detail."}
        ]
    }
]

# Process input
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to("cuda")

# Generate response
with torch.no_grad():
    output_ids = model.generate(**inputs, max_new_tokens=200)
    
    # Decode response
    response = processor.batch_decode(
        output_ids[:, inputs.input_ids.shape[1]:], 
        skip_special_tokens=True
    )[0]

print(response)
```

## Memory-Efficient Implementation

This version includes optimizations for systems with limited resources, with better error handling and memory management:

```python
import torch
import transformers
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration, BitsAndBytesConfig
from huggingface_hub import login
import requests
from PIL import Image
from io import BytesIO
import gc
import os

# Login to Hugging Face with token
token = "YOUR_HF_TOKEN"  # Replace with your valid token
login(token)

# Set environment variables to optimize memory usage
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:128"

def process_vision_info(messages):
    """Process images and videos from messages"""
    image_inputs = []
    video_inputs = None
    
    for message in messages:
        if message["role"] == "user" and isinstance(message["content"], list):
            for content in message["content"]:
                if content["type"] == "image":
                    # Handle image from URL
                    if isinstance(content["image"], str) and content["image"].startswith("http"):
                        try:
                            response = requests.get(content["image"], timeout=10)
                            response.raise_for_status()
                            image = Image.open(BytesIO(response.content)).convert("RGB")
                            image_inputs.append(image)
                        except (requests.RequestException, IOError) as e:
                            print(f"Error loading image from URL: {e}")
                    # Handle base64 images
                    elif isinstance(content["image"], str) and content["image"].startswith("data:image"):
                        try:
                            import base64
                            # Extract base64 data after the comma
                            base64_data = content["image"].split(',')[1]
                            image_data = base64.b64decode(base64_data)
                            image = Image.open(BytesIO(image_data)).convert("RGB")
                            image_inputs.append(image)
                        except Exception as e:
                            print(f"Error loading base64 image: {e}")
                    # Handle local file paths
                    elif isinstance(content["image"], str) and content["image"].startswith("file://"):
                        try:
                            file_path = content["image"][7:]  # Remove 'file://'
                            image = Image.open(file_path).convert("RGB")
                            image_inputs.append(image)
                        except Exception as e:
                            print(f"Error loading local image: {e}")
                    else:
                        print("Unsupported image format or source")
    
    return image_inputs, video_inputs

# Print versions for debugging
print(f"Transformers version: {transformers.__version__}")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")
    print(f"CUDA memory allocated: {torch.cuda.memory_allocated(0)/1024**3:.2f} GB")
    print(f"CUDA memory reserved: {torch.cuda.memory_reserved(0)/1024**3:.2f} GB")

# Load the 4-bit quantized model from Unsloth
model_id = "ABDALLALSWAITI/Qwen2.5-VL-7B-Instruct-unsloth-bnb-4bit-copy"
try:
    # Free GPU memory before loading
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        gc.collect()
    
    # Load the processor first (less memory intensive)
    print("Loading processor...")
    processor = AutoProcessor.from_pretrained(model_id, token=token)
    
    # Configure quantization parameters
    quantization_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.float16,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        llm_int8_enable_fp32_cpu_offload=True
    )
    
    print("Loading model...")
    # Try loading with GPU offloading enabled
    try:
        model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
            model_id,
            token=token,
            device_map="auto",
            quantization_config=quantization_config,
            low_cpu_mem_usage=True,
        )
        print("Model loaded successfully with GPU acceleration")
    except (ValueError, RuntimeError, torch.cuda.OutOfMemoryError) as e:
        print(f"GPU loading failed: {e}")
        print("Falling back to CPU-only mode")
        
        # Clean up any partially loaded model
        if 'model' in locals():
            del model
            torch.cuda.empty_cache()
            gc.collect()
        
        # Try again with CPU only
        model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
            model_id,
            token=token,
            device_map="cpu",
            torch_dtype=torch.float32,
        )
        print("Model loaded on CPU successfully")
        
    # Print model's device map if available
    if hasattr(model, 'hf_device_map'):
        print("Model device map:")
        for module, device in model.hf_device_map.items():
            print(f"  {module}: {device}")
    
    # Example message with an image
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "image": "https://i.pinimg.com/736x/69/cd/59/69cd59a5ee5e041aa00f088465befbad.jpg",
                },
                {"type": "text", "text": "Describe this image in detail."},
            ],
        }
    ]
    
    # Process the messages
    print("Processing input...")
    text = processor.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    image_inputs, video_inputs = process_vision_info(messages)
    
    # Check if we have valid image inputs
    if not image_inputs:
        raise ValueError("No valid images were processed")
    
    # Prepare inputs for the model
    inputs = processor(
        text=[text],
        images=image_inputs,
        videos=video_inputs,
        padding=True,
        return_tensors="pt",
    )
    
    # Determine which device to use based on model's main device
    if hasattr(model, 'hf_device_map'):
        # Find the primary device (usually where the first transformer block is)
        for key, device in model.hf_device_map.items():
            if 'transformer.blocks.0' in key or 'model.embed_tokens' in key:
                input_device = device
                break
        else:
            # Default to first device in the map
            input_device = next(iter(model.hf_device_map.values()))
    else:
        # If not distributed, use the model's device
        input_device = next(model.parameters()).device
    
    print(f"Using device {input_device} for inputs")
    inputs = {k: v.to(input_device) for k, v in inputs.items()}
    
    # Generate the response
    print("Generating response...")
    with torch.no_grad():
        generation_config = {
            "max_new_tokens": 256,
            "do_sample": True,
            "temperature": 0.7,
            "top_p": 0.9,
        }
        generated_ids = model.generate(**inputs, **generation_config)
        
    # Process the output
    generated_ids_trimmed = [
        out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs["input_ids"], generated_ids)
    ]
    output_text = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )
    
    # Print the response
    print("\nModel response:")
    print(output_text[0])
except Exception as e:
    import traceback
    print(f"An error occurred: {e}")
    print(traceback.format_exc())
finally:
    # Clean up
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
```

## Quantization Benefits

The 4-bit quantized model offers several advantages:

1. **Reduced Memory Usage**: Uses approximately 4-5GB of VRAM compared to 14-16GB for the full model
2. **Wider Accessibility**: Can run on consumer GPUs with limited VRAM (e.g., RTX 3060, GTX 1660)
3. **CPU Fallback**: The memory-efficient implementation can fall back to CPU if GPU memory is insufficient
4. **Minimal Performance Loss**: The quantized model maintains most of the reasoning capabilities of the full model

## Performance Tips

1. **Control Image Resolution**:
   ```python
   processor = AutoProcessor.from_pretrained(
       model_id, 
       token=token,
       min_pixels=256*28*28,  # Lower bound
       max_pixels=1280*28*28  # Upper bound
   )
   ```

2. **Enable Flash Attention 2** for better performance (if supported):
   ```python
   model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
       model_id,
       token=token,
       torch_dtype=torch.bfloat16,
       attn_implementation="flash_attention_2",
       device_map="auto",
       quantization_config=bnb_config
   )
   ```

3. **Memory Management**:
   - Call `torch.cuda.empty_cache()` and `gc.collect()` before and after using the model
   - Set environment variables: `os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:128"`
   - Use `low_cpu_mem_usage=True` when loading the model

4. **Generation Parameters**:
   - Adjust `max_new_tokens` based on your needs (lower values use less memory)
   - Use temperature and top_p to control randomness:
     ```python
     generation_config = {
         "max_new_tokens": 256,
         "do_sample": True,
         "temperature": 0.7,
         "top_p": 0.9,
     }
     ```

5. **Multi-Image Processing**:
   When working with multiple images, batch processing them properly can save memory and improve efficiency:
   ```python
   messages = [
       {
           "role": "user",
           "content": [
               {"type": "image", "image": "url_to_image1"},
               {"type": "image", "image": "url_to_image2"},
               {"type": "text", "text": "Compare these two images."}
           ]
       }
   ]
   ```