File size: 19,633 Bytes
b924a95 191e7b6 b924a95 191e7b6 b924a95 191e7b6 b924a95 191e7b6 b924a95 191e7b6 b924a95 191e7b6 b924a95 191e7b6 b924a95 191e7b6 b924a95 191e7b6 ef61fcc b924a95 191e7b6 b924a95 191e7b6 b924a95 191e7b6 b924a95 191e7b6 b924a95 191e7b6 b924a95 191e7b6 b924a95 191e7b6 b924a95 191e7b6 b924a95 191e7b6 ef61fcc 191e7b6 b924a95 191e7b6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 |
---
base_model: Qwen/Qwen2.5-VL-7B-Instruct
language:
- en
library_name: transformers
pipeline_tag: image-text-to-text
license: apache-2.0
tags:
- multimodal
- qwen
- qwen2
- unsloth
- transformers
- vision
---
<div>
<p style="margin-bottom: 0;margin-top:0;">
<em>Unsloth's <a href="https://unsloth.ai/blog/dynamic-4bit">Dynamic 4-bit Quants</a> is selectively quantized, greatly improving accuracy over standard 4-bit.</em>
</p>
<div style="display: flex; gap: 5px; align-items: center;margin-top:0; ">
<a href="https://github.com/unslothai/unsloth/">
<img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="133">
</a>
<a href="https://discord.gg/unsloth">
<img src="https://github.com/unslothai/unsloth/raw/main/images/Discord%20button.png" width="173">
</a>
<a href="https://docs.unsloth.ai/">
<img src="https://raw.githubusercontent.com/unslothai/unsloth/refs/heads/main/images/documentation%20green%20button.png" width="143">
</a>
</div>
<h1 style="margin-top: 0rem;">Finetune LLMs 2-5x faster with 70% less memory via Unsloth</h2>
</div>
We have a free Google Colab Tesla T4 notebook for Qwen2-VL (7B) here: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen2_VL_(7B)-Vision.ipynb
## ✨ Finetune for Free
All notebooks are **beginner friendly**! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face.
| Unsloth supports | Free Notebooks | Performance | Memory use |
|-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------|
| **Llama-3.2 (3B)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb) | 2.4x faster | 58% less |
| **Llama-3.2 (11B vision)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb) | 2x faster | 60% less |
| **Qwen2 VL (7B)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen2_VL_(7B)-Vision.ipynb) | 1.8x faster | 60% less |
| **Qwen2.5 (7B)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen2.5_(7B)-Alpaca.ipynb) | 2x faster | 60% less |
| **Llama-3.1 (8B)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-Alpaca.ipynb) | 2.4x faster | 58% less |
| **Phi-3.5 (mini)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Phi_3.5_Mini-Conversational.ipynb) | 2x faster | 50% less |
| **Gemma 2 (9B)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Gemma2_(9B)-Alpaca.ipynb) | 2.4x faster | 58% less |
| **Mistral (7B)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Mistral_v0.3_(7B)-Conversational.ipynb) | 2.2x faster | 62% less |
[<img src="https://raw.githubusercontent.com/unslothai/unsloth/refs/heads/main/images/documentation%20green%20button.png" width="200"/>](https://docs.unsloth.ai)
- This [Llama 3.2 conversational notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb) is useful for ShareGPT ChatML / Vicuna templates.
- This [text completion notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Mistral_(7B)-Text_Completion.ipynb) is for raw text. This [DPO notebook](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing) replicates Zephyr.
- \* Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster.
# Qwen2.5-VL
## Introduction
In the past five months since Qwen2-VL's release, numerous developers have built new models on the Qwen2-VL vision-language models, providing us with valuable feedback. During this period, we focused on building more useful vision-language models. Today, we are excited to introduce the latest addition to the Qwen family: Qwen2.5-VL.
#### Key Enhancements:
* **Understand things visually**: Qwen2.5-VL is not only proficient in recognizing common objects such as flowers, birds, fish, and insects, but it is highly capable of analyzing texts, charts, icons, graphics, and layouts within images.
* **Being agentic**: Qwen2.5-VL directly plays as a visual agent that can reason and dynamically direct tools, which is capable of computer use and phone use.
* **Understanding long videos and capturing events**: Qwen2.5-VL can comprehend videos of over 1 hour, and this time it has a new ability of cpaturing event by pinpointing the relevant video segments.
* **Capable of visual localization in different formats**: Qwen2.5-VL can accurately localize objects in an image by generating bounding boxes or points, and it can provide stable JSON outputs for coordinates and attributes.
* **Generating structured outputs**: for data like scans of invoices, forms, tables, etc. Qwen2.5-VL supports structured outputs of their contents, benefiting usages in finance, commerce, etc.
#### Model Architecture Updates:
* **Dynamic Resolution and Frame Rate Training for Video Understanding**:
We extend dynamic resolution to the temporal dimension by adopting dynamic FPS sampling, enabling the model to comprehend videos at various sampling rates. Accordingly, we update mRoPE in the time dimension with IDs and absolute time alignment, enabling the model to learn temporal sequence and speed, and ultimately acquire the ability to pinpoint specific moments.
<p align="center">
<img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-VL/qwen2.5vl_arc.jpeg" width="80%"/>
<p>
* **Streamlined and Efficient Vision Encoder**
We enhance both training and inference speeds by strategically implementing window attention into the ViT. The ViT architecture is further optimized with SwiGLU and RMSNorm, aligning it with the structure of the Qwen2.5 LLM.
We have three models with 3, 7 and 72 billion parameters. This repo contains the instruction-tuned 7B Qwen2.5-VL model. For more information, visit our [Blog](https://qwenlm.github.io/blog/qwen2.5-vl/) and [GitHub](https://github.com/QwenLM/Qwen2.5-VL).
# Using Qwen2.5-VL 7B with 4-bit Quantization
This guide demonstrates how to use the 4-bit quantized version of Qwen2.5-VL, a multimodal vision-language model that can understand images and generate descriptive text. The 4-bit quantization significantly reduces memory requirements while maintaining good performance.
## Table of Contents
- [Requirements](#requirements)
- [Standard Implementation](#standard-implementation)
- [Memory-Efficient Implementation](#memory-efficient-implementation)
- [Quantization Benefits](#quantization-benefits)
- [Performance Tips](#performance-tips)
## Requirements
```bash
pip install transformers torch bitsandbytes accelerate pillow huggingface_hub
pip install qwen-vl-utils[decord]==0.0.8 # For video support (recommended)
# OR
pip install qwen-vl-utils # Falls back to torchvision for video
```
## Standard Implementation
This implementation provides a good balance between performance and memory efficiency:
```python
import torch
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration, BitsAndBytesConfig
from huggingface_hub import login
import requests
from PIL import Image
from io import BytesIO
# Login to Hugging Face with token
# You need to use a valid token with access to the model
token = "YOUR_HF_TOKEN" # Replace with your valid token
login(token)
# Configure quantization
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4"
)
# Model ID
model_id = "ABDALLALSWAITI/Qwen2.5-VL-7B-Instruct-unsloth-bnb-4bit-copy"
# Load processor
processor = AutoProcessor.from_pretrained(model_id, token=token)
# Load model
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_id,
quantization_config=bnb_config,
device_map="auto",
token=token
)
# Process image from URL
image_url = "https://i.pinimg.com/736x/69/cd/59/69cd59a5ee5e041aa00f088465befbad.jpg"
response = requests.get(image_url)
image = Image.open(BytesIO(response.content)).convert("RGB")
# Create message according to Qwen2.5-VL format
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "Describe this image in detail."}
]
}
]
# Process input
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to("cuda")
# Generate response
with torch.no_grad():
output_ids = model.generate(**inputs, max_new_tokens=200)
# Decode response
response = processor.batch_decode(
output_ids[:, inputs.input_ids.shape[1]:],
skip_special_tokens=True
)[0]
print(response)
```
## Memory-Efficient Implementation
This version includes optimizations for systems with limited resources, with better error handling and memory management:
```python
import torch
import transformers
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration, BitsAndBytesConfig
from huggingface_hub import login
import requests
from PIL import Image
from io import BytesIO
import gc
import os
# Login to Hugging Face with token
token = "YOUR_HF_TOKEN" # Replace with your valid token
login(token)
# Set environment variables to optimize memory usage
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:128"
def process_vision_info(messages):
"""Process images and videos from messages"""
image_inputs = []
video_inputs = None
for message in messages:
if message["role"] == "user" and isinstance(message["content"], list):
for content in message["content"]:
if content["type"] == "image":
# Handle image from URL
if isinstance(content["image"], str) and content["image"].startswith("http"):
try:
response = requests.get(content["image"], timeout=10)
response.raise_for_status()
image = Image.open(BytesIO(response.content)).convert("RGB")
image_inputs.append(image)
except (requests.RequestException, IOError) as e:
print(f"Error loading image from URL: {e}")
# Handle base64 images
elif isinstance(content["image"], str) and content["image"].startswith("data:image"):
try:
import base64
# Extract base64 data after the comma
base64_data = content["image"].split(',')[1]
image_data = base64.b64decode(base64_data)
image = Image.open(BytesIO(image_data)).convert("RGB")
image_inputs.append(image)
except Exception as e:
print(f"Error loading base64 image: {e}")
# Handle local file paths
elif isinstance(content["image"], str) and content["image"].startswith("file://"):
try:
file_path = content["image"][7:] # Remove 'file://'
image = Image.open(file_path).convert("RGB")
image_inputs.append(image)
except Exception as e:
print(f"Error loading local image: {e}")
else:
print("Unsupported image format or source")
return image_inputs, video_inputs
# Print versions for debugging
print(f"Transformers version: {transformers.__version__}")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
print(f"CUDA device: {torch.cuda.get_device_name(0)}")
print(f"CUDA memory allocated: {torch.cuda.memory_allocated(0)/1024**3:.2f} GB")
print(f"CUDA memory reserved: {torch.cuda.memory_reserved(0)/1024**3:.2f} GB")
# Load the 4-bit quantized model from Unsloth
model_id = "ABDALLALSWAITI/Qwen2.5-VL-7B-Instruct-unsloth-bnb-4bit-copy"
try:
# Free GPU memory before loading
if torch.cuda.is_available():
torch.cuda.empty_cache()
gc.collect()
# Load the processor first (less memory intensive)
print("Loading processor...")
processor = AutoProcessor.from_pretrained(model_id, token=token)
# Configure quantization parameters
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
llm_int8_enable_fp32_cpu_offload=True
)
print("Loading model...")
# Try loading with GPU offloading enabled
try:
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_id,
token=token,
device_map="auto",
quantization_config=quantization_config,
low_cpu_mem_usage=True,
)
print("Model loaded successfully with GPU acceleration")
except (ValueError, RuntimeError, torch.cuda.OutOfMemoryError) as e:
print(f"GPU loading failed: {e}")
print("Falling back to CPU-only mode")
# Clean up any partially loaded model
if 'model' in locals():
del model
torch.cuda.empty_cache()
gc.collect()
# Try again with CPU only
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_id,
token=token,
device_map="cpu",
torch_dtype=torch.float32,
)
print("Model loaded on CPU successfully")
# Print model's device map if available
if hasattr(model, 'hf_device_map'):
print("Model device map:")
for module, device in model.hf_device_map.items():
print(f" {module}: {device}")
# Example message with an image
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://i.pinimg.com/736x/69/cd/59/69cd59a5ee5e041aa00f088465befbad.jpg",
},
{"type": "text", "text": "Describe this image in detail."},
],
}
]
# Process the messages
print("Processing input...")
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
# Check if we have valid image inputs
if not image_inputs:
raise ValueError("No valid images were processed")
# Prepare inputs for the model
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
# Determine which device to use based on model's main device
if hasattr(model, 'hf_device_map'):
# Find the primary device (usually where the first transformer block is)
for key, device in model.hf_device_map.items():
if 'transformer.blocks.0' in key or 'model.embed_tokens' in key:
input_device = device
break
else:
# Default to first device in the map
input_device = next(iter(model.hf_device_map.values()))
else:
# If not distributed, use the model's device
input_device = next(model.parameters()).device
print(f"Using device {input_device} for inputs")
inputs = {k: v.to(input_device) for k, v in inputs.items()}
# Generate the response
print("Generating response...")
with torch.no_grad():
generation_config = {
"max_new_tokens": 256,
"do_sample": True,
"temperature": 0.7,
"top_p": 0.9,
}
generated_ids = model.generate(**inputs, **generation_config)
# Process the output
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs["input_ids"], generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
# Print the response
print("\nModel response:")
print(output_text[0])
except Exception as e:
import traceback
print(f"An error occurred: {e}")
print(traceback.format_exc())
finally:
# Clean up
if torch.cuda.is_available():
torch.cuda.empty_cache()
```
## Quantization Benefits
The 4-bit quantized model offers several advantages:
1. **Reduced Memory Usage**: Uses approximately 4-5GB of VRAM compared to 14-16GB for the full model
2. **Wider Accessibility**: Can run on consumer GPUs with limited VRAM (e.g., RTX 3060, GTX 1660)
3. **CPU Fallback**: The memory-efficient implementation can fall back to CPU if GPU memory is insufficient
4. **Minimal Performance Loss**: The quantized model maintains most of the reasoning capabilities of the full model
## Performance Tips
1. **Control Image Resolution**:
```python
processor = AutoProcessor.from_pretrained(
model_id,
token=token,
min_pixels=256*28*28, # Lower bound
max_pixels=1280*28*28 # Upper bound
)
```
2. **Enable Flash Attention 2** for better performance (if supported):
```python
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_id,
token=token,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map="auto",
quantization_config=bnb_config
)
```
3. **Memory Management**:
- Call `torch.cuda.empty_cache()` and `gc.collect()` before and after using the model
- Set environment variables: `os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:128"`
- Use `low_cpu_mem_usage=True` when loading the model
4. **Generation Parameters**:
- Adjust `max_new_tokens` based on your needs (lower values use less memory)
- Use temperature and top_p to control randomness:
```python
generation_config = {
"max_new_tokens": 256,
"do_sample": True,
"temperature": 0.7,
"top_p": 0.9,
}
```
5. **Multi-Image Processing**:
When working with multiple images, batch processing them properly can save memory and improve efficiency:
```python
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "url_to_image1"},
{"type": "image", "image": "url_to_image2"},
{"type": "text", "text": "Compare these two images."}
]
}
]
``` |