Update README.md

Browse files

Files changed (1) hide show

README.md +338 -448

README.md CHANGED Viewed

@@ -57,7 +57,7 @@ All notebooks are **beginner friendly**! Add your dataset, click "Run All", and
 ## Introduction
-In the past five months since Qwen2-VL’s release, numerous developers have built new models on the Qwen2-VL vision-language models, providing us with valuable feedback. During this period, we focused on building more useful vision-language models. Today, we are excited to introduce the latest addition to the Qwen family: Qwen2.5-VL.
 #### Key Enhancements:
 * **Understand things visually**: Qwen2.5-VL is not only proficient in recognizing common objects such as flowers, birds, fish, and insects, but it is highly capable of analyzing texts, charts, icons, graphics, and layouts within images.
@@ -89,482 +89,372 @@ We enhance both training and inference speeds by strategically implementing wind
 We have three models with 3, 7 and 72 billion parameters. This repo contains the instruction-tuned 7B Qwen2.5-VL model. For more information, visit our [Blog](https://qwenlm.github.io/blog/qwen2.5-vl/) and [GitHub](https://github.com/QwenLM/Qwen2.5-VL).
-## Evaluation
-### Image benchmark
-| Benchmark | InternVL2.5-8B | MiniCPM-o 2.6 | GPT-4o-mini | Qwen2-VL-7B |**Qwen2.5-VL-7B** |
-| :--- | :---: | :---: | :---: | :---: | :---: |
-| MMMU<sub>val</sub>  | 56 | 50.4 | **60**| 54.1 | 58.6|
-| MMMU-Pro<sub>val</sub>  | 34.3 | - | 37.6| 30.5 | 41.0|
-| DocVQA<sub>test</sub>  | 93 | 93 | - | 94.5 | **95.7** |
-| InfoVQA<sub>test</sub>  | 77.6 | - |  - |76.5 | **82.6** |
-| ChartQA<sub>test</sub>  | 84.8 | - |- | 83.0 |**87.3** |
-| TextVQA<sub>val</sub>  | 79.1 | 80.1 | -| 84.3 | **84.9**|
-| OCRBench | 822 | 852 | 785 | 845 | **864** |
-| CC_OCR | 57.7 |  | | 61.6 | **77.8**|
-| MMStar | 62.8| | |60.7| **63.9**|
-| MMBench-V1.1-En<sub>test</sub>  | 79.4 | 78.0 | 76.0| 80.7 | **82.6** |
-| MMT-Bench<sub>test</sub> | - | - | - |**63.7** |63.6 |
-| MMStar | **61.5** | 57.5 |  54.8 | 60.7 |63.9 |
-| MMVet<sub>GPT-4-Turbo</sub>  | 54.2 | 60.0 | 66.9 | 62.0 | **67.1**|
-| HallBench<sub>avg</sub>  | 45.2 | 48.1 | 46.1| 50.6 | **52.9**|
-| MathVista<sub>testmini</sub>  | 58.3 | 60.6 | 52.4 | 58.2 | **68.2**|
-| MathVision  | - | -  | - | 16.3 | **25.07** |
-### Video Benchmarks
-| Benchmark |  Qwen2-VL-7B | **Qwen2.5-VL-7B** |
-| :--- | :---: | :---: |
-| MVBench |  67.0 | **69.6** |
-| PerceptionTest<sub>test</sub>  | 66.9 | **70.5** |
-| Video-MME<sub>wo/w subs</sub>   | 63.3/69.0 | **65.1**/**71.6** |
-| LVBench  |  | 45.3 |
-| LongVideoBench  |  | 54.7 |
-| MMBench-Video | 1.44 | 1.79 |
-| TempCompass |  | 71.7 |
-| MLVU |  | 70.2 |
-| CharadesSTA/mIoU |  43.6|
-### Agent benchmark
-| Benchmarks              | Qwen2.5-VL-7B |
-|-------------------------|---------------|
-| ScreenSpot              |     84.7    |
-| ScreenSpot Pro          |     29.0    |
-| AITZ_EM                 |  	81.9    |
-| Android Control High_EM |    	60.1    |
-| Android Control Low_EM  |  	93.7    |
-| AndroidWorld_SR         | 	25.5  	|
-| MobileMiniWob++_SR      | 	91.4    |
 ## Requirements
-The code of Qwen2.5-VL has been in the latest Hugging face transformers and we advise you to build from source with command:
-```
-pip install git+https://github.com/huggingface/transformers accelerate
-```
-or you might encounter the following error:
-```
-KeyError: 'qwen2_5_vl'
-```
-## Quickstart
-Below, we provide simple examples to show how to use Qwen2.5-VL with 🤖 ModelScope and 🤗 Transformers.
-The code of Qwen2.5-VL has been in the latest Hugging face transformers and we advise you to build from source with command:
-```
-pip install git+https://github.com/huggingface/transformers accelerate
-```
-or you might encounter the following error:
-```
-KeyError: 'qwen2_5_vl'
-```
-We offer a toolkit to help you handle various types of visual input more conveniently, as if you were using an API. This includes base64, URLs, and interleaved images and videos. You can install it using the following command:
 ```bash
-# It's highly recommanded to use `[decord]` feature for faster video loading.
-pip install qwen-vl-utils[decord]==0.0.8
-```
-If you are not using Linux, you might not be able to install `decord` from PyPI. In that case, you can use `pip install qwen-vl-utils` which will fall back to using torchvision for video processing. However, you can still [install decord from source](https://github.com/dmlc/decord?tab=readme-ov-file#install-from-source) to get decord used when loading video.
-### Using 🤗  Transformers to Chat
-Here we show a code snippet to show you how to use the chat model with `transformers` and `qwen_vl_utils`:
-```python
-from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
-from qwen_vl_utils import process_vision_info
-# default: Load the model on the available device(s)
-model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
-    "Qwen/Qwen2.5-VL-7B-Instruct", torch_dtype="auto", device_map="auto"
-)
-# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
-# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
-#     "Qwen/Qwen2.5-VL-7B-Instruct",
-#     torch_dtype=torch.bfloat16,
-#     attn_implementation="flash_attention_2",
-#     device_map="auto",
-# )
-# default processer
-processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
-# The default range for the number of visual tokens per image in the model is 4-16384.
-# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
-# min_pixels = 256*28*28
-# max_pixels = 1280*28*28
-# processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)
-messages = [
-    {
-        "role": "user",
-        "content": [
-            {
-                "type": "image",
-                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
-            },
-            {"type": "text", "text": "Describe this image."},
-        ],
-    }
-]
-# Preparation for inference
-text = processor.apply_chat_template(
-    messages, tokenize=False, add_generation_prompt=True
-)
-image_inputs, video_inputs = process_vision_info(messages)
-inputs = processor(
-    text=[text],
-    images=image_inputs,
-    videos=video_inputs,
-    padding=True,
-    return_tensors="pt",
-)
-inputs = inputs.to("cuda")
-# Inference: Generation of the output
-generated_ids = model.generate(**inputs, max_new_tokens=128)
-generated_ids_trimmed = [
-    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
-]
-output_text = processor.batch_decode(
-    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
-)
-print(output_text)
 ```
-<details>
-<summary>Multi image inference</summary>
-```python
-# Messages containing multiple images and a text query
-messages = [
-    {
-        "role": "user",
-        "content": [
-            {"type": "image", "image": "file:///path/to/image1.jpg"},
-            {"type": "image", "image": "file:///path/to/image2.jpg"},
-            {"type": "text", "text": "Identify the similarities between these images."},
-        ],
-    }
-]
-# Preparation for inference
-text = processor.apply_chat_template(
-    messages, tokenize=False, add_generation_prompt=True
-)
-image_inputs, video_inputs = process_vision_info(messages)
-inputs = processor(
-    text=[text],
-    images=image_inputs,
-    videos=video_inputs,
-    padding=True,
-    return_tensors="pt",
-)
-inputs = inputs.to("cuda")
-# Inference
-generated_ids = model.generate(**inputs, max_new_tokens=128)
-generated_ids_trimmed = [
-    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
-]
-output_text = processor.batch_decode(
-    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
-)
-print(output_text)
-```
-</details>
-<details>
-<summary>Video inference</summary>
 ```python
-# Messages containing a images list as a video and a text query
-messages = [
-    {
-        "role": "user",
-        "content": [
-            {
-                "type": "video",
-                "video": [
-                    "file:///path/to/frame1.jpg",
-                    "file:///path/to/frame2.jpg",
-                    "file:///path/to/frame3.jpg",
-                    "file:///path/to/frame4.jpg",
-                ],
-            },
-            {"type": "text", "text": "Describe this video."},
-        ],
-    }
-]
-# Messages containing a local video path and a text query
-messages = [
-    {
-        "role": "user",
-        "content": [
-            {
-                "type": "video",
-                "video": "file:///path/to/video1.mp4",
-                "max_pixels": 360 * 420,
-                "fps": 1.0,
-            },
-            {"type": "text", "text": "Describe this video."},
-        ],
-    }
-]
-# Messages containing a video url and a text query
-messages = [
-    {
-        "role": "user",
-        "content": [
-            {
-                "type": "video",
-                "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-VL/space_woaudio.mp4",
-            },
-            {"type": "text", "text": "Describe this video."},
-        ],
-    }
-]
-#In Qwen 2.5 VL, frame rate information is also input into the model to align with absolute time.
-# Preparation for inference
-text = processor.apply_chat_template(
-    messages, tokenize=False, add_generation_prompt=True
-)
-image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
-inputs = processor(
-    text=[text],
-    images=image_inputs,
-    videos=video_inputs,
-    fps=fps,
-    padding=True,
-    return_tensors="pt",
-    **video_kwargs,
 )
-inputs = inputs.to("cuda")
-# Inference
-generated_ids = model.generate(**inputs, max_new_tokens=128)
-generated_ids_trimmed = [
-    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
-]
-output_text = processor.batch_decode(
-    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
-)
-print(output_text)
-```
-Video URL compatibility largely depends on the third-party library version. The details are in the table below. change the backend by `FORCE_QWENVL_VIDEO_READER=torchvision` or `FORCE_QWENVL_VIDEO_READER=decord` if you prefer not to use the default one.
-| Backend     | HTTP | HTTPS |
-|-------------|------|-------|
-| torchvision >= 0.19.0 | ✅  | ✅   |
-| torchvision < 0.19.0  | ❌  | ❌   |
-| decord      | ✅  | ❌   |
-</details>
-<details>
-<summary>Batch inference</summary>
-```python
-# Sample messages for batch inference
-messages1 = [
-    {
-        "role": "user",
-        "content": [
-            {"type": "image", "image": "file:///path/to/image1.jpg"},
-            {"type": "image", "image": "file:///path/to/image2.jpg"},
-            {"type": "text", "text": "What are the common elements in these pictures?"},
-        ],
-    }
-]
-messages2 = [
-    {"role": "system", "content": "You are a helpful assistant."},
-    {"role": "user", "content": "Who are you?"},
-]
-# Combine messages for batch processing
-messages = [messages1, messages2]
-# Preparation for batch inference
-texts = [
-    processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
-    for msg in messages
-]
-image_inputs, video_inputs = process_vision_info(messages)
-inputs = processor(
-    text=texts,
-    images=image_inputs,
-    videos=video_inputs,
-    padding=True,
-    return_tensors="pt",
-)
-inputs = inputs.to("cuda")
-# Batch Inference
-generated_ids = model.generate(**inputs, max_new_tokens=128)
-generated_ids_trimmed = [
-    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
-]
-output_texts = processor.batch_decode(
-    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
 )
-print(output_texts)
-```
-</details>
-### 🤖 ModelScope
-We strongly advise users especially those in mainland China to use ModelScope. `snapshot_download` can help you solve issues concerning downloading checkpoints.
-### More Usage Tips
-For input images, we support local files, base64, and URLs. For videos, we currently only support local files.
-```python
-# You can directly insert a local file path, a URL, or a base64-encoded image into the position where you want in the text.
-## Local file path
-messages = [
-    {
-        "role": "user",
-        "content": [
-            {"type": "image", "image": "file:///path/to/your/image.jpg"},
-            {"type": "text", "text": "Describe this image."},
-        ],
-    }
-]
-## Image URL
-messages = [
-    {
-        "role": "user",
-        "content": [
-            {"type": "image", "image": "http://path/to/your/image.jpg"},
-            {"type": "text", "text": "Describe this image."},
-        ],
-    }
-]
-## Base64 encoded image
 messages = [
     {
         "role": "user",
         "content": [
-            {"type": "image", "image": "data:image;base64,/9j/..."},
-            {"type": "text", "text": "Describe this image."},
-        ],
     }
 ]
-```
-#### Image Resolution for performance boost
-The model supports a wide range of resolution inputs. By default, it uses the native resolution for input, but higher resolutions can enhance performance at the cost of more computation. Users can set the minimum and maximum number of pixels to achieve an optimal configuration for their needs, such as a token count range of 256-1280, to balance speed and memory usage.
-```python
-min_pixels = 256 * 28 * 28
-max_pixels = 1280 * 28 * 28
-processor = AutoProcessor.from_pretrained(
-    "Qwen/Qwen2.5-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels
-)
 ```
-Besides, We provide two methods for fine-grained control over the image size input to the model:
-1. Define min_pixels and max_pixels: Images will be resized to maintain their aspect ratio within the range of min_pixels and max_pixels.
-2. Specify exact dimensions: Directly set `resized_height` and `resized_width`. These values will be rounded to the nearest multiple of 28.
 ```python
-# min_pixels and max_pixels
-messages = [
-    {
-        "role": "user",
-        "content": [
-            {
-                "type": "image",
-                "image": "file:///path/to/your/image.jpg",
-                "resized_height": 280,
-                "resized_width": 420,
-            },
-            {"type": "text", "text": "Describe this image."},
-        ],
-    }
-]
-# resized_height and resized_width
-messages = [
-    {
-        "role": "user",
-        "content": [
-            {
-                "type": "image",
-                "image": "file:///path/to/your/image.jpg",
-                "min_pixels": 50176,
-                "max_pixels": 50176,
-            },
-            {"type": "text", "text": "Describe this image."},
-        ],
-    }
-]
-```
-### Processing Long Texts
-The current `config.json` is set for context length up to 32,768 tokens.
-To handle extensive inputs exceeding 32,768 tokens, we utilize [YaRN](https://arxiv.org/abs/2309.00071), a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts.
-For supported frameworks, you could add the following to `config.json` to enable YaRN:
-{
-	...,
-    "type": "yarn",
-    "mrope_section": [
-        16,
-        24,
-        24
-    ],
-    "factor": 4,
-    "original_max_position_embeddings": 32768
-}
-However, it should be noted that this method has a significant impact on the performance of temporal and spatial localization tasks, and is therefore not recommended for use.
-At the same time, for long video inputs, since MRoPE itself is more economical with ids, the max_position_embeddings can be directly modified to a larger value, such as 64k.
-## Citation
-If you find our work helpful, feel free to give us a cite.
-```
-@misc{qwen2.5-VL,
-    title = {Qwen2.5-VL},
-    url = {https://qwenlm.github.io/blog/qwen2.5-vl/},
-    author = {Qwen Team},
-    month = {January},
-    year = {2025}
-}
-@article{Qwen2VL,
-  title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
-  author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
-  journal={arXiv preprint arXiv:2409.12191},
-  year={2024}
-}
-@article{Qwen-VL,
-  title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
-  author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
-  journal={arXiv preprint arXiv:2308.12966},
-  year={2023}
-}
 ```

 ## Introduction
+In the past five months since Qwen2-VL's release, numerous developers have built new models on the Qwen2-VL vision-language models, providing us with valuable feedback. During this period, we focused on building more useful vision-language models. Today, we are excited to introduce the latest addition to the Qwen family: Qwen2.5-VL.
 #### Key Enhancements:
 * **Understand things visually**: Qwen2.5-VL is not only proficient in recognizing common objects such as flowers, birds, fish, and insects, but it is highly capable of analyzing texts, charts, icons, graphics, and layouts within images.
 We have three models with 3, 7 and 72 billion parameters. This repo contains the instruction-tuned 7B Qwen2.5-VL model. For more information, visit our [Blog](https://qwenlm.github.io/blog/qwen2.5-vl/) and [GitHub](https://github.com/QwenLM/Qwen2.5-VL).
+# Using Qwen2.5-VL 7B with 4-bit Quantization
+This guide demonstrates how to use the 4-bit quantized version of Qwen2.5-VL, a multimodal vision-language model that can understand images and generate descriptive text. The 4-bit quantization significantly reduces memory requirements while maintaining good performance.
+## Table of Contents
+- [Requirements](#requirements)
+- [Standard Implementation](#standard-implementation)
+- [Memory-Efficient Implementation](#memory-efficient-implementation)
+- [Quantization Benefits](#quantization-benefits)
+- [Performance Tips](#performance-tips)
 ## Requirements
 ```bash
+pip install transformers torch bitsandbytes accelerate pillow huggingface_hub
+pip install qwen-vl-utils[decord]==0.0.8  # For video support (recommended)
+# OR
+pip install qwen-vl-utils  # Falls back to torchvision for video
 ```
+## Standard Implementation
+This implementation provides a good balance between performance and memory efficiency:
 ```python
+import torch
+from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration, BitsAndBytesConfig
+from huggingface_hub import login
+import requests
+from PIL import Image
+from io import BytesIO
+# Login to Hugging Face with token
+# You need to use a valid token with access to the model
+token = "YOUR_HF_TOKEN"  # Replace with your valid token
+login(token)
+# Configure quantization
+bnb_config = BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_compute_dtype=torch.float16,
+    bnb_4bit_use_double_quant=True,
+    bnb_4bit_quant_type="nf4"
 )
+# Model ID
+model_id = "unsloth/Qwen2.5-VL-7B-Instruct-unsloth-bnb-4bit"
+# Load processor
+processor = AutoProcessor.from_pretrained(model_id, token=token)
+# Load model
+model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
+    model_id,
+    quantization_config=bnb_config,
+    device_map="auto",
+    token=token
 )
+# Process image from URL
+image_url = "https://i.pinimg.com/736x/69/cd/59/69cd59a5ee5e041aa00f088465befbad.jpg"
+response = requests.get(image_url)
+image = Image.open(BytesIO(response.content)).convert("RGB")
+# Create message according to Qwen2.5-VL format
 messages = [
     {
         "role": "user",
         "content": [
+            {"type": "image", "image": image},
+            {"type": "text", "text": "Describe this image in detail."}
+        ]
     }
 ]
+# Process input
+text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+inputs = processor(text=[text], images=[image], return_tensors="pt").to("cuda")
+# Generate response
+with torch.no_grad():
+    output_ids = model.generate(**inputs, max_new_tokens=200)
+    # Decode response
+    response = processor.batch_decode(
+        output_ids[:, inputs.input_ids.shape[1]:],
+        skip_special_tokens=True
+    )[0]
+print(response)
 ```
+## Memory-Efficient Implementation
+This version includes optimizations for systems with limited resources, with better error handling and memory management:
 ```python
+import torch
+import transformers
+from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration, BitsAndBytesConfig
+from huggingface_hub import login
+import requests
+from PIL import Image
+from io import BytesIO
+import gc
+import os
+# Login to Hugging Face with token
+token = "YOUR_HF_TOKEN"  # Replace with your valid token
+login(token)
+# Set environment variables to optimize memory usage
+os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:128"
+def process_vision_info(messages):
+    """Process images and videos from messages"""
+    image_inputs = []
+    video_inputs = None
+    for message in messages:
+        if message["role"] == "user" and isinstance(message["content"], list):
+            for content in message["content"]:
+                if content["type"] == "image":
+                    # Handle image from URL
+                    if isinstance(content["image"], str) and content["image"].startswith("http"):
+                        try:
+                            response = requests.get(content["image"], timeout=10)
+                            response.raise_for_status()
+                            image = Image.open(BytesIO(response.content)).convert("RGB")
+                            image_inputs.append(image)
+                        except (requests.RequestException, IOError) as e:
+                            print(f"Error loading image from URL: {e}")
+                    # Handle base64 images
+                    elif isinstance(content["image"], str) and content["image"].startswith("data:image"):
+                        try:
+                            import base64
+                            # Extract base64 data after the comma
+                            base64_data = content["image"].split(',')[1]
+                            image_data = base64.b64decode(base64_data)
+                            image = Image.open(BytesIO(image_data)).convert("RGB")
+                            image_inputs.append(image)
+                        except Exception as e:
+                            print(f"Error loading base64 image: {e}")
+                    # Handle local file paths
+                    elif isinstance(content["image"], str) and content["image"].startswith("file://"):
+                        try:
+                            file_path = content["image"][7:]  # Remove 'file://'
+                            image = Image.open(file_path).convert("RGB")
+                            image_inputs.append(image)
+                        except Exception as e:
+                            print(f"Error loading local image: {e}")
+                    else:
+                        print("Unsupported image format or source")
+    return image_inputs, video_inputs
+# Print versions for debugging
+print(f"Transformers version: {transformers.__version__}")
+print(f"PyTorch version: {torch.__version__}")
+print(f"CUDA available: {torch.cuda.is_available()}")
+if torch.cuda.is_available():
+    print(f"CUDA device: {torch.cuda.get_device_name(0)}")
+    print(f"CUDA memory allocated: {torch.cuda.memory_allocated(0)/1024**3:.2f} GB")
+    print(f"CUDA memory reserved: {torch.cuda.memory_reserved(0)/1024**3:.2f} GB")
+# Load the 4-bit quantized model from Unsloth
+model_id = "unsloth/Qwen2.5-VL-7B-Instruct-unsloth-bnb-4bit"
+try:
+    # Free GPU memory before loading
+    if torch.cuda.is_available():
+        torch.cuda.empty_cache()
+        gc.collect()
+    # Load the processor first (less memory intensive)
+    print("Loading processor...")
+    processor = AutoProcessor.from_pretrained(model_id, token=token)
+    # Configure quantization parameters
+    quantization_config = BitsAndBytesConfig(
+        load_in_4bit=True,
+        bnb_4bit_compute_dtype=torch.float16,
+        bnb_4bit_use_double_quant=True,
+        bnb_4bit_quant_type="nf4",
+        llm_int8_enable_fp32_cpu_offload=True
+    )
+    print("Loading model...")
+    # Try loading with GPU offloading enabled
+    try:
+        model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
+            model_id,
+            token=token,
+            device_map="auto",
+            quantization_config=quantization_config,
+            low_cpu_mem_usage=True,
+        )
+        print("Model loaded successfully with GPU acceleration")
+    except (ValueError, RuntimeError, torch.cuda.OutOfMemoryError) as e:
+        print(f"GPU loading failed: {e}")
+        print("Falling back to CPU-only mode")
+        # Clean up any partially loaded model
+        if 'model' in locals():
+            del model
+            torch.cuda.empty_cache()
+            gc.collect()
+        # Try again with CPU only
+        model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
+            model_id,
+            token=token,
+            device_map="cpu",
+            torch_dtype=torch.float32,
+        )
+        print("Model loaded on CPU successfully")
+    # Print model's device map if available
+    if hasattr(model, 'hf_device_map'):
+        print("Model device map:")
+        for module, device in model.hf_device_map.items():
+            print(f"  {module}: {device}")
+    # Example message with an image
+    messages = [
+        {
+            "role": "user",
+            "content": [
+                {
+                    "type": "image",
+                    "image": "https://i.pinimg.com/736x/69/cd/59/69cd59a5ee5e041aa00f088465befbad.jpg",
+                },
+                {"type": "text", "text": "Describe this image in detail."},
+            ],
+        }
+    ]
+    # Process the messages
+    print("Processing input...")
+    text = processor.apply_chat_template(
+        messages, tokenize=False, add_generation_prompt=True
+    )
+    image_inputs, video_inputs = process_vision_info(messages)
+    # Check if we have valid image inputs
+    if not image_inputs:
+        raise ValueError("No valid images were processed")
+    # Prepare inputs for the model
+    inputs = processor(
+        text=[text],
+        images=image_inputs,
+        videos=video_inputs,
+        padding=True,
+        return_tensors="pt",
+    )
+    # Determine which device to use based on model's main device
+    if hasattr(model, 'hf_device_map'):
+        # Find the primary device (usually where the first transformer block is)
+        for key, device in model.hf_device_map.items():
+            if 'transformer.blocks.0' in key or 'model.embed_tokens' in key:
+                input_device = device
+                break
+        else:
+            # Default to first device in the map
+            input_device = next(iter(model.hf_device_map.values()))
+    else:
+        # If not distributed, use the model's device
+        input_device = next(model.parameters()).device
+    print(f"Using device {input_device} for inputs")
+    inputs = {k: v.to(input_device) for k, v in inputs.items()}
+    # Generate the response
+    print("Generating response...")
+    with torch.no_grad():
+        generation_config = {
+            "max_new_tokens": 256,
+            "do_sample": True,
+            "temperature": 0.7,
+            "top_p": 0.9,
+        }
+        generated_ids = model.generate(**inputs, **generation_config)
+    # Process the output
+    generated_ids_trimmed = [
+        out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs["input_ids"], generated_ids)
+    ]
+    output_text = processor.batch_decode(
+        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
+    )
+    # Print the response
+    print("\nModel response:")
+    print(output_text[0])
+except Exception as e:
+    import traceback
+    print(f"An error occurred: {e}")
+    print(traceback.format_exc())
+finally:
+    # Clean up
+    if torch.cuda.is_available():
+        torch.cuda.empty_cache()
 ```
+## Quantization Benefits
+The 4-bit quantized model offers several advantages:
+1. **Reduced Memory Usage**: Uses approximately 4-5GB of VRAM compared to 14-16GB for the full model
+2. **Wider Accessibility**: Can run on consumer GPUs with limited VRAM (e.g., RTX 3060, GTX 1660)
+3. **CPU Fallback**: The memory-efficient implementation can fall back to CPU if GPU memory is insufficient
+4. **Minimal Performance Loss**: The quantized model maintains most of the reasoning capabilities of the full model
+## Performance Tips
+1. **Control Image Resolution**:
+   ```python
+   processor = AutoProcessor.from_pretrained(
+       model_id,
+       token=token,
+       min_pixels=256*28*28,  # Lower bound
+       max_pixels=1280*28*28  # Upper bound
+   )
+   ```
+2. **Enable Flash Attention 2** for better performance (if supported):
+   ```python
+   model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
+       model_id,
+       token=token,
+       torch_dtype=torch.bfloat16,
+       attn_implementation="flash_attention_2",
+       device_map="auto",
+       quantization_config=bnb_config
+   )
+   ```
+3. **Memory Management**:
+   - Call `torch.cuda.empty_cache()` and `gc.collect()` before and after using the model
+   - Set environment variables: `os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:128"`
+   - Use `low_cpu_mem_usage=True` when loading the model
+4. **Generation Parameters**:
+   - Adjust `max_new_tokens` based on your needs (lower values use less memory)
+   - Use temperature and top_p to control randomness:
+     ```python
+     generation_config = {
+         "max_new_tokens": 256,
+         "do_sample": True,
+         "temperature": 0.7,
+         "top_p": 0.9,
+     }
+     ```
+5. **Multi-Image Processing**:
+   When working with multiple images, batch processing them properly can save memory and improve efficiency:
+   ```python
+   messages = [
+       {
+           "role": "user",
+           "content": [
+               {"type": "image", "image": "url_to_image1"},
+               {"type": "image", "image": "url_to_image2"},
+               {"type": "text", "text": "Compare these two images."}
+           ]
+       }
+   ]
+   ```