--- base_model: Qwen/Qwen2.5-VL-7B-Instruct language: - en library_name: transformers pipeline_tag: image-text-to-text license: apache-2.0 tags: - multimodal - qwen - qwen2 - unsloth - transformers - vision ---
Unsloth's Dynamic 4-bit Quants is selectively quantized, greatly improving accuracy over standard 4-bit.
* **Streamlined and Efficient Vision Encoder** We enhance both training and inference speeds by strategically implementing window attention into the ViT. The ViT architecture is further optimized with SwiGLU and RMSNorm, aligning it with the structure of the Qwen2.5 LLM. We have three models with 3, 7 and 72 billion parameters. This repo contains the instruction-tuned 7B Qwen2.5-VL model. For more information, visit our [Blog](https://qwenlm.github.io/blog/qwen2.5-vl/) and [GitHub](https://github.com/QwenLM/Qwen2.5-VL). # Using Qwen2.5-VL 7B with 4-bit Quantization This guide demonstrates how to use the 4-bit quantized version of Qwen2.5-VL, a multimodal vision-language model that can understand images and generate descriptive text. The 4-bit quantization significantly reduces memory requirements while maintaining good performance. ## Table of Contents - [Requirements](#requirements) - [Standard Implementation](#standard-implementation) - [Memory-Efficient Implementation](#memory-efficient-implementation) - [Quantization Benefits](#quantization-benefits) - [Performance Tips](#performance-tips) ## Requirements ```bash pip install transformers torch bitsandbytes accelerate pillow huggingface_hub pip install qwen-vl-utils[decord]==0.0.8 # For video support (recommended) # OR pip install qwen-vl-utils # Falls back to torchvision for video ``` ## Standard Implementation This implementation provides a good balance between performance and memory efficiency: ```python import torch from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration, BitsAndBytesConfig from huggingface_hub import login import requests from PIL import Image from io import BytesIO # Login to Hugging Face with token # You need to use a valid token with access to the model token = "YOUR_HF_TOKEN" # Replace with your valid token login(token) # Configure quantization bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4" ) # Model ID model_id = "ABDALLALSWAITI/Qwen2.5-VL-7B-Instruct-unsloth-bnb-4bit-copy" # Load processor processor = AutoProcessor.from_pretrained(model_id, token=token) # Load model model = Qwen2_5_VLForConditionalGeneration.from_pretrained( model_id, quantization_config=bnb_config, device_map="auto", token=token ) # Process image from URL image_url = "https://i.pinimg.com/736x/69/cd/59/69cd59a5ee5e041aa00f088465befbad.jpg" response = requests.get(image_url) image = Image.open(BytesIO(response.content)).convert("RGB") # Create message according to Qwen2.5-VL format messages = [ { "role": "user", "content": [ {"type": "image", "image": image}, {"type": "text", "text": "Describe this image in detail."} ] } ] # Process input text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = processor(text=[text], images=[image], return_tensors="pt").to("cuda") # Generate response with torch.no_grad(): output_ids = model.generate(**inputs, max_new_tokens=200) # Decode response response = processor.batch_decode( output_ids[:, inputs.input_ids.shape[1]:], skip_special_tokens=True )[0] print(response) ``` ## Memory-Efficient Implementation This version includes optimizations for systems with limited resources, with better error handling and memory management: ```python import torch import transformers from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration, BitsAndBytesConfig from huggingface_hub import login import requests from PIL import Image from io import BytesIO import gc import os # Login to Hugging Face with token token = "YOUR_HF_TOKEN" # Replace with your valid token login(token) # Set environment variables to optimize memory usage os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:128" def process_vision_info(messages): """Process images and videos from messages""" image_inputs = [] video_inputs = None for message in messages: if message["role"] == "user" and isinstance(message["content"], list): for content in message["content"]: if content["type"] == "image": # Handle image from URL if isinstance(content["image"], str) and content["image"].startswith("http"): try: response = requests.get(content["image"], timeout=10) response.raise_for_status() image = Image.open(BytesIO(response.content)).convert("RGB") image_inputs.append(image) except (requests.RequestException, IOError) as e: print(f"Error loading image from URL: {e}") # Handle base64 images elif isinstance(content["image"], str) and content["image"].startswith("data:image"): try: import base64 # Extract base64 data after the comma base64_data = content["image"].split(',')[1] image_data = base64.b64decode(base64_data) image = Image.open(BytesIO(image_data)).convert("RGB") image_inputs.append(image) except Exception as e: print(f"Error loading base64 image: {e}") # Handle local file paths elif isinstance(content["image"], str) and content["image"].startswith("file://"): try: file_path = content["image"][7:] # Remove 'file://' image = Image.open(file_path).convert("RGB") image_inputs.append(image) except Exception as e: print(f"Error loading local image: {e}") else: print("Unsupported image format or source") return image_inputs, video_inputs # Print versions for debugging print(f"Transformers version: {transformers.__version__}") print(f"PyTorch version: {torch.__version__}") print(f"CUDA available: {torch.cuda.is_available()}") if torch.cuda.is_available(): print(f"CUDA device: {torch.cuda.get_device_name(0)}") print(f"CUDA memory allocated: {torch.cuda.memory_allocated(0)/1024**3:.2f} GB") print(f"CUDA memory reserved: {torch.cuda.memory_reserved(0)/1024**3:.2f} GB") # Load the 4-bit quantized model from Unsloth model_id = "ABDALLALSWAITI/Qwen2.5-VL-7B-Instruct-unsloth-bnb-4bit-copy" try: # Free GPU memory before loading if torch.cuda.is_available(): torch.cuda.empty_cache() gc.collect() # Load the processor first (less memory intensive) print("Loading processor...") processor = AutoProcessor.from_pretrained(model_id, token=token) # Configure quantization parameters quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", llm_int8_enable_fp32_cpu_offload=True ) print("Loading model...") # Try loading with GPU offloading enabled try: model = Qwen2_5_VLForConditionalGeneration.from_pretrained( model_id, token=token, device_map="auto", quantization_config=quantization_config, low_cpu_mem_usage=True, ) print("Model loaded successfully with GPU acceleration") except (ValueError, RuntimeError, torch.cuda.OutOfMemoryError) as e: print(f"GPU loading failed: {e}") print("Falling back to CPU-only mode") # Clean up any partially loaded model if 'model' in locals(): del model torch.cuda.empty_cache() gc.collect() # Try again with CPU only model = Qwen2_5_VLForConditionalGeneration.from_pretrained( model_id, token=token, device_map="cpu", torch_dtype=torch.float32, ) print("Model loaded on CPU successfully") # Print model's device map if available if hasattr(model, 'hf_device_map'): print("Model device map:") for module, device in model.hf_device_map.items(): print(f" {module}: {device}") # Example message with an image messages = [ { "role": "user", "content": [ { "type": "image", "image": "https://i.pinimg.com/736x/69/cd/59/69cd59a5ee5e041aa00f088465befbad.jpg", }, {"type": "text", "text": "Describe this image in detail."}, ], } ] # Process the messages print("Processing input...") text = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) image_inputs, video_inputs = process_vision_info(messages) # Check if we have valid image inputs if not image_inputs: raise ValueError("No valid images were processed") # Prepare inputs for the model inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", ) # Determine which device to use based on model's main device if hasattr(model, 'hf_device_map'): # Find the primary device (usually where the first transformer block is) for key, device in model.hf_device_map.items(): if 'transformer.blocks.0' in key or 'model.embed_tokens' in key: input_device = device break else: # Default to first device in the map input_device = next(iter(model.hf_device_map.values())) else: # If not distributed, use the model's device input_device = next(model.parameters()).device print(f"Using device {input_device} for inputs") inputs = {k: v.to(input_device) for k, v in inputs.items()} # Generate the response print("Generating response...") with torch.no_grad(): generation_config = { "max_new_tokens": 256, "do_sample": True, "temperature": 0.7, "top_p": 0.9, } generated_ids = model.generate(**inputs, **generation_config) # Process the output generated_ids_trimmed = [ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs["input_ids"], generated_ids) ] output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) # Print the response print("\nModel response:") print(output_text[0]) except Exception as e: import traceback print(f"An error occurred: {e}") print(traceback.format_exc()) finally: # Clean up if torch.cuda.is_available(): torch.cuda.empty_cache() ``` ## Quantization Benefits The 4-bit quantized model offers several advantages: 1. **Reduced Memory Usage**: Uses approximately 4-5GB of VRAM compared to 14-16GB for the full model 2. **Wider Accessibility**: Can run on consumer GPUs with limited VRAM (e.g., RTX 3060, GTX 1660) 3. **CPU Fallback**: The memory-efficient implementation can fall back to CPU if GPU memory is insufficient 4. **Minimal Performance Loss**: The quantized model maintains most of the reasoning capabilities of the full model ## Performance Tips 1. **Control Image Resolution**: ```python processor = AutoProcessor.from_pretrained( model_id, token=token, min_pixels=256*28*28, # Lower bound max_pixels=1280*28*28 # Upper bound ) ``` 2. **Enable Flash Attention 2** for better performance (if supported): ```python model = Qwen2_5_VLForConditionalGeneration.from_pretrained( model_id, token=token, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", device_map="auto", quantization_config=bnb_config ) ``` 3. **Memory Management**: - Call `torch.cuda.empty_cache()` and `gc.collect()` before and after using the model - Set environment variables: `os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:128"` - Use `low_cpu_mem_usage=True` when loading the model 4. **Generation Parameters**: - Adjust `max_new_tokens` based on your needs (lower values use less memory) - Use temperature and top_p to control randomness: ```python generation_config = { "max_new_tokens": 256, "do_sample": True, "temperature": 0.7, "top_p": 0.9, } ``` 5. **Multi-Image Processing**: When working with multiple images, batch processing them properly can save memory and improve efficiency: ```python messages = [ { "role": "user", "content": [ {"type": "image", "image": "url_to_image1"}, {"type": "image", "image": "url_to_image2"}, {"type": "text", "text": "Compare these two images."} ] } ] ```