Bubbl-P4-multimodal-instruct (4-bit Quantized)

This repository contains a 4-bit quantized version of the microsoft/Phi-4-multimodal-instruct model.

Quantization was performed using the bitsandbytes library integrated with transformers.

Model Description

Original Model: microsoft/Phi-4-multimodal-instruct
Quantization Method: bitsandbytes Post-Training Quantization (PTQ)
Precision: 4-bit
Quantization Config:
- load_in_4bit=True
- bnb_4bit_quant_type="nf4" (NormalFloat 4-bit)
- bnb_4bit_compute_dtype=torch.bfloat16 (Computation performed in BF16 for compatible GPUs like A100)
- bnb_4bit_use_double_quant=True (Enables nested quantization for potentially more memory savings)

This version was created to provide the capabilities of Phi-4-multimodal with a significantly reduced memory footprint, making it suitable for deployment on GPUs with lower VRAM.

Intended Use

This quantized model is primarily intended for scenarios where VRAM resources are constrained, but the advanced multimodal reasoning, language understanding, and instruction-following capabilities of Phi-4-multimodal-instruct are desired.

Refer to the original model card for the full range of intended uses and capabilities of the base model.

How to Use

You can load this 4-bit quantized model directly using the transformers library. Ensure you have bitsandbytes and accelerate installed (pip install transformers bitsandbytes accelerate torch torchvision pillow soundfile scipy sentencepiece protobuf).

from transformers import AutoModelForCausalLM, AutoProcessor
import torch

model_id = "bubblspace/Bubbl-P4-multimodal-instruct"

# Load the processor (requires trust_remote_code)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

# Load the model with 4-bit quantization enabled
# The quantization config is loaded automatically from the model's config file
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True, # Essential for Phi-4 models
    load_in_4bit=True,     # Explicitly activate 4-bit loading (though config should handle it)
    device_map="auto"      # Automatically map model layers to available GPU(s)
    # torch_dtype=torch.bfloat16 # Often not needed here as bnb_4bit_compute_dtype is handled
)

print("4-bit quantized model loaded successfully!")

# --- Example: Text Inference ---
prompt = "<|user|>\nExplain the benefits of model quantization.<|end|>\n<|assistant|>"
inputs = processor(text=prompt, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=150)
response_text = processor.batch_decode(outputs)[0]
print(response_text)

# --- Example: Image Inference Placeholder ---
# from PIL import Image
# import requests
# url = "your_image_url.jpg"
# image = Image.open(requests.get(url, stream=True).raw)
# image_prompt = "<|user|>\n<|image_1|>\nDescribe this image.<|end|>\n<|assistant|>"
# inputs = processor(text=image_prompt, images=image, return_tensors="pt").to(model.device)
# outputs = model.generate(**inputs, max_new_tokens=100)
# response_text = processor.batch_decode(outputs, skip_special_tokens=True)[0]
# print(response_text)

# --- Example: Audio Inference Placeholder ---
# import soundfile as sf
# audio_path = "your_audio.wav"
# audio_array, sampling_rate = sf.read(audio_path)
# audio_prompt = "<|user|>\n<|audio_1|>\nTranscribe this audio.<|end|>\n<|assistant|>"
# inputs = processor(text=audio_prompt, audios=[(audio_array, sampling_rate)], return_tensors="pt").to(model.device)
# # ... generate and decode ...

Important: Remember to always pass trust_remote_code=True when loading both the processor and the model for Phi-4 architectures.

Hardware Requirements

Requires a CUDA-enabled GPU.
The 4-bit quantization significantly reduces VRAM requirements compared to the original BF16 model (approx. 11-12GB). This version should fit comfortably on GPUs with ~10GB VRAM, and potentially less depending on context length and batch size (evaluation recommended).
Performance gains (inference speed) compared to the original are most noticeable on GPUs that efficiently handle lower-precision operations (e.g., NVIDIA Ampere, Ada Lovelace series like A100, L4, RTX 30/40xx).

Limitations and Considerations

Potential Accuracy Impact: While 4-bit quantization aims to preserve performance, there might be a slight degradation in accuracy compared to the original BF16 model. Users should evaluate the model's performance on their specific tasks to ensure the trade-off is acceptable.
Inference Speed: Memory usage is significantly reduced. Inference speed may or may not be faster than the original BF16 model; it depends heavily on the hardware, batch size, sequence length, and specific implementation details. Test on your target hardware.
Multimodal Evaluation: Quantization primarily affects the model weights. Thorough evaluation on specific vision and audio tasks is recommended to confirm performance characteristics for multimodal use cases.
Inherited Limitations: This model inherits the limitations, biases, and safety considerations of the original microsoft/Phi-4-multimodal-instruct model. Please refer to its model card for detailed information on responsible AI practices.

License

The model is licensed under the MIT License, consistent with the original microsoft/Phi-4-multimodal-instruct model.

Citation

Please cite the original work if you use this model:

@misc{phi4multimodal2025,
      title={Phi-4-multimodal: A Compact Multimodal Model for Recommendation, Recognition, and Reasoning},
      author={Microsoft},
      year={2025},
      eprint={2503.01743},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}


Additionally, if you use this specific 4-bit quantized version, please acknowledge **Bubblspace** ([bubblspace.com](https://bubblspace.com)) and **AIEDX** ([aiedx.com](https://aiedx.com)) for providing this quantized model. You could add a note such as:

> *"We used the 4-bit quantized version of Phi-4-multimodal-instruct provided by Bubblspace/AIEDX, available at huggingface.co/bubblspace/Bubbl-P4-multimodal-instruct."*

bubblspace
/

Bubbl-P4-multimodal-instruct