OPEA/gemma-3-12b-it-int4-AutoRound-cpu

Model Details

This model is an int4 model with group_size 128 and symmetric quantization of google/gemma-3-12b-it generated by intel/auto-round algorithm.

Please follow the license of the original model.

Inference on CPU

we found the unquantized layer must run on BF16 or FP32, so cuda inference is not available now.

Requirements

pip install auto-round
pip uninstall intel-extension-for-pytorch
pip install intel-extension-for-transformers

from transformers import AutoProcessor, Gemma3ForConditionalGeneration
from PIL import Image
import requests
import torch
from auto_round import AutoRoundConfig

model_id = "OPEA/gemma-3-12b-it-int4-AutoRound-cpu"

quantization_config = AutoRoundConfig(backend="cpu")
model = Gemma3ForConditionalGeneration.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="cpu", quantization_config=quantization_config
).eval()

processor = AutoProcessor.from_pretrained(model_id)

messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "You are a helpful assistant."}]
    },
    {
        "role": "user",
        "content": [
            {"type": "image",
             "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
            {"type": "text", "text": "Describe this image in detail."}
        ]
    }
]

inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
).to(model.device, dtype=torch.bfloat16)

input_len = inputs["input_ids"].shape[-1]

with torch.inference_mode():
    generation = model.generate(**inputs, max_new_tokens=100, do_sample=False)
    generation = generation[0][input_len:]

decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)
"""Here's a detailed description of the image:

**Overall Impression:**

The image is a close-up shot of a vibrant garden scene, focusing on a pink cosmos flower with a bumblebee visiting it. The overall feel is natural, colorful, and slightly wild.

**Main Elements:**

*   **Cosmos Flower:** The central focus is a large, pink cosmos flower with broad petals. It's in full bloom and appears healthy.
*   **Bumblebee:**"""

Generate the model

Here is the sample command to reproduce the model.

pip install git+https://github.com/intel/auto-round.git@main
auto-round-mllm \
--model google/gemma-3-12b-it \
--device 0 \
--bits 4 \
--batch_size 1 \
--gradient_accumulate_steps 8 \
--format 'auto_round' \
--output_dir "./tmp_autoround"

OPEA
/

gemma-3-12b-it-int4-AutoRound-cpu

Model Details

Inference on CPU

Generate the model

Model tree for OPEA/gemma-3-12b-it-int4-AutoRound-cpu

Dataset used to train OPEA/gemma-3-12b-it-int4-AutoRound-cpu

Collection including OPEA/gemma-3-12b-it-int4-AutoRound-cpu

VLMs-AutoRound