File size: 1,955 Bytes
666b3db
 
 
 
 
 
 
 
 
a3b9edc
f2f84f2
a3b9edc
666b3db
15b309e
 
666b3db
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
---
license: gemma
base_model:
- google/gemma-3-12b-it
---
This is an HQQ-quantized version (4-bit, group-size=64) of the <a href="https://huggingface.co/google/gemma-3-12b-it">gemma-3-12b-it</a> model.

## Usage
```Python
#use transformers up to 52cc204dd7fbd671452448028aae6262cea74dc2
#pip install git+https://github.com/huggingface/transformers@52cc204dd7fbd671452448028aae6262cea74dc2

import torch
backend       = "gemlite" 
compute_dtype = torch.bfloat16 
cache_dir     = None
model_id      = 'mobiuslabsgmbh/gemma-3-12b-it_4bitgs64_bfp16_hqq_hf'

#Load model
from transformers import Gemma3ForConditionalGeneration, AutoProcessor

processor = AutoProcessor.from_pretrained(model_id, cache_dir=cache_dir)
model = Gemma3ForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=compute_dtype,
    attn_implementation="sdpa",
    cache_dir=cache_dir,
    device_map="cuda",
)

#Optimize
from hqq.utils.patching import prepare_for_inference
prepare_for_inference(model.language_model, backend=backend, verbose=True)


############################################################################
#Inference
messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "You are a helpful assistant."}]
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
            {"type": "text", "text": "Describe this image in detail."}
        ]
    }
]

inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=compute_dtype)

input_len = inputs["input_ids"].shape[-1]

with torch.inference_mode():
    generation = model.generate(**inputs, max_new_tokens=128, do_sample=False)[0][input_len:]
    decoded    = processor.decode(generation, skip_special_tokens=True)

print(decoded)


```