File size: 2,645 Bytes
666b3db
 
 
 
 
 
 
9593552
 
 
 
 
 
 
 
 
 
 
 
666b3db
 
a3b9edc
f2f84f2
a3b9edc
666b3db
15b309e
 
666b3db
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
---
license: gemma
base_model:
- google/gemma-3-12b-it
---
This is an HQQ-quantized version (4-bit, group-size=64) of the <a href="https://huggingface.co/google/gemma-3-12b-it">gemma-3-12b-it</a> model.

## Performance

| Models            | <a href="https://huggingface.co/google/gemma-3-12b-it">bfp16</a> | <a href="https://huggingface.co/mobiuslabsgmbh/gemma-3-12b-it_4bitgs64_bfp16_hqq_hf">HQQ 4-bit gs-64</a> | <a href="https://huggingface.co/gaunernst/gemma-3-12b-it-int4-awq">QAT 4-bit gs-32</a> |
|:-------------------:|:--------:|:--------:|:--------:|
| ARC (25-shot)      | 0.724 | 0.701 | 0.690 | 
| HellaSwag (10-shot)| 0.839 | 0.826 | 0.792 | 
| MMLU (5-shot)      | 0.730 | 0.724 | 0.693 |
| TruthfulQA-MC2     | 0.580 | 0.585 | 0.550 |
| Winogrande (5-shot)| 0.766 | 0.774 | 0.755 |
| GSM8K (5-shot)     | 0.874 | 0.862 | 0.808 |
| Average            | 0.752 | 0.745 | 0.715 |

## Usage
```Python
#use transformers up to 52cc204dd7fbd671452448028aae6262cea74dc2
#pip install git+https://github.com/huggingface/transformers@52cc204dd7fbd671452448028aae6262cea74dc2

import torch
backend       = "gemlite" 
compute_dtype = torch.bfloat16 
cache_dir     = None
model_id      = 'mobiuslabsgmbh/gemma-3-12b-it_4bitgs64_bfp16_hqq_hf'

#Load model
from transformers import Gemma3ForConditionalGeneration, AutoProcessor

processor = AutoProcessor.from_pretrained(model_id, cache_dir=cache_dir)
model = Gemma3ForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=compute_dtype,
    attn_implementation="sdpa",
    cache_dir=cache_dir,
    device_map="cuda",
)

#Optimize
from hqq.utils.patching import prepare_for_inference
prepare_for_inference(model.language_model, backend=backend, verbose=True)


############################################################################
#Inference
messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "You are a helpful assistant."}]
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
            {"type": "text", "text": "Describe this image in detail."}
        ]
    }
]

inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=compute_dtype)

input_len = inputs["input_ids"].shape[-1]

with torch.inference_mode():
    generation = model.generate(**inputs, max_new_tokens=128, do_sample=False)[0][input_len:]
    decoded    = processor.decode(generation, skip_special_tokens=True)

print(decoded)


```