|
--- |
|
license: gemma |
|
base_model: |
|
- google/gemma-3-12b-it |
|
--- |
|
This is an HQQ-quantized version (4-bit, group-size=64) of the <a href="https://huggingface.co/google/gemma-3-12b-it">gemma-3-12b-it</a> model. |
|
|
|
## Performance |
|
|
|
| Models | <a href="https://huggingface.co/google/gemma-3-12b-it">bfp16</a> | <a href="https://huggingface.co/mobiuslabsgmbh/gemma-3-12b-it_4bitgs64_bfp16_hqq_hf">HQQ 4-bit gs-64</a> | <a href="https://huggingface.co/gaunernst/gemma-3-12b-it-int4-awq">QAT 4-bit gs-32</a> | |
|
|:-------------------:|:--------:|:--------:|:--------:| |
|
| ARC (25-shot) | 0.724 | 0.701 | 0.690 | |
|
| HellaSwag (10-shot)| 0.839 | 0.826 | 0.792 | |
|
| MMLU (5-shot) | 0.730 | 0.724 | 0.693 | |
|
| TruthfulQA-MC2 | 0.580 | 0.585 | 0.550 | |
|
| Winogrande (5-shot)| 0.766 | 0.774 | 0.755 | |
|
| GSM8K (5-shot) | 0.874 | 0.862 | 0.808 | |
|
| Average | 0.752 | 0.745 | 0.715 | |
|
|
|
## Usage |
|
```Python |
|
#use transformers up to 52cc204dd7fbd671452448028aae6262cea74dc2 |
|
#pip install git+https://github.com/huggingface/transformers@52cc204dd7fbd671452448028aae6262cea74dc2 |
|
|
|
import torch |
|
backend = "gemlite" |
|
compute_dtype = torch.bfloat16 |
|
cache_dir = None |
|
model_id = 'mobiuslabsgmbh/gemma-3-12b-it_4bitgs64_bfp16_hqq_hf' |
|
|
|
#Load model |
|
from transformers import Gemma3ForConditionalGeneration, AutoProcessor |
|
|
|
processor = AutoProcessor.from_pretrained(model_id, cache_dir=cache_dir) |
|
model = Gemma3ForConditionalGeneration.from_pretrained( |
|
model_id, |
|
torch_dtype=compute_dtype, |
|
attn_implementation="sdpa", |
|
cache_dir=cache_dir, |
|
device_map="cuda", |
|
) |
|
|
|
#Optimize |
|
from hqq.utils.patching import prepare_for_inference |
|
prepare_for_inference(model.language_model, backend=backend, verbose=True) |
|
|
|
|
|
############################################################################ |
|
#Inference |
|
messages = [ |
|
{ |
|
"role": "system", |
|
"content": [{"type": "text", "text": "You are a helpful assistant."}] |
|
}, |
|
{ |
|
"role": "user", |
|
"content": [ |
|
{"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"}, |
|
{"type": "text", "text": "Describe this image in detail."} |
|
] |
|
} |
|
] |
|
|
|
inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=compute_dtype) |
|
|
|
input_len = inputs["input_ids"].shape[-1] |
|
|
|
with torch.inference_mode(): |
|
generation = model.generate(**inputs, max_new_tokens=128, do_sample=False)[0][input_len:] |
|
decoded = processor.decode(generation, skip_special_tokens=True) |
|
|
|
print(decoded) |
|
|
|
|
|
``` |
|
|
|
|