--- license: gemma base_model: - google/gemma-3-12b-it --- This is an HQQ-quantized version (4-bit, group-size=64) of the gemma-3-12b-it model. ## Performance | Models | bfp16 | HQQ 4-bit gs-64 | QAT 4-bit gs-32 | |:-------------------:|:--------:|:--------:|:--------:| | ARC (25-shot) | 0.724 | 0.701 | 0.690 | | HellaSwag (10-shot)| 0.839 | 0.826 | 0.792 | | MMLU (5-shot) | 0.730 | 0.724 | 0.693 | | TruthfulQA-MC2 | 0.580 | 0.585 | 0.550 | | Winogrande (5-shot)| 0.766 | 0.774 | 0.755 | | GSM8K (5-shot) | 0.874 | 0.862 | 0.808 | | Average | 0.752 | 0.745 | 0.715 | ## Usage ```Python #use transformers up to 52cc204dd7fbd671452448028aae6262cea74dc2 #pip install git+https://github.com/huggingface/transformers@52cc204dd7fbd671452448028aae6262cea74dc2 import torch backend = "gemlite" compute_dtype = torch.bfloat16 cache_dir = None model_id = 'mobiuslabsgmbh/gemma-3-12b-it_4bitgs64_bfp16_hqq_hf' #Load model from transformers import Gemma3ForConditionalGeneration, AutoProcessor processor = AutoProcessor.from_pretrained(model_id, cache_dir=cache_dir) model = Gemma3ForConditionalGeneration.from_pretrained( model_id, torch_dtype=compute_dtype, attn_implementation="sdpa", cache_dir=cache_dir, device_map="cuda", ) #Optimize from hqq.utils.patching import prepare_for_inference prepare_for_inference(model.language_model, backend=backend, verbose=True) ############################################################################ #Inference messages = [ { "role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}] }, { "role": "user", "content": [ {"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"}, {"type": "text", "text": "Describe this image in detail."} ] } ] inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=compute_dtype) input_len = inputs["input_ids"].shape[-1] with torch.inference_mode(): generation = model.generate(**inputs, max_new_tokens=128, do_sample=False)[0][input_len:] decoded = processor.decode(generation, skip_special_tokens=True) print(decoded) ```