mobiuslabsgmbh
/

gemma-3-12b-it_4bitgs64_bfp16_hqq_hf

8-bit precision

Model card Files Files and versions Community

gemma-3-12b-it_4bitgs64_bfp16_hqq_hf / README.md

mobicham's picture

Update README.md

9593552 verified 3 months ago

|

history blame contribute delete

2.65 kB

	---
	license: gemma
	base_model:
	- google/gemma-3-12b-it
	---
	This is an HQQ-quantized version (4-bit, group-size=64) of the <a href="https://huggingface.co/google/gemma-3-12b-it">gemma-3-12b-it</a> model.

	## Performance

	\| Models \| <a href="https://huggingface.co/google/gemma-3-12b-it">bfp16</a> \| <a href="https://huggingface.co/mobiuslabsgmbh/gemma-3-12b-it_4bitgs64_bfp16_hqq_hf">HQQ 4-bit gs-64</a> \| <a href="https://huggingface.co/gaunernst/gemma-3-12b-it-int4-awq">QAT 4-bit gs-32</a> \|
	\|:-------------------:\|:--------:\|:--------:\|:--------:\|
	\| ARC (25-shot) \| 0.724 \| 0.701 \| 0.690 \|
	\| HellaSwag (10-shot)\| 0.839 \| 0.826 \| 0.792 \|
	\| MMLU (5-shot) \| 0.730 \| 0.724 \| 0.693 \|
	\| TruthfulQA-MC2 \| 0.580 \| 0.585 \| 0.550 \|
	\| Winogrande (5-shot)\| 0.766 \| 0.774 \| 0.755 \|
	\| GSM8K (5-shot) \| 0.874 \| 0.862 \| 0.808 \|
	\| Average \| 0.752 \| 0.745 \| 0.715 \|

	## Usage
	```Python
	#use transformers up to 52cc204dd7fbd671452448028aae6262cea74dc2
	#pip install git+https://github.com/huggingface/transformers@52cc204dd7fbd671452448028aae6262cea74dc2

	import torch
	backend = "gemlite"
	compute_dtype = torch.bfloat16
	cache_dir = None
	model_id = 'mobiuslabsgmbh/gemma-3-12b-it_4bitgs64_bfp16_hqq_hf'

	#Load model
	from transformers import Gemma3ForConditionalGeneration, AutoProcessor

	processor = AutoProcessor.from_pretrained(model_id, cache_dir=cache_dir)
	model = Gemma3ForConditionalGeneration.from_pretrained(
	model_id,
	torch_dtype=compute_dtype,
	attn_implementation="sdpa",
	cache_dir=cache_dir,
	device_map="cuda",
	)

	#Optimize
	from hqq.utils.patching import prepare_for_inference
	prepare_for_inference(model.language_model, backend=backend, verbose=True)


	############################################################################
	#Inference
	messages = [
	{
	"role": "system",
	"content": [{"type": "text", "text": "You are a helpful assistant."}]
	},
	{
	"role": "user",
	"content": [
	{"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
	{"type": "text", "text": "Describe this image in detail."}
	]
	}
	]

	inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=compute_dtype)

	input_len = inputs["input_ids"].shape[-1]

	with torch.inference_mode():
	generation = model.generate(**inputs, max_new_tokens=128, do_sample=False)[0][input_len:]
	decoded = processor.decode(generation, skip_special_tokens=True)

	print(decoded)


	```