This is a requantized version of https://huggingface.co/google/gemma-3-27b-it-qat-q4_0-gguf.

The official QAT weights released by google use fp16 (instead of Q6_K) for the embeddings table, which makes this model take a significant extra amount of memory (and storage) compared to what Q4_0 quants are supposed to take. Instead of quantizing the table myself, I extracted it from Bartowski's quantized models. Requantizing with llama.cpp achieves a very similar result.

Here are some benchmark results:

Model File size ↓ PPL (wiki.text.raw) ↓ Hellaswag, 4k tasks ↑
This model 15.6 GB 8.2335 +/- 0.06321 82.875% [81.6761%, 84.0108%]
This model (previous version) 15.6 GB 8.2291 +/- 0.06315 82.725% [81.5222%, 83.8650%]
QAT Q4_0 (google) 17.2 GB 8.2323 +/- 0.06320 82.850% [81.6505%, 83.9865%]

Note that this model ends up smaller than the Q4_0 from Bartowski. This is because llama.cpp sets some tensors to Q4_1 when quantizing models to Q4_0 with imatrix, but this is a static quant. The perplexity score for this one is even lower with this model compared to the original model by Google, but the results are within margin of error, so it's probably just luck.

I also fixed the control token metadata, which was slightly degrading the performance of the model in instruct mode. Shoutout to ngxson for finding the issue, tdh111 for making me aware of the issue, and u/dampflokfreund on reddit (Dampfinchen on Huggingface) for sharing the steps to fix it.

Downloads last month
29,589
GGUF
Model size
27B params
Architecture
gemma3
Hardware compatibility
Log In to view the estimation

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for stduhpf/google-gemma-3-27b-it-qat-q4_0-gguf-small

Quantized
(2)
this model