why is the size bigger than regular Q4_0 quants ?

by lefromage - opened Apr 3

Apr 3

this quant is : 16GB /gemma-3-27b-it-q4_0.gguf

the same using llama.cpp quant is smaller and works better:
bartowski_google_gemma-3-27b-it-GGUF_google_gemma-3-27b-it-Q4_0.gguf is 15GB

stduhpf

Apr 3

token_embd.weight is in fp16 with is model, but Q6_K for the other quant you linked. That alone is about 1.4B params, so in f16 that takes 2.8GB vs 1.15GB when using q6.

Sagicc

Apr 3

This comment has been hidden (marked as Off-Topic)

McUH

Apr 3

the same using llama.cpp quant is smaller and works better:
bartowski_google_gemma-3-27b-it-GGUF_google_gemma-3-27b-it-Q4_0.gguf is 15GB

You mean the normal imatrix quant works better than this produced with quantization aware training? On what tasks is the bartowski quant better?

stduhpf

Apr 5

For those who want I have uploaded a smaller version of this model with quantiezed token embeddings table. It doesn't seem to significantly hurt the performance.

ubergarm

Apr 18

•

edited Apr 18

I'd love to see a comparison between these QAT 4-bit quants and some normal llama.cpp imatrix GGUFs with roughly the same bpw. Easiest and fastest way to do it would probably be to simply compare perplexity with same text and seed between a couple 4-bit GGUFs and the bf16 as baseline e.g.

wget https://github.com/user-attachments/files/19090237/wiki.test.raw.gz
gunzip wiki.test.raw.gz

./build/bin/llama-perplexity \
    --model /mnt/models/gemma-3-27b-it-qat-q4_0.gguf \
    -ngl 99 \
    --ctx-size 512 \
    --ubatch-size 512 \
    -f wiki.test.raw \
    --seed 1337 \
    --threads 8

...

Final estimate: PPL = ?.???? +/- 0.0????

ubergarm

Apr 18

I did a write-up with some results on the new QAT model: https://github.com/ikawrakow/ik_llama.cpp/discussions/334

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment