Thank you for your work and contributions to the community!
I've been exactly exploring the same thing. Gemma 3 is a great model. worth the time. but I was wondering too how to approach this and how to see if there was a difference with the QAT model. I am using your BF16/Q4K standard Gemma 3 27B quant as my daily driver now and I will run this same AI through its paces and thank you so much! BF16 embeddings even at lower weight quants seem to be the magic sauce. 🫡
If I was not using all my resources qauntizing model at the moment . I would do more indepth perplexity measurements for the lower quants . I wrote this script to make a simple test between two models . You need to https://huggingface.co/Mungert/gemma-3-4b-it-qat-q4_0-GGUF/blob/main/perp_test_2_files.py. Increase the CHUNKS = 1 parameter to get more accurate results. If you want to test your setup then use text output from your interactions with the model for perplexity data. If you have not got the time and considering you get good results with my BF16/Q4K standard Gemma 3 27B . Then I am guessing that the model that google have provided with the standard QAT Q4_0 is going to be better than my model. They use f16 (close to bf16) for embeddings and quantised part should work better as well : https://huggingface.co/google/gemma-3-4b-it-qat-q4_0-gguf try it .