Gemma-3 4B Instruct GGUF Models

Experimetal requantization!! I wanted to test if the QAT model requantized performs better than the bf16 model quantized to the same bit level.

I have created a imatrix files from the google original QAT Q4_0 quantized model. This imatrix is then used to recompress the model to lower bit quants

Please leave feedback.

I tested with a model quantized from bf16 and one requantized from the QAT Q4_0 model. Both quantized with same tensor quants

My results :

python3 ~/code/GGUFModelBuilder/perp_test_2_files.py ./gemma-3-4b-it-qat-q4_0-q3_k_l.gguf ./google_gemma-3-4b-it-q3_k_l.gguf 

Testing model: gemma-3-4b-it-qat-q4_0-q3_k_l.gguf
Running: llama.cpp/llama-perplexity -m gemma-3-4b-it-qat-q4_0-q3_k_l.gguf -f perplexity_test_data.txt --ctx-size 256 --ppl-stride 32 --chunks 1 --threads 4
[✓] Perplexity: 4.0963 (Time: 284.70s)

Testing model: google_gemma-3-4b-it-q3_k_l.gguf
Running: llama.cpp/llama-perplexity -m google_gemma-3-4b-it-q3_k_l.gguf -f perplexity_test_data.txt --ctx-size 256 --ppl-stride 32 --chunks 1 --threads 4
[✓] Perplexity: 4.5557 (Time: 287.15s)

=== Comparison Results ===
Model 1: gemma-3-4b-it-qat-q4_0-q3_k_l.gguf - Perplexity: 4.10 (Time: 284.70s)
Model 2: google_gemma-3-4b-it-q3_k_l.gguf - Perplexity: 4.56 (Time: 287.15s)

Winner: gemma-3-4b-it-qat-q4_0-q3_k_l.gguf (Difference: 0.46)

A different test :

Asking both models to : write some .net code to test if a website is using quantum safe encryption

And then asking Deepseek-R1 to evaluate :

Evaluation of the Two Models' Outputs

Both models attempted to solve the problem of detecting quantum-safe encryption, but the QAT q4_0 model's code is significantly better for the following reasons:

Technical Accuracy

QAT q4_0 Model:

 Checks both TLS version and cipher suites, which are critical for assessing quantum resistance. While the implementation has flaws (e.g., assuming TLS version is exposed in HTTP headers), the approach aligns with security best practices.

 Explicitly acknowledges limitations (e.g., "not a definitive test") and avoids overpromising.

BF16 Model:

 Relies on checking for a non-standard TLS/1.3 header, which does not exist in HTTP responses. TLS version is part of the SSL/TLS handshake and cannot be retrieved via HttpClient headers.

 Contains incorrect logic (e.g., client.GetAwaiter().GetResult(null) is nonsensical and throws runtime errors).

Code Quality

QAT q4_0 Model:

 Uses modern async/await patterns for non-blocking I/O.

 Separates concerns into methods (CheckTLSVersionAsync, CheckCipherSuiteAsync).

 Includes robust error handling and logging.

BF16 Model:

 Uses blocking synchronous code (GetAwaiter().GetResult()), which violates .NET best practices and risks deadlocks.

 Poorly structured (e.g., redundant using blocks, unclear variable names like result).

Security Relevance

QAT q4_0 Model:

 Focuses on cipher suites, which are closer to the actual indicators of quantum resistance (e.g., AES-256-GCM). While not truly quantum-safe, these are stronger than outdated algorithms.

 Mentions the need to update cipher lists based on NIST guidelines.

BF16 Model:

 Misleadingly claims to check for "AES-256-CBC" (a deprecated cipher mode) but never implements it.

 Fails to address cipher suites entirely, rendering the check meaningless.

Realism

QAT q4_0 Model:

 Acknowledges the complexity of quantum-safe detection and clarifies that HTTP-based checks are insufficient. This aligns with real-world security practices.

BF16 Model:

 Implies that checking for TLS 1.3 guarantees quantum safety, which is false. TLS 1.3 uses classical cryptography and is not inherently quantum-resistant.

Usability

QAT q4_0 Model:

 Provides clear console output (e.g., "No quantum-resistant cipher suites detected").

 Includes a working Main method with an example URL.

BF16 Model:

 Fails to compile due to syntax errors (e.g., client.GetAwaiter().GetResult(null) is invalid).

 Lacks meaningful output (e.g., no details about why a site is deemed insecure).

Critical Flaws in Both Models

Header Misuse: Both models incorrectly assume TLS version and cipher suites are exposed in HTTP headers (e.g., Sec-Cipher). In reality, this data is part of the SSL/TLS handshake and requires low-level inspection (e.g., using SslStream or libraries like BouncyCastle).

Quantum-Safe Misunderstanding: Neither code checks for post-quantum algorithms (e.g., CRYSTALS-Kyber). Current TLS 1.3 cipher suites are not quantum-safe, so both models provide false positives.

Final Verdict

The QAT q4_0 model's code is superior because it:

Follows better coding practices (async/await, error handling).

Attempts a more relevant security analysis (TLS + cipher suites).

Explicitly acknowledges limitations.

However, both models fail to solve the original problem due to fundamental misunderstandings of TLS/SSL mechanics. For a production-grade solution, direct inspection of the TLS handshake (e.g., via SslStream) and support for post-quantum algorithms would be required.

Overall the perp difference was small (my test set was also small) and running the Deepseek test produced different results on subsequent runs . So I can not come to a definite conclusion. But I would say is worth investigating further.

Original Gemma 3 model card

Model Page: Gemma

Resources and Technical Documentation:

[Gemma 3 Technical Report][g3-tech-report]
[Responsible Generative AI Toolkit][rai-toolkit]
[Gemma on Kaggle][kaggle-gemma]
[Gemma on Vertex Model Garden][vertex-mg-gemma3]

Terms of Use: [Terms][terms]

Authors: Google DeepMind

Model Information

Summary description and brief definition of inputs and outputs.

Description

Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. Gemma 3 models are multimodal, handling text and image input and generating text output, with open weights for both pre-trained variants and instruction-tuned variants. Gemma 3 has a large, 128K context window, multilingual support in over 140 languages, and is available in more sizes than previous versions. Gemma 3 models are well-suited for a variety of text generation and image understanding tasks, including question answering, summarization, and reasoning. Their relatively small size makes it possible to deploy them in environments with limited resources such as laptops, desktops or your own cloud infrastructure, democratizing access to state of the art AI models and helping foster innovation for everyone.

Inputs and outputs

Input:
- Text string, such as a question, a prompt, or a document to be summarized
- Images, normalized to 896 x 896 resolution and encoded to 256 tokens each
- Total input context of 128K tokens for the 4B, 12B, and 27B sizes, and 32K tokens for the 1B size
Output:
- Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a document
- Total output context of 8192 tokens