Gemma 3n fixes for GGUFs and Ollama

#6
by shimmyshimmer - opened
Unsloth AI org

Hey guys after a bit of back and forth, the quants now properly work on Ollama!

Thanks to @ngxson and Michael Yang from Ollama and , there were 2 issues specifically for GGUFs:

  1. The add_shared_kv_layers was accidentally encoded in float32 which is fine, but is slightly complicated to decode from Ollama's side - a simple change to uint32 solves the issue.
  2. The per_layer_token_embd should be Q8_0 in precision. Anything lower seems to not function properly and errors out in the Ollama engine - to reduce issues for our community, we made this all Q8_0 in all quants - unfortunately this does use more space.

More details: https://docs.unsloth.ai/basics/gemma-3n-how-to-run-and-fine-tune#gemma-3n-fixes-analysis

shimmyshimmer pinned discussion

Thank you for the investigation! Will have a look into this as soon as I can.

Is it just me? I get NaNs when running perplexity and using a GGUF converted from this model with the latest @ngxson PR (bf16, f16, Q8_0 all produce NaNs after the 1st or second batch).

I can't seem to use the long context. This might be related to the NaN perplexity part.

Unsloth AI org

@ikawrakow Hmm that's weird - I didn't get NaNs for imatrix, but it did shoot up from 2 or 3 to 7, which generally is somewhat weird since I'm using the chat template directly.

Unsloth AI org

On the other hand, I am getting super high loss when finetuning ie 7 or 8, so it's related maybe - my guess is because this is a "real" multimodal model, PPL and other metrics don't work anymore, or there are implementation issues.

It happened with Qwen 2.5 VL 72B Instruct where I got PPL of like 30 or something

@danielhanchen But have you run perplexity on Wikitext2? Some say that Wikitext perplexity tells us nothing, but in my case in does tell me a great deal: when developing and getting NaNs, 100% of the time I have a bug. Either way, I don't use llama.cpp very often lately, so perhaps something has changed that I don't know about, and this is why I'm getting the NaNs. I did

hugging face-cli download --local-dir gemma3n google/gemma-3n-E4B-it
python3 convert_hf_to_gguf.py --outtype bf16 gemma3n
./bin/llama-perplexity -m gemma3n/Gemma3N-6.9B-BF16.gguf -f ../tests/wiki.test.raw -t 1 -ngl 100

which results in

...
system_info: n_threads = 1 (n_threads_batch = 1) / 32 | CUDA : ARCHS = 890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
perplexity: tokenizing the input ..
perplexity: tokenization took 495.384 ms
perplexity: calculating perplexity over 576 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 0.97 seconds per pass - ETA 2.32 minutes
[1]19.5416,[2]2914256.6411,[3]nan,[4]nan,[5]nan,[6]nan,[7]nan,[8]nan,[9]nan,[10]nan,[11]nan,[12]nan,[13]nan,[14]nan,[15]nan,[16]nan,[17]nan,[18]nan,[19]nan,[20]nan,[21]nan,[22]nan,[23]nan,[24]nan,^C

When running convert_hf_to_gguf.py I'm getting many warnings of the type

WARNING:hf-to-gguf:ignore token 262144: id is out of range, max=262143

Is this something to worry about?

Interestingly enough, if I add -b 512 to the perplexity command, PPL is high but stays finite (it finishes with Final estimate: PPL = 35.3014 +/- 0.41619)

Unsloth AI org

@ikawrakow Ok that PPL is definitely very high indeed - and even -b 512 with 35 is way too high.

There were some new fixes for Gemma 3N, but I'm assuming you're using the latest llama.cpp.

I can run PPL and get back to you, but from what I remember PPL on imatrix was 2 ish and it grew to 7.

I got 27 with an older run of the model and -b 512. (which, I reported to llama.cpp. I tried the base model and got a perplexity of ~7.6 (still a bit higher than Gemma 3 4b)).

I got 27 with an older run of the model and -b 512. (which, I reported to llama.cpp. I tried the base model and got a perplexity of ~7.6 (still a bit higher than Gemma 3 4b)).

I think I saw your issue and the response. Instruction tuning does indeed increase PPL, but it is typically a 10-20% effect. I have never seen a factor of 3 or 4 increase in PPL before.

There were some new fixes for Gemma 3N, but I'm assuming you're using the latest llama.cpp.

I was using @ngxson 's PR that wasn't merged yet when I started looking into Gemma3n.

I can run PPL and get back to you, but from what I remember PPL on imatrix was 2 ish and it grew to 7.

This must be a very forgiving calibration dataset, then. PPL of 2...3 on Wikitext2 is typical for much larger base models.

I think that a typical 4B model should have a perplexity of around 12

Unsloth AI org

@ikawrakow I think someone else also got high PPL - https://github.com/ggml-org/llama.cpp/issues/14437 but it's around 27 ish.

The dataset I used uses the chat template, so maybe hence why PPL is between 2 to 7

that's me

Unsloth AI org

OO hi hi :)

I run a new run with the latest LCPP and converted a new Q6_K model by myself. The NaN perplexity problem persists. If I set -b 512, the perplexity is now around 35-36.

It seems like I didn't need much tokens to just make the model crash. This happens with any kind of quantization, Q2_K to Q6_K, when I give it a bit longer prompt than a few hundred tokens they started to either crash or generate gibberish (randomly).

Which is entirely predictable, given the NaNs when computing perplexity with u-batch > 512. My guess is that the implementation in llama.cpp is not quite there yet.

I converted a new Q6_K model from the BF16 model (not F16 anymore). With the CPU I don't see any real problem in perplexity measure other than it's extremely high (about 30-40 or so). with the GPU in the perplexity rose to millions.

Even funnier is that if I let the GPU do the prompt processing work then perplexity is increased by a bit. Not into millions, but still very weird

Small reminder that https://huggingface.co/unsloth/gemma-3n-E4B-it-GGUF/blob/main/gemma-3n-E4B-it-Q4_0.gguf is still q5_1 for per_layer_token_embd.weight and that is breaking output when offloaded to GPU.

It's just this specific tensor: per_layer_token_embd.weight everything else is fine.

The NaN perplexity part seems to be fixed now

Small reminder that https://huggingface.co/unsloth/gemma-3n-E4B-it-GGUF/blob/main/gemma-3n-E4B-it-Q4_0.gguf is still q5_1 for per_layer_token_embd.weight and that is breaking output when offloaded to GPU.

It's just this specific tensor: per_layer_token_embd.weight everything else is fine.

I don't see a need for offloading it to GPU, but I think this is a bug that needs to be fixed

Sign up or log in to comment