Gemma 3n fixes for GGUFs and Ollama
Hey guys after a bit of back and forth, the quants now properly work on Ollama!
Thanks to @ngxson and Michael Yang from Ollama and , there were 2 issues specifically for GGUFs:
- The add_shared_kv_layers was accidentally encoded in float32 which is fine, but is slightly complicated to decode from Ollama's side - a simple change to uint32 solves the issue.
- The per_layer_token_embd should be Q8_0 in precision. Anything lower seems to not function properly and errors out in the Ollama engine - to reduce issues for our community, we made this all Q8_0 in all quants - unfortunately this does use more space.
More details: https://docs.unsloth.ai/basics/gemma-3n-how-to-run-and-fine-tune#gemma-3n-fixes-analysis
Thank you for the investigation! Will have a look into this as soon as I can.
Is it just me? I get NaNs when running perplexity and using a GGUF converted from this model with the latest
@ngxson
PR (bf16
, f16
, Q8_0
all produce NaNs after the 1st or second batch).
I can't seem to use the long context. This might be related to the NaN perplexity part.
@ikawrakow Hmm that's weird - I didn't get NaNs for imatrix, but it did shoot up from 2 or 3 to 7, which generally is somewhat weird since I'm using the chat template directly.
On the other hand, I am getting super high loss when finetuning ie 7 or 8, so it's related maybe - my guess is because this is a "real" multimodal model, PPL and other metrics don't work anymore, or there are implementation issues.
It happened with Qwen 2.5 VL 72B Instruct where I got PPL of like 30 or something
@danielhanchen
But have you run perplexity on Wikitext2? Some say that Wikitext perplexity tells us nothing, but in my case in does tell me a great deal: when developing and getting NaNs, 100% of the time I have a bug. Either way, I don't use llama.cpp
very often lately, so perhaps something has changed that I don't know about, and this is why I'm getting the NaNs. I did
hugging face-cli download --local-dir gemma3n google/gemma-3n-E4B-it
python3 convert_hf_to_gguf.py --outtype bf16 gemma3n
./bin/llama-perplexity -m gemma3n/Gemma3N-6.9B-BF16.gguf -f ../tests/wiki.test.raw -t 1 -ngl 100
which results in
...
system_info: n_threads = 1 (n_threads_batch = 1) / 32 | CUDA : ARCHS = 890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
perplexity: tokenizing the input ..
perplexity: tokenization took 495.384 ms
perplexity: calculating perplexity over 576 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 0.97 seconds per pass - ETA 2.32 minutes
[1]19.5416,[2]2914256.6411,[3]nan,[4]nan,[5]nan,[6]nan,[7]nan,[8]nan,[9]nan,[10]nan,[11]nan,[12]nan,[13]nan,[14]nan,[15]nan,[16]nan,[17]nan,[18]nan,[19]nan,[20]nan,[21]nan,[22]nan,[23]nan,[24]nan,^C
When running convert_hf_to_gguf.py
I'm getting many warnings of the type
WARNING:hf-to-gguf:ignore token 262144: id is out of range, max=262143
Is this something to worry about?
Interestingly enough, if I add -b 512
to the perplexity
command, PPL is high but stays finite (it finishes with Final estimate: PPL = 35.3014 +/- 0.41619
)
@ikawrakow Ok that PPL is definitely very high indeed - and even -b 512 with 35 is way too high.
There were some new fixes for Gemma 3N, but I'm assuming you're using the latest llama.cpp.
I can run PPL and get back to you, but from what I remember PPL on imatrix was 2 ish and it grew to 7.
I got 27 with an older run of the model and -b 512
. (which, I reported to llama.cpp. I tried the base model and got a perplexity of ~7.6 (still a bit higher than Gemma 3 4b)).
I got 27 with an older run of the model and -b 512. (which, I reported to llama.cpp. I tried the base model and got a perplexity of ~7.6 (still a bit higher than Gemma 3 4b)).
I think I saw your issue and the response. Instruction tuning does indeed increase PPL, but it is typically a 10-20% effect. I have never seen a factor of 3 or 4 increase in PPL before.
There were some new fixes for Gemma 3N, but I'm assuming you're using the latest llama.cpp.
I was using @ngxson 's PR that wasn't merged yet when I started looking into Gemma3n.
I can run PPL and get back to you, but from what I remember PPL on imatrix was 2 ish and it grew to 7.
This must be a very forgiving calibration dataset, then. PPL of 2...3 on Wikitext2 is typical for much larger base models.
I think that a typical 4B model should have a perplexity of around 12
@ikawrakow I think someone else also got high PPL - https://github.com/ggml-org/llama.cpp/issues/14437 but it's around 27 ish.
The dataset I used uses the chat template, so maybe hence why PPL is between 2 to 7
that's me
OO hi hi :)
I run a new run with the latest LCPP and converted a new Q6_K model by myself. The NaN perplexity problem persists. If I set -b 512, the perplexity is now around 35-36.
It seems like I didn't need much tokens to just make the model crash. This happens with any kind of quantization, Q2_K to Q6_K, when I give it a bit longer prompt than a few hundred tokens they started to either crash or generate gibberish (randomly).
Which is entirely predictable, given the NaNs when computing perplexity with u-batch > 512. My guess is that the implementation in llama.cpp
is not quite there yet.
I converted a new Q6_K model from the BF16 model (not F16 anymore). With the CPU I don't see any real problem in perplexity measure other than it's extremely high (about 30-40 or so). with the GPU in the perplexity rose to millions.
Even funnier is that if I let the GPU do the prompt processing work then perplexity is increased by a bit. Not into millions, but still very weird
Small reminder that https://huggingface.co/unsloth/gemma-3n-E4B-it-GGUF/blob/main/gemma-3n-E4B-it-Q4_0.gguf is still q5_1 for per_layer_token_embd.weight
and that is breaking output when offloaded to GPU.
It's just this specific tensor: per_layer_token_embd.weight
everything else is fine.
The NaN perplexity part seems to be fixed now
Small reminder that https://huggingface.co/unsloth/gemma-3n-E4B-it-GGUF/blob/main/gemma-3n-E4B-it-Q4_0.gguf is still q5_1 for
per_layer_token_embd.weight
and that is breaking output when offloaded to GPU.It's just this specific tensor:
per_layer_token_embd.weight
everything else is fine.
I don't see a need for offloading it to GPU, but I think this is a bug that needs to be fixed