This model breaks apart after a few tokens when the context gets long

by stduhpf - opened Apr 6

Apr 6

•

tested with llama.cpp commit 0c74b04376b0b9efc096480fe10f866afc8d7c1c.

After some time this model consistently starts generating gibberish, sometimes repeating the same token over and over again. Typically this starts hapenning around 600 tokens in, sometimes earlier.

I couldn't replicate this problem with the original model converted to fp16, nor any other quantized version of it (even q4_0).

Something is wrong specifically with this qat version.

stduhpf

Apr 6

To reproduce:
llama-cli.exe -m models\gemma-3-1b-it-Q4_0.gguf -ngl 99 -t 6 -tb 12 -c 16384 -sm none -p "Hello! " --ignore-eos -n 768

stduhpf

Apr 6

The pretrained version doesn't seem to have the same problem, it's really just this one.

Jenish-23

Apr 8

yes, I'm facing the same problem.

I tried the recommended hyper-parameters as well:
temp=1.0,
top-p = 0.95,
top-k = 64

but same problem.

exclusif20

Apr 11

I have exactly the same problem with the 1b qat model.

janwas

May 6

I'm curious which backend (CPU, Metal, CUDA etc) you all are using? Are the activations float16?

stduhpf

May 7

I'm using Vulkan backend, but I tried on CPU backend and got the same results. I'm not entirely sure what types are the activations, but I would guess they're regular float32?

janwas

May 9

Thanks. I wondered about the float16 because its limited range has caused issues:
https://www.reddit.com/r/LocalLLaMA/comments/1jf10ar/gemma_3_grpo_now_in_unsloth_bug_fixes/

Which CPU are you running?

Unless it's Arm, it seems unlikely the CPU backend would be using f16, because on Intel only recent CPUs can do f16 math, and it's pretty expensive to convert f16 to f32.

janwas

May 13

Also, I think the IT model here was uploaded on Apr 11, which is after the bug report. Is it possible you mean the PT model from Apr3?

stduhpf

May 13

•

edited May 13

Also, I think the IT model here was uploaded on Apr 11, which is after the bug report. Is it possible you mean the PT model from Apr3?

No, the IT model was first uploaded on the April 3rd just like the pretrained version. It was later updated on April 11 with some metadata fixes related to control tokens, but that didn't fix the issue being reported here. The QAT pt works perfectly fine.

Also I still get the same results on Metal backend.

janwas

May 13

Thanks for confirming the model version/date.

If PT works, then it seems likely this is indeed due to an activation range issue. GPU(Metal) is quite likely to use float16 and I've also seen that used in the Llama.cpp CPU code.
Is there some flag that forces f32 precision?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment