UD version not doing great with YaRN compared to non-UD of the same size

#4
by Thireus - opened

After comparing the results of Qwen3-32B-128K-UD-Q8_K_XL.gguf vs Qwen3-32B-128K-Q8_0.gguf against the following prompt: https://thireus.com/REDDIT/Qwen3_Runescape_Massive_Prompt.txt I've observed that the UD version would not be able to retrieve the correct values from the XP table. It appears to have difficulties with the comma (the one placed every 3 digit positions for numbers larger than 999). Something that is not happening with the non-UD version.

For example, it would state "user's table shows level 99 with an XP needed of 130,344,31" when in fact it should be 13,034,431. As a result it fails to give the correct answer, which the non-UD version mostly doesn't seem to have any issues with.

Observations posted here: https://huggingface.co/Qwen/Qwen3-32B/discussions/18#6812a1ba10b870a148d70023

Unsloth AI org

@Thireus Hey! Do you know if this test was done via one shot or 4 shot as you did in https://huggingface.co/Qwen/Qwen3-32B/discussions/18#6812a1ba10b870a148d70023?

Ie is it 3/4 like Q8_0 or like 0/4?

If it's a mis-placed comma, it could be accidental sampling - I could reupload all 128K quants again if it helps.

But if the Q8_0 128K works, then that's interesting.

Unsloth AI org

Another theory is because I upcasted some layers to BF16 in Q8_K_XL, those layers are actually not calibrated with the 12K context length calibration dataset that I used, so maybe, just maybe the pure 8bit does better since it was actually calibrated.

I actually forgot if even Q8_0 changes if we use imatrix or not, but if it does, I'm assuming BF16 has no effect when imatrix is provided. That's another possible explanation.

Thank you for giving your thoughts on this @danielhanchen . I had done 3 tests in a row for UD Q8_K_XL yesterday but didn't save the results, I should have.

I did more runs, uploaded here: https://huggingface.co/Qwen/Qwen3-32B/discussions/18#681391d8ad0d99809805987e

I'm not too sure if we can draw any conclusion at this stage. Too much "luck" is involved, and it would need more runs.
I was finally able to reproduce the comma issue with Q8_0 so maybe it is due to accidental sampling in the original model indeed.

Something else I've observed though: Q8_0 is able to self-correct during reasoning, but not UD Q8_K_XL... Now it could entirely be based on luck again, so I'm not too sure. Also, when UD Q8_K_XL fails it is likely because it invented some numbers, as opposed to Q8_0 for which I seem to be able to understand why it failed (e.g. confused one table row for another or comma-related).
To me these runs still suggest that UD may at worst be slightly degraded compared to Q8_0, but this isn't striking.

BF16 is also able to fail, but also able to self-correct, a behaviour more similar to Q8_0 than UD Q8_K_XL.

UD is doing differently. Not enough evidence to support that UD may be worst than non-UD version, since the original model also gets things wrong. Closing the topic.

Thireus changed discussion status to closed

Sign up or log in to comment