unsloth/gpt-oss-20b-GGUF · Speed differences for different quants

11 days ago

So this started because I was curious why the ggml-org/gpt-oss models had tg speeds up to 50% faster than the F16's even though the file sizes aren't so different. Well, the easy answer is because the ggml-org are Q8_0 and the F16s are... F16. This only affects the embedding, attention, and output layers, but it looks like it's enough to make a big difference on some setups (all this testing was done on a Strix Halo device).

I did some initial comparison w/ gpt-oss-120b, but since I didn't feel like wasting time/bandwidth, I did more extensive testing on gpt-oss-20b to characterize the speed difference based on quantization, a lot more than expected considering how close the file sizes are ! Maybe this big difference is only on APU platforms? These tests are on an AMD Strix Halo (Ryzen AI 395) w/ a recent TheRock/ROCm 7 nightly on a recent llama.cpp w/ rocWMMA FA compiled (I did some brief AMDVLK Vulkan testing and roughly in-line as well on the same hardware):

model	size	test	t/s
ggml-org gpt-oss-20b MXFP4	11.27 GiB	tg128	62.15 ± 0.01
unsloth gpt-oss-20b F16	12.83 GiB	tg128	42.93 ± 0.00
unsloth gpt-oss-20b UD Q8_K_XL	12.28 GiB	tg128	50.89 ± 0.00
unsloth gpt-oss-20b Q8_0	11.27 GiB	tg128	59.06 ± 0.00
unsloth gpt-oss-20b UD Q4_K_XL	11.04 GiB	tg128	62.04 ± 0.01
unsloth gpt-oss-20b Q4_K_M	10.81 GiB	tg128	62.21 ± 0.01

I did do additional testing on the Q8_0 so I was satisfied they matched pretty closely. For those looking for a bit more color I posted a bit more info in a separate Framework discussion thread, but I thought I'd highlight it here since I hadn't seen a mention of the potentially big speed differences in quants (even when the file sizes don't change much!).

shimmyshimmer

Unsloth AI org 10 days ago

•

edited 10 days ago

The Q8_0 is supposed to be exactly the same and the MXFP4 one. Some of the close results are most likely noise. You can run that one if you want

For F16 it is not quantized at all hence why it's slower and bigger. IT's the original precision of the model. But it's good to know some of the supposed speed differences

The reason why we recommend the F16 version is because it is OpenAI's original original unquantized precision, and we didn't do anything to it. The Q8 and MXFP4 one have some quantization involved.

leonardlin

10 days ago

In my linked comment I ran some more repetitions (-r 20 instead of the default 5) and reversed order and gave some time in between to make sure they were at relatively equal temperatures and the Q8_0 was closer. Interestingly there was still slight that was difference larger than stddev but ¯\_(ツ)_/¯