Speed differences for different quants
So this started because I was curious why the ggml-org/gpt-oss models had tg
speeds up to 50% faster than the F16's even though the file sizes aren't so different. Well, the easy answer is because the ggml-org are Q8_0 and the F16s are... F16. This only affects the embedding, attention, and output layers, but it looks like it's enough to make a big difference on some setups (all this testing was done on a Strix Halo device).
I did some initial comparison w/ gpt-oss-120b, but since I didn't feel like wasting time/bandwidth, I did more extensive testing on gpt-oss-20b to characterize the speed difference based on quantization, a lot more than expected considering how close the file sizes are ! Maybe this big difference is only on APU platforms? These tests are on an AMD Strix Halo (Ryzen AI 395) w/ a recent TheRock/ROCm 7 nightly on a recent llama.cpp w/ rocWMMA FA compiled (I did some brief AMDVLK Vulkan testing and roughly in-line as well on the same hardware):
model | size | test | t/s |
---|---|---|---|
ggml-org gpt-oss-20b MXFP4 | 11.27 GiB | tg128 | 62.15 ± 0.01 |
unsloth gpt-oss-20b F16 | 12.83 GiB | tg128 | 42.93 ± 0.00 |
unsloth gpt-oss-20b UD Q8_K_XL | 12.28 GiB | tg128 | 50.89 ± 0.00 |
unsloth gpt-oss-20b Q8_0 | 11.27 GiB | tg128 | 59.06 ± 0.00 |
unsloth gpt-oss-20b UD Q4_K_XL | 11.04 GiB | tg128 | 62.04 ± 0.01 |
unsloth gpt-oss-20b Q4_K_M | 10.81 GiB | tg128 | 62.21 ± 0.01 |
I did do additional testing on the Q8_0 so I was satisfied they matched pretty closely. For those looking for a bit more color I posted a bit more info in a separate Framework discussion thread, but I thought I'd highlight it here since I hadn't seen a mention of the potentially big speed differences in quants (even when the file sizes don't change much!).
The Q8_0 is supposed to be exactly the same and the MXFP4 one. Some of the close results are most likely noise. You can run that one if you want
For F16 it is not quantized at all hence why it's slower and bigger. IT's the original precision of the model. But it's good to know some of the supposed speed differences
The reason why we recommend the F16 version is because it is OpenAI's original original unquantized precision, and we didn't do anything to it. The Q8 and MXFP4 one have some quantization involved.
In my linked comment I ran some more repetitions (-r 20 instead of the default 5) and reversed order and gave some time in between to make sure they were at relatively equal temperatures and the Q8_0 was closer. Interestingly there was still slight that was difference larger than stddev but ¯\_(ツ)_/¯