Native FP4 seems to make quantization meaningless

#7
by lingyezhixing - opened

After seeing the list of quantization files, I think quantization seems not very useful for this model trained natively in FP4. All the quantizations of this model do not significantly reduce the model size, and even fail to lower the VRAM requirements for full GPU operation below 12GB. For devices with VRAM below 16GB, mixed precision inference is still needed, while devices with at least 16GB VRAM can run the full F16/BF16. Maybe native FP4 training is really the future, haha

Unsloth AI org

After seeing the list of quantization files, I think quantization seems not very useful for this model trained natively in FP4. All the quantizations of this model do not significantly reduce the model size, and even fail to lower the VRAM requirements for full GPU operation below 12GB. For devices with VRAM below 16GB, mixed precision inference is still needed, while devices with at least 16GB VRAM can run the full F16/BF16. Maybe native FP4 training is really the future, haha

The model was not trained in fp4. It was trained in f16 then post trained to fp4.

Also this model has very similar model sizes due to llama.cpp limitations atm so it;s unique to only this model. With a proper llama.cpp implementation, you can definitely quantize this down further

After seeing the list of quantization files, I think quantization seems not very useful for this model trained natively in FP4. All the quantizations of this model do not significantly reduce the model size, and even fail to lower the VRAM requirements for full GPU operation below 12GB. For devices with VRAM below 16GB, mixed precision inference is still needed, while devices with at least 16GB VRAM can run the full F16/BF16. Maybe native FP4 training is really the future, haha

The model was not trained in fp4. It was trained in f16 then post trained to fp4.

Also this model has very similar model sizes due to llama.cpp limitations atm so it;s unique to only this model. With a proper llama.cpp implementation, you can definitely quantize this down further

Hi, do you think going to Q4_K_XL is worth it, since it only reduces ~2GB weights? And is the BF16 gguf any different from the F16 one? (speed/accuracy) Thanks for your work!

Unsloth AI org

After seeing the list of quantization files, I think quantization seems not very useful for this model trained natively in FP4. All the quantizations of this model do not significantly reduce the model size, and even fail to lower the VRAM requirements for full GPU operation below 12GB. For devices with VRAM below 16GB, mixed precision inference is still needed, while devices with at least 16GB VRAM can run the full F16/BF16. Maybe native FP4 training is really the future, haha

The model was not trained in fp4. It was trained in f16 then post trained to fp4.

Also this model has very similar model sizes due to llama.cpp limitations atm so it;s unique to only this model. With a proper llama.cpp implementation, you can definitely quantize this down further

Hi, do you think going to Q4_K_XL is worth it, since it only reduces ~2GB weights? And is the BF16 gguf any different from the F16 one? (speed/accuracy) Thanks for your work!

If you can fit F16, definitely got for the F16. There's not much difference between Q4_K_XL in performance. F16 is the model's full original performance

shimmyshimmer changed discussion status to closed

Sign up or log in to comment