Qwen3-32B-unsloth-bnb-4bit vs. bnb-8bit vs. gguf-Q8_0 et.al.?
Thank you very much for all the quants!
I see that the dynamic encoding process for this model determined that around 8 bit/weight composite size was optimum for this model as judged by the files (36 GBy sum size of the model files).
That's compared to around 20 GBy sum size of the model files of Qwen3-32B-bnb-4bit, so +16GBy encoding size allowances made for much better accuracy quality purposes which is fine if that's what achieves the benefit of dyanmic 2 quantization vs. bnb4 only.
But I wonder how the Qwen3-32B-unsloth-bnb-4bit (36GBy files sizes) model quality benchmarks compares to:
Qwen3-32B-UD-Q8_K_XL.gguf,
Qwen3-32B-UD-Q6_K_XL.gguf,
Qwen3-32B-Q8_0.gguf,
Qwen3-32B-Q6_K.gguf,
Qwen3-32B-FP8,
Qwen3-32B-bnb-8bit (not made but possible using bnb AFAIK)
...which have either similar total model file sizes (FP8, bnb8, and Q8 variants) or smaller file sizes (Q6 variants) and which might be plausible alternatives (if one can run such other formats) to Qwen3-32B-unsloth-bnb-4bit if they when loaded use similar VRAM/RAM size (maybe this is not at all true?).
Obviously there may not be such an actual benchmark to compare these but even roughly / qualitatively, how might one wisely choose among these quants or similar 4-8 bit AWQ, GPTQ, GGUF, BNB quants when balancing quality vs. RAM size and inference speed (let's say not all GPUs / systems can process FP4, FP8 so those and other alternatives may be practical or not depending on the user).
I am assuming that Qwen3-32B-UD-Q8_K_XL, Qwen3-32B-Q8_0 should give quite excellent quality / accuracy so the question would be to rank the smaller quantization / format options inferior to those which have good RAM size vs. performance (speed, accuracy).
General rule of thumb is to use Safetensors for serving/fine-tuning. The accuracy for dynamic 4bit vs dynamic ggufs is nearly the same. I think dynamic 2.0 ggufs are ever so slightly better though because of the calibration dataset + more layer selections
GGUFs can also be used to serve but unsure if vllm supports it.
Regarding this:
GGUFs can also be used to serve but unsure if vllm supports it
I tried with last vllm version (0.8.5.post1) and it is still not ready for qwen3 models. It throws an error when loading the quantized GGUF versions.