3.0bpw

#1
by Doctor-Shotgun - opened

Wondering if you could upload the 3.0bpw-h6 if you get a chance? I'm curious to see how well it fits on a single 96gb GPU without having to split across cards. The k/v cache for this model is rather small from when I tested the GGUF (~6gb for 32k fp16) - I think it's largely going to be dependent on the compute buffer size whether it works or not.

And 3.2 would also be appreciated, I have 104GB and I think the 3.6 may not fit too well

Wondering if you could upload the 3.0bpw-h6 if you get a chance? I'm curious to see how well it fits on a single 96gb GPU without having to split across cards. The k/v cache for this model is rather small from when I tested the GGUF (~6gb for 32k fp16) - I think it's largely going to be dependent on the compute buffer size whether it works or not.

Uploading now.

And 3.2 would also be appreciated, I have 104GB and I think the 3.6 may not fit too well

I may not get to uploading this before my first quants of the new thinking version of this become available, but we'll see.

@tomt610 @Doctor-Shotgun They're both available.

Thanks!

Just pulled the 3.0bpw quant - I managed to load it with max_seq_len 32768 and cache_mode 8,8 on a single RTX PRO 6000 96gb. Performance seems fair with ~700 T/s prompt processing and ~26 T/s generation.

I'm also curious whether it would be viable to selectively quant certain tensors to higher or lower precision like the GGUF folks do for optimal low bit performance.

@MikeRoz

I played around with this model a bit, and a couple observations:

  1. cudaMallocAsync backend isn't currently working properly the way it's implemented in tabbyAPI. By setting the environment variable manually (PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync), I managed to save ~8gb of VRAM while loading 3.0bpw-h6 - fairly significant (it let me do away with quantized cache entirely). It may be worth re-trying your size measurements with this enabled if not already.
  2. Inspired by ubergarm's quants, I used recompile to manually tune the tensor bitrates to make a slightly larger 3.07bpw-h6 that is significantly more performant than the standard 3.0bpw-h6: https://huggingface.co/Doctor-Shotgun/Qwen3-235B-A22B-Instruct-2507-exl3_3.07bpw-h6-custom

If you've run perplexity measurements using exllamav3's eval/ppl.py on your collection of quants here, I'd be interested in seeing how they compare.

Sign up or log in to comment