3.0bpw

by Doctor-Shotgun - opened 30 days ago

30 days ago

Wondering if you could upload the 3.0bpw-h6 if you get a chance? I'm curious to see how well it fits on a single 96gb GPU without having to split across cards. The k/v cache for this model is rather small from when I tested the GGUF (~6gb for 32k fp16) - I think it's largely going to be dependent on the compute buffer size whether it works or not.

tomt610

30 days ago

And 3.2 would also be appreciated, I have 104GB and I think the 3.6 may not fit too well

MikeRoz

Owner 30 days ago

Wondering if you could upload the 3.0bpw-h6 if you get a chance? I'm curious to see how well it fits on a single 96gb GPU without having to split across cards. The k/v cache for this model is rather small from when I tested the GGUF (~6gb for 32k fp16) - I think it's largely going to be dependent on the compute buffer size whether it works or not.

Uploading now.

And 3.2 would also be appreciated, I have 104GB and I think the 3.6 may not fit too well

I may not get to uploading this before my first quants of the new thinking version of this become available, but we'll see.

MikeRoz

Owner 29 days ago

@tomt610 @Doctor-Shotgun They're both available.

Doctor-Shotgun

29 days ago

•

edited 29 days ago

Thanks!

Just pulled the 3.0bpw quant - I managed to load it with max_seq_len 32768 and cache_mode 8,8 on a single RTX PRO 6000 96gb. Performance seems fair with ~700 T/s prompt processing and ~26 T/s generation.

I'm also curious whether it would be viable to selectively quant certain tensors to higher or lower precision like the GGUF folks do for optimal low bit performance.

Doctor-Shotgun

27 days ago

•

edited 27 days ago

@MikeRoz

I played around with this model a bit, and a couple observations:

cudaMallocAsync backend isn't currently working properly the way it's implemented in tabbyAPI. By setting the environment variable manually (PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync), I managed to save ~8gb of VRAM while loading 3.0bpw-h6 - fairly significant (it let me do away with quantized cache entirely). It may be worth re-trying your size measurements with this enabled if not already.
Inspired by ubergarm's quants, I used recompile to manually tune the tensor bitrates to make a slightly larger 3.07bpw-h6 that is significantly more performant than the standard 3.0bpw-h6: https://huggingface.co/Doctor-Shotgun/Qwen3-235B-A22B-Instruct-2507-exl3_3.07bpw-h6-custom

If you've run perplexity measurements using exllamav3's eval/ppl.py on your collection of quants here, I'd be interested in seeing how they compare.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment