exl2-6.0bpw

by tatianapoliakova - opened May 21

May 21

It seems that there's no official tabbyAPI support for exl3 yet. Is it possible to have exl2-6.0bpw version, as it's more precise? For GLM-Z1-Rumination-32B-0414 as well? It seems that GLM-Z1-Rumination-32B-0414 is the best model.

MetaphoricalCode

Owner May 21

•

edited May 21

It seems that there's no official tabbyAPI support for exl3 yet. Is it possible to have exl2-6.0bpw version, as it's more precise? For GLM-Z1-Rumination-32B-0414 as well? It seems that GLM-Z1-Rumination-32B-0414 is the best model.

Exllamav3 support was merged to tabbyAPI's main branch on May, 9. The quantization process is rather straightforward and easy. It's covered in the docs: https://github.com/turboderp-org/exllamav3/blob/master/doc/convert.md

GLM-4-32B-0414 exl3 6bpw (not exl2) is already quantized: https://huggingface.co/owentruong/GLM-4-32B-0414-EXL3/tree/6.0

GLM-4-32B models are not supported in Exllamav2. Turboderp, the creator of Exllama, is now mostly focused on v3. Most likely the support won't be added to v2. It's worth switching to v3 if your GPU is Ada or Blackwell based (RTX 40xx / 50xx). Performance is still an issue for Ampere (RTX 30xx), it's being worked on. It's not horrible and some people run exl3 with RTX 3090, but performance is expected to be improved in the future.

tatianapoliakova

May 21

OK, I didn't know that exllamav2 isn't supported for these models.

I'm intereseted in testing GLM-Z1-Rumination-32B-0414, as it uses more deep thinking process, if I understood well. If I use several RTX 3xxx but PCI x1 connection, will it dramatically decrease the conversion speed or it's better to use 2 GPUs but PCI x16 (I don't have more PCI x16), do you know ?

MetaphoricalCode

Owner May 21

OK, I didn't know that exllamav2 isn't supported for these models.

I'm intereseted in testing GLM-Z1-Rumination-32B-0414, as it uses more deep thinking process, if I understood well. If I use several RTX 3xxx but PCI x1 connection, will it dramatically decrease the conversion speed or it's better to use 2 GPUs but PCI x16 (I don't have more PCI x16), do you know ?

I have no information on that matter, sadly. However, you may reach out for help in Exllama's Discord server: https://discord.gg/NSFwVuCjRq
Surely there will be those who know. Turboderp is there, too.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment