REALLY slow with flash attention and quantized cache.

#2
by Olafangensan - opened

Using Q6_K_L:
I get 10-11 T/s with Q4 cache, and 34T/s without.
Saying that, 16k tokens take up 5 gigabytes without FA, using together 22 gigabytes of VRAM, and full 32k easily OOMs my 3090. Is this a GGUF issue or is it on the model's side?

Strange, do you observe this with other models?

I would expect Q4 quantization process to slow it down a bit (I think) since it has to quantize on the fly, but still wouldn't have expected that huge of a hit πŸ€”

Not to that extent, no. In comparison, Mistral 24b(Q5_K_L) at 32k with 8-bit cache is about 19.5Gb on a Windows 11 machine, with 0.7 being just for Windows.
36.91T/s

Mistral 24b Thinker v1.1 by Undi95(Q6_K), 8-bit cache, 32k, 21.7Gb(0.7 for windows).
27.50T/s

Not tested properly, just throwing the same prompt in the kobold lite UI at them.

Using Q6_K_L:
I get 10-11 T/s with Q4 cache, and 34T/s without.
Saying that, 16k tokens take up 5 gigabytes without FA, using together 22 gigabytes of VRAM, and full 32k easily OOMs my 3090. Is this a GGUF issue or is it on the model's side?

Lol, I wish I had at least 10-11 T/s and here you are complaining...

In fairness Q8 may be easier to do on the fly, can you try Q4 with those models?

0.6-0.7Gb for windows

Mistral 24b(Q5_K_L) at 32k with 4-bit cache - 17.9Gb, 38.84T/s

Mistral 24B ArliAI RPMax v1.4(Q6_K_L) at 32k with 4-bit cache - 20.5Gb, 33.98T/s

Reka Flash 3(Q6_K_L) at 32k with 8-bit cache - 20.1Gb, 11.5T/s

0.6-0.7Gb for windows

Mistral 24b(Q5_K_L) at 32k with 4-bit cache - 17.9Gb, 38.84T/s

Mistral 24B ArliAI RPMax v1.4(Q6_K_L) at 32k with 4-bit cache - 20.5Gb, 33.98T/s

Reka Flash 3(Q6_K_L) at 32k with 8-bit cache - 20.1Gb, 11.5T/s

If it's of any consolation, it's not just you. I can run Mistral 24B without flash attention, but not Reka Flash 3. This model IS quite demanding, which is kinda shame because it is a smaller model than Mistral 24B for crying out loud, but you have thousands creators and thousands different model architectures that all behave differently and have different requirements, so trying out a new model is always kinda like opening a mystery box and you never know what you'll get.

Sign up or log in to comment