Surprised by performance (speed)
I switched to this model from previous version, and I was surprised how significantly less power hungry this one is.
Running Q4_K_XL on CPU from RAM with only 2 layers and KV cache offloaded to VRAM using LM Studio, with exactly the same parameters and context, the speed of prompt processing is significantly better on this one in comparison with unsloth/DeepSeek-V3-0324-GGUF/UD-Q4_K_XL. Even RAM temperatures stay relatively steady, while it was difficult to keep RAM cool with the previous version - once the context grew.
So I wonder... how did you manage this? Is it entirely expected?
Or, is there a catch, something lost in exchange for speed?
Thank you for your hard work on the quants.
Interesting might be because llama cpp did many updates to running which makes it much better now