Is the 2.51bit model using imatrix?
I've played with both the 2.51bit and 2.22bit R1 models before and 2.22 is much better than 2.51. For the new 0324 2.51bit model, is it using imatrix?
On a side note, 2.22bit is like 20% slower than 2.51bit on KTransformers. Not sure if it's caused by imatrix.
It's not due to imatrix.
2.22 uses more complex scaling to save bits and with current inference methods this costs time.
2.51 is not, however 2.22 is. Yes, imatrix might be making it slower
It's not due to imatrix.
2.22 uses more complex scaling to save bits and with current inference methods this costs time.
It is partially due to imatrix making it slower
That would be surprising. I'll do measurements to understand why this happens in 2 bit quant and not in 4bit.
I measured IQ4_NL before and there is no speed difference when using imatrix or not.
Yeah imatrix shouldn't affect the speed much. It's just that K Quants are easier on the cpu compared to iq quants.
I'm writing a new llama.cpp backend that inferences the IQ quant family much more efficiently, even faster than current K quants.
IQ4_NL and IQ4_XS will be the first two data types supported, that's why I care about them so much and benchmark in that area.
I'm writing a new llama.cpp backend that inferences the IQ quant family much more efficiently, even faster than current K quants.
IQ4_NL and IQ4_XS will be the first two data types supported, that's why I care about them so much and benchmark in that area.
That's nice to hear. Please just make a PR on llama.cpp, so everyone can benefit if you deem it successful.