Is the 2.51bit model using imatrix?

by daweiba12 - opened 5 days ago

5 days ago

I've played with both the 2.51bit and 2.22bit R1 models before and 2.22 is much better than 2.51. For the new 0324 2.51bit model, is it using imatrix?

On a side note, 2.22bit is like 20% slower than 2.51bit on KTransformers. Not sure if it's caused by imatrix.

TobDeBer

5 days ago

It's not due to imatrix.
2.22 uses more complex scaling to save bits and with current inference methods this costs time.

shimmyshimmer

Unsloth AI org 5 days ago

2.51 is not, however 2.22 is. Yes, imatrix might be making it slower

shimmyshimmer

Unsloth AI org 5 days ago

It's not due to imatrix.
2.22 uses more complex scaling to save bits and with current inference methods this costs time.

It is partially due to imatrix making it slower

TobDeBer

5 days ago

That would be surprising. I'll do measurements to understand why this happens in 2 bit quant and not in 4bit.
I measured IQ4_NL before and there is no speed difference when using imatrix or not.

Dampfinchen

4 days ago

Yeah imatrix shouldn't affect the speed much. It's just that K Quants are easier on the cpu compared to iq quants.

TobDeBer

4 days ago

I'm writing a new llama.cpp backend that inferences the IQ quant family much more efficiently, even faster than current K quants.
IQ4_NL and IQ4_XS will be the first two data types supported, that's why I care about them so much and benchmark in that area.

Dampfinchen

3 days ago

I'm writing a new llama.cpp backend that inferences the IQ quant family much more efficiently, even faster than current K quants.
IQ4_NL and IQ4_XS will be the first two data types supported, that's why I care about them so much and benchmark in that area.

That's nice to hear. Please just make a PR on llama.cpp, so everyone can benefit if you deem it successful.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment