Please consider creating ik_llama.cpp compatible quants (without llama.cpp-specific MLA tensors)

by Lissanro - opened May 6

May 6

MLA implementation in ik_llama.cpp and in llama.cpp are not compatible:
https://github.com/ikawrakow/ik_llama.cpp/issues/383

Using convert_hf_to_gguf.py from ik_llama.cpp (or from llama.cpp before MLA was implemented) allows to create quants that are compatible with ik_llama.cpp.

One of advantages of ik_llama.cpp that it allows to run DeepSeek 671B models much faster, for example using Q4 quant on my hardware (EPYC 7763, 1TB RAM, 4x3090) ik_llama.cpp gives 8 tokens/s while llama.cpp less than 4 with all other things being equal (prompt and context size).

ubergarm

May 13

•

edited May 13

Looks like you got it working @Lissanro https://github.com/ikawrakow/ik_llama.cpp/issues/383#issuecomment-2869544925 nice job!

I've only converted the original fp8 to bf16 directly using the evshiron fork method and triton-cpu. About the only thing my 3090TI can't do is native fp8 haha... I generally don't use the python scripts in ik's fork as I'm not completely sure what has been updated, though there have been some updates.

This might also be why your imatrix is different size, but not sure exactly how MLA tensors are (or are not) converted via the different methods.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment