@eaddario on Hugging Face: "Squeezing Tensor Bits: the quest for smaller LLMs An area of personal…"

eaddario

posted an update 3 days ago

Post

2624

Squeezing Tensor Bits: the quest for smaller LLMs

An area of personal interest is finding ways to optimize the inference performance of LLMs when deployed in resource-constrained environments like commodity hardware, desktops, laptops, mobiles, edge devices, etc.

The method that I'm using to produce these experimental versions, for example eaddario/DeepSeek-R1-Distill-Llama-8B-GGUF is explained in https://medium.com/@eaddario/squeezing-tensor-bits-the-quest-for-smaller-llms-86b23bd052ca

At a high level it involves using a custom version of the llama-quantize tool to selectively quantize different tensors at different levels. On average a 10% or more reduction with little loss of quality is possible.

There’re two PRs to merge these changes back into the core project but until then, the modified version will be available on GitHub https://github.com/EAddario/llama.cpp/tree/quantize

Would love to hear if you can achieve smaller sizes at higher quality!

Mdubbya

3 days ago

Excited to try this out!

agentlans

2 days ago

Nice to know that GGUF can be optimized further! By the way, I came across another approach called ExLlamaV2's EXL2 format that also uses selective quantization but with a wider range of bit-widths and mixing within layers. What do you think?

https://github.com/turboderp-org/exllamav2?tab=readme-ov-file#exl2-quantization

eaddario

about 6 hours ago

From what I gather, it isn't clear if the benefit of adding EXL2 would justify the effort. Such an enhancement would require extensive changes across the code base, not just for the quantization process itself (comparatively straightforward), but to load and serve the models (fairly complex).

In the late 2023 Oobabooga did a comparison between different quants showing advantages of EXL2 over GGUF but GGerganov addressed most of the issues: A detailed comparison between GPTQ, AWQ, EXL2, q4_K_M, q4_K_S, and load_in_4bit & cuda : improve text-generation and batched decoding performance

Still, this would be an interesting project to learn llama.cpp inside out. Maybe one for the future!

Join the conversation