Ed Addario's picture

Ed Addario PRO

eaddario

AI & ML interests

None yet

Recent Activity

Organizations

None yet

Posts 5

view post
Post
2624
Squeezing Tensor Bits: the quest for smaller LLMs

An area of personal interest is finding ways to optimize the inference performance of LLMs when deployed in resource-constrained environments like commodity hardware, desktops, laptops, mobiles, edge devices, etc.

The method that I'm using to produce these experimental versions, for example eaddario/DeepSeek-R1-Distill-Llama-8B-GGUF is explained in https://medium.com/@eaddario/squeezing-tensor-bits-the-quest-for-smaller-llms-86b23bd052ca

At a high level it involves using a custom version of the llama-quantize tool to selectively quantize different tensors at different levels. On average a 10% or more reduction with little loss of quality is possible.

There’re two PRs to merge these changes back into the core project but until then, the modified version will be available on GitHub https://github.com/EAddario/llama.cpp/tree/quantize

Would love to hear if you can achieve smaller sizes at higher quality!
view post
Post
388
Experimental eaddario/DeepSeek-R1-Distill-Llama-8B-GGUF is now available.

Got close to achieve a 10% reduction in size but the quality started to deteriorate so this version has been pruned more conservatively. Sizes, on average, are about 8% smaller with only a very small penalty (< 1%) in quality.

After trying with different models and parameter count, I suspect the best I'll be able to do with the current process is between 6 and 8% reduction so have decided to try a different approach. I'll publish process and findings next.

For background: https://huggingface.co/posts/eaddario/832567461491467