Post
2624
Squeezing Tensor Bits: the quest for smaller LLMs
An area of personal interest is finding ways to optimize the inference performance of LLMs when deployed in resource-constrained environments like commodity hardware, desktops, laptops, mobiles, edge devices, etc.
The method that I'm using to produce these experimental versions, for example eaddario/DeepSeek-R1-Distill-Llama-8B-GGUF is explained in https://medium.com/@eaddario/squeezing-tensor-bits-the-quest-for-smaller-llms-86b23bd052ca
At a high level it involves using a custom version of the llama-quantize tool to selectively quantize different tensors at different levels. On average a 10% or more reduction with little loss of quality is possible.
There’re two PRs to merge these changes back into the core project but until then, the modified version will be available on GitHub https://github.com/EAddario/llama.cpp/tree/quantize
Would love to hear if you can achieve smaller sizes at higher quality!
An area of personal interest is finding ways to optimize the inference performance of LLMs when deployed in resource-constrained environments like commodity hardware, desktops, laptops, mobiles, edge devices, etc.
The method that I'm using to produce these experimental versions, for example eaddario/DeepSeek-R1-Distill-Llama-8B-GGUF is explained in https://medium.com/@eaddario/squeezing-tensor-bits-the-quest-for-smaller-llms-86b23bd052ca
At a high level it involves using a custom version of the llama-quantize tool to selectively quantize different tensors at different levels. On average a 10% or more reduction with little loss of quality is possible.
There’re two PRs to merge these changes back into the core project but until then, the modified version will be available on GitHub https://github.com/EAddario/llama.cpp/tree/quantize
Would love to hear if you can achieve smaller sizes at higher quality!