Ed Addario's picture

Ed Addario PRO

eaddario

AI & ML interests

None yet

Recent Activity

updated a model about 3 hours ago
eaddario/Qwen3-30B-A3B-pruned-GGUF
published a model about 7 hours ago
eaddario/Qwen3-30B-A3B-pruned-GGUF
updated a model about 12 hours ago
eaddario/Qwen3-30B-A3B-GGUF
View all activity

Organizations

Hugging Face Discord Community's profile picture

Posts 9

view post
Post
1231
Layer-wise and Pruned versions of google/gemma-3-12b-it

After enhancing llama.cpp to handle user-defined quantization levels for arbitrary tensors (https://github.com/ggml-org/llama.cpp/pull/12511), I have added an option to prune whole layers (https://github.com/ggml-org/llama.cpp/pull/13037), and have published two versions of google/gemma-3-12b-it for demo and testing purposes:

* Tesor-wise: eaddario/gemma-3-12b-it-GGUF
* Pruned: eaddario/gemma-3-12b-it-pruned-GGUF

Even though the Perplexity scores of the pruned version are 3 times higher, the ARC, HellaSwag, MMLU, Truthful QA and WinoGrande scores are holding remarkably well, considering two layers were removed (26 and 29). This seems to support Xin Men et al conclusions in ShortGPT: Layers in Large Language Models are More Redundant Than You Expect (2403.03853)

Results summary in the model's card and test results in the ./scores directory. Questions/feedback is always welcomed.
view post
Post
328
HF community survey: What is an acceptable Perplexity (PPL) degradation?

An area of personal research is to find ways to shrink the size of LLMs without incurring in a noticeable loss of capability. All the models in my repo have been generated by quantizing different tensors at different levels based on how much they influence the inference process (see the model's card for more details). This approach produces, on average, a ~10% size reduction with < 1% of PPL penalty.

I'm now focusing on pruning (whole layer removal), as a way to achieve better size reduction, but this comes at the cost of a much higher PPL degradation.

So, the question for the HF community is: what is the lowest/worst PPL correlation coefficient (𝜌PPL) you'd consider acceptable for a quantized model? (e.g. 99%? 95%? 90%? etc)

To clarify, by 𝜌PPL I mean the Cor(ln(PPL(Q)), ln(PPL(base))) statistic generated by llama-perplexity.