mradermacher/Rombo-LLM-V3.0-Qwen-32b-i1-GGUF · Please make i1 quants of my latest 72b model

Feb 20

I really appreciate your work. I just released my best model yet. I would love it if you could make some i1 quants of it.

https://huggingface.co/Rombo-Org/Rombo-LLM-V3.0-Qwen-72b

mradermacher

Owner Feb 20

Sure! it's queued, would be a shame if we didn't have quants for it :)

You can check for progress at http://hf.tst.eu/status.html or regularly check the model
summary page at https://hf.tst.eu/model#Rombo-LLM-V3.0-Qwen-72b-GGUF for quants to appear.

mradermacher changed discussion status to closed Feb 20

Smorty100

Feb 21

@rombodawg hi! i would like to know why we would use i1 quants at all... like - the general consensis i think is that Q4 is pretty good, and anything below is kinda not worth it anymore, where just getting a smaller model with a Q4 makes more sense performance-wise.

i have heard of those 1.58 bit quants, which apparently perform surprisingly well, but im assuming those are not the same...

would you mind explaining it to me? I am very curious :)

mradermacher

Owner Feb 21

i1-IQ4_XS is better quality then Q4_K_M at 20% smaller size. for example. and i1-IQ1_S quants are ~1.58bpw, and perform "surprisingly good" (but much worse than Q4_K).

Smorty100

Feb 22

•

edited Feb 22

@mradermacher oooh so i1 quant is not just a "one bit quant"? but some interesting inbetween? that is super cool! :o

i is that some new fancy quant method i completely missed? or has it just - always been around

(also, please @me ,otherwise i don't get the notification)

mradermacher

Owner Feb 22

•

edited Feb 22

@Smorty100 No, it's just our naming convention for imatrix quants.

rombodawg

Feb 23

@mradermacher This model looks like a banger, can we get i1 quants of it please? Its uncensored r1 distill 70b. No china censorship
https://huggingface.co/perplexity-ai/r1-1776-distill-llama-70b

mradermacher

Owner Feb 23

Queued, on our newest, experimental, quant node, too. Keep your fingers crossed that it works.

You can check for progress at http://hf.tst.eu/status.html or regularly check the model
summary page at https://hf.tst.eu/model#r1-1776-distill-llama-70b-GGUF for quants to appear.

rombodawg

Mar 10

@mradermacher Just to let you know, I created a new quantization method with the help of @bartowski feel free to add it to your quant list.

https://www.reddit.com/r/KoboldAI/comments/1j6bx40/the_highest_quality_quantization_varient_gguf_and/

mradermacher

Owner Mar 10

•

edited Mar 10

@rombodawg this myth is often repeated (and in fact, I used these quantization types when I started quantising beginning of last year, most of my early quants are of this or similar type), but the only data anybody has ever shown to me has indicated that this is not actually true.

So I call bullshit on this. "Me and bartoski figured out" does not cut it in science. You actually have to demonstrate it with evidence. Anecdotal claims are not evidence.

Also, calling these "new quantisation types" feels like somebody is trying to false advertise. These are not new quantisation types, merely variants of existing ones that have been found to be not worth it in the past.

mradermacher

Owner Mar 10

I've also left a comment on your reddit posting, but I am pleased that the majority of comments there are in agreement with this assessment. I don't think bullshitting people is a service to this community.

rombodawg

Mar 10

•

edited Mar 10

@mradermacher I can understand your frustrasion, but i assure you, i heavily tested the quants compared to regular Qx_k_l and they did get noticable improvments. I personally dont believe in bullshitting anyone, as I also find no value in pretending something is good just for the sake of clout.

I only shared the method and the quant because my own personal testing proves they were superior. I dont test using perplexity, or using noise, or even benchmarks. I test models side by side using the same settings and seeds for hours at a time with a wide variety of prompts that cover a large basis of tasks. And decide for myself if the results are better or worse.

This method of testing has never let me down to this day since the leak of llama-1. And its why my all my models are high quality on my hf pages.

But we can agree to disagree if you dont feel the same way, im indiffent 🙂

That just leaves more fun for me

nicoboss

Mar 10

•

edited Mar 10

I only shared the method and the quant because my own personal testing proves they were superior. I dont test using perplexity, or using noise, or even benchmarks. I test models side by side using the same settings and seeds for hours at a time with a wide variety of prompts that cover a large basis of tasks. And decide for myself if the results are better or worse.

Please measure KL-divergence, correct token probability and same token probability. If you provide convincing numbers to support your clams, I can reproduce them, and your quants offer a better quality/size ratio we will for sure start doing them. I spent month comparing the quality of different quants. You can find my data under https://www.nicobosshard.ch/LLM-Eval_Quality_v1.tar.zst and see plots under https://huggingface.co/mradermacher/BabyHercules-4x150M-GGUF/discussions/2 - there are good reasons we are skeptical. Every few weeks someone comes and introduces their new quant method yet when we test them, they turn out to be worse than what we currently use. We did not randomly pick which quants we do but carefully decided based on over 500 GPU hours worth of quality measurement data I collected.

@rombodawg I'm almost certain that FP32 should never be inside a quant. Models are trained in BF16. Someone on HuggingFace showed that for LLMs going from BF16 to FP16 does not matter by doing BF16 => FP32 and comparing all the actually floating-point numbers with BF16 => FP16 => FP32. All values where either the exact same or so close it won't matter - especially not for quants. So at least store them in FP16 and don't waste everyone’s RAM by storing them in FP32.

bartowski

Mar 10

Someone on HuggingFace showed that for LLMs going from BF16 to FP16 does not matter by doing BF16 => FP32 and comparing all the actually floating-point numbers with BF16 => FP16 => FP32

oh hey, that was me (unless someone else did it too)

https://huggingface.co/posts/bartowski/928757596721302

bartowski

Mar 10

•

edited Mar 10

actually must have been someone different, I did BF16 -> FP32 (calculate) -> FP16 -> FP32 (calculate)

rombodawg

Mar 10

@nicoboss
The reason i went with fp32 is because testing Q8_0 and fp16 in MMLU showed that Q8_0 often is higher quality than fp16. And when hand testing the Fp32 versions performed better than the Q8_0 versions. That is my justification for using Fp32 over Q8_0 and fp16

nicoboss

Mar 10

oh hey, that was me (unless someone else did it too)
https://huggingface.co/posts/bartowski/928757596721302

Wow thank you so much for your BF16/FP16 analysis and huge thanks for linking the past. It helped me with many decissions. It indeed was you and I just missremembered some details as that was half a year ago and I was unable to find the post again to refresh my memory.

Awesome so only 0.03% where squashed. That way too little to matter for sure.

The reason i went with fp32 is because testing Q8_0 and fp16 in MMLU showed that Q8_0 often is higher quality than fp16.

MMLU and other evals are not the right tool to compare such small differences. Your measurements are probably all within margin of error. Let's show you some of my plots where I also labeled the measurement error.

As you can see everything except evals have a low measurement error. Evals fluctuate by multiple percentage points and this despite me even using a weighted average of ARC Easy, Arc Challenge, MMLU, and Winogrande which further reduce the measurement error.

I highly recommend you instead plot KL-divergence, correct token probability and same token probability. You can measure them using llama-perplexity. Once you do, I would expect FP16 and FP32 to perform exactly the same. What would be very interesting is if any meaningful difference between Q8 and FP16/FP23 can be observed. I would be extremely surprised if you still see a difference between FP16 and FP32 but I'm open for surprises.

rombodawg

Mar 10

@nicoboss This is really good info. Im going to have to retest Q6_k_l, Q8_0, fp16 and fp32 and get to the bottom of this. I really appreciate the insight.

nicoboss

Mar 10

@nicoboss This is really good info. Im going to have to retest Q6_k_l, Q8_0, fp16 and fp32 and get to the bottom of this. I really appreciate the insight.

Awesome! I'm looking forward for your results. Ideally you would retest with one of the original Qwen 2.5 series of models as for all of those (from smallest to largest in base and instruct) we have collected massive amounts of quant quality and quant performance measurements so we could easily compare your results with our results. If you prefer to retest with a different model feel free do souse it but please let me know which one you used as quant quality depends on model size and architecture. For example, large monolithic models tend to loos less quality when quantized.