Iwan Kawrakow

ikawrakow

AI & ML interests

None yet

Recent Activity

Organizations

None yet

ikawrakow's activity

view reply

In case this is of interest, here are some notes on quantizing L4-Scout with ik_llama.cpp

view reply

Maybe my disagreement comes from a fundamental misunderstanding

The way I understand it, PPL is the basically the likelihood that the model's next token prediction matches the corpus

KLD is the likelihood of one model's next token prediction matching another model's prediction

Your understanding is correct. The magic happens because we are not looking at a single token prediction, where we get the predicted probabilities, and in the case of PPL we accumulate ln(predicted probability of observed next token), while for KLD we accumulate sum over all tokens in the vocabulary (p(full model)_i ln(p(approximate model)_i / p(full_model)_i). Instead, we run that over a number of contexts (based on which the LLM predicts the next token probabilities), and average the results. Which means that in the case of PPL, we are averaging the log likelihoods over the probability to observe a given token with a preceding context in the test corpus. In the case of KLD, after averaging over the test contexts, and assuming the full model does predict reasonable token probabilities, we end up averaging log likelyhoods over the exact same probability distribution. That's why the two things are equivalent and fall on a straight line.

I don't see a simple explanation of why also the top token probability behaves in the same simple way, but that's what we observe.

I did add initially the KLD stuff to llama.cpp, and I wish I had done it earlier as mean KLD uncertainty is much lower than PPL uncertainty for the same amount of evaluated contexts. That would have been immensely useful in the early days. When I started working on quantization for llama.cpp there wasn't CUDA support at all, and a PPL calculation of Wikitext2 took 70 minutes for a 7B model running on the CPU (on my Ryzen-7950X CPU using OpenBLAS, quantized it was slower). When the first CUDA support was added, PPL calculation time for a 7B model dropped to 18 minutes (it is less than a minute now on the same 4080 GPU!). That was the performance we had when I published k-quants. But today, given that PPL is so fast, I personally find it a hassle to first have to compute token probabilities with the base model (which generates a huge file, and I'm notoriously short on disk space, so I never keep these files around), to then run KLD with the quantized model, to learn essentially nothing new that I didn't already know.

view reply

Here are two graphs showing mean KLD and top token probability as a function of ln(PPL). The data is taken from the above table. The correlation coefficients when fitting the data points with a straight line (the red lines in the plots) is 0.988 for KLD and -0.978 for top token probability. Statistically, this means that if I computed (or observed) one of these 3, I wouldn't need to compute (or observe) the other two as I will learn nothing new. I took the liberty to add error bars on the top token probability, this gives a more realistic view on departures from the expectation.

ppl_vs_kld.png

ppl_vs_top1.png

view reply

DeepSeek-V3/R1, which are the best open weight models, do have a pretty low PPL on Wikitext. Having memoized Wikipedia is not a bad thing to have in your back pocket when you are a large language model. Apart from this, a lower PPL on Wikitext, basically always translates to a lower PPL on any other text, and that means the model is able to better anticipate what will come next, given the preceding context. With other words, exactly what we expect a LLM to be able to do.

PPL is the same thing as KLD (I know we disagree here, but if you want an explanation why PPL goes up with vocabulary size, that's the simplest route), and KLD tends to increase with increasing dimensionality of the distribution. The higher the dimensionality (vocabulary in this case), the harder it is to match the underlying distribution, and so one tends to observe higher KLD values for approximations of the true distribution. Hence, an equally capable model with larger vocabulary will tend to produce a higher PPL than the model with the smaller vocabulary.

view reply

The vocabulary is 200k, so that necessarily results in a higher PPL. And the more you instruction tune it, the worse the PPL.

Interestingly enough, PPL for the full model goes down by a non-negligible amount if you activate 2 experts instead of 1. I wonder if they were planning to have 2 active experts (and did most of the training with that), but then changed their mind in the last moment to improve inference speed, and didn't quite manage to achieve optimality with a single active expert. With past MoE models, one could always gain a bit by using more active experts than prescribed by the model for very low bpw quantization, but the base model always became worse of you tried using more experts.

view reply

Nice post, thanks for sharing. What is the test corpus for calculating PPL/KLD/etc.?

replied to bartowski's post 7 months ago
view reply

Yes, and ln(PPL(Q)/PPL(base)) from my understanding measures the difference between the probabilities for the "correct" tokens according to the test dataset (at least for the second half of each chunk (same as for KLD)). Which means it would be possible to somehow keep perplexity the same or better while also increasing KLD (by making the non-"correct" tokens have different probabilities).

Imagine we want to know how often drivers in a given city run red traffic lights. Me, being old and lazy, just grab a comfy chair, pick a traffic light, and start counting number of cars passing buy and number of red light violations. You, being young and energetic, convince the city to install a camera with AI image processing and all the jazz on every single traffic light in the city. A day after we started counting my estimate of red light violation rate is quite different from yours, and you feel really happy that all the effort you put into this project has payed off. But as time passes by my estimate becomes really close to yours. Unless I was dumb enough to pick a traffic light in a quiet residential area where no more than 100 drivers ever pass by, and instead selected a traffic light where the people driving through are a statistically representative sample of the drivers in the city, statistics tells us that my estimate will be the same as yours. With that, lets go back to PPL and KLD. Me, being old and lazy, just run PPL and look at a single token (traffic light) at any given time without any prior preparation. You, being young and energetic, have computed and stored token probabilities for all models you are interested in so that, when you run KLD with a quantized model you can compare token probabilities for all tokens in the vocabulary (traffic lights in the city). After a few tokens (short period of traffic light observation), my estimate of PPL = Sum p_i ln (1/p_i) (probability for someone to run a red light at the traffic light I picked) is quite different from your KLD = Sum p_i ln (1/p_i) (probability for someone to run a red light at all traffic lights in the city). But as I look at more and more contexts and tokens (days of counting traffic light violators with different drivers passing buy), my PPL becomes essentially the same as your KLD, assuming the test dataset I used (traffic light I picked) is a representative sample of text written by humans (drivers in the city).

So, while it is possible to increase KLD for a given context by appropriately modifying probabilities of not observed tokens, ln PPL(Q)/PPL(B) will be the same as the expectation value of KLD, provided the test dataset is representative for the kind of language tasks we are interested in.

This makes me wonder: do all of the token probabilities have to match closely for a quantized model to still be good?

Good question. Suppose I'm a speaker of a really niche language and I'm interested in evaluating the performance of an LLM claiming to support my niche language. My language being so niche means that a) There was not a lot of data in my niche language included in the training, and, more importantly, b) The tokens representing my language are a small fraction of the tokens included in the vocabulary. The LLM being SOTA and multilingual is quite large, so I can only run it quantized on my computer. I have found a very good test dataset in my language and I have computed PPL(Q), PPL(B) and <KLD(Q, B)>. Which one should I trust more? <KLD(Q,B)>, which is heavily dominated by tokens that are completely irrelevant for my language, or PPL(Q)/PPL(B), where averaging of p_i ln(1/p_i) took place only over the tokens of interest to me? To go back to our traffic light example, suppose there is a major intercity road passing through, where you also installed cameras. Suppose traffic on that road is so heavy that half of the cars passing through traffic lights in the city do so on that road, with the vast majority of them being out of town. Guess what, in that case my old-person-lazy approach of sitting at a traffic light, which is not on the intercity road, and counting, might actually give a better answer to the question of how many red light violators are there among our city drivers than your fancy AI camera counting of all traffic lights.

replied to bartowski's post 7 months ago
view reply

To that end, do you happen to know if when quantizing from BF16.. does it get converted to FP16 first? Does it even matter? BF16 -> Q8 vs BF16 -> FP16 -> Q8, I wonder how different it would be. Gut instinct says it's in the 0.01% range.

bf16 -> Q8_0 \approx bf16 > fp16 >= fp16 -> Q8_0.

bf16 uses 8 bits for the exponent and the remaining 8 bits for sign+mantissa. When you take a bunch of bf16 values and scale them with the maximum, as it is done for Q8_0 quantization, the exponent is pretty much gone, so we are dealing with 8 bits for sign and precision, so Q8_0 can almost always perfectly match bf16 model weights.

But even 6-bit quantization with my new IQ6_K quants beets fp16 when quantized from bf16 for Phi-3.5-mini. This includes not just PPL, but HellaSwag and MMLU.

Concerning PPL vs KLD: not sure how it became fashionable around the Internet to dismiss PPL and to prefer KL Divergence (KLD), given that they are basically the same. KLD measures the difference between 2 probability distributions, typically between a "ground truth" and a model prediction. In the context of LLMs, this is basically the difference between token probabilities computed by a "better model" (let's call it B as "better") and an "approximate model" (let's call it A, as approximate). E.g., quantized vs base, but in this context fp16 vs bf16. LLM's need a context to predict probabilities. Looking at just one context does not make sense as KLD(B, A, context) will vary wildly from one context to another. So, one computes an expectation value <KLD(B, A)> over a set of contexts (test dataset). It is easy to show that

ln PPL(B)/PPL(A) = <KLD(B, A)>

when computed over the same test dataset (just go over the KLD implementation I added to llama.cpp to see it. ln in the above is the natural logarithm).

replied to bartowski's post 7 months ago
view reply

There is bf16 GPU support in this repository. On my GPU (RTX-4080) it is about 20% slower than fp16 for imatrix calculations. You can use it exactly the same as mainline llama.cpp.

Concerning only 0.03% of bf16 weights being squashed when converting to fp16: 0.03% is a very small fraction of weights, but that does not mean that the effect is just as small. There has been a lot of talk about the impact of "outliers", and bf16 weights that need to be "squashed" when converting to fp16 are outliers. I observe a 0.4% difference in perplexity for Phi-3.5-mini between bf16 and fp16. I have only tested for a small model because the GPU implementation in my repository does not (yet) support bf16 KV-cache, so I have to run on the CPU where bf16 KV-cache is available (when the CPU provides native bf16 support, e.g., Zen4). Here is a full breakdown of the perplexity differences for Phi-3.5-mini (context is the usual 512 tokens):

PPL(fp16 weights, fp16 KV-cache) =  6.5816
PPL(bf16 weights, fp16 KV-cache) =  6.5649
PPL(bf16 weights, bf16 KV-cache) =  6.5556
New activity in ikawrakow/validation-datasets-for-llama.cpp 7 months ago

Error with test data

7
#2 opened 8 months ago by
fedric95
New activity in ikawrakow/mixtral-instruct-8x7b-quantized-gguf about 1 year ago
New activity in ikawrakow/mixtral-8x7b-quantized-gguf about 1 year ago

Broken M quants

1
#2 opened about 1 year ago by
Artefact2