2.25 bpw perplexity

#2
by malamen4 - opened

I think there's an issue with 2.25 bpw, and how measurements are applied when merging.

2.0 bpw ppl: 4.99
3.0 bpw ppl: 3.60

2.25 bpw ppl: 5.01
I wasn't able to do any better when merging with the measurements json myself.

2.06 bpw with override.yaml below, ppl: 4.13

python eval\model_diff.py -ma GLM-4.6-exl3-3.0bpw_H6 -mb GLM-4.6-exl3-2.25bpw_H6 -r 5

-- A perplexity:  3.60285360
 -- B perplexity:  5.01247883
 -- A label in top-K:
      K = 1: 0.7057
      K = 2: 0.8197
      K = 3: 0.8622
      K = 4: 0.8850
      K = 5: 0.8987
 -- B label in top-K:
      K = 1: 0.6419
      K = 2: 0.7669
      K = 3: 0.8175
      K = 4: 0.8454
      K = 5: 0.8645
 -- Top-K agreement, A vs B:
      K = 1: 0.7908
      K = 2: 0.4579
      K = 3: 0.2180
      K = 4: 0.0905
      K = 5: 0.0363
 -- KL divergence (A, B):  0.57431712
 -- KL divergence (B, A):  0.45050738

python eval\model_diff.py -ma GLM-4.6-exl3-2.0bpw_H6 -mb GLM-4.6-exl3-2.25bpw_H6 -r 5

-- A perplexity:  4.99270217
 -- B perplexity:  5.01247883
 -- A label in top-K:
      K = 1: 0.6314
      K = 2: 0.7601
      K = 3: 0.8177
      K = 4: 0.8474
      K = 5: 0.8627
 -- B label in top-K:
      K = 1: 0.6419
      K = 2: 0.7669
      K = 3: 0.8175
      K = 4: 0.8454
      K = 5: 0.8645
 -- Top-K agreement, A vs B:
      K = 1: 0.8126
      K = 2: 0.5179
      K = 3: 0.2731
      K = 4: 0.1253
      K = 5: 0.0532
 -- KL divergence (A, B):  0.34435631
 -- KL divergence (B, A):  0.32301066

I made an override.yaml to recompile and merge 2.0 bpw with 3.0 bpw, 2.06 bpw.

sources:
  - id: 1
    model_dir: GLM-4.6-exl3-2.0bpw_H6
  - id: 2
    model_dir: GLM-4.6-exl3-3.0bpw_H6
overrides:
  - key: "*.self_attn.*"
    source: 2
  - key: "*.shared_experts.*"
    source: 2
  - key: "*.layers.[0-3].*"
    source: 2

ppl: 4.13587955

python eval\model_diff.py -ma GLM-4.6-exl3-2.0bpw_H6 -mb GLM-4.6-exl3-2.06bpw -r 5

-- A perplexity:  4.99270217
 -- B perplexity:  4.13587955
 -- A label in top-K:
      K = 1: 0.6314
      K = 2: 0.7601
      K = 3: 0.8177
      K = 4: 0.8474
      K = 5: 0.8627
 -- B label in top-K:
      K = 1: 0.6725
      K = 2: 0.7942
      K = 3: 0.8430
      K = 4: 0.8695
      K = 5: 0.8851
 -- Top-K agreement, A vs B:
      K = 1: 0.8053
      K = 2: 0.5017
      K = 3: 0.2591
      K = 4: 0.1209
      K = 5: 0.0502
 -- KL divergence (A, B):  0.33592803
 -- KL divergence (B, A):  0.36627963

Any place we could d/l? 2.25 already looks like you wont get much ctx.

Testing with -r 5 only runs 5 rows of 2048 tokens, which is a very small sample for a perplexity test and the results are going to be noisy. Keep in mind this is a very large, sparse model. KL-div is more robust to smaller sample sizes, but the correct comparison is between the quantized model and the unquantized original. Here's what I get for the three versions:

2.00 bpw:

 -- A perplexity:  5.00308313
 -- B perplexity:  4.26057546
 -- A label in top-K:
      K = 1: 0.6324
      K = 2: 0.7613
      K = 3: 0.8164
      K = 4: 0.8469
      K = 5: 0.8648
 -- B label in top-K:
      K = 1: 0.6997
      K = 2: 0.8001
      K = 3: 0.8419
      K = 4: 0.8650
      K = 5: 0.8797
 -- Top-K agreement, A vs B:
      K = 1: 0.7212
      K = 2: 0.3501
      K = 3: 0.1388
      K = 4: 0.0503
      K = 5: 0.0160
 -- KL divergence (A, B):  0.95041441
 -- KL divergence (B, A):  0.98535391

2.25 bpw:

 -- A perplexity:  5.02164230
 -- B perplexity:  4.26057546
 -- A label in top-K:
      K = 1: 0.6405
      K = 2: 0.7640
      K = 3: 0.8158
      K = 4: 0.8452
      K = 5: 0.8641
 -- B label in top-K:
      K = 1: 0.6997
      K = 2: 0.8001
      K = 3: 0.8419
      K = 4: 0.8650
      K = 5: 0.8797
 -- Top-K agreement, A vs B:
      K = 1: 0.7704
      K = 2: 0.4215
      K = 3: 0.1862
      K = 4: 0.0753
      K = 5: 0.0259
 -- KL divergence (A, B):  0.60512452
 -- KL divergence (B, A):  0.68797545

3.00 bpw:

 -- A perplexity:  3.69533893
 -- B perplexity:  4.26057546
 -- A label in top-K:
      K = 1: 0.7013
      K = 2: 0.8153
      K = 3: 0.8587
      K = 4: 0.8814
      K = 5: 0.8951
 -- B label in top-K:
      K = 1: 0.6997
      K = 2: 0.8001
      K = 3: 0.8419
      K = 4: 0.8650
      K = 5: 0.8797
 -- Top-K agreement, A vs B:
      K = 1: 0.8376
      K = 2: 0.5506
      K = 3: 0.2960
      K = 4: 0.1453
      K = 5: 0.0659
 -- KL divergence (A, B):  0.37688994
 -- KL divergence (B, A):  0.32623222

Perplexity is all over the place, with the 3bpw model scoring significantly better than the original. But that only means it's better at predicting those specific 10k tokens, which is a poor proxy for how well the quantized model reproduces the original. KL-div directly compares the shape of the output distribution to that of the original, on the same inputs, and here we see the expected increase in accuracy as bitrate increases.

I'm only really able to test up to 2.25 bpw here, but that version does seem solid (2.0 also did okay.)

Manually recompiling with the bigger attn, dense layers and shared experts makes a lot of sense, though I would probably use the 4bpw tensors for it. I would do this for the 2.25 bpw version as well, which should end up somewhere around 2.3bpw but be significantly improved. Currently still downloading the 4 bpw quant so I can try it out.

Here is 4.00 bpw for completeness:

 -- A perplexity:  3.82770390
 -- B perplexity:  4.26057546
 -- A label in top-K:
      K = 1: 0.7124
      K = 2: 0.8143
      K = 3: 0.8532
      K = 4: 0.8750
      K = 5: 0.8879
 -- B label in top-K:
      K = 1: 0.6997
      K = 2: 0.8001
      K = 3: 0.8419
      K = 4: 0.8650
      K = 5: 0.8797
 -- Top-K agreement, A vs B:
      K = 1: 0.8915
      K = 2: 0.6804
      K = 3: 0.4559
      K = 4: 0.2783
      K = 5: 0.1557
 -- KL divergence (A, B):  0.19731148
 -- KL divergence (B, A):  0.17492220

I should also mention that the strings used in the override file are globs, not regexes. So it allows * as a wildcard, but [0-3] won't match anything. So I used these overrides to create two more variants:

sources:
  - id: x
    model_dir: /mnt/str/models/glm4.6/exl3/4.0bpw/
overrides:
  - key: "*.self_attn.*"
    source: x
  - key: "*.shared_experts.*"
    source: x
  - key: "*.layers.0.*"
    source: x
  - key: "*.layers.1.*"
    source: x
  - key: "*.layers.2.*"
    source: x
  - key: "*.layers.3.*"
    source: x

2.12 bpw:

python util/recompile.py -i .../2.00bpw -o .../2.12bpw -or .../overrides.yaml

 -- A perplexity:  4.14987601
 -- B perplexity:  4.26057546
 -- A label in top-K:
      K = 1: 0.6680
      K = 2: 0.7932
      K = 3: 0.8410
      K = 4: 0.8660
      K = 5: 0.8840
 -- B label in top-K:
      K = 1: 0.6997
      K = 2: 0.8001
      K = 3: 0.8419
      K = 4: 0.8650
      K = 5: 0.8797
 -- Top-K agreement, A vs B:
      K = 1: 0.7589
      K = 2: 0.3946
      K = 3: 0.1693
      K = 4: 0.0622
      K = 5: 0.0248
 -- KL divergence (A, B):  0.73031745
 -- KL divergence (B, A):  0.72991458

2.33 bpw (from the 2.25 bpw in this repo, same overrides):

 -- A perplexity:  4.28272396
 -- B perplexity:  4.26057546
 -- A label in top-K:
      K = 1: 0.6647
      K = 2: 0.7891
      K = 3: 0.8400
      K = 4: 0.8641
      K = 5: 0.8830
 -- B label in top-K:
      K = 1: 0.6997
      K = 2: 0.8001
      K = 3: 0.8419
      K = 4: 0.8650
      K = 5: 0.8797
 -- Top-K agreement, A vs B:
      K = 1: 0.7773
      K = 2: 0.4276
      K = 3: 0.1917
      K = 4: 0.0767
      K = 5: 0.0302
 -- KL divergence (A, B):  0.60877988
 -- KL divergence (B, A):  0.62779789

CatBench, 2.12 bpw:
image

2.33 bpw:
image

Origninal model (on z.ai):
image

ChatGPT 5 for reference and because it's funny (same prompt, reasoning etc. enabled):
cute_kitten

Overall these quants seem very usable.

TIL, ty!

EXL3 is looking strong into these low bpw ranges!!

CatBench

wait wat, this is real? guy on reddit linking me here asking for "Probably all need to do the SVG kitty test instead of ppl" πŸ’€

What is the prompt and was thinking enabled for these? lmao... I couldn't find it using web searches hah...

the correct comparison is between the quantized model and the unquantized original

yeah, i have some data on my quant set including iq1_kt (using 1.75bpw trellis for all routed experts, but larger ones for attn/shexp/first 3 dense layers) posted on the hf repo. i know we can't compare across systems given slightly different settings and such, but my wiki.test.raw context 512 llama-perplexity runs are looking like this (yes,i used iq5_k for "baseline" 😭, similar to GLM-4.5 had wonky perplexity graph too)

GLM-4-6-ppl

and some KLD testing thanks to @AesSedai and @ddh0 for the plotting scripts and KLD test corpusddh0_imat_calibration_data_v2.txt with bf16 GGUF as baseline

ubergarm-glm-4.6-topp-vs-kld

Thanks again for all your work and influence on me and other folks trying to eek the best out of our limited hardware!

UPDATE

  • smol-IQ2_KS 97.990 GiB (2.359 BPW)
  • Final estimate: PPL = 5.2760 +/- 0.03410

Create an SVG image of a cute kitty./nothink

cute-kitty-glm-4-6-smol-iq2_ks

😹

GLM 4.6 FP8 on sglang
image

GLM 4.6 3.0 bpw exl3
image

Qwen3-Max
image

ChatGPT
image

What is the prompt and was thinking enabled for these?

The prompt is: Write a Python script that draws a cute kitten using matplotlib. Then plop the output into a Python shell and see what comes out. They were made with thinking enabled. Not that I've ever seen thinking actually do anything useful yet for this model. :)

Important to remember that these are tiny ppl tests above, and the setup is very different from llama.cpp which uses a warm context for every sample. In fact it's not even 5x2048 unique tokens here, but five rows of 2048 with a 1536 token overlap (512 token stride). So they're really not very comparable numbers. I've done a lot of plots like these to try to guarantee apples-to-apples with the same test logic across frameworks, but it's very time consuming and a struggle to keep everything up to date to support the latest models that are actually interesting to test. Sometimes you need to downgrade stuff when regressions happen (e.g. I don't think the latest LCPP supports Mixtral anymore, is that right?) And also I keep running out of storage space :(

and some KLD testing thanks to @AesSedai and @ddh0 for the plotting scripts and KLD test corpusddh0_imat_calibration_data_v2.txt with bf16 GGUF as baseline

ubergarm-glm-4.6-topp-vs-kld

I noticed the q3_KS there is around 280GB but the files I can download for it are around 160GB.

confused cat...

Why are those filesizes so big? (I feel like the answer will embarass me for asking.)

Why are those filesizes so big? (I feel like the answer will embarass me for asking.)

The model size is given in GiB, not GB. Maybe that's what you mean?

@turboderp

I've done a lot of plots like these to try to guarantee apples-to-apples with the same test logic across frameworks, but it's very time consuming and a struggle to keep everything up to date to support the latest models that are actually interesting to test.

Thanks for the cat prompt and detailed discussion here. Yes your exllamav3/eval/compare_q_llamacpp.py is probably the best tool to make reasonable comparisons across ecosystems. But as you say it is struggle to deal with regressions and such. I don't think ik has python bindings either so for now we will have to stick to CatBench!! Thanks!


@BingoBird

I noticed the q3_KS there is around 280GB but the files I can download for it are around 160GB.

so the IQ3_KS 148.390 GiB (3.573 BPW) I confirmed has files that size. I believe the confusion is due to the graph having a color bar on the right is not related to the Y-Axis, but is a "color axis" which maybe is difficult to observe depending on color vision, monitor etc. This is a 3rd independent axis unrelated to the physical location on the graph, but a third dimensions related to the color of the dot.


@ddh0

For sure, I've had folks ask me about why their files are not the right size it is confising... But saying "jibibytes" outloud is sure fun xD

UD-Q3K_XL
svg-kitten-glm

Considering it used XTC and no thinking.

glm-miku

The miku leaves much to be desired tho.

Sign up or log in to comment