ubergarm commited on
Commit
869d411
·
1 Parent(s): f07ee20

Add IQ1_S_R4 and IQ4_KS_R4 and secret meme

Browse files
Files changed (3) hide show
  1. README.md +166 -5
  2. images/buff-mokey-meme.png +3 -0
  3. images/perplexity.png +3 -0
README.md CHANGED
@@ -30,15 +30,97 @@ Excited to share and learn together. Thanks!
30
  ## Quant Collection
31
  So far these are my best recipes offering the lowest perplexity per GiB models suitable for a wide variety of CPU+GPU or CPU *only* rigs.
32
 
 
 
33
  * `DeepSeek-R1-0528-Q8_0` 666GiB
34
  - `Final estimate: PPL = 3.2130 +/- 0.01698`
35
  - I didn't upload this, it is for baseline reference only.
 
 
 
36
  * `DeepSeek-R1-0528-IQ3_K_R4` 301GiB
37
  - `Final estimate: PPL = 3.2730 +/- 0.01738`
38
  - Fits 32k context in under 24GiB VRAM
39
  * `DeepSeek-R1-0528-IQ2_K_R4` 220GiB
40
  - `Final estimate: PPL = 3.5069 +/- 0.01893`
41
  - Fits 32k context in under 16GiB VRAM
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42
 
43
  #### `IQ3_K_R4` 3.847 BPW (301GiB)
44
  Special mix `IQ4_KS_R4` `ffn_down` and `IQ3_K_R4` `ffn_(up|gate)` routed experts. All other layers `q8_0` for CPU+GPU offload. For max speed on CPU *only* rigs use `--run-time-repack`.
@@ -47,7 +129,7 @@ Special mix `IQ4_KS_R4` `ffn_down` and `IQ3_K_R4` `ffn_(up|gate)` routed experts
47
 
48
  <summary>👈 Possible VRAM & RAM Combinations</summary>
49
 
50
- This is probably a good size quant for a 368GB RAM rig preferrably with at least a single 24GB VRAM GPU.
51
  It is probably a little out of reach for a 256GB RAM rig unless you have 80+GB VRAM.
52
  You could still run "troll rig" style and page off disk for maybe 5 tok/sec and some hot NVMe drives hahah...
53
 
@@ -224,11 +306,88 @@ custom=$(
224
 
225
  </details>
226
 
227
- #### What Next?
228
- Let me know in the comments or ik_llama.cpp discussion reference at the bottom if there is demand for quants targeting other common hardware configurations e.g.:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
229
 
230
- * Possibly slightly larger mix for higher RAM systems if there is interest?
231
- * Possibly an extreme small <= ~1.5BPW quant for 128GB RAM+VRAM? Is this even possible with decent quality? lol...
232
 
233
  ## Quick Start
234
  #### `ik_llama.cpp` API server for GPU+CPU
@@ -257,6 +416,8 @@ CUDA_VISIBLE_DEVICES="0," \
257
  # Adjust number of routed expert layers for additional VRAM on each GPU
258
  # Compile with -DGGML_SCHED_MAX_COPIES=1 for multi-GPUs
259
  # Compile with -DGGML_CUDA_IQK_FORCE_BF16=1 if putting `_R4` tensors on GPU (for DeepSeek only)
 
 
260
  ./build/bin/llama-server \
261
  --model /mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/DeepSeek-R1-0528-IQ3_K_R4.gguf \
262
  --alias ubergarm/DeepSeek-R1-0528-IQ3_K_R4 \
 
30
  ## Quant Collection
31
  So far these are my best recipes offering the lowest perplexity per GiB models suitable for a wide variety of CPU+GPU or CPU *only* rigs.
32
 
33
+ ![Perplexity Chart](images/perplexity.png "Chart showing Perplexity improving as BPW increases.")
34
+
35
  * `DeepSeek-R1-0528-Q8_0` 666GiB
36
  - `Final estimate: PPL = 3.2130 +/- 0.01698`
37
  - I didn't upload this, it is for baseline reference only.
38
+ * `DeepSeek-R1-0528-IQ4_KS_R4` 368GiB
39
+ - `Final estimate: PPL = 3.2286 +/- 0.01710`
40
+ - Fits 32k context in under 24GiB VRAM
41
  * `DeepSeek-R1-0528-IQ3_K_R4` 301GiB
42
  - `Final estimate: PPL = 3.2730 +/- 0.01738`
43
  - Fits 32k context in under 24GiB VRAM
44
  * `DeepSeek-R1-0528-IQ2_K_R4` 220GiB
45
  - `Final estimate: PPL = 3.5069 +/- 0.01893`
46
  - Fits 32k context in under 16GiB VRAM
47
+ - Fits 64k context in under 24GiB VRAM
48
+ * `DeepSeek-R1-0528-IQ1_S_R4` 131GiB
49
+ - `Final estimate: PPL = 4.8831 +/- 0.02878`
50
+ - Fits 32k+ context in under 16GiB VRAM
51
+ - Fits 90k+ context in under 24GiB VRAM
52
+ - "Only for the desperate."
53
+ - Technically "better" (lower) PPL than `Qwen3-235B-A22B-Q8_0 @ ~5.31`
54
+ * I might try an `iqN_kt` "QTIP/exl3/trellis" style quant, but it is rather experimental and takes a long time to cook and slow on CPU inference...
55
+
56
+ #### `IQ4_KS_R4` 4.701 BPW (368GiB)
57
+ Special mix `IQ5_KS_R4` `ffn_down` and `IQ4_KS_R4` `ffn_(up|gate)` routed experts. All other layers `q8_0` for CPU+GPU offload. For max speed on CPU *only* rigs use `--run-time-repack`.
58
+
59
+ <details>
60
+
61
+ <summary>👈 Secret Recipe</summary>
62
+
63
+ This quant might be fairly fast despite the larger size given `_KS` quant inferencing optimizations. Made this as there were some requests for a larger size. This on *might* fit on 368GB RAM if you have more than average VRAM, or comfortably on a 512GB RAM rig preferably with 24GB VRAM though fine for CPU only as well.
64
+
65
+ ```bash
66
+ #!/usr/bin/env bash
67
+
68
+ custom="
69
+ # Token embedding and output tensors (GPU)
70
+ token_embd\.weight=q8_0
71
+ output\.weight=q8_0
72
+ output_norm\.weight=q8_0
73
+
74
+ # First 3 dense layers (0-3) (GPU)
75
+ blk\.[0-2]\..*=q8_0
76
+
77
+ # All attention, weights, and bias tensors for MoE layers (3-60) (GPU)
78
+ blk\.[3-9]\.attn_.*=q8_0
79
+ blk\.[1-5][0-9]\.attn_.*=q8_0
80
+ blk\.60\.attn_.*=q8_0
81
+
82
+ blk\.[3-9]\.ffn_norm\.weight=q8_0
83
+ blk\.[1-5][0-9]\.ffn_norm\.weight=q8_0
84
+ blk\.60\.ffn_norm\.weight=q8_0
85
+
86
+ blk\.[3-9]\.exp_probs_b\.bias=q8_0
87
+ blk\.[1-5][0-9]\.exp_probs_b\.bias=q8_0
88
+ blk\.60\.exp_probs_b\.bias=q8_0
89
+
90
+ # Shared Experts (3-60) (GPU)
91
+ blk\.[3-9]\.ffn_down_shexp\.weight=q8_0
92
+ blk\.[1-5][0-9]\.ffn_down_shexp\.weight=q8_0
93
+ blk\.60\.ffn_down_shexp\.weight=q8_0
94
+
95
+ blk\.[3-9]\.ffn_(gate|up)_shexp\.weight=q8_0
96
+ blk\.[1-5][0-9]\.ffn_(gate|up)_shexp\.weight=q8_0
97
+ blk\.60\.ffn_(gate|up)_shexp\.weight=q8_0
98
+
99
+ # MoE Experts (3-60) (CPU)
100
+ blk\.[3-9]\.ffn_down_exps\.weight=iq5_ks_r4
101
+ blk\.[1-5][0-9]\.ffn_down_exps\.weight=iq5_ks_r4
102
+ blk\.60\.ffn_down_exps\.weight=iq5_ks_r4
103
+
104
+ blk\.[3-9]\.ffn_(gate|up)_exps\.weight=iq4_ks_r4
105
+ blk\.[1-5][0-9]\.ffn_(gate|up)_exps\.weight=iq4_ks_r4
106
+ blk\.60\.ffn_(gate|up)_exps\.weight=iq4_ks_r4
107
+ "
108
+
109
+ custom=$(
110
+ echo "$custom" | grep -v '^#' | \
111
+ sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
112
+ )
113
+
114
+ ./build/bin/llama-quantize \
115
+ --custom-q "$custom" \
116
+ --imatrix /mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/imatrix-DeepSeek-R1-0528.dat \
117
+ /mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/DeepSeek-R1-256x21B-0528-BF16-00001-of-00030.gguf \
118
+ /mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/DeepSeek-R1-0528-IQ4_KS_R4.gguf \
119
+ IQ4_KS_R4 \
120
+ 24
121
+ ```
122
+
123
+ </details>
124
 
125
  #### `IQ3_K_R4` 3.847 BPW (301GiB)
126
  Special mix `IQ4_KS_R4` `ffn_down` and `IQ3_K_R4` `ffn_(up|gate)` routed experts. All other layers `q8_0` for CPU+GPU offload. For max speed on CPU *only* rigs use `--run-time-repack`.
 
129
 
130
  <summary>👈 Possible VRAM & RAM Combinations</summary>
131
 
132
+ This is probably a good size quant for a 368GB RAM rig preferably with at least a single 24GB VRAM GPU.
133
  It is probably a little out of reach for a 256GB RAM rig unless you have 80+GB VRAM.
134
  You could still run "troll rig" style and page off disk for maybe 5 tok/sec and some hot NVMe drives hahah...
135
 
 
306
 
307
  </details>
308
 
309
+ #### `IQ1_S_R4` 1.664 BPW (131GiB)
310
+ Special mix `IQ1_M_R4` `ffn_down` and `IQ1_S_R4` `ffn_(up|gate)` routed experts. All other layers mostly `iq4_ks` for CPU+GPU offload. For max speed on CPU *only* rigs use `--run-time-repack`.
311
+
312
+ <details>
313
+
314
+ ![Reverse Buff Mokey Meme](images/buff-mokey-meme.png "Reverse Buff Mokey Meme Comparing full R1-671B fp8 to smol iq1_s quant.")
315
+
316
+ Possibly useful for 128GiB RAM + 16GB+ VRAM? Maybe? It does actually work and can read python code okay. For all I know it might be better than Qwen3-235B-A22B given the iq1_s_r4 actually has lower PPL!
317
+
318
+ Not recommended and slower than a larger quant unless this is the *only* thing you can fit completely in RAM+VRAM as this quant seems slower and less optimized for inferencing and in testing has slower TG and worse quality (higher perplexity). Plus I'm not sure that you can use it with multi-GPU offload so check the ik_llama.cpp PRs as these tiny quants are less used.
319
+
320
+ <summary>👈 Secret Recipe</summary>
321
+
322
+ ```bash
323
+ #!/usr/bin/env bash
324
+
325
+ custom="
326
+ # Token embedding and output tensors (GPU)
327
+ # note token_embd cannot be repacked quant type
328
+ token_embd\.weight=iq4_ks
329
+ output\.weight=iq4_ks
330
+ output_norm\.weight=iq4_ks
331
+
332
+ # First 3 dense layers (0-3) (GPU)
333
+ # Except blk.*.attn_k_b.weight is not divisible by 256 so only supports qN_0
334
+ blk\.[0-2]\.attn_k_b.*=q4_0
335
+ blk\.[0-2]\.attn_.*=iq4_ks
336
+ blk\.[0-2]\..*=iq4_ks
337
+
338
+ # All attention, norm weights, and bias tensors for MoE layers (3-60) (GPU)
339
+ # Except blk.*.attn_k_b.weight is not divisible by 256 so only supports qN_0
340
+ blk\.[3-9]\.attn_k_b.*=q4_0
341
+ blk\.[1-5][0-9]\.attn_k_b.*=q4_0
342
+ blk\.60\.attn_k_b.*=q4_0
343
+
344
+ blk\.[3-9]\.attn_.*=iq4_ks
345
+ blk\.[1-5][0-9]\.attn_.*=iq4_ks
346
+ blk\.60\.attn_.*=iq4_ks
347
+
348
+ blk\.[3-9]\.ffn_norm\.weight=iq4_ks
349
+ blk\.[1-5][0-9]\.ffn_norm\.weight=iq4_ks
350
+ blk\.60\.ffn_norm\.weight=iq4_ks
351
+
352
+ blk\.[3-9]\.exp_probs_b\.bias=iq4_ks
353
+ blk\.[1-5][0-9]\.exp_probs_b\.bias=iq4_ks
354
+ blk\.60\.exp_probs_b\.bias=iq4_ks
355
+
356
+ # Shared Experts (3-60) (GPU)
357
+ blk\.[3-9]\.ffn_down_shexp\.weight=iq4_ks
358
+ blk\.[1-5][0-9]\.ffn_down_shexp\.weight=iq4_ks
359
+ blk\.60\.ffn_down_shexp\.weight=iq4_ks
360
+
361
+ blk\.[3-9]\.ffn_(gate|up)_shexp\.weight=iq4_ks
362
+ blk\.[1-5][0-9]\.ffn_(gate|up)_shexp\.weight=iq4_ks
363
+ blk\.60\.ffn_(gate|up)_shexp\.weight=iq4_ks
364
+
365
+ # Routed Experts (3-60) (CPU)
366
+ blk\.[3-9]\.ffn_down_exps\.weight=iq1_m_r4
367
+ blk\.[1-5][0-9]\.ffn_down_exps\.weight=iq1_m_r4
368
+ blk\.60\.ffn_down_exps\.weight=iq1_m_r4
369
+
370
+ blk\.[3-9]\.ffn_(gate|up)_exps\.weight=iq1_s_r4
371
+ blk\.[1-5][0-9]\.ffn_(gate|up)_exps\.weight=iq1_s_r4
372
+ blk\.60\.ffn_(gate|up)_exps\.weight=iq1_s_r4
373
+ "
374
+
375
+ custom=$(
376
+ echo "$custom" | grep -v '^#' | \
377
+ sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
378
+ )
379
+
380
+ ./build/bin/llama-quantize \
381
+ --custom-q "$custom" \
382
+ --imatrix /mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/imatrix-DeepSeek-R1-0528.dat \
383
+ /mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/DeepSeek-R1-256x21B-0528-BF16-00001-of-00030.gguf \
384
+ /mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/DeepSeek-R1-0528-IQ1_S_R4.gguf \
385
+ IQ1_S_R4 \
386
+ 24
387
+ ```
388
+
389
+ </details>
390
 
 
 
391
 
392
  ## Quick Start
393
  #### `ik_llama.cpp` API server for GPU+CPU
 
416
  # Adjust number of routed expert layers for additional VRAM on each GPU
417
  # Compile with -DGGML_SCHED_MAX_COPIES=1 for multi-GPUs
418
  # Compile with -DGGML_CUDA_IQK_FORCE_BF16=1 if putting `_R4` tensors on GPU (for DeepSeek only)
419
+ # (might go faster or slower with FORCE_BF16 depending on GPU model)
420
+ # If you have extra VRAM go with `-b 4096 -ub 4096` for potential big PP gains!
421
  ./build/bin/llama-server \
422
  --model /mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/DeepSeek-R1-0528-IQ3_K_R4.gguf \
423
  --alias ubergarm/DeepSeek-R1-0528-IQ3_K_R4 \
images/buff-mokey-meme.png ADDED

Git LFS Details

  • SHA256: 15e8f8fc60de0158eb5639513b9629edaa91c47d996becbc496d0cdea4c56eb1
  • Pointer size: 131 Bytes
  • Size of remote file: 119 kB
images/perplexity.png ADDED

Git LFS Details

  • SHA256: 4295b96c2ba5d0d14d6a2a14baf0dc35f3431773a56095d3b67d25d61c89821b
  • Pointer size: 131 Bytes
  • Size of remote file: 128 kB