Add IQ1_S_R4 and IQ4_KS_R4 and secret meme

Browse files

Files changed (3) hide show

README.md +166 -5
images/buff-mokey-meme.png +3 -0
images/perplexity.png +3 -0

README.md CHANGED Viewed

@@ -30,15 +30,97 @@ Excited to share and learn together. Thanks!
 ## Quant Collection
 So far these are my best recipes offering the lowest perplexity per GiB models suitable for a wide variety of CPU+GPU or CPU *only* rigs.
 * `DeepSeek-R1-0528-Q8_0` 666GiB
   - `Final estimate: PPL = 3.2130 +/- 0.01698`
   - I didn't upload this, it is for baseline reference only.
 * `DeepSeek-R1-0528-IQ3_K_R4` 301GiB
   - `Final estimate: PPL = 3.2730 +/- 0.01738`
   - Fits 32k context in under 24GiB VRAM
 * `DeepSeek-R1-0528-IQ2_K_R4` 220GiB
   - `Final estimate: PPL = 3.5069 +/- 0.01893`
   - Fits 32k context in under 16GiB VRAM
 #### `IQ3_K_R4` 3.847 BPW (301GiB)
 Special mix `IQ4_KS_R4` `ffn_down` and `IQ3_K_R4` `ffn_(up|gate)` routed experts. All other layers `q8_0` for CPU+GPU offload. For max speed on CPU *only* rigs use `--run-time-repack`.
@@ -47,7 +129,7 @@ Special mix `IQ4_KS_R4` `ffn_down` and `IQ3_K_R4` `ffn_(up|gate)` routed experts
 <summary>👈 Possible VRAM & RAM Combinations</summary>
-This is probably a good size quant for a 368GB RAM rig preferrably with at least a single 24GB VRAM GPU.
 It is probably a little out of reach for a 256GB RAM rig unless you have 80+GB VRAM.
 You could still run "troll rig" style and page off disk for maybe 5 tok/sec and some hot NVMe drives hahah...
@@ -224,11 +306,88 @@ custom=$(
 </details>
-#### What Next?
-Let me know in the comments or ik_llama.cpp discussion reference at the bottom if there is demand for quants targeting other common hardware configurations e.g.:
-* Possibly slightly larger mix for higher RAM systems if there is interest?
-* Possibly an extreme small <= ~1.5BPW quant for 128GB RAM+VRAM? Is this even possible with decent quality? lol...
 ## Quick Start
 #### `ik_llama.cpp` API server for GPU+CPU
@@ -257,6 +416,8 @@ CUDA_VISIBLE_DEVICES="0," \
 # Adjust number of routed expert layers for additional VRAM on each GPU
 # Compile with -DGGML_SCHED_MAX_COPIES=1 for multi-GPUs
 # Compile with -DGGML_CUDA_IQK_FORCE_BF16=1 if putting `_R4` tensors on GPU (for DeepSeek only)
 ./build/bin/llama-server \
     --model /mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/DeepSeek-R1-0528-IQ3_K_R4.gguf \
     --alias ubergarm/DeepSeek-R1-0528-IQ3_K_R4 \

 ## Quant Collection
 So far these are my best recipes offering the lowest perplexity per GiB models suitable for a wide variety of CPU+GPU or CPU *only* rigs.
+![Perplexity Chart](images/perplexity.png "Chart showing Perplexity improving as BPW increases.")
 * `DeepSeek-R1-0528-Q8_0` 666GiB
   - `Final estimate: PPL = 3.2130 +/- 0.01698`
   - I didn't upload this, it is for baseline reference only.
+* `DeepSeek-R1-0528-IQ4_KS_R4` 368GiB
+  - `Final estimate: PPL = 3.2286 +/- 0.01710`
+  - Fits 32k context in under 24GiB VRAM
 * `DeepSeek-R1-0528-IQ3_K_R4` 301GiB
   - `Final estimate: PPL = 3.2730 +/- 0.01738`
   - Fits 32k context in under 24GiB VRAM
 * `DeepSeek-R1-0528-IQ2_K_R4` 220GiB
   - `Final estimate: PPL = 3.5069 +/- 0.01893`
   - Fits 32k context in under 16GiB VRAM
+  - Fits 64k context in under 24GiB VRAM
+* `DeepSeek-R1-0528-IQ1_S_R4` 131GiB
+  - `Final estimate: PPL = 4.8831 +/- 0.02878`
+  - Fits 32k+ context in under 16GiB VRAM
+  - Fits 90k+ context in under 24GiB VRAM
+  - "Only for the desperate."
+  - Technically "better" (lower) PPL than `Qwen3-235B-A22B-Q8_0 @ ~5.31`
+* I might try an `iqN_kt` "QTIP/exl3/trellis" style quant, but it is rather experimental and takes a long time to cook and slow on CPU inference...
+#### `IQ4_KS_R4` 4.701 BPW (368GiB)
+Special mix `IQ5_KS_R4` `ffn_down` and `IQ4_KS_R4` `ffn_(up|gate)` routed experts. All other layers `q8_0` for CPU+GPU offload. For max speed on CPU *only* rigs use `--run-time-repack`.
+<details>
+<summary>👈 Secret Recipe</summary>
+This quant might be fairly fast despite the larger size given `_KS` quant inferencing optimizations. Made this as there were some requests for a larger size. This on *might* fit on 368GB RAM if you have more than average VRAM, or comfortably on a 512GB RAM rig preferably with 24GB VRAM though fine for CPU only as well.
+```bash
+#!/usr/bin/env bash
+custom="
+# Token embedding and output tensors (GPU)
+token_embd\.weight=q8_0
+output\.weight=q8_0
+output_norm\.weight=q8_0
+# First 3 dense layers (0-3) (GPU)
+blk\.[0-2]\..*=q8_0
+# All attention, weights, and bias tensors for MoE layers (3-60) (GPU)
+blk\.[3-9]\.attn_.*=q8_0
+blk\.[1-5][0-9]\.attn_.*=q8_0
+blk\.60\.attn_.*=q8_0
+blk\.[3-9]\.ffn_norm\.weight=q8_0
+blk\.[1-5][0-9]\.ffn_norm\.weight=q8_0
+blk\.60\.ffn_norm\.weight=q8_0
+blk\.[3-9]\.exp_probs_b\.bias=q8_0
+blk\.[1-5][0-9]\.exp_probs_b\.bias=q8_0
+blk\.60\.exp_probs_b\.bias=q8_0
+# Shared Experts (3-60) (GPU)
+blk\.[3-9]\.ffn_down_shexp\.weight=q8_0
+blk\.[1-5][0-9]\.ffn_down_shexp\.weight=q8_0
+blk\.60\.ffn_down_shexp\.weight=q8_0
+blk\.[3-9]\.ffn_(gate|up)_shexp\.weight=q8_0
+blk\.[1-5][0-9]\.ffn_(gate|up)_shexp\.weight=q8_0
+blk\.60\.ffn_(gate|up)_shexp\.weight=q8_0
+# MoE Experts (3-60) (CPU)
+blk\.[3-9]\.ffn_down_exps\.weight=iq5_ks_r4
+blk\.[1-5][0-9]\.ffn_down_exps\.weight=iq5_ks_r4
+blk\.60\.ffn_down_exps\.weight=iq5_ks_r4
+blk\.[3-9]\.ffn_(gate|up)_exps\.weight=iq4_ks_r4
+blk\.[1-5][0-9]\.ffn_(gate|up)_exps\.weight=iq4_ks_r4
+blk\.60\.ffn_(gate|up)_exps\.weight=iq4_ks_r4
+"
+custom=$(
+  echo "$custom" | grep -v '^#' | \
+  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
+)
+./build/bin/llama-quantize \
+    --custom-q "$custom" \
+    --imatrix /mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/imatrix-DeepSeek-R1-0528.dat \
+    /mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/DeepSeek-R1-256x21B-0528-BF16-00001-of-00030.gguf \
+    /mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/DeepSeek-R1-0528-IQ4_KS_R4.gguf \
+    IQ4_KS_R4 \
+    24
+```
+</details>
 #### `IQ3_K_R4` 3.847 BPW (301GiB)
 Special mix `IQ4_KS_R4` `ffn_down` and `IQ3_K_R4` `ffn_(up|gate)` routed experts. All other layers `q8_0` for CPU+GPU offload. For max speed on CPU *only* rigs use `--run-time-repack`.
 <summary>👈 Possible VRAM & RAM Combinations</summary>
+This is probably a good size quant for a 368GB RAM rig preferably with at least a single 24GB VRAM GPU.
 It is probably a little out of reach for a 256GB RAM rig unless you have 80+GB VRAM.
 You could still run "troll rig" style and page off disk for maybe 5 tok/sec and some hot NVMe drives hahah...
 </details>
+#### `IQ1_S_R4` 1.664 BPW (131GiB)
+Special mix `IQ1_M_R4` `ffn_down` and `IQ1_S_R4` `ffn_(up|gate)` routed experts. All other layers mostly `iq4_ks` for CPU+GPU offload. For max speed on CPU *only* rigs use `--run-time-repack`.
+<details>
+![Reverse Buff Mokey Meme](images/buff-mokey-meme.png "Reverse Buff Mokey Meme Comparing full R1-671B fp8 to smol iq1_s quant.")
+Possibly useful for 128GiB RAM + 16GB+ VRAM? Maybe? It does actually work and can read python code okay. For all I know it might be better than Qwen3-235B-A22B given the iq1_s_r4 actually has lower PPL!
+Not recommended and slower than a larger quant unless this is the *only* thing you can fit completely in RAM+VRAM as this quant seems slower and less optimized for inferencing and in testing has slower TG and worse quality (higher perplexity). Plus I'm not sure that you can use it with multi-GPU offload so check the ik_llama.cpp PRs as these tiny quants are less used.
+<summary>👈 Secret Recipe</summary>
+```bash
+#!/usr/bin/env bash
+custom="
+# Token embedding and output tensors (GPU)
+# note token_embd cannot be repacked quant type
+token_embd\.weight=iq4_ks
+output\.weight=iq4_ks
+output_norm\.weight=iq4_ks
+# First 3 dense layers (0-3) (GPU)
+# Except blk.*.attn_k_b.weight is not divisible by 256 so only supports qN_0
+blk\.[0-2]\.attn_k_b.*=q4_0
+blk\.[0-2]\.attn_.*=iq4_ks
+blk\.[0-2]\..*=iq4_ks
+# All attention, norm weights, and bias tensors for MoE layers (3-60) (GPU)
+# Except blk.*.attn_k_b.weight is not divisible by 256 so only supports qN_0
+blk\.[3-9]\.attn_k_b.*=q4_0
+blk\.[1-5][0-9]\.attn_k_b.*=q4_0
+blk\.60\.attn_k_b.*=q4_0
+blk\.[3-9]\.attn_.*=iq4_ks
+blk\.[1-5][0-9]\.attn_.*=iq4_ks
+blk\.60\.attn_.*=iq4_ks
+blk\.[3-9]\.ffn_norm\.weight=iq4_ks
+blk\.[1-5][0-9]\.ffn_norm\.weight=iq4_ks
+blk\.60\.ffn_norm\.weight=iq4_ks
+blk\.[3-9]\.exp_probs_b\.bias=iq4_ks
+blk\.[1-5][0-9]\.exp_probs_b\.bias=iq4_ks
+blk\.60\.exp_probs_b\.bias=iq4_ks
+# Shared Experts (3-60) (GPU)
+blk\.[3-9]\.ffn_down_shexp\.weight=iq4_ks
+blk\.[1-5][0-9]\.ffn_down_shexp\.weight=iq4_ks
+blk\.60\.ffn_down_shexp\.weight=iq4_ks
+blk\.[3-9]\.ffn_(gate|up)_shexp\.weight=iq4_ks
+blk\.[1-5][0-9]\.ffn_(gate|up)_shexp\.weight=iq4_ks
+blk\.60\.ffn_(gate|up)_shexp\.weight=iq4_ks
+# Routed Experts (3-60) (CPU)
+blk\.[3-9]\.ffn_down_exps\.weight=iq1_m_r4
+blk\.[1-5][0-9]\.ffn_down_exps\.weight=iq1_m_r4
+blk\.60\.ffn_down_exps\.weight=iq1_m_r4
+blk\.[3-9]\.ffn_(gate|up)_exps\.weight=iq1_s_r4
+blk\.[1-5][0-9]\.ffn_(gate|up)_exps\.weight=iq1_s_r4
+blk\.60\.ffn_(gate|up)_exps\.weight=iq1_s_r4
+"
+custom=$(
+  echo "$custom" | grep -v '^#' | \
+  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
+)
+./build/bin/llama-quantize \
+    --custom-q "$custom" \
+    --imatrix /mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/imatrix-DeepSeek-R1-0528.dat \
+    /mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/DeepSeek-R1-256x21B-0528-BF16-00001-of-00030.gguf \
+    /mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/DeepSeek-R1-0528-IQ1_S_R4.gguf \
+    IQ1_S_R4 \
+    24
+```
+</details>
 ## Quick Start
 #### `ik_llama.cpp` API server for GPU+CPU
 # Adjust number of routed expert layers for additional VRAM on each GPU
 # Compile with -DGGML_SCHED_MAX_COPIES=1 for multi-GPUs
 # Compile with -DGGML_CUDA_IQK_FORCE_BF16=1 if putting `_R4` tensors on GPU (for DeepSeek only)
+# (might go faster or slower with FORCE_BF16 depending on GPU model)
+# If you have extra VRAM go with `-b 4096 -ub 4096` for potential big PP gains!
 ./build/bin/llama-server \
     --model /mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/DeepSeek-R1-0528-IQ3_K_R4.gguf \
     --alias ubergarm/DeepSeek-R1-0528-IQ3_K_R4 \

images/buff-mokey-meme.png ADDED Viewed

Git LFS Details

SHA256: 15e8f8fc60de0158eb5639513b9629edaa91c47d996becbc496d0cdea4c56eb1
Pointer size: 131 Bytes
Size of remote file: 119 kB

images/perplexity.png ADDED Viewed

Git LFS Details

SHA256: 4295b96c2ba5d0d14d6a2a14baf0dc35f3431773a56095d3b67d25d61c89821b
Pointer size: 131 Bytes
Size of remote file: 128 kB