Update README.md
Browse files
README.md
CHANGED
@@ -84,15 +84,15 @@ You can try the following to squeeze out more context on your system:
|
|
84 |
Smaller quants, like `UD-Q2_K_XL` are much faster when generating tokens, but often produce code that fails to run or contains bugs. Based on empirical observations, coding seems to be strongly affected by the model quantization. So we use larger quantization where it matters to reduce perplexity while remaining within the target system constraints of 24GB-32GB VRAM, 512GB RAM.
|
85 |
|
86 |
### Quantization Approach
|
87 |
-
When running with **Flash MLA** optimization enabled, **ik_llama** will unpack **attention**
|
88 |
|
89 |
Quantization Summary:
|
90 |
-
- Keep all the small `F32`
|
91 |
-
- Quantize all the **attention** and related
|
92 |
-
- Quantize all the **ffn_down_exps**
|
93 |
-
- Quantize all the **ffn_up_exps** and **ffn_gate_exps**
|
94 |
|
95 |
-
The **attn_kv_b**
|
96 |
|
97 |
### No imatrix
|
98 |
Generally, imatrix is not recommended for Q4 and larger quants. The problem with imatrix is that it will guide what model remembers, while anything not covered by the text sample used to generate the imartrix is more likely to be forgotten. For example, an imatrix derived from wikipedia sample is likely to negatively affect tasks like coding. In other words, while imatrix can improve specific benchmarks, that are similar to the imatrix input sample, it will also skew the model performance towards tasks similar to the imatrix sample at the expense of other tasks.
|
|
|
84 |
Smaller quants, like `UD-Q2_K_XL` are much faster when generating tokens, but often produce code that fails to run or contains bugs. Based on empirical observations, coding seems to be strongly affected by the model quantization. So we use larger quantization where it matters to reduce perplexity while remaining within the target system constraints of 24GB-32GB VRAM, 512GB RAM.
|
85 |
|
86 |
### Quantization Approach
|
87 |
+
When running with **Flash MLA** optimization enabled, **ik_llama** will unpack **attention** tensors into `Q8_0`, so we match that in our model (similar to ubergarm's ik_llama.cpp quants). We also keep all the other small tensors as `Q8_0` while also leaving any `F32` tensors untouched. The MoE tensors make up the bulk of the model. The **ffn_down_exps** tensors are especially sensitive to quantization (we borrow this idea from `unsloth` quants), so we quantize them as `Q6_K_R4`. Finally, all the other large MoE tensors (ffn_up_exps, ffn_gate_exps) are quantized as `Q4_K_R4`
|
88 |
|
89 |
Quantization Summary:
|
90 |
+
- Keep all the small `F32` tensors untouched
|
91 |
+
- Quantize all the **attention** and related tensors to `Q8_0`
|
92 |
+
- Quantize all the **ffn_down_exps** tensors to `Q6_K_R4`
|
93 |
+
- Quantize all the **ffn_up_exps** and **ffn_gate_exps** tensors to `Q4_K_R4`
|
94 |
|
95 |
+
The **attn_kv_b** tensors are included in the original model, but they contain the same information as **attn_k_b** and **attn_v_b** tensors. Some quants, like `unsloth`, remove **attn_k_b** and **attn_v_b** tensors altogether. We keep all these tensors for completeness, but push **attn_kv_b** out of VRAM with `attn_kv_b=CPU`, since `ik_llama` prefers to use **attn_k_b** and **attn_v_b** when all the tensors are available. This behavior may change between releases, so try with `attn_k_b=CPU,attn_v_b=CPU` instead and check which option gives you the best performance!
|
96 |
|
97 |
### No imatrix
|
98 |
Generally, imatrix is not recommended for Q4 and larger quants. The problem with imatrix is that it will guide what model remembers, while anything not covered by the text sample used to generate the imartrix is more likely to be forgotten. For example, an imatrix derived from wikipedia sample is likely to negatively affect tasks like coding. In other words, while imatrix can improve specific benchmarks, that are similar to the imatrix input sample, it will also skew the model performance towards tasks similar to the imatrix sample at the expense of other tasks.
|