Model Card

High quality quantization of Kimi-K2-Instruct without using imatrix.

Run

System Requirements

24G VRAM
768G RAM

You may be able to run with 512G of RAM by removing --no-mmap and -rtr with a performance hit that will depend on what MoE experts get activated for your prompt.

Run with ik_llama.cpp, 32G VRAM

Use the main branch for ik_llama.cpp

./build/bin/llama-server \
    --alias anikifoss/Kimi-K2-Instruct-DQ4_K \
    --model /mnt/data/Models/anikifoss/Kimi-K2-Instruct-DQ4_K/Kimi-K2-Instruct-DQ4_K-00001-of-00014.gguf \
    --no-mmap -rtr \
    --temp 0.5 --top-k 0 --top-p 1.0 --min-p 0.1 --repeat-penalty 1.0 \
    --ctx-size 131072 \
    -ctk f16 \
    -mla 3 -fa \
    -amb 512 \
    -b 2048 -ub 2048 \
    -fmoe \
    --n-gpu-layers 99 \
    --override-tensor exps=CPU \
    --parallel 1 \
    --threads 32 \
    --threads-batch 64 \
    --host 127.0.0.1 \
    --port 8090

Run with llama, 32G VRAM

Use this branch

./build/bin/llama-server \
    --alias anikifoss/Kimi-K2-Instruct-DQ4_K \
    --model /mnt/data/Models/anikifoss/Kimi-K2-Instruct-DQ4_K/Kimi-K2-Instruct-DQ4_K-00001-of-00014.gguf \
    --no-mmap \
    --temp 0.5 --top-k 0 --top-p 1.0 --min-p 0.1 --repeat-penalty 1.0 \
    --ctx-size 131072 \
    -ctk f16 \
    -fa \
    -b 2048 -ub 2048 \
    --n-gpu-layers 99 \
    --override-tensor exps=CPU \
    --parallel 1 \
    --threads 32 \
    --threads-batch 64 \
    --host 127.0.0.1 \
    --port 8090

Quantization Approach

Keep all the small F32 tensors untouched
Quantize all the attention and related tensors to Q8_0
Quantize all the ffn_down_exps tensors to Q6_K
Quantize all the ffn_up_exps and ffn_gate_exps tensors to Q4_K

No imatrix

Generally, imatrix is not recommended for Q4 and larger quants. The problem with imatrix is that it will guide what model remembers, while anything not covered by the text sample used to generate the imartrix is more likely to be forgotten. For example, an imatrix derived from wikipedia sample is likely to negatively affect tasks like coding. In other words, while imatrix can improve specific benchmarks, that are similar to the imatrix input sample, it will also skew the model performance towards tasks similar to the imatrix sample at the expense of other tasks.

anikifoss
/

Kimi-K2-Instruct-DQ4_K