Model Card

High quality quantization of Kimi-K2-Instruct without using imatrix.

Run

System Requirements

  • 24G VRAM
  • 768G RAM

You may be able to run with 512G of RAM by removing --no-mmap and -rtr with a performance hit that will depend on what MoE experts get activated for your prompt.

Run with ik_llama.cpp, 32G VRAM

./build/bin/llama-server \
    --alias anikifoss/Kimi-K2-Instruct-DQ4_K \
    --model /mnt/data/Models/anikifoss/Kimi-K2-Instruct-DQ4_K/Kimi-K2-Instruct-DQ4_K-00001-of-00014.gguf \
    --no-mmap -rtr \
    --temp 0.5 --top-k 0 --top-p 1.0 --min-p 0.1 --repeat-penalty 1.0 \
    --ctx-size 131072 \
    -ctk f16 \
    -mla 3 -fa \
    -amb 512 \
    -b 2048 -ub 2048 \
    -fmoe \
    --n-gpu-layers 99 \
    --override-tensor exps=CPU \
    --parallel 1 \
    --threads 32 \
    --threads-batch 64 \
    --host 127.0.0.1 \
    --port 8090

Run with llama, 32G VRAM

./build/bin/llama-server \
    --alias anikifoss/Kimi-K2-Instruct-DQ4_K \
    --model /mnt/data/Models/anikifoss/Kimi-K2-Instruct-DQ4_K/Kimi-K2-Instruct-DQ4_K-00001-of-00014.gguf \
    --no-mmap \
    --temp 0.5 --top-k 0 --top-p 1.0 --min-p 0.1 --repeat-penalty 1.0 \
    --ctx-size 131072 \
    -ctk f16 \
    -fa \
    -b 2048 -ub 2048 \
    --n-gpu-layers 99 \
    --override-tensor exps=CPU \
    --parallel 1 \
    --threads 32 \
    --threads-batch 64 \
    --host 127.0.0.1 \
    --port 8090

Quantization Approach

  • Keep all the small F32 tensors untouched
  • Quantize all the attention and related tensors to Q8_0
  • Quantize all the ffn_down_exps tensors to Q6_K
  • Quantize all the ffn_up_exps and ffn_gate_exps tensors to Q4_K

No imatrix

Generally, imatrix is not recommended for Q4 and larger quants. The problem with imatrix is that it will guide what model remembers, while anything not covered by the text sample used to generate the imartrix is more likely to be forgotten. For example, an imatrix derived from wikipedia sample is likely to negatively affect tasks like coding. In other words, while imatrix can improve specific benchmarks, that are similar to the imatrix input sample, it will also skew the model performance towards tasks similar to the imatrix sample at the expense of other tasks.

Downloads last month
620
GGUF
Model size
1,026B params
Architecture
deepseek2
Hardware compatibility
Log In to view the estimation
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for anikifoss/Kimi-K2-Instruct-DQ4_K

Quantized
(17)
this model