ik_llama.cpp imatrix Quantizations of Kimi-Dev-72B

This quant collection REQUIRES ik_llama.cpp fork to support advanced non-linear SotA quants and Multi-Head Latent Attention (MLA). Do not download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc! Though might work in Nexesenex's croco.cpp kobold fork (untested).

smol-IQ3_K 32.273 GiB (3.813 BPW)

  • type f32: 401 tensors
  • type q4_K: 1 tensors token_embd
  • type q6_K: 1 tensors output ("head")
  • type iq4_nl: 80 tensors down
  • type iq3_k: 320 tensors (q|o) (gate|up)
  • type iq4_k: 160 tensors (k|v)

Quickstart

# Clone
git clone [email protected]:ikawrakow/ik_llama.cpp.git
cd ik_llama.cpp

# Build (might try adding -DGGML_CUDA_IQK_FORCE_BF16=1 for 3090s and older)
cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_F16=ON -DGGML_SCHED_MAX_COPIES=1
cmake --build build --config Release -j $(nproc)

# Run (set threads to number of CPU physical cores, mmap is fine for faster startup, adjust ctx/ngl as needed)
./build/bin/llama-server \
    --model /mnt/models/ubergarm/Kimi-Dev-72B-GGUF/Kimi-Dev-72B-smol-IQ3_K.gguf \
    --ctx-size 8192 \
    -ctk q8_0 -ctv q8_0 \
    -fa \
    --no-mmap \
    -ngl 48 \
    --threads 16 \
    --parallel 1 \
    --host 127.0.0.1 \
    --port 8080

Benchmarks

Speed

  • High-end Gaming Rig Hardware
    • AMD 9950X
    • Overclocked infinity fabric "gear 1" clocks
    • 2x 48GB DDR5@6400 RAM (~87GB/s benchmarked)
    • 3090 TI FE 24GB VRAM @ 450 Watts (uncapped)
  • PP ~500 tok/sec with 2k batches
  • TG ~5 tok/sec limited by RAM i/o bandwidth
./build/bin/llama-sweep-bench \
    --model /mnt/models/ubergarm/Kimi-Dev-72B-GGUF/Kimi-Dev-72B-smol-IQ3_K.gguf \
    --ctx-size 6144 \
    -ctk q8_0 -ctv q8_0 \
    -fa \
    --no-mmap \
    -ub 2048 -b 2048 \
    -ngl 48 \
    --warmup-batch \
    --threads 16

ubergarm/Kimmy-Dev-72B-smol-IQ3_K

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 512 0 3.925 521.77 103.624 4.94
2048 512 2048 4.058 504.63 105.265 4.86

Quality

I tested perplexity for a bunch of experimental quants, decided this one was a decent trade-off between quality and speed.

Perplexity Chart

FAQ

  1. Why is it smol?
  • I ran out of names making a bunch of similar sized quants for the Perplexity graph above lol.
  1. Will you make larger GGUFs?
  • Naw, you can get good mainline llama.cpp GGUFs from others already like bartowski and bullerwins.
  1. Where can I get those hot new EXL3 quants?
  1. What about the new iqK_kt QTIP Trellis style quants?
  • I may release something eventually, but they are still pretty fresh gonna wait a minute to see if any breaking changes happen before releasing.
  • Also the column dimension of the ffn_down tensor is not divisible by 256 so had to use iq4_nl unless something changes.

References

Downloads last month
0
GGUF
Model size
72.7B params
Architecture
qwen2
Hardware compatibility
Log In to view the estimation
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for ubergarm/Kimi-Dev-72B-GGUF

Base model

Qwen/Qwen2.5-72B
Quantized
(13)
this model