ubergarm/Kimi-Dev-72B-GGUF

`ik_llama.cpp` imatrix Quantizations of Kimi-Dev-72B

This quant collection REQUIRES ik_llama.cpp fork to support advanced non-linear SotA quants and Multi-Head Latent Attention (MLA). Do not download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc! Though might work in Nexesenex's croco.cpp kobold fork (untested).

smol-IQ3_K 32.273 GiB (3.813 BPW)

type f32: 401 tensors
type q4_K: 1 tensors token_embd
type q6_K: 1 tensors output ("head")
type iq4_nl: 80 tensors down
type iq3_k: 320 tensors (q|o) (gate|up)
type iq4_k: 160 tensors (k|v)

Quickstart

# Clone
git clone [email protected]:ikawrakow/ik_llama.cpp.git
cd ik_llama.cpp

# Build (might try adding -DGGML_CUDA_IQK_FORCE_BF16=1 for 3090s and older)
cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_F16=ON -DGGML_SCHED_MAX_COPIES=1
cmake --build build --config Release -j $(nproc)

# Run (set threads to number of CPU physical cores, mmap is fine for faster startup, adjust ctx/ngl as needed)
./build/bin/llama-server \
    --model /mnt/models/ubergarm/Kimi-Dev-72B-GGUF/Kimi-Dev-72B-smol-IQ3_K.gguf \
    --ctx-size 8192 \
    -ctk q8_0 -ctv q8_0 \
    -fa \
    --no-mmap \
    -ngl 48 \
    --threads 16 \
    --parallel 1 \
    --host 127.0.0.1 \
    --port 8080

Benchmarks

Speed

High-end Gaming Rig Hardware
- AMD 9950X
- Overclocked infinity fabric "gear 1" clocks
- 2x 48GB DDR5@6400 RAM (~87GB/s benchmarked)
- 3090 TI FE 24GB VRAM @ 450 Watts (uncapped)
PP ~500 tok/sec with 2k batches
TG ~5 tok/sec limited by RAM i/o bandwidth

./build/bin/llama-sweep-bench \
    --model /mnt/models/ubergarm/Kimi-Dev-72B-GGUF/Kimi-Dev-72B-smol-IQ3_K.gguf \
    --ctx-size 6144 \
    -ctk q8_0 -ctv q8_0 \
    -fa \
    --no-mmap \
    -ub 2048 -b 2048 \
    -ngl 48 \
    --warmup-batch \
    --threads 16

ubergarm/Kimmy-Dev-72B-smol-IQ3_K

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	512	0	3.925	521.77	103.624	4.94
2048	512	2048	4.058	504.63	105.265	4.86

Quality

I tested perplexity for a bunch of experimental quants, decided this one was a decent trade-off between quality and speed.

FAQ

Why is it smol?

I ran out of names making a bunch of similar sized quants for the Perplexity graph above lol.

Will you make larger GGUFs?

Naw, you can get good mainline llama.cpp GGUFs from others already like bartowski and bullerwins.

Where can I get those hot new EXL3 quants?

Check out ArtusDev's collection

What about the new iqK_kt QTIP Trellis style quants?

I may release something eventually, but they are still pretty fresh gonna wait a minute to see if any breaking changes happen before releasing.
Also the column dimension of the ffn_down tensor is not divisible by 256 so had to use iq4_nl unless something changes.

References

ik_llama.cpp

ubergarm
/

Kimi-Dev-72B-GGUF

`ik_llama.cpp` imatrix Quantizations of Kimi-Dev-72B

smol-IQ3_K 32.273 GiB (3.813 BPW)

Quickstart

Benchmarks

Speed

ubergarm/Kimmy-Dev-72B-smol-IQ3_K

Quality

FAQ

References

Model tree for ubergarm/Kimi-Dev-72B-GGUF

ik_llama.cpp imatrix Quantizations of Kimi-Dev-72B

smol-IQ3_K 32.273 GiB (3.813 BPW)

Quickstart

Benchmarks

Speed

ubergarm/Kimmy-Dev-72B-smol-IQ3_K

Quality

FAQ

References

Model tree for ubergarm/Kimi-Dev-72B-GGUF

`ik_llama.cpp` imatrix Quantizations of Kimi-Dev-72B