ik_llama.cpp
imatrix Quantizations of Kimi-Dev-72B
This quant collection REQUIRES ik_llama.cpp fork to support advanced non-linear SotA quants and Multi-Head Latent Attention (MLA). Do not download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc! Though might work in Nexesenex's croco.cpp kobold fork (untested).
smol-IQ3_K 32.273 GiB (3.813 BPW)
- type f32: 401 tensors
- type q4_K: 1 tensors token_embd
- type q6_K: 1 tensors output ("head")
- type iq4_nl: 80 tensors down
- type iq3_k: 320 tensors (q|o) (gate|up)
- type iq4_k: 160 tensors (k|v)
Quickstart
# Clone
git clone [email protected]:ikawrakow/ik_llama.cpp.git
cd ik_llama.cpp
# Build (might try adding -DGGML_CUDA_IQK_FORCE_BF16=1 for 3090s and older)
cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_F16=ON -DGGML_SCHED_MAX_COPIES=1
cmake --build build --config Release -j $(nproc)
# Run (set threads to number of CPU physical cores, mmap is fine for faster startup, adjust ctx/ngl as needed)
./build/bin/llama-server \
--model /mnt/models/ubergarm/Kimi-Dev-72B-GGUF/Kimi-Dev-72B-smol-IQ3_K.gguf \
--ctx-size 8192 \
-ctk q8_0 -ctv q8_0 \
-fa \
--no-mmap \
-ngl 48 \
--threads 16 \
--parallel 1 \
--host 127.0.0.1 \
--port 8080
Benchmarks
Speed
- High-end Gaming Rig Hardware
- AMD 9950X
- Overclocked infinity fabric "gear 1" clocks
- 2x 48GB DDR5@6400 RAM (~87GB/s benchmarked)
- 3090 TI FE 24GB VRAM @ 450 Watts (uncapped)
- PP ~500 tok/sec with 2k batches
- TG ~5 tok/sec limited by RAM i/o bandwidth
./build/bin/llama-sweep-bench \
--model /mnt/models/ubergarm/Kimi-Dev-72B-GGUF/Kimi-Dev-72B-smol-IQ3_K.gguf \
--ctx-size 6144 \
-ctk q8_0 -ctv q8_0 \
-fa \
--no-mmap \
-ub 2048 -b 2048 \
-ngl 48 \
--warmup-batch \
--threads 16
ubergarm/Kimmy-Dev-72B-smol-IQ3_K
PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
---|---|---|---|---|---|---|
2048 | 512 | 0 | 3.925 | 521.77 | 103.624 | 4.94 |
2048 | 512 | 2048 | 4.058 | 504.63 | 105.265 | 4.86 |
Quality
I tested perplexity for a bunch of experimental quants, decided this one was a decent trade-off between quality and speed.
FAQ
- Why is it
smol
?
- I ran out of names making a bunch of similar sized quants for the Perplexity graph above lol.
- Will you make larger GGUFs?
- Naw, you can get good mainline llama.cpp GGUFs from others already like bartowski and bullerwins.
- Where can I get those hot new EXL3 quants?
- Check out ArtusDev's collection
- What about the new
iqK_kt
QTIP Trellis style quants?
- I may release something eventually, but they are still pretty fresh gonna wait a minute to see if any breaking changes happen before releasing.
- Also the column dimension of the
ffn_down
tensor is not divisible by 256 so had to useiq4_nl
unless something changes.
References
- Downloads last month
- 0
Hardware compatibility
Log In
to view the estimation
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support