Slow Token Generation on A100

#13
by kingabzpro - opened

Please tell me what I am doing wrong? 95 % GPU RAM with 0 % GPU compute and 99 % CPU load.

Script:

./llama.cpp/llama-cli \
  --model /unsloth/Kimi-K2-Instruct-GGUF/UD-TQ1_0/Kimi-K2-Instruct-UD-TQ1_0-00001-of-00005.gguf \
  --cache-type-k q4_0 \
  --threads -1 \
  --n-gpu-layers 99 \
  --temp 0.6 \
  --min_p 0.01 \
  --ctx-size 16384 \
  --seed 3407 \
  -ot ".ffn_(up|down)_exps.=CPU" \
  --prompt "Hey"

My Specs are 1XA100 SXM 250GB RAM + 80GB VRAM.
Screenshot 2025-07-18 034627.png

Screenshot 2025-07-18 034520.png

Your system ram looks too high. Those separate CPU buffers are inefficient / it happens when you forget to disable memory mapping:
--no-mmap

Also consider using ik_llama with -fmoe and -mla3 to improve speed / reduce memory usage.

The --no-mmap didn't improved the speed. I think, I will try ik_llama later.

Sign up or log in to comment