Slow Token Generation on A100

#13

by kingabzpro - opened 5 days ago

Discussion

kingabzpro

5 days ago

Please tell me what I am doing wrong? 95 % GPU RAM with 0 % GPU compute and 99 % CPU load.

Script:

./llama.cpp/llama-cli \
  --model /unsloth/Kimi-K2-Instruct-GGUF/UD-TQ1_0/Kimi-K2-Instruct-UD-TQ1_0-00001-of-00005.gguf \
  --cache-type-k q4_0 \
  --threads -1 \
  --n-gpu-layers 99 \
  --temp 0.6 \
  --min_p 0.01 \
  --ctx-size 16384 \
  --seed 3407 \
  -ot ".ffn_(up|down)_exps.=CPU" \
  --prompt "Hey"

My Specs are 1XA100 SXM 250GB RAM + 80GB VRAM.

gghfez

3 days ago

Your system ram looks too high. Those separate CPU buffers are inefficient / it happens when you forget to disable memory mapping:
--no-mmap

Also consider using ik_llama with -fmoe and -mla3 to improve speed / reduce memory usage.

kingabzpro

3 days ago

The --no-mmap didn't improved the speed. I think, I will try ik_llama later.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment