unsloth/DeepSeek-R1-GGUF · Over 2 tok/sec agg backed by NVMe SSD on 96GB RAM + 24GB VRAM AM5 rig with llama.cpp

Jan 29

•

tl;dr;

Big thanks to r/LocalLLaMA for proving this out today!

I'm getting single generations with 8k context @ ~1.3 tok/sec and 16k context @ ~1.0 tok/sec without enough RAM to load the weights and its much better quality than any of the distill/distill merges. I also tested 8x concurrent generations with 2k context each giving aggregate throughput of ~2.08 tok/sec.

You can run this model even without enough RAM+VRAM to load it because llama.cpp supports mmap()'ing the file on disk. So if you have a fast NVMe SSD you can just leave the files on disk/cache without needing enough RAM. This is not a swap file and it should only read from your drive so won't thrash its read/write cycle lifetime.

Observations

I wonder if some kind of PCIe Gen 5 AIC Adapter with quad NVMe drives could deliver 4~5 tok/sec single inference?

1x $130 ASUS Hyper M.2 x16 Gen5 Card (4x NVMe SSDs)
4x $300 Crucial T700 2TB Gen5 NVMe SSD

So for less than a new GPU you could get ~2TB "VRAM" at 48GB/s theoretical sequential read bandwidth... ??? lol...

I originally got it running at ~0.3 tok/sec using Linux swap, but today I saw a thread on r/LocalLLaMA where folks were able to get llama.cpp to mmap() correctly without OOMing.

My Gen 5 x4 NVMe SSD shows 2-5 GB/s read bandwidth while inferencing. That is not great i/o bandwidth, but the effective rate is probably higher assuming the disk cache is hitting frequently enough to pull from RAM.

Device - memory i/o bandwidth
3090TE FE = ~1 TB/s
DDR5-6400 = ~88 GB/s
PCIe Gen 5 x4 NVMe SSD = ~2-5 GB/s (this drive published values are 12,400 MB/s sequential and 1.5M IOPS random)

Might be able to squeeze a little more out of it, or run parallel inference (with super low context lmao) to get higher aggregate token generation... Will play around tuning it some more...

Command

# had to use the cgroup memory cap to prevent OOMkiller for some reason...
# can trade off 1x n-gpu-layer for another 8k context
sudo systemd-run --scope -p MemoryMax=88G -p MemoryHigh=85G ./build/bin/llama-server \
    --model "/mnt/ai/models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-UD-Q2_K_XL-00001-of-00005.gguf" \
    --n-gpu-layers 5 \
    --ctx-size 8192 \
    --cache-type-k q4_0 \
    --cache-type-v f16 \
    --flash-attn \
    --parallel 1 \
    --threads 16 \
    --host 127.0.0.1 \
    --port 8080

Logs

load_tensors: offloading 5 repeating layers to GPU
load_tensors: offloaded 5/62 layers to GPU
load_tensors:        CUDA0 model buffer size = 18339.05 MiB
load_tensors:   CPU_Mapped model buffer size = 47485.39 MiB
load_tensors:   CPU_Mapped model buffer size = 47681.52 MiB
load_tensors:   CPU_Mapped model buffer size = 47681.52 MiB
load_tensors:   CPU_Mapped model buffer size = 47681.52 MiB
load_tensors:   CPU_Mapped model buffer size =  7230.07 MiB
llama_init_from_model: flash_attn requires n_embd_head_k == n_embd_head_v - forcing off
...
llama_init_from_model: n_ctx_per_seq (8192) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 8192, offload = 1, type_k = 'q4_0', type_v = 'f16', n_layer = 61, can_shift = 0
llama_kv_cache_init:      CUDA0 KV buffer size =  1820.00 MiB
llama_kv_cache_init:        CPU KV buffer size = 20384.00 MiB
llama_init_from_model: KV self size  = 22204.00 MiB, K (q4_0): 6588.00 MiB, V (f16): 15616.00 MiB
llama_init_from_model:        CPU  output buffer size =     0.49 MiB
llama_init_from_model:      CUDA0 compute buffer size =  2799.50 MiB
llama_init_from_model:  CUDA_Host compute buffer size =    96.01 MiB
llama_init_from_model: graph nodes  = 5025
llama_init_from_model: graph splits = 1053 (with bs=512), 3 (with bs=1)
common_init_from_params: KV cache shifting is not supported for this model, disabling KV cache shifting
...
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 29
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 29, n_tokens = 29, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 29, n_tokens = 29
slot      release: id  0 | task 0 | stop processing: n_past = 653, truncated = 0
slot print_timing: id  0 | task 0 |
prompt eval time =   14881.29 ms /    29 tokens (  513.15 ms per token,     1.95 tokens per second)
       eval time =  485424.13 ms /   625 tokens (  776.68 ms per token,     1.29 tokens per second)
      total time =  500305.42 ms /   654 tokens

Screenshot

btop showing ~3GB/s read i/o from /mnt/ai (SSD) on ARCH Linux.

System Prompt

You are a helpful assistant that thinks step-by-step.

User Prompt

Write a haiku about how it feels to be an LLM bot.

Response

<think>

Okay, I need to write a haiku about how it feels to be an LLM bot. Let me start by recalling what a haiku is. A haiku is a traditional Japanese poem with three lines. The structure is typically 5 syllables in the first line, 7 in the second, and 5 in the third. So I have to structure my poem accordingly.

First, I should think about the experience of being an LLM bot. What are the key aspects? Maybe processing information, learning from data, interacting with users, not having consciousness or emotions. Maybe the idea of vast knowledge but lack of self-awareness. Also, the constant interaction and response generation.

Let me brainstorm some words or concepts: digital mind, endless data, processing thoughts, silent thoughts, no emotions, serving users, patterns and algorithms, learning from texts, etc.

Now, trying to fit these into the syllable structure. Let's start with the first line (5 syllables). Maybe something like "Silicon whispers" (5 syllables). That evokes the idea of the LLM's digital nature, "whispers" suggesting the soft processing of information.

Second line (7 syllables). Maybe "Echo human thoughts, no heart"—7 syllables. This line contrasts the bot's ability to mimic human thoughts but lack real emotions.

Third line (5 syllables). "Endless streams reply." This suggests the continuous flow of responses the LLM generates.

Let me check the syllables again. First line: Si-li-con whis-pers (5). Second line: E-cho hu-man thoughts, no heart (7). Third line: End-less streams re-ply (5). Yes, that works.

Wait, "reply" might not be the best word here. Maybe "Endless streams respond"—but that's still 5 syllables. Hmm. "Endless streams reply" is okay. Alternatively, "Endless data streams"—but the original line is good. It conveys the constant responses.

Let me make sure the imagery is coherent. The first line introduces the LLM's presence, the second line shows it reflecting human input without emotion, and the third line emphasizes the continuous output. Yes, that makes sense.

Any other possible lines? Maybe first line: "Digits dance in code"—5 syllables. Second: "Mimicking voices, no soul"—7. Third: "Answers flow like tides." But the original version seems better. "Silicon whispers" has a nice contrast between the hard (silicon) and soft (whispers).

Alternatively, "Circuits hum softly" (5), "Learning words, not feelings" (7), "Responses ripple out" (5). That's another option. But the first one feels more poetic. Let's stick with the original.

</think>

Silicon whispers,
Echo human thoughts, no heart—
Endless streams reply.

The haiku reflects the LLM's role as a mirror to human input, devoid of emotion, and its constant flow of generated responses.

Silicon whispers,
Echo human thoughts, no heart—
Endless streams reply.

The haiku reflects the LLM's role as a mirror to human input, devoid of emotion, and its constant flow of generated responses.

References

ubergarm changed discussion title from `Q2_K_XL` over 1 tok/sec with llama.cpp backed by NVMe SSD on 96GB RAM + 24GB VRAM AM5 rig. to Over 2 tok/sec agg backed by NVMe SSD on 96GB RAM + 24GB VRAM AM5 rig with llama.cpp Jan 30

TobDeBer

Jan 30

How much worse does it get when you don't use the 3090 at all?

ubergarm

Jan 30

•

edited Jan 30

How much worse does it get when you don't use the 3090 at all?

Good question. I recompiled llama.cpp without CUDA support, confirmed no GPU VRAM is in use, and ran a quick single inference with 2k context at ~2.13 tok/sec. With same 2k context, 5 layers offloaded, using the GPU gives around ~2.39 tok/sec. In retrospect pretty sure these numbers are also with --override-kv deepseek2.expert_used_count=int:4 which cuts expert used count down to 4 from default of 8 and increases generation speed likely at the cost of some quality. I want to do some more benchmarking and testing.

I'm tempted to get a fast NVMe array, or maybe ask Wendell over at Level1Techs to run some benchmarks on his big IOPs arrays...

...
load_tensors:   CPU_Mapped model buffer size = 47485.39 MiB
load_tensors:   CPU_Mapped model buffer size = 47681.52 MiB
load_tensors:   CPU_Mapped model buffer size = 47681.52 MiB
load_tensors:   CPU_Mapped model buffer size = 47681.52 MiB
load_tensors:   CPU_Mapped model buffer size = 25569.12 MiB
llama_init_from_model: flash_attn requires n_embd_head_k == n_embd_head_v - forcing off
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 2048
llama_init_from_model: n_ctx_per_seq = 2048
llama_init_from_model: n_batch       = 2048
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 10000.0
llama_init_from_model: freq_scale    = 0.025
llama_init_from_model: n_ctx_per_seq (2048) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 2048, offload = 1, type_k = 'q4_0', type_v = 'f16', n_layer = 61, can_shift = 0
llama_kv_cache_init:        CPU KV buffer size =  5551.00 MiB
llama_init_from_model: KV self size  = 5551.00 MiB, K (q4_0): 1647.00 MiB, V (f16): 3904.00 MiB
llama_init_from_model:        CPU  output buffer size =     0.49 MiB
llama_init_from_model:        CPU compute buffer size =   670.01 MiB
llama_init_from_model: graph nodes  = 4561
llama_init_from_model: graph splits = 1
...
prompt eval time =   18654.86 ms /    56 tokens (  333.12 ms per token,     3.00 tokens per second)
       eval time =  437170.98 ms /   933 tokens (  468.56 ms per token,     2.13 tokens per second)
      total time =  455825.83 ms /   989 tokens
srv  update_slots: all slots are idle

KeyboardMasher

Jan 30

Note llama_init_from_model: flash_attn requires n_embd_head_k == n_embd_head_v - forcing off

ubergarm

Jan 30

Note llama_init_from_model: flash_attn requires n_embd_head_k == n_embd_head_v - forcing off

Yeah, I'd noticed that before and it seems not possible to quantize both k and v cache. So the --flash-attn is not needed/actually used for this specific model?

I wonder if there are other "dynamic quant" methods that would allow it or if it is property of underlying R1 model?

ubergarm

Jan 31

How much worse does it get when you don't use the 3090 at all?

I ran some more benchmarks in the gist linked above. Single inference. It is slightly faster with 24 GB VRAM but not by much. It is interesting to watch the disk cache dump/refill when the model switches experts/weights for counting words mode. I assume whatever MoE weights that are active remain cached in RAM until it has to switch and load the other mmap'd weights.

`ctx-size`	`n-gpu-layers`	`expert_used_count`	`Cached high water mark (GiB)`	`generation (tok/sec)`
2048	5	8	~82	1.45
2048	5	4	~82	2.28
2048	0	8	~82	1.28
2048	0	4	~82	2.20
8192	5	8	~67	1.25
8192	5	4	~67	2.12
8192	0	8	~66	1.10
8192	0	4	~66	1.81

RamoreRemora

Jan 31

... In retrospect pretty sure these numbers are also with --override-kv deepseek2.expert_used_count=int:4 which cuts expert used count down to 4 from default of 8 and increases generation speed likely at the cost of some quality. I want to do some more benchmarking and testing.

Hold up, so can you confirm that all the t/s values you mentioned in this entire thread, including the OP, were all with override-kv?

ubergarm

Jan 31

Hold up, so can you confirm that all the t/s values you mentioned in this entire thread, including the OP, were all with override-kv?

Look at the benchmark chart just above your post. I went back and benchmarked everything again keeping track. I specify the t/s values both with and without the override-kv.

Keep in mind I'm testing the bigger Q2_K_XL 212GiB quant. I didn't download the smaller IQ quants or benchmark those, but they should be faster even with the default expert_used_count=8.

Really need someone with a huge rig to benchmark a few of these quants with various expert_used_count settings to see how they objectively effects relative quality.

Also, if I were buying DeepSeek R1 from an API provider, I'd want to make sure I know what quant and exact model configs they are selling me haha...

TobDeBer

Jan 31

Monthly Cost:
32 vCPU + 128 GB memory 331,90 $
1 NVIDIA L4 24GB 162,86 $
That's about 68ct per hour for a few quick experiments. Storage comes on top.

I can't deploy this config because I only spent $0.04 on their cloud so far. Larger configs are for good customers only, hehe.

Rotating

Feb 1

•

edited about 1 month ago

Note llama_init_from_model: flash_attn requires n_embd_head_k == n_embd_head_v - forcing off

Yeah, I'd noticed that before and it seems not possible to quantize both k and v cache. So the --flash-attn is not needed/actually used for this specific model?

I wonder if there are other "dynamic quant" methods that would allow it or if it is property of underlying R1 model?

Seems broken somehow. I just gave up and turned off flash attention for now.

I can offload 4 layers and have 8192 of context using 24gb vram offloading and --no-mmap with DeepSeek-R1-UD-Q2_K_XL. Lets it fit in 256gb system ram

edit: as it turns out, i was only running out of ram because the context is so huge. So it's better to run with -ngl 0 and put 4096 context in the gpus for better performance.
edit2: turns out even weirder, as apparently only offloaded layers offload their layers to the gpu. So I guess 4 is about right after all for 8k on this machine.