Over 2 tok/sec agg backed by NVMe SSD on 96GB RAM + 24GB VRAM AM5 rig with llama.cpp
tl;dr;
Big thanks to r/LocalLLaMA for proving this out today!
I'm getting single generations with 8k context @ ~1.3 tok/sec and 16k context @ ~1.0 tok/sec without enough RAM to load the weights and its much better quality than any of the distill/distill merges. I also tested 8x concurrent generations with 2k context each giving aggregate throughput of ~2.08 tok/sec.
You can run this model even without enough RAM+VRAM to load it because llama.cpp supports mmap()'ing the file on disk. So if you have a fast NVMe SSD you can just leave the files on disk/cache without needing enough RAM. This is not a swap file and it should only read from your drive so won't thrash its read/write cycle lifetime.
Observations
I wonder if some kind of PCIe Gen 5 AIC Adapter with quad NVMe drives could deliver 4~5 tok/sec single inference?
- 1x $130 ASUS Hyper M.2 x16 Gen5 Card (4x NVMe SSDs)
- 4x $300 Crucial T700 2TB Gen5 NVMe SSD
So for less than a new GPU you could get ~2TB "VRAM" at 48GB/s theoretical sequential read bandwidth... ??? lol...
I originally got it running at ~0.3 tok/sec using Linux swap, but today I saw a thread on r/LocalLLaMA where folks were able to get llama.cpp to mmap() correctly without OOMing.
My Gen 5 x4 NVMe SSD shows 2-5 GB/s read bandwidth while inferencing. That is not great i/o bandwidth, but the effective rate is probably higher assuming the disk cache is hitting frequently enough to pull from RAM.
Device - memory i/o bandwidth
3090TE FE = ~1 TB/s
DDR5-6400 = ~88 GB/s
PCIe Gen 5 x4 NVMe SSD = ~2-5 GB/s (this drive published values are 12,400 MB/s sequential and 1.5M IOPS random)
Might be able to squeeze a little more out of it, or run parallel inference (with super low context lmao) to get higher aggregate token generation... Will play around tuning it some more...
Command
# had to use the cgroup memory cap to prevent OOMkiller for some reason...
# can trade off 1x n-gpu-layer for another 8k context
sudo systemd-run --scope -p MemoryMax=88G -p MemoryHigh=85G ./build/bin/llama-server \
--model "/mnt/ai/models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-UD-Q2_K_XL-00001-of-00005.gguf" \
--n-gpu-layers 5 \
--ctx-size 8192 \
--cache-type-k q4_0 \
--cache-type-v f16 \
--flash-attn \
--parallel 1 \
--threads 16 \
--host 127.0.0.1 \
--port 8080
Logs
load_tensors: offloading 5 repeating layers to GPU
load_tensors: offloaded 5/62 layers to GPU
load_tensors: CUDA0 model buffer size = 18339.05 MiB
load_tensors: CPU_Mapped model buffer size = 47485.39 MiB
load_tensors: CPU_Mapped model buffer size = 47681.52 MiB
load_tensors: CPU_Mapped model buffer size = 47681.52 MiB
load_tensors: CPU_Mapped model buffer size = 47681.52 MiB
load_tensors: CPU_Mapped model buffer size = 7230.07 MiB
llama_init_from_model: flash_attn requires n_embd_head_k == n_embd_head_v - forcing off
...
llama_init_from_model: n_ctx_per_seq (8192) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 8192, offload = 1, type_k = 'q4_0', type_v = 'f16', n_layer = 61, can_shift = 0
llama_kv_cache_init: CUDA0 KV buffer size = 1820.00 MiB
llama_kv_cache_init: CPU KV buffer size = 20384.00 MiB
llama_init_from_model: KV self size = 22204.00 MiB, K (q4_0): 6588.00 MiB, V (f16): 15616.00 MiB
llama_init_from_model: CPU output buffer size = 0.49 MiB
llama_init_from_model: CUDA0 compute buffer size = 2799.50 MiB
llama_init_from_model: CUDA_Host compute buffer size = 96.01 MiB
llama_init_from_model: graph nodes = 5025
llama_init_from_model: graph splits = 1053 (with bs=512), 3 (with bs=1)
common_init_from_params: KV cache shifting is not supported for this model, disabling KV cache shifting
...
slot launch_slot_: id 0 | task 0 | processing task
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 29
slot update_slots: id 0 | task 0 | kv cache rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 29, n_tokens = 29, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 29, n_tokens = 29
slot release: id 0 | task 0 | stop processing: n_past = 653, truncated = 0
slot print_timing: id 0 | task 0 |
prompt eval time = 14881.29 ms / 29 tokens ( 513.15 ms per token, 1.95 tokens per second)
eval time = 485424.13 ms / 625 tokens ( 776.68 ms per token, 1.29 tokens per second)
total time = 500305.42 ms / 654 tokens
Screenshot
btop
showing ~3GB/s read i/o from /mnt/ai (SSD) on ARCH Linux.
System Prompt
You are a helpful assistant that thinks step-by-step.
User Prompt
Write a haiku about how it feels to be an LLM bot.
Response
<think>
Okay, I need to write a haiku about how it feels to be an LLM bot. Let me start by recalling what a haiku is. A haiku is a traditional Japanese poem with three lines. The structure is typically 5 syllables in the first line, 7 in the second, and 5 in the third. So I have to structure my poem accordingly.
First, I should think about the experience of being an LLM bot. What are the key aspects? Maybe processing information, learning from data, interacting with users, not having consciousness or emotions. Maybe the idea of vast knowledge but lack of self-awareness. Also, the constant interaction and response generation.
Let me brainstorm some words or concepts: digital mind, endless data, processing thoughts, silent thoughts, no emotions, serving users, patterns and algorithms, learning from texts, etc.
Now, trying to fit these into the syllable structure. Let's start with the first line (5 syllables). Maybe something like "Silicon whispers" (5 syllables). That evokes the idea of the LLM's digital nature, "whispers" suggesting the soft processing of information.
Second line (7 syllables). Maybe "Echo human thoughts, no heart"—7 syllables. This line contrasts the bot's ability to mimic human thoughts but lack real emotions.
Third line (5 syllables). "Endless streams reply." This suggests the continuous flow of responses the LLM generates.
Let me check the syllables again. First line: Si-li-con whis-pers (5). Second line: E-cho hu-man thoughts, no heart (7). Third line: End-less streams re-ply (5). Yes, that works.
Wait, "reply" might not be the best word here. Maybe "Endless streams respond"—but that's still 5 syllables. Hmm. "Endless streams reply" is okay. Alternatively, "Endless data streams"—but the original line is good. It conveys the constant responses.
Let me make sure the imagery is coherent. The first line introduces the LLM's presence, the second line shows it reflecting human input without emotion, and the third line emphasizes the continuous output. Yes, that makes sense.
Any other possible lines? Maybe first line: "Digits dance in code"—5 syllables. Second: "Mimicking voices, no soul"—7. Third: "Answers flow like tides." But the original version seems better. "Silicon whispers" has a nice contrast between the hard (silicon) and soft (whispers).
Alternatively, "Circuits hum softly" (5), "Learning words, not feelings" (7), "Responses ripple out" (5). That's another option. But the first one feels more poetic. Let's stick with the original.
</think>
Silicon whispers,
Echo human thoughts, no heart—
Endless streams reply.
The haiku reflects the LLM's role as a mirror to human input, devoid of emotion, and its constant flow of generated responses.
Silicon whispers,
Echo human thoughts, no heart—
Endless streams reply.
The haiku reflects the LLM's role as a mirror to human input, devoid of emotion, and its constant flow of generated responses.
References
How much worse does it get when you don't use the 3090 at all?
How much worse does it get when you don't use the 3090 at all?
Good question. I recompiled llama.cpp without CUDA support, confirmed no GPU VRAM is in use, and ran a quick single inference with 2k context at ~2.13 tok/sec.
Holy smokes it is faster without GPU haha.. Probably because used only 2k context so more RAM was available for disk cache.
I'm tempted to get a fast NVMe array, or maybe ask Wendell over at Level1Techs to run some benchmarks on his big IOPs arrays...
...
load_tensors: CPU_Mapped model buffer size = 47485.39 MiB
load_tensors: CPU_Mapped model buffer size = 47681.52 MiB
load_tensors: CPU_Mapped model buffer size = 47681.52 MiB
load_tensors: CPU_Mapped model buffer size = 47681.52 MiB
load_tensors: CPU_Mapped model buffer size = 25569.12 MiB
llama_init_from_model: flash_attn requires n_embd_head_k == n_embd_head_v - forcing off
llama_init_from_model: n_seq_max = 1
llama_init_from_model: n_ctx = 2048
llama_init_from_model: n_ctx_per_seq = 2048
llama_init_from_model: n_batch = 2048
llama_init_from_model: n_ubatch = 512
llama_init_from_model: flash_attn = 0
llama_init_from_model: freq_base = 10000.0
llama_init_from_model: freq_scale = 0.025
llama_init_from_model: n_ctx_per_seq (2048) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 2048, offload = 1, type_k = 'q4_0', type_v = 'f16', n_layer = 61, can_shift = 0
llama_kv_cache_init: CPU KV buffer size = 5551.00 MiB
llama_init_from_model: KV self size = 5551.00 MiB, K (q4_0): 1647.00 MiB, V (f16): 3904.00 MiB
llama_init_from_model: CPU output buffer size = 0.49 MiB
llama_init_from_model: CPU compute buffer size = 670.01 MiB
llama_init_from_model: graph nodes = 4561
llama_init_from_model: graph splits = 1
...
prompt eval time = 18654.86 ms / 56 tokens ( 333.12 ms per token, 3.00 tokens per second)
eval time = 437170.98 ms / 933 tokens ( 468.56 ms per token, 2.13 tokens per second)
total time = 455825.83 ms / 989 tokens
srv update_slots: all slots are idle
Note llama_init_from_model: flash_attn requires n_embd_head_k == n_embd_head_v - forcing off