Good job
your work is on time and so cool
and a good news:glm4moe is adapted by llama.cpp (b6085)
@huccjj yep folks have been working together to get supported added to (ik_)llama.cpp.
be aware i'll end up deleting this existing EXPERIMENTAL gguf and replace it with new ones once this PR on ik_llama.cpp is updated and finalized:https://github.com/ikawrakow/ik_llama.cpp/pull/668#issuecomment-3152087986
Coming Soon :tm: haha... thanks!
I'll wait till new one is out :) been downloading like crazy recently π thank sir for the hard work!!
I'll wait till new one is out :) been downloading like crazy recently π thank sir for the hard work!!
The new air is available, with instructions on how to get it working until PR support is merged into main of ik_llama.cpp
And right!? so many models OMG haha.. But GLM-4.5 and Air version seem pretty good for the size. I hope to have some quants of the bigger one tomorrow and the imatrix is already cooking!
lol ok! i just downloaded and i am building ik. :)
i purchased 5X5090s and I am planning to sell the 2X6000 Pros. I cannot justify keeping them. I will still have significantly good processing power and 160GB of VRAM in 5x5090s though.
I will be able to run GLM air models on 3X5090s and one of R1-0528 models on 1X5090 and some image generation on 1X5090. I think I have this all setup correctly how I want it. :)
Interesting, 3x 5090s for GLM Air...
I'm running a 6000 Pro myself and I'm curious if there's any noticeable performance boost using ik_llama.cpp over standard mainline llama.cpp for pure-CUDA inference, given that there are some MoE-specific optimizations?
I'd otherwise try hybrid inference on the larger GLM 4.5 lol, if it weren't for the fact that the big GPU is confined to my Windows gaming machine for the moment, and Windows is suboptimal for hybrid inference (doesn't seem to be a way to avoid RAM OOM or paging out if the weights are larger than system RAM).
I'm running a 6000 Pro myself and I'm curious if there's any noticeable performance boost using ik_llama.cpp over standard mainline llama.cpp for pure-CUDA inference, given that there are some MoE-specific optimizations?
For full offload situation it can vary, a lot of the CUDA implementations are somewhat similar to mainline, however you have access to better quality quants and right stuff like -fmoe
. You can test A/B test the specific quant and offload configuration using llama-sweep-bench
(basically just replace your usual llama-server ...
command and add --warmup-batch
and use like 20k context. You can use my mainline branch which has it too here: https://github.com/ubergarm/llama.cpp/tree/ug/port-sweep-bench
For windows right some folks have complained about multi-GPU issues with windows, and yeah I'd not expect it to handle paging/full RAM situation as gracefully (maybe could disable swap on windows somehow would help?). There are some windows builds here if you want to test: https://github.com/Thireus/ik_llama.cpp/releases but not sure if the new GLM-4.5 branch that just got merged into main is in that release yet.
Keep us posted what you find and feel free to share your full commands for workshopping etc!
Also specifically for GLM-4.5 this performance PR just got made
You beat me to it. I wanted to wait for the sweep-bench
results from
@Thireus
before announcing the PR to a wider audience.
Concerning the -amb
flag: the flag has no effect when using flash attention and not dealing with a DeepSeek model. It was implemented specifically for DeepSeek R1, V3, etc. Initially to reduce the compute buffer size without FA because FA did not work for DeepSeek. Later it was extended to also cover the more advanced MLA implementation in ik_llama.cpp
(mla = 2
or mla = 3
). So, this will not solve the OOM problem. But depending on use case (e.g., best possible PP performance is more important than TG performance), one can just leave the routed experts for a few layers on the CPU. How many layers are needed depends on how much VRAM one has, the quantization type of the experts, and the -b / -ub
one wants to use. If the number of required layers is small, this will have a relatively minor impact on TG performance. For GLM-4.5-Air, the routed experts are 2.2B parameters per layer, so for the Q5_K_S
model that was discussed above each layer left on the CPU will free up 1.42 GiB of VRAM to use for compute buffers. I see the CUDA compute buffer being 813 MiB with the default batch/u-batch size, and 2432 MiB with -b 4096 -ub 4096
, so potentially just a single layer is enough, and almost for sure not more than 2.
So, people have confirmed that the ik_llama.cpp
performance issue for GLM-4.5 models has been fixed with this PR. Depending on GPU, OS, and context length, ik_llama.cpp
is now between slightly slower and much faster than llama.cpp
.
Built and tested the new PR for GLM 4.5 - it seems that now TG speed is on par with mainline llama.cpp on my setup.
./llama-sweep-bench.exe -m "C:\ML\GGUF\GLM-4.5-Air-Q5_K_S.gguf" -c 32768 -ngl 999 -fa -fmoe --no-mmap -t 1 --warmup-batch
Prompt processing in ik_llama.cpp is still far slower than mainline llama.cpp at -ub 512
though, which is what I'm still trying to figure out. It's not a small gap at all, around 40-50% slower depending on the context length. ik_llama.cpp seems to be very sensitive to -ub
, where smaller values tank the prompt processing speed, while mainline llama.cpp doesn't experience this. Empirically like at least -ub 2048
is needed for ik_llama.cpp to surpass mainline llama.cpp's prompt processing speed on my setup - wondering if this is a bug? Mainline here (also at -ub 512
):
Okay this is great! That PR solved the performance issue for me.
GLM-4.5 (big version) is now on par with llama.cpp for text generation, and faster for prompt processing. This is on Linux with 6x3090 fully offloaded.
Question: Is there a way to stop the full prompt/response being dumped to the terminal? The latest build is doing this.
Question: Is there a way to stop the full prompt/response being dumped to the terminal? The latest build is doing this.
I noticed that recently too that client requests are logging to stdout or stderr or something on llama-server. I'm not sure how to turn that off, I don't recall that being a thing until maybe a couple weeks ago?
$ ./build/bin/llama-server --help
-v, --verbose print verbose information
--verbosity N set specific verbosity level (default: 0)
--verbose-prompt print a verbose prompt before generation (default: false)
--no-display-prompt don't print prompt at generation (default: false)
I have been looking for a way to turn it off too, --log-disable
didn't do it if i recall. Perhaps this --no-display-prompt
will do it? I'll have to try too, let me know if you figure it out haha...
Yeah I think it started happening around the time they were working on tool calls, so figure it was leftover hard-coded debug output.
I've been working around it by appending this:
|grep -v 'format_partial_response_oaicompat'
I'll try that --no-display-prompt