ubergarm/GLM-4.5-Air-GGUF

huccjj

23 days ago

your work is on time and so cool

huccjj

23 days ago

and a good news:glm4moe is adapted by llama.cpp (b6085)

ubergarm

Owner 23 days ago

@huccjj yep folks have been working together to get supported added to (ik_)llama.cpp.

be aware i'll end up deleting this existing EXPERIMENTAL gguf and replace it with new ones once this PR on ik_llama.cpp is updated and finalized:https://github.com/ikawrakow/ik_llama.cpp/pull/668#issuecomment-3152087986

Coming Soon :tm: haha... thanks!

mtcl

23 days ago

I'll wait till new one is out :) been downloading like crazy recently 😂 thank sir for the hard work!!

ubergarm

Owner 23 days ago

I'll wait till new one is out :) been downloading like crazy recently 😂 thank sir for the hard work!!

The new air is available, with instructions on how to get it working until PR support is merged into main of ik_llama.cpp

And right!? so many models OMG haha.. But GLM-4.5 and Air version seem pretty good for the size. I hope to have some quants of the bigger one tomorrow and the imatrix is already cooking!

mtcl

23 days ago

lol ok! i just downloaded and i am building ik. :)

i purchased 5X5090s and I am planning to sell the 2X6000 Pros. I cannot justify keeping them. I will still have significantly good processing power and 160GB of VRAM in 5x5090s though.

I will be able to run GLM air models on 3X5090s and one of R1-0528 models on 1X5090 and some image generation on 1X5090. I think I have this all setup correctly how I want it. :)

Doctor-Shotgun

21 days ago

Interesting, 3x 5090s for GLM Air...

I'm running a 6000 Pro myself and I'm curious if there's any noticeable performance boost using ik_llama.cpp over standard mainline llama.cpp for pure-CUDA inference, given that there are some MoE-specific optimizations?

I'd otherwise try hybrid inference on the larger GLM 4.5 lol, if it weren't for the fact that the big GPU is confined to my Windows gaming machine for the moment, and Windows is suboptimal for hybrid inference (doesn't seem to be a way to avoid RAM OOM or paging out if the weights are larger than system RAM).

ubergarm

Owner 21 days ago

@Doctor-Shotgun

I'm running a 6000 Pro myself and I'm curious if there's any noticeable performance boost using ik_llama.cpp over standard mainline llama.cpp for pure-CUDA inference, given that there are some MoE-specific optimizations?

For full offload situation it can vary, a lot of the CUDA implementations are somewhat similar to mainline, however you have access to better quality quants and right stuff like -fmoe. You can test A/B test the specific quant and offload configuration using llama-sweep-bench (basically just replace your usual llama-server ... command and add --warmup-batch and use like 20k context. You can use my mainline branch which has it too here: https://github.com/ubergarm/llama.cpp/tree/ug/port-sweep-bench

For windows right some folks have complained about multi-GPU issues with windows, and yeah I'd not expect it to handle paging/full RAM situation as gracefully (maybe could disable swap on windows somehow would help?). There are some windows builds here if you want to test: https://github.com/Thireus/ik_llama.cpp/releases but not sure if the new GLM-4.5 branch that just got merged into main is in that release yet.

Keep us posted what you find and feel free to share your full commands for workshopping etc!

57 hidden messages

Expand all

tdh111

11 days ago

ik_llama.cpp overtakes mainline llama.cpp at -ub 2048 -b 2048 but this also results in OOM when I try to load full context.

-ambexists for this use case, see this PR.

Also specifically for GLM-4.5 this performance PR just got made

ikawrakow

11 days ago

Also specifically for GLM-4.5 this performance PR just got made

You beat me to it. I wanted to wait for the sweep-bench results from @Thireus before announcing the PR to a wider audience.

Concerning the -amb flag: the flag has no effect when using flash attention and not dealing with a DeepSeek model. It was implemented specifically for DeepSeek R1, V3, etc. Initially to reduce the compute buffer size without FA because FA did not work for DeepSeek. Later it was extended to also cover the more advanced MLA implementation in ik_llama.cpp (mla = 2 or mla = 3). So, this will not solve the OOM problem. But depending on use case (e.g., best possible PP performance is more important than TG performance), one can just leave the routed experts for a few layers on the CPU. How many layers are needed depends on how much VRAM one has, the quantization type of the experts, and the -b / -ub one wants to use. If the number of required layers is small, this will have a relatively minor impact on TG performance. For GLM-4.5-Air, the routed experts are 2.2B parameters per layer, so for the Q5_K_S model that was discussed above each layer left on the CPU will free up 1.42 GiB of VRAM to use for compute buffers. I see the CUDA compute buffer being 813 MiB with the default batch/u-batch size, and 2432 MiB with -b 4096 -ub 4096, so potentially just a single layer is enough, and almost for sure not more than 2.

ikawrakow

10 days ago

So, people have confirmed that the ik_llama.cpp performance issue for GLM-4.5 models has been fixed with this PR. Depending on GPU, OS, and context length, ik_llama.cpp is now between slightly slower and much faster than llama.cpp.

Doctor-Shotgun

10 days ago

•

edited 10 days ago

Built and tested the new PR for GLM 4.5 - it seems that now TG speed is on par with mainline llama.cpp on my setup.

./llama-sweep-bench.exe -m "C:\ML\GGUF\GLM-4.5-Air-Q5_K_S.gguf" -c 32768 -ngl 999 -fa -fmoe --no-mmap -t 1 --warmup-batch

Prompt processing in ik_llama.cpp is still far slower than mainline llama.cpp at -ub 512 though, which is what I'm still trying to figure out. It's not a small gap at all, around 40-50% slower depending on the context length. ik_llama.cpp seems to be very sensitive to -ub, where smaller values tank the prompt processing speed, while mainline llama.cpp doesn't experience this. Empirically like at least -ub 2048 is needed for ik_llama.cpp to surpass mainline llama.cpp's prompt processing speed on my setup - wondering if this is a bug? Mainline here (also at -ub 512):

gghfez

10 days ago

Okay this is great! That PR solved the performance issue for me.
GLM-4.5 (big version) is now on par with llama.cpp for text generation, and faster for prompt processing. This is on Linux with 6x3090 fully offloaded.

Question: Is there a way to stop the full prompt/response being dumped to the terminal? The latest build is doing this.

ubergarm

Owner 10 days ago

@gghfez

Question: Is there a way to stop the full prompt/response being dumped to the terminal? The latest build is doing this.

I noticed that recently too that client requests are logging to stdout or stderr or something on llama-server. I'm not sure how to turn that off, I don't recall that being a thing until maybe a couple weeks ago?

$ ./build/bin/llama-server --help
    -v,    --verbose                print verbose information
           --verbosity N            set specific verbosity level (default: 0)
           --verbose-prompt         print a verbose prompt before generation (default: false)
           --no-display-prompt      don't print prompt at generation (default: false)

I have been looking for a way to turn it off too, --log-disable didn't do it if i recall. Perhaps this --no-display-prompt will do it? I'll have to try too, let me know if you figure it out haha...

gghfez

10 days ago

Yeah I think it started happening around the time they were working on tool calls, so figure it was leftover hard-coded debug output.
I've been working around it by appending this:

|grep -v 'format_partial_response_oaicompat'

I'll try that --no-display-prompt

gghfez

10 days ago

@ubergarm Thanks, --no-display-prompt did the trick.

ubergarm
/

GLM-4.5-Air-GGUF

Good job