Can you quantize this model so that it will run on top of 48G of video memory + 128G of RAM?

#1
by xldistance - opened

I would very much like to run this model

You're fast, I hadn't even uploaded the README lol.

So you want something under 176GiB total size with some room for context and buffers and such. That would be something in the sub 3BPW range. I think I can do that and maintain half decent perplexity. Probably the IQ2_KS will probably be the right combination of small with a good speed vs quality trade-off.

I'm making the iqN_k quants a little juiced up in the attn for better perplexity but slightly slower TG speed. The ks will be balanced quality with a little more speed. The KT is experimental and hoping to land it for full offload on dual RTX 6000 PRO Blackwells < 198GiB.

No rest for you with all these MoEs lol

Owner
β€’
edited Jul 23
145G    /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/Qwen3-Coder-480B-A35B-Instruct-IQ2_KS.gguf

llm_load_print_meta: model ftype      = IQ2_KS - 2.1875 bpw
llm_load_print_meta: model params     = 480.155 B
llm_load_print_meta: model size       = 144.126 GiB (2.578 BPW)
llm_load_print_meta: repeating layers = 142.917 GiB (2.567 BPW, 478.288 B parameters)
llm_load_print_meta: general.name     = Qwen3 Coder 480B A35B Instruct

@xldistance uploading now! good luck! i vibe checked it and it ran okay in quick tests. getting more perplexity data in the coming days for comparisons.

you will probably want to use -ot ... =CUDA0 -ot ... =CUDA1 to offload additional ffn layers onto each GPU to fit it all. Should have plenty of space left for good context length as well at q8_0 kv-cache quantization.

lemme know how it goes!

This will replace the 235b model for me :) so excited for your quant.

@mtcl

you can fit a larger one, i'll upload more as soon as i get some sleep, they're cooking overnight haha...

Unacceptable! No sleep for you!
Jk jk :) thank you for everything that you do! ❀️

Owner
β€’
edited Jul 23

I possibly noticed some pauses when it prints out a , character. I'm too sleepy to know for sure, but possibly related thing here to look at tomorrow if it comes up: --override-kv tokenizer.ggml.bos_token_id=int:-1 https://github.com/ikawrakow/ik_llama.cpp/pull/573#issuecomment-3053501803 i may be totally wrong, just in case anyone else notices pauses after it generates ,

but it might not help at all and could be a wild goose chase: https://github.com/ikawrakow/ik_llama.cpp/issues/464#issuecomment-3104837441

It's not just your model, so I tried UD_Q4_K_XL FROM unsloth on ik_llama and it pauses after a comma. But same model doesn't pause on llama.cpp.

but it might not help at all and could be a wild goose chase: https://github.com/ikawrakow/ik_llama.cpp/issues/464#issuecomment-3104837441

It is one.

Downloading now, thank you very much!

Weirdly, the new Qwen3 moe (not this coding model) was doing the "pause after comma" thing, on OpenRouter + OpenWebUI for me a few hours after it was released / on there.

I experienced the pause after comma thing a little bit, but IQ2_KS was surprisingly surprisingly capable! Still waiting on my new mobo setup, but I was able to oneshot a simple spaceship lander game just fine.

{AE007AF4-620D-4B0F-8579-C81EEC908067}.png

{40A09F6B-B027-4D92-B47B-FF0D07C35BBE}.png

I've been meaning to test ik_llama on linux, because I feel something is seriously wrong with windows and multi-gpu... I've been chalking it up to my 192gb DDR5 being on AM5 and 4800mhz, but honestly with this setup I expect a little more token gen speed...

7950x
192gb DDR5 4800mhz
1x4090
2x3090

 .\llama-server.exe ^
  --model "%MODEL_PATH%" ^
  --alias ubergarm/Qwen3-Coder-480B-A35B-Instruct ^
  -fa -fmoe ^
  -ctk q8_0 -ctv q8_0 ^
  --ctx-size 32768 ^
  --n-gpu-layers 99 ^
  -ot "blk\.[0-7]\.ffn.*=CUDA0" ^
  -ot "blk\.1[0-7]\.ffn.*=CUDA1" ^
  -ot "blk\.2[0-7]\.ffn.*=CUDA2" ^
  -ot "blk.*\.ffn.*=CPU" ^
  --threads 16 ^
  --batch-size 4096 ^
  -ub 4096 -b 409 ^
  --host 0.0.0.0 ^
  --port 8081

--- EDIT

I installed fedora on a spare nvme and tested model, seeing speed increases as much as 4-7x depending on model.

tldr; don't use windows for bleeding edge LLM optimizations, you will leave a lot of performance on the table !

@phakio @ubergarm Add this command to solve the comma problem, --override-kv tokenizer.ggml.bos_token_id=int:151643
The problem has been found inside the post,https://github.com/ikawrakow/ik_llama.cpp/issues/464#issuecomment-3104837441

@phakio @ubergarm Add this command to solve the comma problem, --override-kv tokenizer.ggml.bos_token_id=int:151643
The problem has been found inside the post,https://github.com/ikawrakow/ik_llama.cpp/issues/464#issuecomment-3104837441

hi should i have to add this in the running comment?

Yeah, just add this to your llama-server command line
--override-kv tokenizer.ggml.bos_token_id=int:151643

@gopi87 @gghfez @xldistance @phakio and all

Good morning, uploading some more models now!

Also pull and rebuild latest tip of main on ik_llama.cpp for what looks like a fix for the pause after , issue then should no longer need to pass that --override-kv .. business: https://github.com/ikawrakow/ik_llama.cpp/issues/464#issuecomment-3106789674

anecdotally the first chat also will go much faster, it is as if before the first chat was using all the routed experts for every character maybe, i dunno, but I know that after pulling and rebuilding it seems smoother for all the chats starting from the very first one now!

thanks!

Sign up or log in to comment