Any chance of a 128k version so we can use it as a draft model for the larger 128k models?

#3
by smcleod - opened

Thanks!

I was literally just looking for a 128k quants from unsloth and was sat here scratching my head like, where is it?

Looking for it as well, but also wondering if a 128k version is really necessary...

I'm using Qwen3-32B-128K-Q8_0.gguf with context size of 131072.

--model-draft Qwen3-0.6B-Q8_0.gguf
--draft-max 8
--draft-min 0
--ctx-size-draft 32768
--draft-p-min 0.5
--gpu-layers-draft 65
--override-kv tokenizer.ggml.bos_token_id=int:151643
--device-draft CUDA0

I have not played with these params yet (they are not optimum), so they are far from optimum and using Q8 instead of Q4 is certainly not a good idea here.

Along with YaRN:

--rope-scaling yarn
--rope-scale 4
--yarn-orig-ctx 32768
Unsloth AI org

Hey guys, as much as we'd love to release 128K quants, the small Qwen3 models don't support 128K context so only the large ones work :)

CC: @smcleod @SamuraiBarbi @Thireus @siddhesh22

Hey guys, as much as we'd love to release 128K quants, the small Qwen3 models don't support 128K context so only the large ones work :)

CC: @smcleod @SamuraiBarbi @Thireus @siddhesh22

I see, I see! Thank you for clarifying, makes sense now :) Appreciate you taking the time to respond to our inquiry!

Thanks Shimmy! Appreciate you taking the time to respond and for all your hard work.

smcleod changed discussion status to closed

@Thireus out of interesting what was your reasoning for --override-kv tokenizer.ggml.bos_token_id=int:151643 ?

@smcleod - Without it I have the following issue: "draft vocab special tokens must match target vocab to use speculation" don't you?

Note: I've tested several configs, including switching to the 4B model. My use-case is long context sizes with YaRN, and I've noticed that using a draft model inevitably lowers the quality of the Qwen-32B model output (brings hallucinations up) which I cannot afford. So I've dropped the idea of using draft, unless there is something I was not doing right...

My draft acceptance rate was between 0.4 and 0.56 across my different attempts, draft models and params used. I also didn't notice significant speed increase at large context size.

Ah, I wondered what the fix was for that, never the less it sounds like the impact of YaRN on quality might be too much of a trade off to be worth it.

smcleod changed discussion status to open
smcleod changed discussion status to closed

Sign up or log in to comment