Any chance of a 128k version so we can use it as a draft model for the larger 128k models?

by smcleod - opened Apr 30

Discussion

smcleod

Apr 30

Thanks!

SamuraiBarbi

May 1

I was literally just looking for a 128k quants from unsloth and was sat here scratching my head like, where is it?

Thireus

May 1

•

edited May 1

Looking for it as well, but also wondering if a 128k version is really necessary...

I'm using Qwen3-32B-128K-Q8_0.gguf with context size of 131072.

--model-draft Qwen3-0.6B-Q8_0.gguf
--draft-max 8
--draft-min 0
--ctx-size-draft 32768
--draft-p-min 0.5
--gpu-layers-draft 65
--override-kv tokenizer.ggml.bos_token_id=int:151643
--device-draft CUDA0

I have not played with these params yet (they are not optimum), so they are far from optimum and using Q8 instead of Q4 is certainly not a good idea here.

Along with YaRN:

--rope-scaling yarn
--rope-scale 4
--yarn-orig-ctx 32768

shimmyshimmer

Unsloth AI org May 1

Hey guys, as much as we'd love to release 128K quants, the small Qwen3 models don't support 128K context so only the large ones work :)

CC: @smcleod @SamuraiBarbi @Thireus @siddhesh22

SamuraiBarbi

May 1

•

edited May 1

Hey guys, as much as we'd love to release 128K quants, the small Qwen3 models don't support 128K context so only the large ones work :)

CC: @smcleod @SamuraiBarbi @Thireus @siddhesh22

I see, I see! Thank you for clarifying, makes sense now :) Appreciate you taking the time to respond to our inquiry!

smcleod

May 3

Thanks Shimmy! Appreciate you taking the time to respond and for all your hard work.

smcleod changed discussion status to closed May 3

smcleod

May 3

@Thireus out of interesting what was your reasoning for --override-kv tokenizer.ggml.bos_token_id=int:151643 ?

Thireus

May 4

•

edited May 4

@smcleod - Without it I have the following issue: "draft vocab special tokens must match target vocab to use speculation" don't you?

Note: I've tested several configs, including switching to the 4B model. My use-case is long context sizes with YaRN, and I've noticed that using a draft model inevitably lowers the quality of the Qwen-32B model output (brings hallucinations up) which I cannot afford. So I've dropped the idea of using draft, unless there is something I was not doing right...

My draft acceptance rate was between 0.4 and 0.56 across my different attempts, draft models and params used. I also didn't notice significant speed increase at large context size.

smcleod

May 7

Ah, I wondered what the fix was for that, never the less it sounds like the impact of YaRN on quality might be too much of a trade off to be worth it.

smcleod changed discussion status to open May 7

smcleod changed discussion status to closed May 7

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment