isogen/Josiefied-Qwen3-8B-abliterated-v1-exl3-4bpw · What is the significance of the data in the model card?

May 12

What does the table mean? Is that comparing the quanted models vs KV cache quantization on Human Eval or something else?

isogen

Owner May 12

Yeah, it is the result of

python eval/humaneval.py --model_dir [model] -cq [cache] -pf qwen3 -spt 1 -temp 0 --max_tokens 1024

using humaneval.py, where Q4 means -cq 4, etc.

So basically it is all the same on HumanEval, since 1% is not statistically significant for 164 tasks.

Tibbnak

May 12

•

edited May 12

Ok cool, it wasn't clear initially from how it was lain out.
Though something you might be able to explore if you want: K and V can be separately defined in EXL3, and it accepts anything in the range of 2-8.

I'm currently messing with version 0.2 exllamav3 on python 3.11 using the experimental exl3 branch of tabby, cuda malloc, and loading this specific repo's model only takes ~ 4.5 gigs at 16k context before first gen using 4 k and 3 v (How it's smaller than the actual file size of the weights, I have no idea. :v. Frickin turboderp space magic. [on further examination it might be tactically sideloading part of the weights to system ram? Still, doing that at usable speeds is something else] Expands to 5gb on first gen )

BazsiBazsi

May 13

How it's smaller than the actual file size of the weights, I have no idea

Also noticed this also with other models. A surprise to be sure but a welcome one

isogen

Owner May 13

The non-embedding parameters here is 3.675 GiB (HF reports GB, while nvtop prefers MiB and GiB), the default cache size is 8192, so we get around 2 * 8192 * num_hidden_layers * num_attention_heads * head_dim * 3.5 / 8, or 0.984 GiB, with pytorch stuff around 0.3 GiB it should be around 5 GiB.

isogen

Owner May 13

IIRC you will get an exception if you go beyond cache_size (aka max_num_tokens in Cache), since it should be seq_len * batch_size <= max_num_tokens, so for 16K it should be -cs 16384 and another GiB.

Tibbnak

May 13

•

edited May 13

Well I on tabby I can run Josiefied-Qwen3-8B-abliterated-v1-exl3-4bpw at

65536 context length
cache_mode: 5,4
and
chunk_size: 2048
with
torch CUDA malloc backend true
with exllamav3 0.2
CP311, Cuda 12.8, torch 2.7

7.3 gigs of vram and about 2.3 gigs of system ram used after first gen. Shared vram disabled, so, not overflow.

(just tested with Malloc off and it didn't seem to really change anything)

isogen

Owner May 14

•

edited May 14

Yeah, swap num_attention_heads for num_key_value_heads in the previous formula, so the actual cache size is num_hidden_layers * max_num_tokens * num_key_value_heads * head_dim * (k_bits + v_bits + 1) / 8, as we can check here. So for this model it is model_size - embedding_size (= 2 * vocab_size * hidden_size) + cache_size + overhead, (5.2e9 - 2 * 151936 * 4096 + 36 * 65536 * 8 * 128 * (5 + 4 + 1) / 8 + 0.5e9) / 2**30 is indeed around 7 GiB. RAM is always around 2.5 GiB anyway.

Tibbnak

May 14

•

edited May 14

If that's a generalizable equation then how come model inferencing engines make us define context limits manually? You could easily read the values from the config and safetensor and re-arrange the math to set vram constant and solve for cache_size automatically

And if you use a platform that handles parallel requests then you just set a cache_size limit and solve for number of parallel requests allowed. :v

I think by my really bad napkin math if Exllamav3 supported gemma3 models with the vision part removed (like ggufs do it), then a 3.5 bpw gemma3-12b, with perplexity similar to Q4_K_S, could do 16384 context at K-4 and V-4 and just about squeeze into 8 gb of vram. (though I'd probably do k-4 v-3 just to be safe on overhead)?

And at least from your Humaneval testing, exllamav3's kv-quanting doesn't seem to really impact that benchmark all too much.

isogen

Owner May 14

You could, but such a formula is approximate, so usually it is done dynamically, like find_executable_batch_size from accelerate. You can do the same for sequence length, i.e. pick some maximal max_seq_len and keep decreasing it until you find a working value. For llama.cpp, you can also ask why it isn't detecting -ngl dynamically, etc. Just things that aren't implemented.

isogen

Owner May 14

Mistral-Nemo-Instruct-2407-exl3-4bpw is another 12B model, so it uses 7.5 GiB at 32K context with Q4 cache. It is maybe weaker than gemma-3-12b (both should be weaker than Qwen3-8B anyway), but the vibe might be better. Can also try 14B models at 3bpw.

isogen

Owner May 17

•

edited May 17

Gemma 3 was added in the dev branch, but it is more memory hungry, so the 12B variant would only do 8K tokens under 8 GiB, or maybe barely 16K.

Tibbnak

May 18

•

edited May 18

You also need to compile exllamav3 yourself and not use the wheels.
ToastyPigeon/Gemma-3-Starshine-12B

It has the vision model integrated but exllamav3 doesn't support them yet so I don't think it loads into vram.

This one quanted to 3.5 and loaded with 5,4 kv cache quant can do 16k context on 8 gb of vram, using chat.py, but you have to specify --gpu_split 8 or it will error out due to lacking vram (there's some mystery phantom overhead that exllamav3 seems to expect but never actually seems to utilize at any point. Or maybe something weird with the calculation? Shrug)

(You might need to enable shared vram driver level, you might not, didn't seem to impact TPS that much)

You can also do it with tabby, though you need to go into model.py of backends\exllamav3\ and line 455, set 'use_per_device'=[8.0] and remove or comment out the line above it 'reserve_per_device=...'. Etc

You also need to add to start.bat a 'call python install [directory to exllamav3 repo]' (or I guess git+[gitrepo] for lazy), maybe a install -r requirements.txt before that, and have cuda toolkit and a compile environment (or if you know how to work venv, just do it manually instead of adding it to start.bat

Ez right?

Oh right. Haven't rested yet if it was fixed (there's a commit in tabby today that might have fixed it) but something wasn't handling bos tokens and they had to be manually added (example silly tavern has an option in tabby API sampler profile), so outputs were gibberish till it was added

(edit: tested, seems to work fine now)