What is the significance of the data in the model card?
What does the table mean? Is that comparing the quanted models vs KV cache quantization on Human Eval or something else?
Yeah, it is the result of
python eval/humaneval.py --model_dir [model] -cq [cache] -pf qwen3 -spt 1 -temp 0 --max_tokens 1024
using humaneval.py, where Q4
means -cq 4
, etc.
So basically it is all the same on HumanEval, since 1% is not statistically significant for 164 tasks.
Ok cool, it wasn't clear initially from how it was lain out.
Though something you might be able to explore if you want: K and V can be separately defined in EXL3, and it accepts anything in the range of 2-8.
I'm currently messing with version 0.2 exllamav3 on python 3.11 using the experimental exl3 branch of tabby, cuda malloc, and loading this specific repo's model only takes ~ 4.5 gigs at 16k context before first gen using 4 k and 3 v (How it's smaller than the actual file size of the weights, I have no idea. :v. Frickin turboderp space magic. [on further examination it might be tactically sideloading part of the weights to system ram? Still, doing that at usable speeds is something else] Expands to 5gb on first gen )
How it's smaller than the actual file size of the weights, I have no idea
Also noticed this also with other models. A surprise to be sure but a welcome one
The non-embedding parameters here is 3.675 GiB (HF reports GB, while nvtop
prefers MiB and GiB), the default cache size is 8192, so we get around 2 * 8192 * num_hidden_layers * num_attention_heads * head_dim * 3.5 / 8
, or 0.984 GiB, with pytorch stuff around 0.3 GiB it should be around 5 GiB.
IIRC you will get an exception if you go beyond cache_size
(aka max_num_tokens
in Cache
), since it should be seq_len * batch_size <= max_num_tokens
, so for 16K it should be -cs 16384
and another GiB.
Well I on tabby I can run Josiefied-Qwen3-8B-abliterated-v1-exl3-4bpw at
- 65536 context length
- cache_mode: 5,4
and - chunk_size: 2048
with - torch CUDA malloc backend true
with exllamav3 0.2
CP311, Cuda 12.8, torch 2.7
7.3 gigs of vram and about 2.3 gigs of system ram used after first gen. Shared vram disabled, so, not overflow.
(just tested with Malloc off and it didn't seem to really change anything)
Yeah, swap num_attention_heads
for num_key_value_heads
in the previous formula, so the actual cache size is num_hidden_layers * max_num_tokens * num_key_value_heads * head_dim * (k_bits + v_bits + 1) / 8
, as we can check here. So for this model it is model_size - embedding_size (= 2 * vocab_size * hidden_size) + cache_size + overhead
, (5.2e9 - 2 * 151936 * 4096 + 36 * 65536 * 8 * 128 * (5 + 4 + 1) / 8 + 0.5e9) / 2**30
is indeed around 7 GiB. RAM is always around 2.5 GiB anyway.
If that's a generalizable equation then how come model inferencing engines make us define context limits manually? You could easily read the values from the config and safetensor and re-arrange the math to set vram constant and solve for cache_size automatically
And if you use a platform that handles parallel requests then you just set a cache_size limit and solve for number of parallel requests allowed. :v
I think by my really bad napkin math if Exllamav3 supported gemma3 models with the vision part removed (like ggufs do it), then a 3.5 bpw gemma3-12b, with perplexity similar to Q4_K_S, could do 16384 context at K-4 and V-4 and just about squeeze into 8 gb of vram. (though I'd probably do k-4 v-3 just to be safe on overhead)?
And at least from your Humaneval testing, exllamav3's kv-quanting doesn't seem to really impact that benchmark all too much.
You could, but such a formula is approximate, so usually it is done dynamically, like find_executable_batch_size
from accelerate
. You can do the same for sequence length, i.e. pick some maximal max_seq_len
and keep decreasing it until you find a working value. For llama.cpp, you can also ask why it isn't detecting -ngl
dynamically, etc. Just things that aren't implemented.
Mistral-Nemo-Instruct-2407-exl3-4bpw is another 12B model, so it uses 7.5 GiB at 32K context with Q4 cache. It is maybe weaker than gemma-3-12b (both should be weaker than Qwen3-8B anyway), but the vibe might be better. Can also try 14B models at 3bpw.