DavidAU/Mistral-Small-3.1-24B-Instruct-2503-MAX-NEO-Imatrix-GGUF

Mar 18

•

I'm currently running this on my 1080 Ti and it's just incredible how capable it is under heavy quantization. My testing so far has just been with high level technical discussions and brainstorming. I wouldn't trust it for zero shot coding, but it seems like it could throw out some very useful pseudocode. It has a nice range of knowledge, excellent depth, and so far hasn't hallucinated or got details mixed up. Of course, it's not going to match up to larger, less quantized models, but for a modern LLM running on an 8-year old 11 GB card, the quality and performance is nuts.

Using llama.cpp with additional quantization support:

IQ3_XSS quant and 16k context
both K-cache and V-cache of Q5_1
flash attention enabled
nvidia-smi reports 10,820 MiB VRAM used (out of 11264 MiB)

Typical speeds:

Context empty/low ( 0-1k / 16k ):
- pp: 300 tps
- tg: 14 tps
Context 25% full ( 4k / 16k ):
- pp: 290 tps
- tg: 11 tps
Context 50% full ( 8k / 16k ):
- pp: 240 tps
- tg: 9 tps

DavidAU

Owner Mar 18

Excellent ; thank you for the write up.
Mistrals are pretty tough ; the "older ones" - 7B - can actually run at IQ1_M - providing they were quanted post OCT 2024 (changes to Llamacpp/Imatrix).

Suggestion: Might want to try the model with flash attention off - sometimes you can get a performance improvement.

nobeardbugs

Mar 18

Good tip, thanks. Just tested that and it looks like it does help, but probably only about 5-10% of a boost as the context fills up. When empty, speeds are the same. But, by the time I was over 6k context I was still getting over 11 tps.
Unfortunately though, because disabling flash attention (and v-cache quant) increases the ram hit, I had to drop the context down to 8k. Juuuust squeezed it in: 11033MiB/11264MiB. I'll probably stick to the higher context for it's extra usefulness. Also, because I want to play with using a reasoning prompt and that's going to chew up tokens.

DavidAU
/

Mistral-Small-3.1-24B-Instruct-2503-MAX-NEO-Imatrix-GGUF

Amazing on old hardware