Amazing on old hardware

#1
by nobeardbugs - opened

I'm currently running this on my 1080 Ti and it's just incredible how capable it is under heavy quantization. My testing so far has just been with high level technical discussions and brainstorming. I wouldn't trust it for zero shot coding, but it seems like it could throw out some very useful pseudocode. It has a nice range of knowledge, excellent depth, and so far hasn't hallucinated or got details mixed up. Of course, it's not going to match up to larger, less quantized models, but for a modern LLM running on an 8-year old 11 GB card, the quality and performance is nuts.

Using llama.cpp with additional quantization support:

  • IQ3_XSS quant and 16k context
  • both K-cache and V-cache of Q5_1
  • flash attention enabled
  • nvidia-smi reports 10,820 MiB VRAM used (out of 11264 MiB)

Typical speeds:

  • Context empty/low ( 0-1k / 16k ):
    • pp: 300 tps
    • tg: 14 tps
  • Context 25% full ( 4k / 16k ):
    • pp: 290 tps
    • tg: 11 tps
  • Context 50% full ( 8k / 16k ):
    • pp: 240 tps
    • tg: 9 tps

Excellent ; thank you for the write up.
Mistrals are pretty tough ; the "older ones" - 7B - can actually run at IQ1_M - providing they were quanted post OCT 2024 (changes to Llamacpp/Imatrix).

Suggestion: Might want to try the model with flash attention off - sometimes you can get a performance improvement.

Good tip, thanks. Just tested that and it looks like it does help, but probably only about 5-10% of a boost as the context fills up. When empty, speeds are the same. But, by the time I was over 6k context I was still getting over 11 tps.
Unfortunately though, because disabling flash attention (and v-cache quant) increases the ram hit, I had to drop the context down to 8k. Juuuust squeezed it in: 11033MiB/11264MiB. I'll probably stick to the higher context for it's extra usefulness. Also, because I want to play with using a reasoning prompt and that's going to chew up tokens.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment