smaller quant
Would it be possible to do a smaller quant for 16GB 4080 owners? Maybe 2.4bpw? I'd like to test it, too.
Btw, did you base on instruct-v0.1 or instruct-v0.2?
Thank you!
Why not ... I'll do a 2.40 for ya. :-)
The original model is here, and I'm not sure about the specifics of the recipe. I just know it (1) isn' t dumber than a bag of hair, and (2) when it works, it puts out some hot stuff. https://huggingface.co/NeverSleep/Noromaid-v0.4-Mixtral-Instruct-8x7b-Zloss
On the downside, it can be prone to getting stuck in a loop once you are past the context - even at 32k. I'm not sure how to avoid it, but it's very annoying. Especially since one of my chats went over 800 messages, and I was really loving it. Then it fell into a loop, yesterday, and I couldn't get it unstuck. I'm sure it is my fault, but I'm not sure what to do to move it on other than starting a fresh chat. Even so, 800 messages feels like a huge win compared to most others.
I think it might still be a touch too large for a 16GB card while using longer context. The problem is, we are getting further down the accuracy curve, so I don't know how far you want to go. I'm not a big fan of 2.0 ... but I'll try a 2.25 to see if that might still have enough of the original flavor in it to get you by.
Okay: I haven't tried this one, but I hope it works great for ya!
I will check and report back! Thanks for the quants and your time, appreciated.
Btw, did you base on instruct-v0.1 or instruct-v0.2?
Again, I'm not sure what's under the hood. You'd have to ask Undi and Ikaridev.
The 2.25bpw loads with 16k context on a 4080, it even still surpasses the one of my standard reasoning tests (only a models get that right).
You: If I have 7 apples today, and ate 3 last week, how many do I have now?
AI: You currently have 7 apples. Your consumption of 3 apples from last week does not affect your current apple count.
Looks promising, I'll check the RP stuff later. Thanks again for your time!
I had a bit of time on my hands and can say that the 2.25bpw works very well on my end, no obvious shortcomings, and it ranks pretty high on my personal favorite list!
On a side note, do you know of the exl2 quants already support this new 2-bit SOTA stuff that was merged recently in llama-cpp?
I actually don't know. I started playing with Aphrodite-engine on Thursday, and it doesn't even support EXL2. I've had to use GPTQ, and this model doesn't work because I can't split it across cards. That said, the inference is INSANELY fast. 8k context response in 4 seconds. 32k context in like 9-12 seconds. It definitively changes the experience, but I've had to drop model size down to MistralTrix. https://huggingface.co/zaq-hack/MistralTrix-v1-GPTQ