Please 3b model rocket 2bit?
Can you quantize a 3b model rocket? For Chat
So far I have only looked into improving quantization of general purpose models. The method I use utilizes an "importance matrix" that helps ensuring more accurate quantized values for more important model weights. This matrix is derived from a calibration run on a training dataset. For base models one can basically use any sufficiently broad text dataset. My best guess is that for chat/instruct tuned models I need to find a good training dataset that is geared towards chat/instruct tuning.
So, in short, yes, I could do, but I need some time to get into quantizing this type of model.
Can you go for a 2 bit version of Mixtral Instruct? We don't really use Mixtral base usually
Same answer as above: give me some time to get into quantizing instruct models before I start posting.
Can you go for a 2 bit version of Mixtral Instruct? We don't really use Mixtral base usually
I'd advise patience. When Ika's PR is merged in LLama.cpp, every new model released after that will benefit from the recent changes. TheBloke might requant some older models too, if requested.
If it's not hard for you.?A smaller version would have been better. Thank you very much.
The Mixtral-instruct-8x7b quantizations are now posted.
@Shqmil The interest for 2-bit quants for rocket-3b is in the very small 2-bit version (2.06 bits per weight), or in the better quality but larger (2.56 bits per weight) 2-bit version?
based on ranking from this post https://www.reddit.com/r/LocalLLaMA/comments/1916896/llm_comparisontest_confirm_leaderboard_big_news/
https://huggingface.co/cloudyu/Mixtral_34Bx2_MoE_60B
Would be interesting if it get a 2bit quant as well, but "https://huggingface.co/TheBloke/Mixtral_34Bx2_MoE_60B-GGUF" Q2K isnt all that interesting.
but Q3MK is too big for my 24GB GPU.
can you go for a Q5_K_M version for mixtral instruct too?
can you go for a Q5_K_M version for mixtral instruct too?
https://huggingface.co/ikawrakow/mixtral-instruct-8x7b-quantized-gguf
I through it already existed
No there is Q5_K_S but not Q5_K_M
For Mixtral-instruct-8x7b Q5_K_M has about the same performance as Q5_K_S, so I did not publish. Where I live 100 Mb/s is the best one can get. The way HF is setup for these models (using git-lfs), a 35 GB file requires ~70 GB to be uploaded, so that takes nearly 2 hours with my Internet speed.
Thank God TheBloke exists, he's the man to do this really tedious work for us lmao
Can you quantize a 3b model rocket? For Chat
Have added two 2-bit models (rocket-3b-2.31bpw.gguf and rocket-3b-2.76bpw.gguf). But their perplexity is so high, especially for the 2.31 bpw model, that I have doubts they will be useful for anything.
For Mixtral-instruct-8x7b Q5_K_M has about the same performance as Q5_K_S, so I did not publish. Where I live 100 Mb/s is the best one can get. The way HF is setup for these models (using git-lfs), a 35 GB file requires ~70 GB to be uploaded, so that takes nearly 2 hours with my Internet speed.
Wow, 100M is horrible slow by today's standard. I guess you live somewhere in US.
Wow, 100M is horrible slow by today's standard. I guess you live somewhere in US.
Haha. It is Italy, actually. They claim my house is too far from the next switch station. While I can clearly see how they do network shaping. It starts way better than 100 Mb/s, then drops well below 100 Mb/s, and then, on a larger data transfer, stabilizes around 96 Mb/s. Why they wouldn't sell me better than 100 Mb/s is beyond me. But my Italian is not quite there yet, so me trying to have a conversation with their representatives in my broken Italian or their broken English doesn't really help.
https://huggingface.co/OpenBuddy/openbuddy-mixtral-7bx8-v16.3-32k
Can do a try? It offers muti-language supports.
@ikawrakow
Did you tried
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"