XiaomiMiMo/MiMo-7B-RL · Is this the MIMo-7B GGUF format?

6 days ago

https://huggingface.co/jedisct1/MiMo-7B-RL-GGUF

When will the official release the GGUF format, or support ollama?

6 days ago

https://huggingface.co/jedisct1/MiMo-7B-RL-GGUF

When will the official release the GGUF format, or support ollama?

As of now, I'd not recommend using any of the existing GGUF models created based on this model. Author of one of them explicitly stated in the model card the following:

I have deleted the mtp layers in order to make it work with llama.cpp. Quality might be degraded.

A proper implementation would be better, but this will work until that is implemented.

This means that the conversion was impossible without this "surgery", so anyone who converted this model into GGUF format ended up with either broken quants or quants which may have degraded quality.

Let's be honest, even when the model is perfectly supported by llama.cpp, the nature of the quantization prevents the model from working in its full potential. If you must remove some parts of it just to make the conversion possible, it will degrade the quality even further.

This is a small 7B model and those are the ones that usually suffer the most from quantization itself, should the quality be degraded even further, chances are it will not reach the performance even of smaller models. Until it gets proper support in llama.cpp, which usually requires full cooperation between the model author and the llama.cpp team, it is not advisable to use these models in GGUF format.

quantflex

6 days ago

Hi guys,

Yes, @MrDevolver is correct 👍 That quote is from my repo here: https://huggingface.co/quantflex/MiMo-7B-RL-nomtp-GGUF

I also put "nomtp" in the title of the repo because I wanted to make it very clear that these are not the final quants as llama.cpp doesn't have an official implementation yet.

I should say that it says this in XiaomiMiMo official repo:

Or, you can register a vLLM loader for MiMo without loading MTP parameters.

So maybe it's not too bad, but I'm not sure. It is my understanding that MTP (multi-token prediction) is used both for quality and efficiency.

quantflex

6 days ago

I made a tutorial:

https://huggingface.co/XiaomiMiMo/MiMo-7B-RL/discussions/5