Tutorial on making GGUF quants for this model (until proper support is implemented)
Here is how you can make your own gguf's if anyone is interested (without mtp layers)
You can also find them premade in my repo here: https://huggingface.co/quantflex/MiMo-7B-RL-nomtp-GGUF
NOTE: This should be a temporary solution until we have proper llama.cpp support. As it might degrade quality.
Related discussion: https://huggingface.co/XiaomiMiMo/MiMo-7B-RL/discussions/3
First in config.json change this (on line 3):
"MiMoForCausalLM"
to:
"Qwen2ForCausalLM"
This is done because it's practically based on Qwen2 architecture with very little modifications.
After that save this as a python script in the safetensors directory: deletemtp.py
from safetensors import safe_open
from safetensors.torch import save_file
shard_path = "model-00004-of-00004.safetensors"
with safe_open(shard_path, framework="pt") as f:
tensors = {
k: f.get_tensor(k)
for k in f.keys()
if not k.startswith("model.mtp_layers.0.")
}
save_file(tensors, shard_path)
This will delete the mtp layers. So just run the script with python.
And finally, in model.safetensors.index.json
delete all the lines that have the word mtp in them:
"model.mtp_layers.0.final_layernorm.weight": "model-00004-of-00004.safetensors",
"model.mtp_layers.0.hidden_layernorm.weight": "model-00004-of-00004.safetensors",
"model.mtp_layers.0.input_layernorm.weight": "model-00004-of-00004.safetensors",
"model.mtp_layers.0.input_proj.weight": "model-00004-of-00004.safetensors",
"model.mtp_layers.0.mlp.down_proj.weight": "model-00004-of-00004.safetensors",
"model.mtp_layers.0.mlp.gate_proj.weight": "model-00004-of-00004.safetensors",
"model.mtp_layers.0.mlp.up_proj.weight": "model-00004-of-00004.safetensors",
"model.mtp_layers.0.post_attention_layernorm.weight": "model-00004-of-00004.safetensors",
"model.mtp_layers.0.self_attn.k_proj.bias": "model-00004-of-00004.safetensors",
"model.mtp_layers.0.self_attn.k_proj.weight": "model-00004-of-00004.safetensors",
"model.mtp_layers.0.self_attn.o_proj.weight": "model-00004-of-00004.safetensors",
"model.mtp_layers.0.self_attn.q_proj.bias": "model-00004-of-00004.safetensors",
"model.mtp_layers.0.self_attn.q_proj.weight": "model-00004-of-00004.safetensors",
"model.mtp_layers.0.self_attn.v_proj.bias": "model-00004-of-00004.safetensors",
"model.mtp_layers.0.self_attn.v_proj.weight": "model-00004-of-00004.safetensors",
"model.mtp_layers.0.token_layernorm.weight": "model-00004-of-00004.safetensors",
Then convert the model to gguf as usual with convert_hf_to_gguf.py
@quantflex hi, could you please temporary remove this guide please?
This will produce wrong GGUF which does not support MTP, the support will be added later: https://github.com/ggml-org/llama.cpp/pull/13236
Once it is added, model produced using your guide will not be compatible
Edit: hmm ok maybe you can keep this for a while, as this is the same situation with deepseek v3 (I completely forgot that it also has MTP, but the GGUF don't support MTP)
But it's up to you anw. Personally I think the 7B is small enough so maybe more people will help finishing the PR