Tutorial on making GGUF quants for this model (until proper support is implemented)

#5
by quantflex - opened

Here is how you can make your own gguf's if anyone is interested (without mtp layers)
You can also find them premade in my repo here: https://huggingface.co/quantflex/MiMo-7B-RL-nomtp-GGUF

NOTE: This should be a temporary solution until we have proper llama.cpp support. As it might degrade quality.

Related discussion: https://huggingface.co/XiaomiMiMo/MiMo-7B-RL/discussions/3

First in config.json change this (on line 3):

    "MiMoForCausalLM"

to:

    "Qwen2ForCausalLM"

This is done because it's practically based on Qwen2 architecture with very little modifications.

After that save this as a python script in the safetensors directory: deletemtp.py

from safetensors import safe_open
from safetensors.torch import save_file

shard_path = "model-00004-of-00004.safetensors"

with safe_open(shard_path, framework="pt") as f:
    tensors = {
        k: f.get_tensor(k)
        for k in f.keys()
        if not k.startswith("model.mtp_layers.0.")
    }

save_file(tensors, shard_path)

This will delete the mtp layers. So just run the script with python.

And finally, in model.safetensors.index.json delete all the lines that have the word mtp in them:

    "model.mtp_layers.0.final_layernorm.weight": "model-00004-of-00004.safetensors",
    "model.mtp_layers.0.hidden_layernorm.weight": "model-00004-of-00004.safetensors",
    "model.mtp_layers.0.input_layernorm.weight": "model-00004-of-00004.safetensors",
    "model.mtp_layers.0.input_proj.weight": "model-00004-of-00004.safetensors",
    "model.mtp_layers.0.mlp.down_proj.weight": "model-00004-of-00004.safetensors",
    "model.mtp_layers.0.mlp.gate_proj.weight": "model-00004-of-00004.safetensors",
    "model.mtp_layers.0.mlp.up_proj.weight": "model-00004-of-00004.safetensors",
    "model.mtp_layers.0.post_attention_layernorm.weight": "model-00004-of-00004.safetensors",
    "model.mtp_layers.0.self_attn.k_proj.bias": "model-00004-of-00004.safetensors",
    "model.mtp_layers.0.self_attn.k_proj.weight": "model-00004-of-00004.safetensors",
    "model.mtp_layers.0.self_attn.o_proj.weight": "model-00004-of-00004.safetensors",
    "model.mtp_layers.0.self_attn.q_proj.bias": "model-00004-of-00004.safetensors",
    "model.mtp_layers.0.self_attn.q_proj.weight": "model-00004-of-00004.safetensors",
    "model.mtp_layers.0.self_attn.v_proj.bias": "model-00004-of-00004.safetensors",
    "model.mtp_layers.0.self_attn.v_proj.weight": "model-00004-of-00004.safetensors",
    "model.mtp_layers.0.token_layernorm.weight": "model-00004-of-00004.safetensors",

Then convert the model to gguf as usual with convert_hf_to_gguf.py

@quantflex hi, could you please temporary remove this guide please?

This will produce wrong GGUF which does not support MTP, the support will be added later: https://github.com/ggml-org/llama.cpp/pull/13236

Once it is added, model produced using your guide will not be compatible

Edit: hmm ok maybe you can keep this for a while, as this is the same situation with deepseek v3 (I completely forgot that it also has MTP, but the GGUF don't support MTP)

But it's up to you anw. Personally I think the 7B is small enough so maybe more people will help finishing the PR

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment