How you approached merging these models?

#1
by Evi1ran - opened

Hello there! 😊
I hope you're doing well. I'm really impressed with your work on combining multiple models into a Mixture of Experts (MoE) β€” it's quite inspiring!

I was wondering if you'd be kind enough to share how you approached merging these models or how you stacked them together to form the MoE structure. If possible, would you mind sharing some code examples or even just the general idea behind your method? I'd greatly appreciate any insights you could offer!

Thank you so much for taking the time to read this, and I look forward to hearing from you! πŸ™

https://huggingface.co/huihui-ai/Huihui-MoE-1.3B-A0.6B-abliterated#training

Conversion: The model copies embeddings, self-attention, and normalization weights from Qwen3-0.6B, replacing MLP layers with MoE layers (3 experts). Gating weights are randomly initialized.

Architecture: Qwen3MoeForCausalLM model with 4 experts per layer (num_experts=4), activating 1 expert per token (num_experts_per_tok=1).

config.json

If the base model supports the MoE (Mixture of Experts) architecture, you can simply convert models with consistent parameters into MoE-type models. For example, this includes Qwen and DeepSeek.

Thank you so much for taking the time to reply! 😊
I’m truly curious about how you merged these specific models: suayptalha/Qwen3-0.6B-Code-Expert , suayptalha/Qwen3-0.6B-Math-Expert , suayptalha/Qwen3-0.6B-Medical-Expert , and huihui-ai/Qwen3-0.6B-abliterated into a single unified model.

What caught my attention is that the merged version you shared exists as one consolidated file, while the model at https://huggingface.co/suayptalha/Arcana-Qwen3-2.4B-A0.6B appears quite different in structure. This discrepancy feels intriguing, and I’d love to learn more about your approach if possible!

Could you kindly share any insights into the methodology or design choices behind this merging process?

You can compare the config.json files of Qwen3-30B-A3B, Qwen3-0.6B, and Huihui-MoE-1B-A0.6B. You should be able to discover some differences or patterns.

If you want to merge multiple models from the Qwen3 series using MoE (Mixture of Experts) fusion, we can certainly give it a try.

Sign up or log in to comment