Try moe with different size models!
For example:
Expert 1:
0.6B-model
Expert 2:
1.7B-model
Expert 3:
32B-model
......
Dual stage selection.
Stage 1 - Expert 1:
Stage 2 (0.6B-model-1, 0.6B-model-2, 0.6B-model-3, 0.6B-model-4 )
Stage 1 - Expert 2:
Stage 2 (1.7B-model-1, 1.7B-model-2, 1.7B-model-3, 1.7B-model-4 )
Stage 1 - Expert 3:
Stage 2 (32B-model-1, 32B-model-2, 32B-model-3, 32B-model-4 )
and so on
what's difference if all models are in one stage :)
Yes, it is possible to combine models of different sizes, such as a 0.6B (600M parameters) and a 1.7B (1.7B parameters) model, using a Mixture of Experts (MoE) architecture. MoE works by assigning tasks to multiple "expert" models, with a gating network dynamically deciding which experts handle specific inputs. Here’s a concise explanation of how this can be done
Combining a 0.6B and 1.7B model via MoE is feasible and can balance efficiency and performance. The key is designing an effective gating network and addressing model compatibility.
Direct integration does not work; it requires later training.
If the 0.6B and 1.7B models come from the same family, they typically share the same vocabulary and input/output dimensions (e.g., hidden size), allowing direct integration into the MoE architecture.
If the model architectures differ (e.g., different hidden sizes or layer counts), adapter layers must be added at the expert input to project inputs into a unified feature space, or dimensions must be aligned at the output.
Thank you for your idea. We will try to see if we can achieve the goal without fine-tuning.