MoE Experiments (proper sparse MoEs)
Collection
2 items
โข
Updated
Based off of smollmv2. (Llama) MoE-ified then further trained on a general dataset.
MoE layers: [8, 12, 16, 20, 24, 28]
Top-k: 2 (activates 50.0% of experts per token)
Hidden size: 960
Total parameters: 494,554,560
Trainable parameters: 494,554,560
Auxiliary loss weight: 0.01
training loss:
Total Loss = 6.4659, LM Loss = 5.9851, Aux Loss = 48.0835
val loss:
Total Loss: 0.8298, LM Loss: 0.7697, Aux Loss: 6.0092