UMoE: Unifying Attention and FFN with Shared Experts
Abstract
Sparse Mixture of Experts (MoE) architectures have emerged as a promising approach for scaling Transformer models. While initial works primarily incorporated MoE into feed-forward network (FFN) layers, recent studies have explored extending the MoE paradigm to attention layers to enhance model performance. However, existing attention-based MoE layers require specialized implementations and demonstrate suboptimal performance compared to their FFN-based counterparts. In this paper, we aim to unify the MoE designs in attention and FFN layers by introducing a novel reformulation of the attention mechanism, revealing an underlying FFN-like structure within attention modules. Our proposed architecture, UMoE, achieves superior performance through attention-based MoE layers while enabling efficient parameter sharing between FFN and attention components.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing (2025)
- MoxE: Mixture of xLSTM Experts with Entropy-Aware Routing for Efficient Language Modeling (2025)
- Beyond Standard MoE: Mixture of Latent Experts for Resource-Efficient Language Models (2025)
- Optimal Scaling Laws for Efficiency Gains in a Theoretical Transformer-Augmented Sectional MoE Framework (2025)
- Mixture of Group Experts for Learning Invariant Representations (2025)
- S'MoRE: Structural Mixture of Residual Experts for LLM Fine-tuning (2025)
- Sparse Mixture of Experts as Unified Competitive Learning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper