Papers
arxiv:2503.02495

Union of Experts: Adapting Hierarchical Routing to Equivalently Decomposed Transformer

Published on Mar 4
· Submitted by yjyangwork on Mar 7
Authors:
,

Abstract

Mixture-of-Experts (MoE) enhances model performance while maintaining computational efficiency, making it well-suited for large-scale applications. However, expert in exist MoE paradigm works as an individual, thereby lacking high-quality expert interactions. Moreover, they have not been effectively extended to attention block, which constrains further efficiency improvements. To tackle these issues, we propose Union-of-Experts (UoE), which decomposes transformer into an equitant group of experts, and then implement dynamic routing on input data and experts. Our approach advances MoE design with three key innovations: (1) We conducted equitant expert decomposition on both MLP blocks and attention blocks based on matrix partition in tensor parallelism. (2) We developed two routing paradigms: patch wise data selection and expert selection, to apply routing across different levels. (3) We design the architecture of UoE model, including Selective Multi-Head Attention (SMHA) and Union-of-MLP-Experts (UoME). (4) We develop parallel implementation of UoE's routing and computation operation, and optimize efficiency based on the hardware processing analysis. The experiments demonstrate that the model employed with UoE surpass Full Attention, state-of-art MoEs and efficient transformers in several tasks across image and natural language domains. The source codes are available at https://github.com/YujiaoYang-work/UoE.

Community

Paper author Paper submitter

model_architecture.png

This paper propose a new concept method: Union-of-Experts. Compared to existing MoE method, it build expert groups by equivalently decomposing a whole model rather than combining multiple individual models together. This approach allows experts to operate as a larger whole rather than a mixture of individuals, which fully leverage the scale effect of the model.

The architecture of UoE model includes Selective Multi-Head Attention (SMHA) and Union-of-MLP-Experts (UoME). The SMHA shares some similarities with the NSA introduced by DeepSeek half a month ago and the MoBA from Moonshot.AI, even though it was independently developed by the author over the course of a year.

On the other hand, the UoME represents a novel architecture that not only inherits the multi-expert and selection routing paradigms from existing MoE models but also enables the activated experts to function as a cohesive whole similar to an MLP of the same scale.

The benefits of applying the ideas of equivalent decomposition and routing to a complete Transformer model are quite evident. The experiments demonstrate that the UoE model surpass Full Attention, state-of-art MoEs and efficient transformers (including the model architecture of recently proposed DeepSeek-V3) in several tasks across image and natural language domains.

thanks

·
Paper author

Thank you! It's my pleasure to contribute.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2503.02495 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2503.02495 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2503.02495 in a Space README.md to link it from this page.

Collections including this paper 4