arxiv:2503.02495

Union of Experts: Adapting Hierarchical Routing to Equivalently Decomposed Transformer

Published on Mar 4

· Submitted by

yjyangwork on Mar 7

Upvote

Authors:

Yujiao Yang ,

Abstract

Mixture-of-Experts (MoE) enhances model performance while maintaining computational efficiency, making it well-suited for large-scale applications. However, expert in exist MoE paradigm works as an individual, thereby lacking high-quality expert interactions. Moreover, they have not been effectively extended to attention block, which constrains further efficiency improvements. To tackle these issues, we propose Union-of-Experts (UoE), which decomposes transformer into an equitant group of experts, and then implement dynamic routing on input data and experts. Our approach advances MoE design with three key innovations: (1) We conducted equitant expert decomposition on both MLP blocks and attention blocks based on matrix partition in tensor parallelism. (2) We developed two routing paradigms: patch wise data selection and expert selection, to apply routing across different levels. (3) We design the architecture of UoE model, including Selective Multi-Head Attention (SMHA) and Union-of-MLP-Experts (UoME). (4) We develop parallel implementation of UoE's routing and computation operation, and optimize efficiency based on the hardware processing analysis. The experiments demonstrate that the model employed with UoE surpass Full Attention, state-of-art MoEs and efficient transformers in several tasks across image and natural language domains. The source codes are available at https://github.com/YujiaoYang-work/UoE.

View arXiv page View PDF GitHub repository Add to collection

Community

yjyangwork

Paper author Paper submitter 3 days ago

This paper propose a new concept method: Union-of-Experts. Compared to existing MoE method, it build expert groups by equivalently decomposing a whole model rather than combining multiple individual models together. This approach allows experts to operate as a larger whole rather than a mixture of individuals, which fully leverage the scale effect of the model.

The architecture of UoE model includes Selective Multi-Head Attention (SMHA) and Union-of-MLP-Experts (UoME). The SMHA shares some similarities with the NSA introduced by DeepSeek half a month ago and the MoBA from Moonshot.AI, even though it was independently developed by the author over the course of a year.

On the other hand, the UoME represents a novel architecture that not only inherits the multi-expert and selection routing paradigms from existing MoE models but also enables the activated experts to function as a cohesive whole similar to an MLP of the same scale.

The benefits of applying the ideas of equivalent decomposition and routing to a complete Transformer model are quite evident. The experiments demonstrate that the UoE model surpass Full Attention, state-of-art MoEs and efficient transformers (including the model architecture of recently proposed DeepSeek-V3) in several tasks across image and natural language domains.

panikov

2 days ago

thanks

yjyangwork

Paper author 2 days ago

Thank you! It's my pleasure to contribute.

librarian-bot

2 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2503.02495 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2503.02495 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2503.02495 in a Space README.md to link it from this page.