Abstract
MoE architectures exhibit language-specific routing in early and late layers but cross-lingual alignment in middle layers, which can be enhanced to improve multilingual performance.
Mixture-of-Experts (MoE) architectures have become the key to scaling modern LLMs, yet little is understood about how their sparse routing dynamics respond to multilingual data. In this work, we analyze expert routing patterns using parallel multilingual datasets and present highly interpretable layer-wise phenomena. We find that MoE models route tokens in language-specific ways in the early and late decoder layers but exhibit significant cross-lingual routing alignment in middle layers, mirroring parameter-sharing trends observed in dense LLMs. In particular, we reveal a clear, strong correlation between a model's performance in a given language and how similarly its tokens are routed to English in these layers. Extending beyond correlation, we explore inference-time interventions that induce higher cross-lingual routing alignment. We introduce a method that steers the router by promoting middle-layer task experts frequently activated in English, and it successfully increases multilingual performance. These 1-2% gains are remarkably consistent across two evaluation tasks, three models, and 15+ languages, especially given that these simple interventions override routers of extensively trained, state-of-the-art LLMs. In comparison, interventions outside of the middle layers or targeting multilingual-specialized experts only yield performance degradation. Altogether, we present numerous findings that explain how MoEs process non-English text and demonstrate that generalization is limited by the model's ability to leverage language-universal experts in all languages.
Community
Multilingual Routing in Mixture-of-Experts LLMs:
This paper presents the first in-depth analysis of how MoE LLMs route multilingual texts, present a number of clear, interpretable patterns and visualizations. For example, here's much higher cross-lingual alignment in the middle layers and alignment strongly correlates with performance in that language.
Then these findings lead to a simple, inference-time router steering method that leads to surprisingly consistent multilingual improvements.
TL;DR better cross-lingual alignment, better multilingual performance
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Utilizing Multilingual Encoders to Improve Large Language Models for Low-Resource Languages (2025)
- AlignX: Advancing Multilingual Large Language Models with Multilingual Representation Alignment (2025)
- MERLIN: Multi-Stage Curriculum Alignment for Multilingual Encoder and LLM Fusion (2025)
- Beyond Benchmarks: Understanding Mixture-of-Experts Models through Internal Mechanisms (2025)
- LD-MoLE: Learnable Dynamic Routing for Mixture of LoRA Experts (2025)
- Language-Specific Layer Matters: Efficient Multilingual Enhancement for Large Vision-Language Models (2025)
- Milco: Learned Sparse Retrieval Across Languages via a Multilingual Connector (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper