arxiv:2506.21103

Learning to Skip the Middle Layers of Transformers

Published on Jun 26

· Submitted by

tim-lawson on Jun 27

Upvote

Authors:

Tim Lawson ,

Abstract

A novel conditional computation architecture for Transformers dynamically skips middle layers based on input and a gating mechanism, but does not outperform dense baselines in reducing computational cost or improving validation performance.

AI-generated summary

Conditional computation is a popular strategy to make Transformers more efficient. Existing methods often target individual modules (e.g., mixture-of-experts layers) or skip layers independently of one another. However, interpretability research has demonstrated that the middle layers of Transformers exhibit greater redundancy, and that early layers aggregate information into token positions. Guided by these insights, we propose a novel architecture that dynamically skips a variable number of layers from the middle outward. In particular, a learned gating mechanism determines whether to bypass a symmetric span of central blocks based on the input, and a gated attention mechanism prevents subsequent tokens from attending to skipped token positions. Residual norms are controlled with a 'sandwich' or 'perilayernorm' scheme and gate sparsity with an adaptive regularization loss. We had aimed to reduce compute requirements for 'simpler' tokens and potentially foster an emergent multi-level representational hierarchy but, at the scales investigated, our approach does not achieve improvements in the trade-off between validation cross-entropy and estimated FLOPs compared to dense baselines with fewer layers. We release our code at https://github.com/tim-lawson/skip-middle.

View arXiv page View PDF GitHub 0 Add to collection

Community

tim-lawson

Paper author Paper submitter about 12 hours ago

We explore a novel gated Transformer architecture that dynamically skips layers from the middle outward, based on interpretability research that shows the middle layers are more often redundant, and growing interest in hierarchical models (e.g., byte-level) and block-level sparsity (e.g., mixture-of-depths).

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 22

Browse 22 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2506.21103 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.21103 in a Space README.md to link it from this page.