Papers
arxiv:2510.15262

Robust Layerwise Scaling Rules by Proper Weight Decay Tuning

Published on Oct 17
· Submitted by Yifeng Liu on Oct 20
Authors:
,
,
,

Abstract

A new weight-decay scaling rule for AdamW is introduced to preserve sublayer gain across widths in modern scale-invariant architectures, enabling zero-shot transfer of learning rate and weight decay.

AI-generated summary

Empirical scaling laws prescribe how to allocate parameters, data, and compute, while maximal-update parameterization (muP) enables learning-rate transfer across widths by equalizing early-time update magnitudes. However, in modern scale-invariant architectures, training quickly enters an optimizer-governed steady state where normalization layers create backward scale sensitivity and the effective learning rate becomes width dependent, degrading muP transfer. We address this by introducing a weight-decay scaling rule for AdamW that preserves sublayer gain across widths. Empirically, the singular-value spectrum of each matrix parameter scales in norm as eta/lambda with an approximately invariant shape; under width scaling d, we observe that the top singular value scales approximately as eta/lambdacdot d^{0.75}. Combining this observation with the muP learning-rate rule eta_2propto d^{-1} for matrix-like parameters implies an empirical weight-decay scaling rule lambda_2propto d that approximately keeps sublayer gains width invariant. Together with vector-like parameters trained at eta_1=Theta_d(1) and lambda_1=0, this yields zero-shot transfer of both learning rate and weight decay from proxy to target widths, removing per-width sweeps. We validate the rule on LLaMA-style Transformers and in a minimal synthetic setting, and we provide a simple diagnostic, matching top singular values, to check sublayer-gain invariance. Our results extend muP beyond the near-init regime by explicitly controlling steady-state scales set by the optimizer, offering a practical recipe for width-robust hyperparameter transfer under AdamW.

Community

Paper author Paper submitter

Our discoveries show a potential novel physics law of μP on AdamW optimizer: Empirically, the singular-value spectrum of each matrix parameter scales in norm as \sqrt{η/λ} with an approximately invariant shape; under width scaling d, we observe that the top singular value scales approximately as \sqrt{η/λ}*d^{0.75}. Combining this observation with the μP learning-rate rule that η_2 is proportional to 1/d for matrix-like parameters implies an empirical weight-decay scaling rule λ_2 \propto \sqrt{d} that approximately keeps sublayer gains width invariant. Together with vector-like parameters trained at η_1=Θ_d(1) and λ_1=0, this yields zero-shot transfer of both learning rate and weight decay from proxy to target widths, removing per-width sweeps.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Thanks

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.15262 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.15262 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.15262 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.