Taming Polysemanticity in LLMs: Provable Feature Recovery via Sparse Autoencoders
Abstract
A new statistical framework and training algorithm, Group Bias Adaptation, enhance Sparse Autoencoders for recovering monosemantic features in Large Language Models, offering theoretical guarantees and superior performance.
We study the challenge of achieving theoretically grounded feature recovery using Sparse Autoencoders (SAEs) for the interpretation of Large Language Models. Existing SAE training algorithms often lack rigorous mathematical guarantees and suffer from practical limitations such as hyperparameter sensitivity and instability. To address these issues, we first propose a novel statistical framework for the feature recovery problem, which includes a new notion of feature identifiability by modeling polysemantic features as sparse mixtures of underlying monosemantic concepts. Building on this framework, we introduce a new SAE training algorithm based on ``bias adaptation'', a technique that adaptively adjusts neural network bias parameters to ensure appropriate activation sparsity. We theoretically prove that this algorithm correctly recovers all monosemantic features when input data is sampled from our proposed statistical model. Furthermore, we develop an improved empirical variant, Group Bias Adaptation (GBA), and demonstrate its superior performance against benchmark methods when applied to LLMs with up to 1.5 billion parameters. This work represents a foundational step in demystifying SAE training by providing the first SAE algorithm with theoretical recovery guarantees, thereby advancing the development of more transparent and trustworthy AI systems through enhanced mechanistic interpretability.
Community
Existing Sparse Autoencoder (SAE) training algorithms often lack rigorous mathematical guarantees for feature recovery. Empirically, methods such as L1 regularization and TopK activation are sensitive to hyperparameter tuning and can exhibit inconsistency. Our work addresses these theoretical and practical issues with the following contributions:
๐ A novel statistical framework that rigorously formalizes feature recovery by modeling polysemantic features as sparse combinations of underlying monosemantic concepts, and establishes a precise notion of feature identifiability.
๐ ๏ธ An innovative SAE training algorithm, Group Bias Adaptation (GBA), which adaptively adjusts neural network bias parameters to enforce optimal activation sparsity, allowing distinct groups of neurons to target different activation frequencies.
๐งฎ The first theoretical guarantee proving that SAE training algorithm can provably recover all monosemantic features when the input data is sampled from our proposed statistical model.
๐ Superior empirical performance on LLMs up to 1.5B parameters, where GBA achieves the best sparsity-loss trade-off while learning more consistent features than benchmark methods.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Train One Sparse Autoencoder Across Multiple Sparsity Budgets to Preserve Interpretability and Accuracy (2025)
- Evaluating Sparse Autoencoders: From Shallow Design to Matching Pursuit (2025)
- Position: Mechanistic Interpretability Should Prioritize Feature Consistency in SAEs (2025)
- Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders (2025)
- Ensembling Sparse Autoencoders (2025)
- SplInterp: Improving our Understanding and Training of Sparse Autoencoders (2025)
- From Flat to Hierarchical: Extracting Sparse Representations with Matching Pursuit (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper