Papers
arxiv:2506.19382

Measuring and Guiding Monosemanticity

Published on Jun 24
Authors:
,
,
,
,

Abstract

Guided Sparse Autoencoders improve feature monosemanticity and control in large language models by conditioning latent representations on labeled concepts.

AI-generated summary

There is growing interest in leveraging mechanistic interpretability and controllability to better understand and influence the internal dynamics of large language models (LLMs). However, current methods face fundamental challenges in reliably localizing and manipulating feature representations. Sparse Autoencoders (SAEs) have recently emerged as a promising direction for feature extraction at scale, yet they, too, are limited by incomplete feature isolation and unreliable monosemanticity. To systematically quantify these limitations, we introduce Feature Monosemanticity Score (FMS), a novel metric to quantify feature monosemanticity in latent representation. Building on these insights, we propose Guided Sparse Autoencoders (G-SAE), a method that conditions latent representations on labeled concepts during training. We demonstrate that reliable localization and disentanglement of target concepts within the latent space improve interpretability, detection of behavior, and control. Specifically, our evaluations on toxicity detection, writing style identification, and privacy attribute recognition show that G-SAE not only enhances monosemanticity but also enables more effective and fine-grained steering with less quality degradation. Our findings provide actionable guidelines for measuring and advancing mechanistic interpretability and control of LLMs.

Community

Paper author

TL;DR — We propose FMS (a metric for whether one latent ≈ one concept) and G-SAE (a lightly supervised SAE that dedicates a tiny set of latents to labeled concepts). This roughly doubles monosemanticity vs. vanilla SAEs and makes detection and steering trivial—without hurting fluency.

What’s new

  • FMS: scores monosemanticity by combining capacity (best single feature) with local/global disentanglement (drop when removed; diminishing returns when adding features).
  • G-SAE: adds a small conditioning loss so specific latent indices align with known concepts; the corresponding decoder columns act as steering vectors.

Key results (Llama-3-8B-base; toxicity, style, privacy)

  • FMS@1: 0.27 → 0.52 overall; up to 0.62 on privacy.
  • Single-feature accuracy: best guided feature 0.86; vanilla needs ~41 features to match.
  • Steering success (vanilla → G-SAE):
    • Toxicity: 0.95 → 0.98
    • Shakespeare: 0.64 → 0.72
    • Mixed (toxicity + Shakespeare): 0.80 → 0.82
    • Privacy (multi-concept): 0.47 → 0.53

Why it matters

  • More monosemantic, controllable features → safer, targeted interventions in LLMs.
  • Works post-hoc on pretrained SAEs; supervision footprint is small and concept-specific.

Method in 5 lines

  • Train an SAE on internal activations (we hook blocks 3 & 11 of Llama-3-8B-base; width ≈ hidden; Top-K ≈ 2048, ~9% sparsity).
  • G-SAE: reserve a tiny latent block for labeled concepts; apply BCE on those indices during training.
  • Detect by reading the reserved indices; steer by adding the concept’s decoder column to the residual stream with scale α.
  • FMS: compute capacity; re-train without the top feature for local; measure marginal gains when adding features for global.
  • Aggregate into FMS@p to compare models/tasks.

Resources

  • 📄 Paper
  • 🌐 Project page
    • 🧪 Minimal FMS code & examples (DecisionTree/SVM + reference): see “Downloads” on the project page.
  • 💻 Code

Big news!! We are accepted at NeurIPS as a spotlight! 🎉🎉

@inproceedings{harle2025monosemanticity,
            title = {Measuring and Guiding Monosemanticity},
            author = {Ruben H{\"a}rle and Felix Friedrich and Manuel Brack and Stephan W{\"a}ldchen and Bj{\"o}rn Deiseroth and
            Patrick Schramowski and Kristian Kersting},
            booktitle = {Advances in Neural Information Processing Systems},
            year = {2025},
            note = {Spotlight}
            }

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2506.19382 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2506.19382 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.19382 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.