Papers
arxiv:2504.00754

Automated Feature Labeling with Token-Space Gradient Descent

Published on Apr 1
Authors:
,

Abstract

Token-space gradient descent optimizes feature labels using language models as discriminators, achieving interpretable single-token labels across various domains.

AI-generated summary

We present a novel approach to feature labeling using gradient descent in token-space. While existing methods typically use language models to generate hypotheses about feature meanings, our method directly optimizes label representations by using a language model as a discriminator to predict feature activations. We formulate this as a multi-objective optimization problem in token-space, balancing prediction accuracy, entropy minimization, and linguistic naturalness. Our proof-of-concept experiments demonstrate successful convergence to interpretable single-token labels across diverse domains, including features for detecting animals, mammals, Chinese text, and numbers. Although our current implementation is constrained to single-token labels and relatively simple features, the results suggest that token-space gradient descent could become a valuable addition to the interpretability researcher's toolkit.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2504.00754 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2504.00754 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2504.00754 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.