(ICML 2025 Poster) SAE-V: Interpreting Multimodal Models for Enhanced Alignment

This repository contains the SAE-V model for our ICML 2025 Poster paper "SAE-V: Interpreting Multimodal Models for Enhanced Alignment", including 2 sparse autoencoder (SAE) and 3 sparse autoencoder with Vision (SAE-V). See each model folders and the source code for more information.

1.Training Parameter

All 5 models training paramters are list below:

Hyper-parameters	SAE and SAE-V of LLaVA-NeXT/Mistral	SAE and SAE-V of Chameleon/Anole
Training Parameters
total training steps	30000	30000
batch size	4096	4096
LR	5e-5	5e-5
LR warmup steps	1500	1500
LR decay steps	6000	6000
adam beta1	0.9	0.9
adam beta2	0.999	0.999
LR scheduler name	constant	constant
LR coefficient	5	5
seed	42	42
dtype	float32	float32
buffer batches num	32	64
store batch size prompts	4	16
feature sampling window	1000	1000
dead feature window	1000	1000
dead feature threshold	1e-4	1e-4
Model Parameters
hook layer	16	8
input dimension	4096	4096
expansion factor	16	32
feature number	65536	131072
context size	4096	2048

The differences in training parameters arise because the LLaVA-NeXT-7B model requires more GPU memory to handle vision input, so fewer batches can be cached. For the SAE and SAE-V parameters, we set different hook layers and context sizes based on the distinct architectures of the two models. We also experimented with different feature numbers on both models, but found that only around 30,000 features are actually activated during training. All training runs were conducted until convergence. All SAE and SAE-V training is performed on 8xA800 GPUs. We ensured that the variations in the parameters did not affect the experiment results.

2. Quickstart

The SAE and SAE-V is developed based on SAELens-V. The loading example is as follow:

from saev_lens import SAE
sae = SAE.load_from_pretrained(
    path = "./SAEV_LLaVA_NeXT-7b_OBELICS",
    device ="cuda:0"
)

More using tutorial is presented in SAELens-V.