(ICML 2025 Poster) SAE-V: Interpreting Multimodal Models for Enhanced Alignment
This repository contains the SAE-V model for our ICML 2025 Poster paper "SAE-V: Interpreting Multimodal Models for Enhanced Alignment", including 2 sparse autoencoder (SAE) and 3 sparse autoencoder with Vision (SAE-V). See each model folders and the source code for more information.
1.Training Parameter
All 5 models training paramters are list below:
Hyper-parameters | SAE and SAE-V of LLaVA-NeXT/Mistral | SAE and SAE-V of Chameleon/Anole |
---|---|---|
Training Parameters | ||
total training steps | 30000 | 30000 |
batch size | 4096 | 4096 |
LR | 5e-5 | 5e-5 |
LR warmup steps | 1500 | 1500 |
LR decay steps | 6000 | 6000 |
adam beta1 | 0.9 | 0.9 |
adam beta2 | 0.999 | 0.999 |
LR scheduler name | constant | constant |
LR coefficient | 5 | 5 |
seed | 42 | 42 |
dtype | float32 | float32 |
buffer batches num | 32 | 64 |
store batch size prompts | 4 | 16 |
feature sampling window | 1000 | 1000 |
dead feature window | 1000 | 1000 |
dead feature threshold | 1e-4 | 1e-4 |
Model Parameters | ||
hook layer | 16 | 8 |
input dimension | 4096 | 4096 |
expansion factor | 16 | 32 |
feature number | 65536 | 131072 |
context size | 4096 | 2048 |
The differences in training parameters arise because the LLaVA-NeXT-7B model requires more GPU memory to handle vision input, so fewer batches can be cached. For the SAE and SAE-V parameters, we set different hook layers and context sizes based on the distinct architectures of the two models. We also experimented with different feature numbers on both models, but found that only around 30,000 features are actually activated during training. All training runs were conducted until convergence. All SAE and SAE-V training is performed on 8xA800 GPUs. We ensured that the variations in the parameters did not affect the experiment results.
2. Quickstart
The SAE and SAE-V is developed based on SAELens-V. The loading example is as follow:
from saev_lens import SAE
sae = SAE.load_from_pretrained(
path = "./SAEV_LLaVA_NeXT-7b_OBELICS",
device ="cuda:0"
)
More using tutorial is presented in SAELens-V.