Model Card for MAIRA-2-SAE

This is a collection of sparse autoencoders (SAEs) trained on the residual stream of layer 15 of MAIRA-2, and described in the preprint 'Insights into a radiology-specialised multimodal large language model with sparse autoencoders', presented at the Actionable Interpretability Workshop @ ICML 2025.

In the preprint, we primarily study an SAE with expansion factor 4. Here we also release SAEs with expansion factors 2 and 8 to enable additional analyses. For expansion factors 2 and 4, we also provide LLM-generated interpretations of each feature and their corresponding interpretability scores.

Model Details

A sparse autoencoder is a model which provides for two functions:

Encoding some input (in this case, model activations) into a "latent space" (in this case, one which is higher dimensional than its input)
Decoding from the "latent space" back into the input space

SAEs encode such that only a small number of latent dimensions (we call these features) are active for any input.

Specifically these are Matryoshka BatchTopK SAEs, which are described in Learning Multi-Level Features with Matryoshka Sparse Autoencoders. Importantly, the decoder is linear, hence the SAE serves to reconstruct model activations as a linear combination of (putatively) interpretable feature directions.

Model Description

Developed by: Microsoft Research Health Futures
Model type: Autoencoder
License: MIT

Uses

These SAEs are shared for research purposes only. Their intended use is interpretability analysis of MAIRA-2. Given MAIRA-2 and a data example (e.g. from MIMIC-CXR), one can retrieve the activation strength of all SAE features. This can be used to ascribe interpretations to SAE features, or to use such feature interpretations to analyse the workings of MAIRA-2.

Direct Use

Use of these SAEs requires access to MAIRA-2 - see the MAIRA-2 model card for details. Assuming one has extracted the residual stream from layer 15 of MAIRA-2, and processed the activations as described in the preprint, the SAE can be used to encode this representation into a higher-dimensional space more suitable for interpretation. We provide a usage example below.

Analyses specifically of the SAEs are also possible, for example by inspecting the learned dictionary elements (the decoder layer). In this case, the provided feature interpretations may be useful, however we stress that only a subset of features have meaningful interpretations.

Out-of-Scope Use

These SAEs were trained on MAIRA-2 activations collected from the MIMIC-CXR findings generation subset of the original MAIRA-2 training dataset. Hence, they may not perform well (in the sense of reconstruction) on other datasets or tasks either within MAIRA-2's training distribution (e.g. PadChest, PadChest-GR), or datasets MAIRA-2 was not trained on. Any non-research use of these SAEs is out of scope.

Bias, Risks, and Limitations

As above, the SAEs were trained and interpreted using the MIMIC-CXR subset of the MAIRA-2 training data. MIMIC-CXR represents a cohort of patients from a single hospital in the USA. Inferences made about MAIRA-2 using these SAEs will necessarily be limited to concepts which could plausibly be discovered using MIMIC-CXR.

How to Get Started with the Model

Setup

Install dictionary_learning: pip install dictionary-learning or uv add dictionary-learning.

We used dictionary_learning as a submodule at commit 07975f7, which is version 0.1.0.

Download weights from the hub

Option 1: Download a single SAE with specified expansion factor

from huggingface_hub import hf_hub_download
expansion_factor = 2
model_name = f"layer15_res_matryoshka_k256_ef{expansion_factor}.pt"
# Each expansion factor has its own subfolder
ef_subfolder = f"ef{expansion_factor}"
# Specify your own local download directory here if you want
local_dir = "./"
local_path = hf_hub_download(repo_id="microsoft/maira-2-sae", subfolder=ef_subfolder, filename=model_name, local_dir=local_dir)

Option 2: Download all SAEs

from huggingface_hub import snapshot_download
# Specify your own local download directory here if you want
local_dir = "./"
snapshot_download(repo_id="microsoft/maira-2-sae", local_dir=local_dir)

Use SAE to get activations

import torch
from dictionary_learning.trainers.matryoshka_batch_top_k import MatryoshkaBatchTopKSAE

# local_path is the path to the dictionary weights (.pt file), however you downloaded them
ae = MatryoshkaBatchTopKSAE.from_pretrained(local_path)

# get NN activations using your preferred method: hooks, transformer_lens, nnsight, etc. ...
# for now we'll just use random activations
activation_dim = 4096
activations = torch.randn(64, activation_dim)
features = ae.encode(activations) # get features from activations
reconstructed_activations = ae.decode(features)

# you can also just get the reconstruction ...
reconstructed_activations = ae(activations)
# ... or get the features and reconstruction at the same time
reconstructed_activations, features = ae(activations, output_features=True)

Training Details

Training Data

We collected activations from the residual stream of layer 15 of MAIRA-2 using the MIMIC-CXR subset of the MAIRA-2 training/validation set. As detailed in our preprint, we collected activations from all tokens in the sequence excluding image tokens and boilerplate/templated subsequences. This resulted in 34.7M tokens for training, and 1.7M for validation (respecting the splits used to train MAIRA-2). Following Gao et al., we scaled all tokens with a normalization factor of 22.34, representing the mean l2 norm of the training samples.

Training Procedure

We trained the SAEs using the open-source dictionary learning library, using the MatryoshkaBatchTopKTrainer.

Training Hyperparameters

Matryoshka group fractions: [1/2, 1/4, 1/8, 1/16, 1/16]
k (mean l0 per batch): 256
Batch size: 8192
Epochs: 1
Expansion factors: 2, 4, 8 (multiple models)

Further hyperparameters are listed in the preprint.

Automated Interpretation

For SAEs with expansion factor 2 and 4, we also provide automatically-generated interpretations of each feature, again as described in our preprint. These are the files autointerp_layer15_res_matryoshka_k256_ef{2,4}.csv.

These interpretations were generated by showing GPT-4o data samples selected based on the activation strength for that feature. Note that we did not show GPT-4o the images, so these interpretations are necessarily limited. We did not run full automated interpretation on expansion factor 8 due to the large number of features (32,768).

We scored the quality of the interpretations using the detection scoring approach from Automatically Interpreting Millions of Features in Large Language Models, wherein the interpretation is provided to a LLM judge (again, GPT-4o) to predict whether a new sample will activate the feature. We provide binary classification metrics (accuracy, precision, recall, and F1) for each feature for both the 'train' samples (samples used to generate the interpretation) and validation (held-out samples) as a measure of interpretability. We also provide statistics on how often each feature was observed to activate in a random subset of the training set (n), to facilitate further analyses.

Citation

BibTeX:

@article{maira2sae,
    title={Insights into a radiology-specialised multimodal large language model with sparse autoencoders},
    author={Kenza Bouzid and Shruthi Bannur and Felix Meissen and Daniel Coelho de Castro and Anton Schwaighofer and Javier Alvarez-Valle and Stephanie L. Hyland},
    journal={Actionable Interpretability Workshop @ ICML 2025},
    year={2025},
    url={https://arxiv.org/abs/2507.12950}
}

APA:

Bouzid, K., Bannur, S., Meissen, F., Coelho de Castro, D., Schwaighofer, A., Alvarez-Valle, J., & Hyland, S. L. (2025). Insights into a radiology-specialised multimodal large language model with sparse autoencoders. Actionable Interpretability Workshop @ ICML 2025. arXiv.

Model Card Contact

Stephanie Hyland ([email protected])
Kenza Bouzid ([email protected])