embedding-saes / README.md
charlieoneill's picture
Update README.md
c8e482f verified
metadata
license: mit
datasets:
  - charlieoneill/csLG
  - JSALT2024-Astro-LLMs/astro_paper_corpus
language:
  - en
tags:
  - sparse-autoencoder
  - embeddings
  - interpretability
  - scientific-nlp

Sparse Autoencoders for Scientific Paper Embeddings

This repository contains a collection of Sparse Autoencoders (SAEs) trained on embeddings from scientific papers in two domains: Computer Science (cs.LG) and Astrophysics (astro.PH). These SAEs are designed to disentangle semantic concepts in dense embeddings while maintaining semantic fidelity.

Model Description

Overview

The SAEs in this repository are trained on embeddings of scientific paper abstracts from arXiv, specifically from the cs.LG (Computer Science - Machine Learning) and astro.PH (Astrophysics) categories. They are designed to extract interpretable features from dense text embeddings derived from large language models.

Model Architecture

Each SAE follows a top-k architecture with varying hyperparameters:

  • k: number of active latents (16, 32, 64, or 128)
  • n: total number of latents (3072, 4608, 6144, 9216, or 12288)

The naming convention for the models is: {domain}_{k}_{n}_{batch_size}.pth

For example, csLG_128_3072_256.pth represents an SAE trained on cs.LG data with k=128, n=3072, and a batch size of 256.

Intended Uses & Limitations

These SAEs are primarily intended for:

  1. Extracting interpretable features from dense embeddings of scientific texts
  2. Enabling fine-grained control over semantic search in scientific literature
  3. Studying the structure of semantic spaces in specific scientific domains

Limitations:

  • The models are domain-specific (cs.LG and astro.PH) and may not generalize well to other domains
  • Performance may vary depending on the quality and domain-specificity of the input embeddings

Training Data

The SAEs were trained on embeddings of abstracts from:

  • cs.LG: 153,000 papers
  • astro.PH: 272,000 papers

Training Procedure

The SAEs were trained using a custom loss function combining reconstruction loss, sparsity constraints, and an auxiliary loss. For detailed training procedures, please refer to our paper (link to be added upon publication).

Evaluation Results

Performance metrics for various configurations:

k n Domain MSE Log FD Act Mean
16 3072 astro.PH 0.2264 -2.7204 0.1264
16 3072 cs.LG 0.2284 -2.7314 0.1332
64 9216 astro.PH 0.1182 -2.4682 0.0539
64 9216 cs.LG 0.1240 -2.3536 0.0545
128 12288 astro.PH 0.0936 -2.7025 0.0399
128 12288 cs.LG 0.0942 -2.0858 0.0342
  • MSE: Normalised Mean Squared Error
  • Log FD: Mean log density of feature activations
  • Act Mean: Mean activation value across non-zero features

For full results, please refer to our paper (link to be added upon publication).

Ethical Considerations

While these models are designed to improve interpretability, users should be aware that:

  1. The extracted features may reflect biases present in the scientific literature used for training
  2. Interpretations of the features should be validated carefully, especially when used for decision-making processes

Citation

If you use these models in your research, please cite our paper (citation to be added upon publication).

Additional Information

For more details on the methodology, feature families, and applications in semantic search, please refer to our full paper (link to be added upon publication).