license: mit
datasets:
- charlieoneill/csLG
- JSALT2024-Astro-LLMs/astro_paper_corpus
language:
- en
tags:
- sparse-autoencoder
- embeddings
- interpretability
- scientific-nlp
Sparse Autoencoders for Scientific Paper Embeddings
This repository contains a collection of Sparse Autoencoders (SAEs) trained on embeddings from scientific papers in two domains: Computer Science (cs.LG) and Astrophysics (astro.PH). These SAEs are designed to disentangle semantic concepts in dense embeddings while maintaining semantic fidelity.
Model Description
Overview
The SAEs in this repository are trained on embeddings of scientific paper abstracts from arXiv, specifically from the cs.LG (Computer Science - Machine Learning) and astro.PH (Astrophysics) categories. They are designed to extract interpretable features from dense text embeddings derived from large language models.
Model Architecture
Each SAE follows a top-k architecture with varying hyperparameters:
- k: number of active latents (16, 32, 64, or 128)
- n: total number of latents (3072, 4608, 6144, 9216, or 12288)
The naming convention for the models is:
{domain}_{k}_{n}_{batch_size}.pth
For example, csLG_128_3072_256.pth
represents an SAE trained on cs.LG data with k=128, n=3072, and a batch size of 256.
Intended Uses & Limitations
These SAEs are primarily intended for:
- Extracting interpretable features from dense embeddings of scientific texts
- Enabling fine-grained control over semantic search in scientific literature
- Studying the structure of semantic spaces in specific scientific domains
Limitations:
- The models are domain-specific (cs.LG and astro.PH) and may not generalize well to other domains
- Performance may vary depending on the quality and domain-specificity of the input embeddings
Training Data
The SAEs were trained on embeddings of abstracts from:
- cs.LG: 153,000 papers
- astro.PH: 272,000 papers
Training Procedure
The SAEs were trained using a custom loss function combining reconstruction loss, sparsity constraints, and an auxiliary loss. For detailed training procedures, please refer to our paper (link to be added upon publication).
Evaluation Results
Performance metrics for various configurations:
k | n | Domain | MSE | Log FD | Act Mean |
---|---|---|---|---|---|
16 | 3072 | astro.PH | 0.2264 | -2.7204 | 0.1264 |
16 | 3072 | cs.LG | 0.2284 | -2.7314 | 0.1332 |
64 | 9216 | astro.PH | 0.1182 | -2.4682 | 0.0539 |
64 | 9216 | cs.LG | 0.1240 | -2.3536 | 0.0545 |
128 | 12288 | astro.PH | 0.0936 | -2.7025 | 0.0399 |
128 | 12288 | cs.LG | 0.0942 | -2.0858 | 0.0342 |
- MSE: Normalised Mean Squared Error
- Log FD: Mean log density of feature activations
- Act Mean: Mean activation value across non-zero features
For full results, please refer to our paper (link to be added upon publication).
Ethical Considerations
While these models are designed to improve interpretability, users should be aware that:
- The extracted features may reflect biases present in the scientific literature used for training
- Interpretations of the features should be validated carefully, especially when used for decision-making processes
Citation
If you use these models in your research, please cite our paper (citation to be added upon publication).
Additional Information
For more details on the methodology, feature families, and applications in semantic search, please refer to our full paper (link to be added upon publication).