charlieoneill
/

embedding-saes

+---
+license: mit
+datasets:
+- charlieoneill/csLG
+- JSALT2024-Astro-LLMs/astro_paper_corpus
+language:
+- en
+tags:
+- sparse-autoencoder
+- embeddings
+- interpretability
+- scientific-nlp
+---
+# Sparse Autoencoders for Scientific Paper Embeddings
+This repository contains a collection of Sparse Autoencoders (SAEs) trained on embeddings from scientific papers in two domains: Computer Science (cs.LG) and Astrophysics (astro.PH). These SAEs are designed to disentangle semantic concepts in dense embeddings while maintaining semantic fidelity.
+## Model Description
+### Overview
+The SAEs in this repository are trained on embeddings of scientific paper abstracts from arXiv, specifically from the cs.LG (Computer Science - Machine Learning) and astro.PH (Astrophysics) categories. They are designed to extract interpretable features from dense text embeddings derived from large language models.
+### Model Architecture
+Each SAE follows a top-k architecture with varying hyperparameters:
+- k: number of active latents (16, 32, 64, or 128)
+- n: total number of latents (3072, 4608, 6144, 9216, or 12288)
+The naming convention for the models is:
+`{domain}_{k}_{n}_{batch_size}.pth`
+For example, `csLG_128_3072_256.pth` represents an SAE trained on cs.LG data with k=128, n=3072, and a batch size of 256.
+## Intended Uses & Limitations
+These SAEs are primarily intended for:
+1. Extracting interpretable features from dense embeddings of scientific texts
+2. Enabling fine-grained control over semantic search in scientific literature
+3. Studying the structure of semantic spaces in specific scientific domains
+Limitations:
+- The models are domain-specific (cs.LG and astro.PH) and may not generalize well to other domains
+- Performance may vary depending on the quality and domain-specificity of the input embeddings
+## Training Data
+The SAEs were trained on embeddings of abstracts from:
+- cs.LG: 153,000 papers
+- astro.PH: 272,000 papers
+## Training Procedure
+The SAEs were trained using a custom loss function combining reconstruction loss, sparsity constraints, and an auxiliary loss. For detailed training procedures, please refer to our paper (link to be added upon publication).
+## Evaluation Results
+Performance metrics for various configurations:
+| k   | n     | Domain   | MSE    | Log FD  | Act Mean |
+|-----|-------|----------|--------|---------|----------|
+| 16  | 3072  | astro.PH | 0.2264 | -2.7204 | 0.1264   |
+| 16  | 3072  | cs.LG    | 0.2284 | -2.7314 | 0.1332   |
+| 64  | 9216  | astro.PH | 0.1182 | -2.4682 | 0.0539   |
+| 64  | 9216  | cs.LG    | 0.1240 | -2.3536 | 0.0545   |
+| 128 | 12288 | astro.PH | 0.0936 | -2.7025 | 0.0399   |
+| 128 | 12288 | cs.LG    | 0.0942 | -2.0858 | 0.0342   |
+MSE: Normalised Mean Squared Error
+Log FD: Mean log density of feature activations
+Act Mean: Mean activation value across non-zero features
+For full results, please refer to our paper (link to be added upon publication).
+## Ethical Considerations
+While these models are designed to improve interpretability, users should be aware that:
+1. The extracted features may reflect biases present in the scientific literature used for training
+2. Interpretations of the features should be validated carefully, especially when used for decision-making processes
+## Citation
+If you use these models in your research, please cite our paper (citation to be added upon publication).
+## Additional Information
+For more details on the methodology, feature families, and applications in semantic search, please refer to our full paper (link to be added upon publication).