|
--- |
|
license: mit |
|
datasets: |
|
- charlieoneill/csLG |
|
- JSALT2024-Astro-LLMs/astro_paper_corpus |
|
language: |
|
- en |
|
tags: |
|
- sparse-autoencoder |
|
- embeddings |
|
- interpretability |
|
- scientific-nlp |
|
--- |
|
|
|
# Sparse Autoencoders for Scientific Paper Embeddings |
|
|
|
This repository contains a collection of Sparse Autoencoders (SAEs) trained on embeddings from scientific papers in two domains: Computer Science (cs.LG) and Astrophysics (astro.PH). These SAEs are designed to disentangle semantic concepts in dense embeddings while maintaining semantic fidelity. |
|
|
|
## Model Description |
|
|
|
### Overview |
|
|
|
The SAEs in this repository are trained on embeddings of scientific paper abstracts from arXiv, specifically from the cs.LG (Computer Science - Machine Learning) and astro.PH (Astrophysics) categories. They are designed to extract interpretable features from dense text embeddings derived from large language models. |
|
|
|
### Model Architecture |
|
|
|
Each SAE follows a top-k architecture with varying hyperparameters: |
|
- k: number of active latents (16, 32, 64, or 128) |
|
- n: total number of latents (3072, 4608, 6144, 9216, or 12288) |
|
|
|
The naming convention for the models is: |
|
`{domain}_{k}_{n}_{batch_size}.pth` |
|
|
|
For example, `csLG_128_3072_256.pth` represents an SAE trained on cs.LG data with k=128, n=3072, and a batch size of 256. |
|
|
|
## Intended Uses & Limitations |
|
|
|
These SAEs are primarily intended for: |
|
1. Extracting interpretable features from dense embeddings of scientific texts |
|
2. Enabling fine-grained control over semantic search in scientific literature |
|
3. Studying the structure of semantic spaces in specific scientific domains |
|
|
|
Limitations: |
|
- The models are domain-specific (cs.LG and astro.PH) and may not generalize well to other domains |
|
- Performance may vary depending on the quality and domain-specificity of the input embeddings |
|
|
|
## Training Data |
|
|
|
The SAEs were trained on embeddings of abstracts from: |
|
- cs.LG: 153,000 papers |
|
- astro.PH: 272,000 papers |
|
|
|
## Training Procedure |
|
|
|
The SAEs were trained using a custom loss function combining reconstruction loss, sparsity constraints, and an auxiliary loss. For detailed training procedures, please refer to our paper (link to be added upon publication). |
|
|
|
## Evaluation Results |
|
|
|
Performance metrics for various configurations: |
|
|
|
|k |n |Domain |MSE |Log FD |Act Mean | |
|
|-----|-------|----------|--------|---------|----------| |
|
| 16 | 3072 | astro.PH | 0.2264 | -2.7204 | 0.1264 | |
|
| 16 | 3072 | cs.LG | 0.2284 | -2.7314 | 0.1332 | |
|
| 64 | 9216 | astro.PH | 0.1182 | -2.4682 | 0.0539 | |
|
| 64 | 9216 | cs.LG | 0.1240 | -2.3536 | 0.0545 | |
|
| 128 | 12288 | astro.PH | 0.0936 | -2.7025 | 0.0399 | |
|
| 128 | 12288 | cs.LG | 0.0942 | -2.0858 | 0.0342 | |
|
|
|
* __MSE__: Normalised Mean Squared Error |
|
* __Log FD__: Mean log density of feature activations |
|
* __Act Mean__: Mean activation value across non-zero features |
|
|
|
For full results, please refer to our paper (link to be added upon publication). |
|
|
|
## Ethical Considerations |
|
|
|
While these models are designed to improve interpretability, users should be aware that: |
|
1. The extracted features may reflect biases present in the scientific literature used for training |
|
2. Interpretations of the features should be validated carefully, especially when used for decision-making processes |
|
|
|
## Citation |
|
|
|
If you use these models in your research, please cite our paper (citation to be added upon publication). |
|
|
|
## Additional Information |
|
|
|
For more details on the methodology, feature families, and applications in semantic search, please refer to our full paper (link to be added upon publication). |