embedding-saes / README.md

Update README.md

c8e482f verified 4 months ago

3.59 kB

	---
	license: mit
	datasets:
	- charlieoneill/csLG
	- JSALT2024-Astro-LLMs/astro_paper_corpus
	language:
	- en
	tags:
	- sparse-autoencoder
	- embeddings
	- interpretability
	- scientific-nlp
	---

	# Sparse Autoencoders for Scientific Paper Embeddings

	This repository contains a collection of Sparse Autoencoders (SAEs) trained on embeddings from scientific papers in two domains: Computer Science (cs.LG) and Astrophysics (astro.PH). These SAEs are designed to disentangle semantic concepts in dense embeddings while maintaining semantic fidelity.

	## Model Description

	### Overview

	The SAEs in this repository are trained on embeddings of scientific paper abstracts from arXiv, specifically from the cs.LG (Computer Science - Machine Learning) and astro.PH (Astrophysics) categories. They are designed to extract interpretable features from dense text embeddings derived from large language models.

	### Model Architecture

	Each SAE follows a top-k architecture with varying hyperparameters:
	- k: number of active latents (16, 32, 64, or 128)
	- n: total number of latents (3072, 4608, 6144, 9216, or 12288)

	The naming convention for the models is:
	`{domain}_{k}_{n}_{batch_size}.pth`

	For example, `csLG_128_3072_256.pth` represents an SAE trained on cs.LG data with k=128, n=3072, and a batch size of 256.

	## Intended Uses & Limitations

	These SAEs are primarily intended for:
	1. Extracting interpretable features from dense embeddings of scientific texts
	2. Enabling fine-grained control over semantic search in scientific literature
	3. Studying the structure of semantic spaces in specific scientific domains

	Limitations:
	- The models are domain-specific (cs.LG and astro.PH) and may not generalize well to other domains
	- Performance may vary depending on the quality and domain-specificity of the input embeddings

	## Training Data

	The SAEs were trained on embeddings of abstracts from:
	- cs.LG: 153,000 papers
	- astro.PH: 272,000 papers

	## Training Procedure

	The SAEs were trained using a custom loss function combining reconstruction loss, sparsity constraints, and an auxiliary loss. For detailed training procedures, please refer to our paper (link to be added upon publication).

	## Evaluation Results

	Performance metrics for various configurations:

	\|k \|n \|Domain \|MSE \|Log FD \|Act Mean \|
	\|-----\|-------\|----------\|--------\|---------\|----------\|
	\| 16 \| 3072 \| astro.PH \| 0.2264 \| -2.7204 \| 0.1264 \|
	\| 16 \| 3072 \| cs.LG \| 0.2284 \| -2.7314 \| 0.1332 \|
	\| 64 \| 9216 \| astro.PH \| 0.1182 \| -2.4682 \| 0.0539 \|
	\| 64 \| 9216 \| cs.LG \| 0.1240 \| -2.3536 \| 0.0545 \|
	\| 128 \| 12288 \| astro.PH \| 0.0936 \| -2.7025 \| 0.0399 \|
	\| 128 \| 12288 \| cs.LG \| 0.0942 \| -2.0858 \| 0.0342 \|

	* __MSE__: Normalised Mean Squared Error
	* __Log FD__: Mean log density of feature activations
	* __Act Mean__: Mean activation value across non-zero features

	For full results, please refer to our paper (link to be added upon publication).

	## Ethical Considerations

	While these models are designed to improve interpretability, users should be aware that:
	1. The extracted features may reflect biases present in the scientific literature used for training
	2. Interpretations of the features should be validated carefully, especially when used for decision-making processes

	## Citation

	If you use these models in your research, please cite our paper (citation to be added upon publication).

	## Additional Information

	For more details on the methodology, feature families, and applications in semantic search, please refer to our full paper (link to be added upon publication).